We are happy to integrate any copy number caller with CNAqc; do get in touch to discuss how to integrate your caller.
CNAqc provides automatic integration with Sequenza
Sequenza calls clonal CNAs in a multi-steps Python/R pipeline:
seqz
file;seqz
file;seqz
for normalisation and segmentation;One advantage of Sequenza is that results are computed for both the best and possible alternative solutions.
Here cellularity is the Sequenza terminology for tumour purity; steps 1-3 are implemented in Python, while steps 4-6 in R. In our experience we have often found the steps 1-3 to be run just once (often with default parameter values), while the last steps are iterated to optimise CNA calling.
We have implemented a calling pipeline (function Sequenza_CNAqc
) that revolves around steps 4-6, using CNAqc to drive the fitting steps of Sequenza - especially step 5 - to determine better CNA segments and purity/ploidy values. To run this pipeline steps 1-3 need to be executed beforehand; according to the Sequenza documentation the default execution is as follows.
# Process a FASTA file to produce a GC Wiggle track file:
sequenza−utils gc_wiggle −w 50 --fasta hg19.fa -o hg19.gc50Base.wig.gz
# Process BAM and Wiggle files to produce a seqz file:
sequenza−utils bam2seqz -n normal.bam -t tumor.bam --fasta hg19.fa \
-gc hg19.gc50Base.wig.gz -o out.seqz.gz
# Post-process by binning the original seqz file:
sequenza−utils seqz_binning --seqz out.seqz.gz -w 50 -o out small.seqz.gz
The pipeline starts from input ranges for cellularity and ploidy values, as canonically required by Sequenza.
Parameters for segmentation are also imputed following the standard Sequenza convention, and step 4 is performed once before starting an iterative procedure that uses a caching system to avoid useless repetitions of computations. Default Sequenza parameter values we release with CNAqc have been delected to work best with whole-genome sequencing data, a type of data we often use to run this pipeline.
The pipeline fits (Sequenza steps 5-6) cellularity and ploidy values, and dumps results, which are quality controlled by peak detection via CNAqc. The analysis with CNAqc can be carried out using either:
Besides the best solution by Sequenza, at every run up to two alternative solutions are generated:
These alternative solutions are queued in a list of cellularity and ploidy values that should be tested, based on their presence in a cache (the cache avoids to re-test solutions already visited). Until the list is empty, the point values of cellularity and ploidy are tested repeating steps 5-6 with small ranges built around the proposed values. For example, if a solution with cellularity and ploidy is to be tested, CNAqc runs Sequenza by using ranges and , where and are parameters of the pipeline.
Therefore the pipeline follows two paths to optimise the calls:
Note that when CNAqc flags as PASS a solution that is visited for the first time, any further correction will not be evaluated since the solution will be found in the cache system of the pipeline.
At the end of the runs all the solutions identified are scored by CNAqc and reported with a PASS/FAIL status associated; the pipeline always indicates the best among the tested ones, based on the assigned quality score.