This pipeline implements optimised allele-specific copy number calling with Sequenza and CNAqc, optimising purity estimated from data, and allele-specific segments.
Requirements:
1. (Prerequisites) Run Sequenza utils steps outside R a. Process a FASTA file to produce a GC Wiggle track file b. Process BAM and Wiggle files to produce a seqz file: c. Post-process by binning the original seqz file:
Instructions to customize these steps are available at https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html
2. This pipeline (Sequenza_CNAqc
) optimizes CNA calling and purity
estimations using peak detection from CNAqc. Starting from a broad set of initial conditions
for purity (5% to 100%) the pipeline fits cellularity and ploidy values and dumps results.
Results are then quality controlled by peak detection via CNAqc. The analysis with CNAqc
can be carried out using either somatic mutations called by Sequenza, or an external set
of input calls.
3. At every run the best solution by Sequenza is stored in a cache, and up to two alternative solutions are generated: one optional by Sequenza, which might propose different exact values for cellularity and ploidy; one by CNAqc, adjusting the current cellularity of the best Sequenza solution. The alternative solutions are enqueued in a list L of cellularity and ploidy values to be tested, based on their presence in the cache.
4. Until the list L is empty, the point values of cellularity and ploidy are
tested repeating step 3, using ranges around the proposed values specified by
input parameters. Further refinement steps will restrict the Sequenza purity
range based on the score computed by analyze_peaks
, which determines a purity
gradient.
5. Each run is saved in a folder named run-1
, run-2
, etc.
When the list L is empty the pipeline stops and a symbolic link final
pointing to the best Sequenza solution, based on CNAqc score is created.
Sequenza_CNAqc(
sample_id,
seqz_file,
mutations = NULL,
sex,
cellularity = c(0.05, 1),
ploidy = c(1.8, 5.4),
reference = "GRCh38",
normalization.method = "median",
window = 1e+05,
gamma = 280,
kmin = 300,
min.reads.baf = 50,
min.reads = 50,
min.reads.normal = 15,
max.mut.types = 1,
delta_cellularity = 0.05,
delta_ploidy = 0.25,
verbose = FALSE,
...
)
The id of the sample.
A binned .seqz
file.
If not NULL (default), an external set of mutation calls, in tibble format. Must include colummns "chr", "from", "to", "NV", "NR", "VAF".
Sex of the patient from whom the sample is drawn.
Input to sequenza.fit
, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.
Input to sequenza.fit
, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.
Reference genome among supported by CNAqc (GRCh38 or hg19/GRCh37).
Input to sequenza.extract
, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.
Input to sequenza.extract
, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.
Input to sequenza.extract
, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.
Input to sequenza.extract
, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.
Input to sequenza.extract
, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.
Input to sequenza.extract
, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.
Input to sequenza.extract
, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.
Input to sequenza.extract
, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.
When a proposed solution of purity and ploidy is tested,
Sequenza is run on an interval of purity centered at the proposed value, of half
width delta_cellularity
.
When a proposed solution of purity and ploidy is tested,
Sequenza is run on an interval of ploidy centered at the proposed value, of half
width delta_ploidy
.
If TRUE (default FALSE) at each Sequenza fitting step, print the updated list of purity and ploidy pairs to test.
Optional parameters passed to the analyse_peaks
function
by CNAqc. Tune these to change error tolerance or karyotypes to use for QC.
A tibble
containing for each run its number
run
, the solution values of purity
and ploidy
, the corresponding
CNAqc score
and quality control status QC
, a list sequenza
of the
input and output of the Sequenza fitting procedure, and the object of class CNAqc
created by the init()
and analyze_peaks()
functions.
If at any time a purity proposal made by CNAqc leads
to inconsistent values (i.e., outside $[0,1]$) the pipeline stops prompting to the user
to adjust manually the range of ploidy
.