This pipeline implements optimised allele-specific copy number calling with Sequenza and CNAqc, optimising purity estimated from data, and allele-specific segments.

Requirements:

1. (Prerequisites) Run Sequenza utils steps outside R a. Process a FASTA file to produce a GC Wiggle track file b. Process BAM and Wiggle files to produce a seqz file: c. Post-process by binning the original seqz file:

Instructions to customize these steps are available at https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html

2. This pipeline (Sequenza_CNAqc) optimizes CNA calling and purity estimations using peak detection from CNAqc. Starting from a broad set of initial conditions for purity (5% to 100%) the pipeline fits cellularity and ploidy values and dumps results. Results are then quality controlled by peak detection via CNAqc. The analysis with CNAqc can be carried out using either somatic mutations called by Sequenza, or an external set of input calls.

3. At every run the best solution by Sequenza is stored in a cache, and up to two alternative solutions are generated: one optional by Sequenza, which might propose different exact values for cellularity and ploidy; one by CNAqc, adjusting the current cellularity of the best Sequenza solution. The alternative solutions are enqueued in a list L of cellularity and ploidy values to be tested, based on their presence in the cache.

4. Until the list L is empty, the point values of cellularity and ploidy are tested repeating step 3, using ranges around the proposed values specified by input parameters. Further refinement steps will restrict the Sequenza purity range based on the score computed by analyze_peaks, which determines a purity gradient.

5. Each run is saved in a folder named run-1, run-2, etc. When the list L is empty the pipeline stops and a symbolic link final pointing to the best Sequenza solution, based on CNAqc score is created.

Sequenza_CNAqc(
  sample_id,
  seqz_file,
  mutations = NULL,
  sex,
  cellularity = c(0.05, 1),
  ploidy = c(1.8, 5.4),
  reference = "GRCh38",
  normalization.method = "median",
  window = 1e+05,
  gamma = 280,
  kmin = 300,
  min.reads.baf = 50,
  min.reads = 50,
  min.reads.normal = 15,
  max.mut.types = 1,
  delta_cellularity = 0.05,
  delta_ploidy = 0.25,
  verbose = FALSE,
  ...
)

Arguments

sample_id

The id of the sample.

seqz_file

A binned .seqz file.

mutations

If not NULL (default), an external set of mutation calls, in tibble format. Must include colummns "chr", "from", "to", "NV", "NR", "VAF".

sex

Sex of the patient from whom the sample is drawn.

cellularity

Input to sequenza.fit, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.

ploidy

Input to sequenza.fit, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.

reference

Reference genome among supported by CNAqc (GRCh38 or hg19/GRCh37).

normalization.method

Input to sequenza.extract, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.

window

Input to sequenza.extract, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.

gamma

Input to sequenza.extract, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.

kmin

Input to sequenza.extract, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.

min.reads.baf

Input to sequenza.extract, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.

min.reads

Input to sequenza.extract, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.

min.reads.normal

Input to sequenza.extract, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.

max.mut.types

Input to sequenza.extract, see https://cran.r-project.org/web/packages/sequenza/vignettes/sequenza.html.

delta_cellularity

When a proposed solution of purity and ploidy is tested, Sequenza is run on an interval of purity centered at the proposed value, of half width delta_cellularity.

delta_ploidy

When a proposed solution of purity and ploidy is tested, Sequenza is run on an interval of ploidy centered at the proposed value, of half width delta_ploidy.

verbose

If TRUE (default FALSE) at each Sequenza fitting step, print the updated list of purity and ploidy pairs to test.

...

Optional parameters passed to the analyse_peaks function by CNAqc. Tune these to change error tolerance or karyotypes to use for QC.

Value

A tibble containing for each run its number run, the solution values of purity and ploidy, the corresponding CNAqc score and quality control status QC, a list sequenza of the input and output of the Sequenza fitting procedure, and the object of class CNAqc created by the init() and analyze_peaks() functions.

Note

If at any time a purity proposal made by CNAqc leads to inconsistent values (i.e., outside $[0,1]$) the pipeline stops prompting to the user to adjust manually the range of ploidy.

Examples