With this function, CNAqc can compute a CCF per mutation upon “phasing” the multiplicity for every input mutation. Phasing is the task of computing the number of copies of a mutation mapping in a certain copy number segment; this task is a difficult, and can lead to erroneous CCF estimates.
CNAqc computes CCFs for simple clonal CNA segments, offering two algorithms to phase mutations directly from VAFs.
* Entropy based method. The entropy-based approach will flag mutations for which we cannot phase multiplicity by VAFs with certainty; the CCFs of these mutations should be manually controlled and, unless necessary, discarded. To this aid, a QC pass is assigned with less than a certain percentage of mutations have uncertain CCFs. The model uses the entropy of a VAF mixture with two Binomial distributions to detect mutations happened before, and after aneuploidy. Assigning multiplicities is difficult at the crossing of the two densities where mutations could have multiplicity 1 or 2. If mistaken, these mutations can determine aritficial peaks in the CCF distribution and compromise downstream subclonal deconvolution.
* Hard-cut based method. A method is available to compute CCFs regardless of the entropy. From the 2-class Binomial mixture, CNAqc uses the means of the Binomial parameters to determine a hard split of the data. Since there are no NA assignments, the computation is always scored PASS for QC purposes; for this reason this computation is more “rough” than the one based on entropy.
Like for other analyses This function creates a field `CCF_estimates` inside the returned object which contains the estimated CCFs.
compute_CCF(
x,
karyotypes = c("1:0", "1:1", "2:0", "2:1", "2:2"),
muts_per_karyotype = 25,
cutoff_QC_PASS = 0.1,
method = "ENTROPY"
)
A CNAqc object.
The karyotypes to use, this package supports only clonal simple CNAs.
Minimum number of mutations that are required to be mapped to a karyotype in order to compute CCF values (default 25).
For the entropy-based method, percentage of mutations that
can be not-assigned (NA
) in a karyotype. If the karyotype has more than
cutoff_QC_PASS
percentage of non-assigned mutations, then the overall set of CCFs
is failed for the karyotype.
Either "ENTROPY"
(default) or "ROUGH"
, to reflect the two different algorithms
to compute CCF.
A CNAqc object with CCF values.
Getters function CCF
and plot_CCF
.
data('example_dataset_CNAqc')
x = init(mutations = example_dataset_CNAqc$mutations, cna =example_dataset_CNAqc$cna, purity = example_dataset_CNAqc$purity)
#>
#> ── CNAqc - CNA Quality Check ───────────────────────────────────────────────────
#>
#> ℹ Using reference genome coordinates for: GRCh38.
#> ✔ Found annotated driver mutations: TTN, CTCF, and TP53.
#> ✔ Fortified calls for 12963 somatic mutations: 12963 SNVs (100%) and 0 indels.
#> ! CNAs have no CCF, assuming clonal CNAs (CCF = 1).
#> ✔ Fortified CNAs for 267 segments: 267 clonal and 0 subclonal.
#> ✔ 12963 mutations mapped to clonal CNAs.
x = compute_CCF(x, karyotypes = c('1:0', '1:1', '2:1', '2:0', '2:2'))
#> Warning: Some karyotypes have fewer than25and will not be analysed.
#> ── Computing mutation multiplicity for karyotype 2:1 using the entropy method. ─
#> ℹ Expected Binomial peak(s) for these calls (1 and 2 copies): 0.307958477508651 and 0.615916955017301
#> ℹ Mixing pre/ post aneuploidy: 0.55 and 0.45
#> ℹ Not assignamble area: [0.423423423423423; 0.504504504504504]
#> ── Computing mutation multiplicity for karyotype 2:0 using the entropy method. ─
#> ℹ Expected Binomial peak(s) for these calls (1 and 2 copies): 0.445 and 0.89
#> ℹ Mixing pre/ post aneuploidy: 0.09 and 0.91
#> ℹ Not assignamble area: [0.631578947368421; 0.723684210526316]
#> ── Computing mutation multiplicity for karyotype 2:2 using the entropy method. ─
#> ℹ Expected Binomial peak(s) for these calls (1 and 2 copies): 0.235449735449735 and 0.470899470899471
#> ℹ Mixing pre/ post aneuploidy: 0.09 and 0.91
#> ℹ Not assignamble area: [0.290780141843972; 0.368794326241135]
print(x)
#> ── [ CNAqc ] MySample 12963 mutations in 267 segments (267 clonal, 0 subclonal).
#>
#> ── Clonal CNAs
#>
#> 2:2 [n = 7478, L = 1483 Mb] ■■■■■■■■■■■■■■■■■■■■■■■■■■■ { CTCF }
#> 4:2 [n = 1893, L = 331 Mb] ■■■■■■■
#> 3:2 [n = 1625, L = 357 Mb] ■■■■■■
#> 2:1 [n = 1563, L = 420 Mb] ■■■■■■ { TTN }
#> 3:0 [n = 312, L = 137 Mb] ■
#> 2:0 [n = 81, L = 39 Mb] { TP53 }
#> 16:2 [n = 4, L = 0 Mb]
#> 25:2 [n = 2, L = 1 Mb]
#> 3:1 [n = 2, L = 1 Mb]
#> 106:1 [n = 1, L = 0 Mb]
#>
#> ℹ Sample Purity: 89% ~ Ploidy: 4.
#>
#> ℹ There are 3 annotated driver(s) mapped to clonal CNAs.
#> chr from to ref alt DP NV VAF driver_label is_driver
#> chr2 179431633 179431634 C T 117 77 0.6581197 TTN TRUE
#> chr16 67646006 67646007 C T 120 54 0.4500000 CTCF TRUE
#> chr17 7577106 7577107 G C 84 78 0.9285714 TP53 TRUE
#>
#> ✔ Cancer Cell Fraction (CCF) data available for karyotypes:2:1, 2:0, and 2:2.
#> ✔ PASS CCF via ENTROPY.
#> ✔ PASS CCF via ENTROPY.
#> ✔ PASS CCF via ENTROPY.
# Extract the values with these other functions
CCF(x)
#> # A tibble: 9,122 × 19
#> chr from to ref alt FILTER DP NV VAF ANNOVAR_FUNCTION
#> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 chr2 357969 357970 C A PASS 104 58 0.558 intergenic
#> 2 chr2 909304 909305 A G PASS 15 7 0.467 ncRNA_intronic
#> 3 chr2 1035751 1035752 C T PASS 93 57 0.613 intronic
#> 4 chr2 1326719 1326720 A T PASS 104 55 0.529 intronic
#> 5 chr2 1515962 1515963 C T PASS 90 50 0.556 intronic
#> 6 chr2 2198361 2198362 G T PASS 125 42 0.336 intronic
#> 7 chr2 2898536 2898537 C T PASS 109 58 0.532 downstream
#> 8 chr2 3125481 3125482 A G PASS 134 35 0.261 ncRNA_intronic
#> 9 chr2 3832360 3832361 A T PASS 120 68 0.567 intergenic
#> 10 chr2 3878408 3878409 T A PASS 126 10 0.0794 intergenic
#> # ℹ 9,112 more rows
#> # ℹ 9 more variables: GENE <chr>, is_driver <lgl>, driver_label <chr>,
#> # type <chr>, karyotype <chr>, segment_id <chr>, mutation_multiplicity <dbl>,
#> # CCF <dbl>, cna <chr>
plot_CCF(x)
#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
#> Warning: Removed 22 rows containing non-finite outside the scale range (`stat_bin()`).
#> Warning: Removed 5 rows containing missing values or values outside the scale range
#> (`geom_bar()`).
#> Warning: Removed 157 rows containing non-finite outside the scale range (`stat_bin()`).
#> Warning: Removed 5 rows containing missing values or values outside the scale range
#> (`geom_bar()`).