Skip to contents
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(INCOMMON)
#> Warning: replacing previous import 'cli::num_ansi_colors' by
#> 'crayon::num_ansi_colors' when loading 'INCOMMON'

We work with the example INCOMMON data.

A Beta-Binomial mixture model for classification

INCOMMON implements a maximum a posteriori classifier to infer the copy number (\(n_A\) for the major allele, \(n_B\) for the minor) and multiplicity \(m\) of mutations from read-count data.

The classifier is based on a Beta-Binomial mixture model, in which the number of reads with a variant (\(NV\)) is the number of events and the sequencing depth \(DP\) is the total number of trials.

A mutation on a genomic site of ploidy \(p = n_A+n_B\), with multiplicity \(m\leq \max(n_A,n_B)\), in a sample of purity \(\pi\) has an expected VAF of \[ \mathbb{E}[\text{VAF}] = \frac{m\pi}{p\pi + 2\left(1-\pi\right)} \]

In the read counting process this represents the event probability. Therefore, based on mutation data

  • \(NV\) reads with the variant at the locus;
  • \(DP\) coverage at the locus;
  • sample purity \(\pi\);

the likelihood of observing \(NV\) reads with the variant is \[ P(X = NV | \theta_\eta,\rho, DP) = \text{Beta-Binonmial}\left(NV \;\large\mid\;\normalsize \theta_\eta, \rho, DP\right) \]

Setting \(\rho = 0\) corresponds to using a pure Binomial model with no model of the sequencer overdispersion.

Example

In the following example, the classification task is run using the default setting \(\rho = 0.01\).

The input must be prepared using function init:

input = init(mutations = example_data$data,
             sample = example_data$sample,
             purity = example_data$purity,
             tumor_type = example_data$tumor_type,
             gene_roles = cancer_gene_census)
print(input)
#> ── [ INCOMMON ]  Sample P-0002081-T01-IM3 (LUAD), with purity 0.6 ──────────────
#> # A tibble: 4 × 10
#>   chr       from       to ref   alt   gene       NV    DP   VAF gene_role
#>   <chr>    <dbl>    <dbl> <chr> <chr> <chr>   <int> <int> <dbl> <chr>    
#> 1 chr12 25398285 25398285 C     A     KRAS      378   743 0.509 oncogene 
#> 2 chr17  7577139  7577139 G     A     TP53      116   246 0.472 TSG      
#> 3 chr19  1221293  1221293 C     A     STK11     122   260 0.469 TSG      
#> 4 chr19 11141472 11141473 -     C     SMARCA4   133   271 0.491 TSG

This is a lung adenocarcinoma (LUAD) sample “P-0002081-T01-IM3” from the MSK-MetTropsim dataset. The sample has purity \(\pi = 0.6\) and contains mutations called on genes KRAS, TP53, STK11 and SMARCA4. A column gene_role is extracted from the COSMIC Cancer Gene Census (default) catalogue and attached to the input data according to gene names. Other sources for gene roles can be input by the user through option gene_roles of function init.

INCOMMON classification of a sample can be performed using function classify. Here we do not use any prior knowledge and no cut-off on the classification entropy.

out = classify(
  x = input,
  priors = NULL,
  entropy_cutoff = NULL,
  rho = 0.01
)
#> 
#> ── INCOMMON inference of copy number and mutation multiplicity for sample P-0002
#>  Performing classification
#>  Loading CNAqc, 'Copy Number Alteration quality check'. Support : <https://caravagn.github.io/CNAqc/>
print(out)
#> ── [ INCOMMON ]  Sample P-0002081-T01-IM3 (LUAD), with purity 0.6 ──────────────
#> ── [ INCOMMON ]  Classified mutations using Beta-binomial model with overdispers
#> # A tibble: 4 × 15
#>   chr      from     to ref   alt   gene     NV    DP   VAF gene_role id    label
#>   <chr>   <dbl>  <dbl> <chr> <chr> <chr> <int> <int> <dbl> <chr>     <chr> <chr>
#> 1 chr12  2.54e7 2.54e7 C     A     KRAS    378   743 0.509 oncogene  chr1… 3N (…
#> 2 chr17  7.58e6 7.58e6 G     A     TP53    116   246 0.472 TSG       chr1… 3N (…
#> 3 chr19  1.22e6 1.22e6 C     A     STK11   122   260 0.469 TSG       chr1… 3N (…
#> 4 chr19  1.11e7 1.11e7 -     C     SMAR…   133   271 0.491 TSG       chr1… 3N (…
#> # ℹ 3 more variables: state <chr>, posterior <dbl>, entropy <dbl>

The results of the classification can be visualized using the internal plotting function. Here is an example:

plot_classification(out)
#> [[1]]

#> 
#> [[2]]

#> 
#> [[3]]

#> 
#> [[4]]

All mutations are classified as with amplification. However, visual inspection of the plots suggest that only the mutation on KRAS has low entropy \(H(x) \leq 0.2\). In effect, it is the only mutation with high sequencing depth \(DP\geq500\).

Given the lower depth, the number of reads with the variant \(NV\) (vertical line) for the other mutations is compatible with multiple copy number configurations, which are thus assigned similar posterior probabilities.

Using priors and the entropy cut-off.

The classification can be made easier by injecting prior knowledge into the model. In function classify, the user can provide a table as argument prior. INCOMMON provides an internal table of prior distributions obtained from PCAWG whole-genomes:

pcawg_priors
#> # A tibble: 4,466 × 4
#>    gene  tumor_type label                 p
#>    <chr> <chr>      <chr>             <dbl>
#>  1 ALB   HCC        1N (Mutated: 1N) 0.151 
#>  2 ALB   HCC        2N (Mutated: 1N) 0.593 
#>  3 ALB   HCC        2N (Mutated: 2N) 0.0474
#>  4 ALB   HCC        3N (Mutated: 1N) 0.0727
#>  5 ALB   HCC        3N (Mutated: 2N) 0.0485
#>  6 ALB   HCC        4N (Mutated: 1N) 0.0198
#>  7 ALB   HCC        4N (Mutated: 2N) 0.0683
#>  8 APOB  HCC        1N (Mutated: 1N) 0.151 
#>  9 APOB  HCC        2N (Mutated: 1N) 0.593 
#> 10 APOB  HCC        2N (Mutated: 2N) 0.0474
#> # ℹ 4,456 more rows

We know classify mutations using this priors. Since we are biasing the classification with prior knowledge, it is now reasonable also to make the classification stricter by using a cut-off on the entropy. The default value for entropy_cutoff is \(0.20\). Mutations with classification entropy above this cutoff will be flagged as Tier-2.

out = classify(
  x = input,
  priors = pcawg_priors,
  entropy_cutoff = 0.2,
  rho = 0.01
)
#> 
#> ── INCOMMON inference of copy number and mutation multiplicity for sample P-0002
#>  Performing classification
#> → No LUAD-specific prior probability specified for KRAS
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for TP53
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for STK11
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for SMARCA4
#> → Using a pan-cancer prior
print(out)
#> ── [ INCOMMON ]  Sample P-0002081-T01-IM3 (LUAD), with purity 0.6 ──────────────
#> ── [ INCOMMON ]  Classified mutations using Beta-binomial model with overdispers
#> # A tibble: 4 × 15
#>   chr      from     to ref   alt   gene     NV    DP   VAF gene_role id    label
#>   <chr>   <dbl>  <dbl> <chr> <chr> <chr> <int> <int> <dbl> <chr>     <chr> <chr>
#> 1 chr12  2.54e7 2.54e7 C     A     KRAS    378   743 0.509 oncogene  chr1… 3N (…
#> 2 chr17  7.58e6 7.58e6 G     A     TP53    116   246 0.472 TSG       chr1… 1N (…
#> 3 chr19  1.22e6 1.22e6 C     A     STK11   122   260 0.469 TSG       chr1… 1N (…
#> 4 chr19  1.11e7 1.11e7 -     C     SMAR…   133   271 0.491 TSG       chr1… 1N (…
#> # ℹ 3 more variables: state <chr>, posterior <dbl>, entropy <dbl>

Tumour suppressor genes TP53, STK11 and SMARCA4 have high frequency of LOH in PCAWG, and low frequency of amplification. Injecting this prior knowledge into the model pushes the classification outcome towards this states. The effect of using the prior is evident from the classification plots:

plot_classification(out)
#> [[1]]

#> 
#> [[2]]

#> 
#> [[3]]

#> 
#> [[4]]