INCOMMON • INCOMMON

library(INCOMMON)
#> Warning: replacing previous import 'cli::num_ansi_colors' by
#> 'crayon::num_ansi_colors' when loading 'INCOMMON'

INCOMMON is a tool for the INference of COpy number and Mutation Multiplicity in ONcology. INCOMMON infers the copy number and multiplicity of somatic mutations from tumor-only read count data, and can be applied to classify mutations from large-size datasets in an efficient and fast way. Mutations are classified as either Tier-1 (present in 100% cells) without copy-number alterations (heterozygous mutant diploid HMD), with loss of heterozygosity (LOH), copy-neutral LOH (CNLOH), amplification (AM), or Tier-2 (subclonal or with high ploidy and low multiplicity).

In addition, INCOMMON offers a genome interpretation framework, in which the full inactivation of tumor suppressor genes (TSG) through mutations with LOH, and the enhanced activation of oncogenes through mutations with amplification can be detected. These events can then be used to perform augmented analysis of survival and metastatic patterns.

The INCOMMON model is designed to work with high coverage sequencing data such as targeted panels but, in principle, it can be used with any sequencing assay. INCOMMON is helpful also to analyse sequencing data from tumor-only assays, in paricular when alignment files (fastq, sam/bam, etc.) are not availble. However, if one can access higher-resolution whole-exome or whole-genome assays, specific deconvolution methodologies should be used.

Mutation copy number and multiplicity inference

INCOMMON assesses, for every mutation, copy number configurations identified by triples $(n_A,n_B,m)$ where $n_A$ is the major allele copy number, $n_B$ is the minor and the mutation is present in $m$ copies. The supported configurations are:

Loss of heterozygosity in monosomy LOH: $(n_A=1,n_B=0, m=1)$
Copy-neutral loss of heterozygosity CNLOH: $(n_A=2, n_B=0 , m=2)$
Amplification AM: $(n_A=3, n_B=2 , m=2)$ or $(n_A=4, n_B=2 , m=2)$
Heterozygous mutant diploid HMD: $(n_A=2, n_B=1 , m=1)$

INCOMMON implements a classifier based on maximum a posteriori estimation to infer the copy number and multiplicity of mutations from read-count data.

The classifier is based on a Beta-Binomial mixture model, in which the number of reads with a variant ( $n$ ) is the number of events and the sequencing depth $N$ is the total number of trials.

A mutation on a genomic site of ploidy $p = n_A+n_B$ , with multiplicity $m\leq \max(n_A,n_B)$ , in a sample of purity $\pi$ has an expected VAF of $\theta_{m,p}\left(\pi\right) = \frac{m\pi}{p\pi + 2\left(1-\pi\right)}$

Tumour sample purity $\pi$ can be estimated by copy number assessment, pathology assessment, or in general any other bioinformatics approach outside INCOMMON. INCOMMON assumes that the input purity is correct.

In the read counting process this represents the event probability. Therefore, the likelihood of observing:

$n$ reads with the variant at the locus
$N$ coverage at the locus
sample purity $\pi$

given ploidy $p$ and multiplicity $m$ is

$$ P(n | N, \theta_{m,p}\left(\pi\right),\rho) = \text{Beta-Binonmial}\left(n \;\large\mid\;\normalsize N,\theta_{m,p}\left(\pi\right), \rho\right) $$

where $\rho$ models the overdispersion of the sequencer.

Setting $\rho = 0$ corresponds to using a pure Binomial model with no model of the sequencer overdispersion.

Input format

The input required for INCOMMMON classification consists of two data tables:

genomic_data: a data table of annotated mutations with columns indicating, for each mutation, the sample name sample, mutant chromosome chr, start position from, end position to, reference allele ref, alternative allele alt, sequencing depth DP, number of reads with variant NV, variant allele frequency VAF, mutant gene name gene Hugo Symbol, and possibly the protein sequence of the variant in HGVS recommended format (preferably 1-letter amino-acid code HGVSp_Short).
clinical_data: a data table of clinical data with matched sample names sample and purity purity (required), and clinical features like tumor type (ONCOTREE code) tumor_type (required for tumor specific priors), survival data such as OS_STATUS and time OS_MONTHS (required for survival analysis), metastasis data such as SAMPLE_TYPE (Primary or Metastasis), number of metastases MET_COUNT (required for metastatic propensity analysis) and metastatic site METASTATIC_SITE (required for metastatic tropism analysis), plus any other useful covariate.
gene_roles: a data table reporting gene names gene and associated roles gene_role (“oncogene” or “TSG”). INCOMMON provides a set of gene roles extracted from the COSMIC Cancer Gene Census (v98) as default.

The input for downstream analysis is checked and cast in the expected format through the function init.

INCOMMON provides data from the publicly available MSK-MetTropism cohort in the correct format. The following example shows how this input is pre-processed by INCOMMON:

data(MSK_genomic_data)
data(MSK_clinical_data)
data(cancer_gene_census)

x = init(
  genomic_data = MSK_genomic_data,
  clinical_data = MSK_clinical_data,
  gene_roles = cancer_gene_census
)
#> ── INCOMMON - Inference of copy number and mutation multiplicity in oncology ───
#> 
#> ── Genomic data ──
#> 
#> ✔ Found 25659 samples, with 224939 mutations in 491 genes
#> ! No read counts found for 1393 mutations in 1393 samples
#> ! Gene name not provided for 1393 mutations
#> ! 201 genes could not be assigned a role (TSG or oncogene)
#> 
#> ── Clinical data ──
#> 
#> ℹ Provided clinical features:
#> ✔ sample (required for classification)
#> ✔ purity (required for classification)
#> ✔ tumor_type
#> ✔ OS_MONTHS
#> ✔ OS_STATUS
#> ✔ SAMPLE_TYPE
#> ✔ MET_COUNT
#> ✔ METASTATIC_SITE
#> ✔ MET_SITE_COUNT
#> ✔ PRIMARY_SITE
#> ✔ SUBTYPE_ABBREVIATION
#> ✔ GENE_PANEL
#> ✔ SEX
#> ✔ TMB_NONSYNONYMOUS
#> ✔ FGA
#> ✔ AGE_AT_SEQUENCING
#> ✔ RACE
#> ✔ Found 25257 matching samples
#> ✖ Found 513 unmatched samples

print(x)
#> ── [ INCOMMON ]  175054 PASS mutations across 24018 samples, with 290 mutant gen
#> ℹ Average sample purity: 0.4
#> ℹ Average sequencing depth: 649
#> # A tibble: 175,054 × 27
#>    sample    tumor_type purity chr     from     to ref   alt      DP    NV   VAF
#>    <chr>     <chr>       <dbl> <chr>  <dbl>  <dbl> <chr> <chr> <int> <int> <dbl>
#>  1 P-0028912 CHOL          0.3 chr17 7.58e6 7.58e6 G     A       837   133 0.159
#>  2 P-0028912 CHOL          0.3 chrX  5.32e7 5.32e7 G     A       832    85 0.102
#>  3 P-0003698 BLCA          0.2 chr17 7.58e6 7.58e6 C     A       437   109 0.249
#>  4 P-0003698 BLCA          0.2 chr5  1.49e8 1.49e8 C     T       360    36 0.1  
#>  5 P-0003698 BLCA          0.2 chr13 3.29e7 3.29e7 G     C      1027   162 0.158
#>  6 P-0003698 BLCA          0.2 chr13 3.29e7 3.29e7 G     C      1021   182 0.178
#>  7 P-0003698 BLCA          0.2 chr19 1.11e7 1.11e7 G     T       573    98 0.171
#>  8 P-0003698 BLCA          0.2 chr22 4.15e7 4.15e7 G     A       416    45 0.108
#>  9 P-0003698 BLCA          0.2 chrX  4.49e7 4.49e7 C     T       730   194 0.266
#> 10 P-0003823 BLCA          0.6 chr5  1.30e6 1.30e6 G     A       218   138 0.633
#> # ℹ 175,044 more rows
#> # ℹ 16 more variables: gene <chr>, gene_role <chr>, OS_MONTHS <dbl>,
#> #   OS_STATUS <dbl>, SAMPLE_TYPE <chr>, MET_COUNT <dbl>, METASTATIC_SITE <chr>,
#> #   MET_SITE_COUNT <dbl>, PRIMARY_SITE <chr>, SUBTYPE_ABBREVIATION <chr>,
#> #   GENE_PANEL <chr>, SEX <chr>, TMB_NONSYNONYMOUS <dbl>, FGA <dbl>,
#> #   AGE_AT_SEQUENCING <dbl>, RACE <chr>

Genome interprter

Downstream of INCOMMON classification, the mutant genome can be interpreted in terms of full inactivation of tumor suppressor genes (TSG) through mutations with LOH, and enhanced activation of oncogenes through mutations with amplification.

For TSGs, full inactivation states include all the copy number configurations with loss of the wild-type (WT) allele (multiplicity equal to ploidy $m=p$ ):

Mutations with LOH
Mutations with CNLOH

For oncogenes, enhanced activation states include all the copy number configurations with amplification of the mutant allele (multiplicity $m = 2$ ):

Mutations with AM (trisomy or tetrasomy)
Mutations with CNLOH

Even if the intepretation of CNLOH as an oncogene activating event might sound unusual, it is based on the intuition that, for an oncogene, the presence of multiple mutant copies is relevant, whereas the absence of the WT is not.

Survival analysis

If patients’ survival status and time are provided as features in the clinical data table clinical_table, survival analysis can be performed. Downstream of classification, INCOMMON can stratify patients based on the mutational and copy number state of a TSG or an oncogene of interest.

INCOMMON provides the following functions, dedicated to fitting survival data:

kaplan_meier_fit uses the Kaplan-Meier estimator to fit survival data from patients stratified with respect to the status of a specific tumor_type and gene (Mutant TSG with/without LOH for suppressors, Mutant oncogene with/without amplification for oncogenes, with the WT group as reference)
cox_fit uses a Cox proportional hazard ratio model to fit survival data. In addition to arguments tumor_type and gene, it accepts other covariates, given they are provided in the clinical_table.

Metastatic patterns

If information about metastatisation is provided, such as type of the sample (primary tumor or metastasis), whether patients are metastatic or not, and sites of metastatisation for primary tumors, in the clinical data table clinical_table, analysis of metastatic propensity and tropism based on INCOMMON classification and genome interpretation can be performed.

INCOMMON provides the following functions for analysis of metastases:

met_propensity: uses a logistic regression test to compute the odds ratio (OR) of metastatisation between patients identified by the two mutational statuses of a gene (Mutant TSG with versus without LOH and Mutant oncogene with versus without amplification) for specific types of primary tumors tumor_type.
met_tropism: uses a logistic regression test to compute the odds ratio (OR) to metastatise to a specific site METASTATIC_SITE between patients identified by the two mutational statuses of a gene (Mutant TSG with versus without LOH and Mutant oncogene with versus without amplification) for specific types of primary tumors tumor_type.