library(INCOMMON)
#> Warning: replacing previous import 'cli::num_ansi_colors' by
#> 'crayon::num_ansi_colors' when loading 'INCOMMON'
INCOMMON is a tool for the INference of COpy number and Mutation Multiplicity in ONcology. INCOMMON infers the copy number and multiplicity of somatic mutations from tumor-only read count data, and can be applied to classify mutations from large-size datasets in an efficient and fast way. Mutations are classified as either Tier-1 (present in 100% cells) without copy-number alterations (heterozygous mutant diploid HMD), with loss of heterozygosity (LOH), copy-neutral LOH (CNLOH), amplification (AM), or Tier-2 (subclonal or with high ploidy and low multiplicity).
In addition, INCOMMON offers a genome interpretation framework, in which the full inactivation of tumor suppressor genes (TSG) through mutations with LOH, and the enhanced activation of oncogenes through mutations with amplification can be detected. These events can then be used to perform augmented analysis of survival and metastatic patterns.
The INCOMMON model is designed to work with high coverage sequencing data such as targeted panels but, in principle, it can be used with any sequencing assay. INCOMMON is helpful also to analyse sequencing data from tumor-only assays, in paricular when alignment files (fastq, sam/bam, etc.) are not availble. However, if one can access higher-resolution whole-exome or whole-genome assays, specific deconvolution methodologies should be used.
Mutation copy number and multiplicity inference
INCOMMON assesses, for every mutation, copy number configurations identified by triples where is the major allele copy number, is the minor and the mutation is present in copies. The supported configurations are:
- Loss of heterozygosity in monosomy LOH:
- Copy-neutral loss of heterozygosity CNLOH:
- Amplification AM: or
- Heterozygous mutant diploid HMD:
INCOMMON implements a classifier based on maximum a posteriori estimation to infer the copy number and multiplicity of mutations from read-count data.
The classifier is based on a Beta-Binomial mixture model, in which the number of reads with a variant () is the number of events and the sequencing depth is the total number of trials.
A mutation on a genomic site of ploidy , with multiplicity , in a sample of purity has an expected VAF of
Tumour sample purity can be estimated by copy number assessment, pathology assessment, or in general any other bioinformatics approach outside INCOMMON. INCOMMON assumes that the input purity is correct.
In the read counting process this represents the event probability. Therefore, the likelihood of observing:
- reads with the variant at the locus
- coverage at the locus
- sample purity
given ploidy and multiplicity is
$$ P(n | N, \theta_{m,p}\left(\pi\right),\rho) = \text{Beta-Binonmial}\left(n \;\large\mid\;\normalsize N,\theta_{m,p}\left(\pi\right), \rho\right) $$
where models the overdispersion of the sequencer.
Setting corresponds to using a pure Binomial model with no model of the sequencer overdispersion.
Input format
The input required for INCOMMMON classification consists of two data tables:
genomic_data
: a data table of annotated mutations with columns indicating, for each mutation, the sample namesample
, mutant chromosomechr
, start positionfrom
, end positionto
, reference alleleref
, alternative allelealt
, sequencing depthDP
, number of reads with variantNV
, variant allele frequencyVAF
, mutant gene namegene
Hugo Symbol, and possibly the protein sequence of the variant in HGVS recommended format (preferably 1-letter amino-acid codeHGVSp_Short
).clinical_data
: a data table of clinical data with matched sample namessample
and puritypurity
(required), and clinical features like tumor type (ONCOTREE code)tumor_type
(required for tumor specific priors), survival data such asOS_STATUS
and timeOS_MONTHS
(required for survival analysis), metastasis data such asSAMPLE_TYPE
(Primary or Metastasis), number of metastasesMET_COUNT
(required for metastatic propensity analysis) and metastatic siteMETASTATIC_SITE
(required for metastatic tropism analysis), plus any other useful covariate.gene_roles
: a data table reporting gene namesgene
and associated rolesgene_role
(“oncogene” or “TSG”). INCOMMON provides a set of gene roles extracted from the COSMIC Cancer Gene Census (v98) as default.
The input for downstream analysis is checked and cast in the expected
format through the function init
.
INCOMMON provides data from the publicly available MSK-MetTropism cohort in the correct format. The following example shows how this input is pre-processed by INCOMMON:
data(MSK_genomic_data)
data(MSK_clinical_data)
data(cancer_gene_census)
x = init(
genomic_data = MSK_genomic_data,
clinical_data = MSK_clinical_data,
gene_roles = cancer_gene_census
)
#> ── INCOMMON - Inference of copy number and mutation multiplicity in oncology ───
#>
#> ── Genomic data ──
#>
#> ✔ Found 25659 samples, with 224939 mutations in 491 genes
#> ! No read counts found for 1393 mutations in 1393 samples
#> ! Gene name not provided for 1393 mutations
#> ! 201 genes could not be assigned a role (TSG or oncogene)
#>
#> ── Clinical data ──
#>
#> ℹ Provided clinical features:
#> ✔ sample (required for classification)
#> ✔ purity (required for classification)
#> ✔ tumor_type
#> ✔ OS_MONTHS
#> ✔ OS_STATUS
#> ✔ SAMPLE_TYPE
#> ✔ MET_COUNT
#> ✔ METASTATIC_SITE
#> ✔ MET_SITE_COUNT
#> ✔ PRIMARY_SITE
#> ✔ SUBTYPE_ABBREVIATION
#> ✔ GENE_PANEL
#> ✔ SEX
#> ✔ TMB_NONSYNONYMOUS
#> ✔ FGA
#> ✔ AGE_AT_SEQUENCING
#> ✔ RACE
#> ✔ Found 25257 matching samples
#> ✖ Found 513 unmatched samples
print(x)
#> ── [ INCOMMON ] 175054 PASS mutations across 24018 samples, with 290 mutant gen
#> ℹ Average sample purity: 0.4
#> ℹ Average sequencing depth: 649
#> # A tibble: 175,054 × 27
#> sample tumor_type purity chr from to ref alt DP NV VAF
#> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <int> <int> <dbl>
#> 1 P-0028912 CHOL 0.3 chr17 7.58e6 7.58e6 G A 837 133 0.159
#> 2 P-0028912 CHOL 0.3 chrX 5.32e7 5.32e7 G A 832 85 0.102
#> 3 P-0003698 BLCA 0.2 chr17 7.58e6 7.58e6 C A 437 109 0.249
#> 4 P-0003698 BLCA 0.2 chr5 1.49e8 1.49e8 C T 360 36 0.1
#> 5 P-0003698 BLCA 0.2 chr13 3.29e7 3.29e7 G C 1027 162 0.158
#> 6 P-0003698 BLCA 0.2 chr13 3.29e7 3.29e7 G C 1021 182 0.178
#> 7 P-0003698 BLCA 0.2 chr19 1.11e7 1.11e7 G T 573 98 0.171
#> 8 P-0003698 BLCA 0.2 chr22 4.15e7 4.15e7 G A 416 45 0.108
#> 9 P-0003698 BLCA 0.2 chrX 4.49e7 4.49e7 C T 730 194 0.266
#> 10 P-0003823 BLCA 0.6 chr5 1.30e6 1.30e6 G A 218 138 0.633
#> # ℹ 175,044 more rows
#> # ℹ 16 more variables: gene <chr>, gene_role <chr>, OS_MONTHS <dbl>,
#> # OS_STATUS <dbl>, SAMPLE_TYPE <chr>, MET_COUNT <dbl>, METASTATIC_SITE <chr>,
#> # MET_SITE_COUNT <dbl>, PRIMARY_SITE <chr>, SUBTYPE_ABBREVIATION <chr>,
#> # GENE_PANEL <chr>, SEX <chr>, TMB_NONSYNONYMOUS <dbl>, FGA <dbl>,
#> # AGE_AT_SEQUENCING <dbl>, RACE <chr>
Genome interprter
Downstream of INCOMMON classification, the mutant genome can be interpreted in terms of full inactivation of tumor suppressor genes (TSG) through mutations with LOH, and enhanced activation of oncogenes through mutations with amplification.
For TSGs, full inactivation states include all the copy number configurations with loss of the wild-type (WT) allele (multiplicity equal to ploidy ):
- Mutations with LOH
- Mutations with CNLOH
For oncogenes, enhanced activation states include all the copy number configurations with amplification of the mutant allele (multiplicity ):
- Mutations with AM (trisomy or tetrasomy)
- Mutations with CNLOH
Even if the intepretation of CNLOH as an oncogene activating event might sound unusual, it is based on the intuition that, for an oncogene, the presence of multiple mutant copies is relevant, whereas the absence of the WT is not.
Survival analysis
If patients’ survival status and time are provided as features in the
clinical data table clinical_table
, survival analysis can
be performed. Downstream of classification, INCOMMON can stratify
patients based on the mutational and copy number state of a TSG or an
oncogene of interest.
INCOMMON provides the following functions, dedicated to fitting survival data:
-
kaplan_meier_fit
uses the Kaplan-Meier estimator to fit survival data from patients stratified with respect to the status of a specifictumor_type
andgene
(Mutant TSG with/without LOH for suppressors, Mutant oncogene with/without amplification for oncogenes, with the WT group as reference) -
cox_fit
uses a Cox proportional hazard ratio model to fit survival data. In addition to argumentstumor_type
andgene
, it accepts othercovariates
, given they are provided in theclinical_table
.
Metastatic patterns
If information about metastatisation is provided, such as type of the
sample (primary tumor or metastasis), whether patients are metastatic or
not, and sites of metastatisation for primary tumors, in the clinical
data table clinical_table
, analysis of metastatic
propensity and tropism based on INCOMMON classification and genome
interpretation can be performed.
INCOMMON provides the following functions for analysis of metastases:
met_propensity
: uses a logistic regression test to compute the odds ratio (OR) of metastatisation between patients identified by the two mutational statuses of agene
(Mutant TSG with versus without LOH and Mutant oncogene with versus without amplification) for specific types of primary tumorstumor_type
.met_tropism
: uses a logistic regression test to compute the odds ratio (OR) to metastatise to a specific siteMETASTATIC_SITE
between patients identified by the two mutational statuses of agene
(Mutant TSG with versus without LOH and Mutant oncogene with versus without amplification) for specific types of primary tumorstumor_type
.