library(INCOMMON)
#> Warning: replacing previous import 'cli::num_ansi_colors' by
#> 'crayon::num_ansi_colors' when loading 'INCOMMON'
INCOMMON is a tool for the INference of COpy number and Mutation Multiplicity in ONcology. INCOMMON infers the copy number and multiplicity of somatic mutations from tumor-only read count data, and can be applied to tumour-only samples in an efficient and fast way. For each mutation in a sample, INCOMMON computes a probability distribution over all the possible combinations of total copy number and multiplicity, and choses the one with maximum probability.
In addition, INCOMMON offers a genome interpretation framework, in which the tumour genome is classified based on the mutant dosage of oncogenes and tumour suppressor genes. Clinical outcome analysis (e.g. survival analysis and organotropism) based on such complex genotypes can be performed with functions integrated in the package.
The INCOMMON model is designed to work with high coverage sequencing data such as targeted panels but, in principle, it can be used with any sequencing assay. INCOMMON is helpful also to analyse sequencing data from tumor-only assays, in paricular when alignment files (fastq, sam/bam, etc.) are not availble. However, if one can access higher-resolution whole-exome or whole-genome assays, specific deconvolution methodologies should be used.
The INCOMMON mutation copy-number caller
INCOMMON is a Bayesian model that can infer, for a tumour mutation, the total copy number at the mutant locus, and the mutation multiplicity (i.e., the number of DNA copies that harbour the mutation). This information provides the allele-specific configuration of the mutant locus.
INCOMMON takes input read counts data for mutations to develop the joint likelihood}
For every mutation , INCOMMON uses the number of reads with the alternative allele and the total reads (depth of sequencing). The model infers two sample-specific parameters, () the sample purity (), and () the rate of reads per chromosome copy (), and two mutation-specific parameters, () the tumour total copy number at the locus () and () the mutation multiplicty (, ).
The model adopted by INCOMMON relates copy number to coverage linearly, so the expected number of reads for chromosome copies is .
INCOMMON links the number of reads to the multiplicity and total copy number considering tumour/ normal admixing
The sequencing depth follows a Poisson distribution with an expected number of reads defined by combining tumour and normal readouts. Given the depth, the number of mutant reads follows a Binomial distribution with success probability determined by the mixing of tumour and normal success rates.
INCOMMON uses Markov Chain Monte Carlo (implemented in
stan
) sampling to estimate a posterior distribution over
,
,
and
.
To leverage the massive amount of public WGS data of human tumours and
gain precision with targeted assays, by default INCOMMON uses a biologically informed prior distribution for
copy number and multiplicity configurations from the PCAWG and HMF
cohorts.
To support orthogonal estimation of tumour purity (e.g. from histopathological evaluation) but resist potential error in the input estimate, INCOMMON centres a prior around the purity measurements provided with each sample. After posterior inference, INCOMMON uses posterior predictive checks to monitor the discrepancy between observed values and inferred posterior distributions on , , and .
Input format
The input required for INCOMMMON classification consists of two data tables:
genomic_data
: a data table of annotated mutations with columns indicating, for each mutation, the sample namesample
, mutant chromosomechr
, start positionfrom
, end positionto
, reference alleleref
, alternative allelealt
, sequencing depthDP
, number of reads with variantNV
, variant allele frequencyVAF
, mutant gene namegene
Hugo Symbol, and possibly the protein sequence of the variant in HGVS recommended format (preferably 1-letter amino-acid codeHGVSp_Short
).clinical_data
: a data table of clinical data with matched sample namessample
and puritypurity
(required), and clinical features like tumor type (ONCOTREE code)tumor_type
(required for tumor specific priors), survival data such asOS_STATUS
and timeOS_MONTHS
(required for survival analysis), metastasis data such asSAMPLE_TYPE
(Primary or Metastasis), number of metastasesMET_COUNT
(required for metastatic propensity analysis) and metastatic siteMETASTATIC_SITE
(required for metastatic tropism analysis), plus any other useful covariate.gene_roles
: a data table reporting gene namesgene
and associated rolesgene_role
(“oncogene” or “TSG”). INCOMMON provides a set of gene roles extracted from the COSMIC Cancer Gene Census (v98) as default.
The input for downstream analysis is checked and cast in the expected
format through the function init
.
INCOMMON provides data from the publicly available MSK-MetTropism cohort in the correct format. The following example shows how this input is pre-processed by INCOMMON:
data(MSK_genomic_data)
data(MSK_clinical_data)
data(cancer_gene_census)
x = init(
genomic_data = MSK_genomic_data,
clinical_data = MSK_clinical_data,
gene_roles = cancer_gene_census
)
#> ── INCOMMON - Inference of copy number and mutation multiplicity in oncology ───
#>
#> ── Genomic data ──
#>
#> ✔ Found 25659 samples, with 224939 mutations in 491 genes
#> ! No read counts found for 1393 mutations in 1393 samples
#> ! Gene name not provided for 1393 mutations
#> ! 201 genes could not be assigned a role (TSG or oncogene)
#>
#> ── Clinical data ──
#>
#> ℹ Provided clinical features:
#> ✔ sample (required for classification)
#> ✔ purity (required for classification)
#> ✔ tumor_type
#> ✔ OS_MONTHS
#> ✔ OS_STATUS
#> ✔ SAMPLE_TYPE
#> ✔ MET_COUNT
#> ✔ METASTATIC_SITE
#> ✔ MET_SITE_COUNT
#> ✔ PRIMARY_SITE
#> ✔ SUBTYPE_ABBREVIATION
#> ✔ GENE_PANEL
#> ✔ SEX
#> ✔ TMB_NONSYNONYMOUS
#> ✔ FGA
#> ✔ AGE_AT_SEQUENCING
#> ✔ RACE
#> ✔ Found 25257 matching samples
#> ✖ Found 513 unmatched samples
print(x)
#> ── [ INCOMMON ] 175054 PASS mutations across 24018 samples, with 290 mutant gen
#> ℹ Average sample purity: 0.4
#> ℹ Average sequencing depth: 649
#> # A tibble: 175,054 × 27
#> sample tumor_type purity chr from to ref alt DP NV VAF
#> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <int> <int> <dbl>
#> 1 P-0028912 CHOL 0.3 chr17 7.58e6 7.58e6 G A 837 133 0.159
#> 2 P-0028912 CHOL 0.3 chrX 5.32e7 5.32e7 G A 832 85 0.102
#> 3 P-0003698 BLCA 0.2 chr17 7.58e6 7.58e6 C A 437 109 0.249
#> 4 P-0003698 BLCA 0.2 chr5 1.49e8 1.49e8 C T 360 36 0.1
#> 5 P-0003698 BLCA 0.2 chr13 3.29e7 3.29e7 G C 1027 162 0.158
#> 6 P-0003698 BLCA 0.2 chr13 3.29e7 3.29e7 G C 1021 182 0.178
#> 7 P-0003698 BLCA 0.2 chr19 1.11e7 1.11e7 G T 573 98 0.171
#> 8 P-0003698 BLCA 0.2 chr22 4.15e7 4.15e7 G A 416 45 0.108
#> 9 P-0003698 BLCA 0.2 chrX 4.49e7 4.49e7 C T 730 194 0.266
#> 10 P-0003823 BLCA 0.6 chr5 1.30e6 1.30e6 G A 218 138 0.633
#> # ℹ 175,044 more rows
#> # ℹ 16 more variables: gene <chr>, gene_role <chr>, OS_MONTHS <dbl>,
#> # OS_STATUS <dbl>, SAMPLE_TYPE <chr>, MET_COUNT <dbl>, METASTATIC_SITE <chr>,
#> # MET_SITE_COUNT <dbl>, PRIMARY_SITE <chr>, SUBTYPE_ABBREVIATION <chr>,
#> # GENE_PANEL <chr>, SEX <chr>, TMB_NONSYNONYMOUS <dbl>, FGA <dbl>,
#> # AGE_AT_SEQUENCING <dbl>, RACE <chr>
Gene mutant dosage
Downstream of INCOMMON classification, the tumour genome can be interpreted in terms of the gene mutant dosage of tumor suppressor genes (typically through mutations with LOH) and enhanced activation of oncogenes (typically through mutations with mutant copy gain).
Mutant dosage classes (low, balanced and high) are derived from the fraction of alleles carrying the mutation (FAM)
which is derived from the full posterior distribution of , computed by INCOMMON.
Survival analysis
If patients’ survival status and time are provided as features in the
clinical data table clinical_table
, survival analysis can be performed.
Downstream of classification, INCOMMON can stratify patients based on
the mutant dosage of a TSG or an oncogene of interest.
INCOMMON provides the following functions, dedicated to fitting survival data:
-
kaplan_meier_fit
uses the Kaplan-Meier estimator to fit survival data from patients stratified with respect to the status of a specifictumor_type
andgene
(low, balanced or high mutant dosage) -
cox_fit
uses a Cox proportional hazard ratio model to fit survival data. In addition to argumentstumor_type
andgene
, it accepts othercovariates
, given they are provided in theclinical_table
.
Metastatic patterns
If information about metastatisation is provided, such as type of the
sample (primary tumor or metastasis), whether patients are metastatic or
not, and sites of metastatisation for primary tumors, in the clinical
data table clinical_table
, analysis of metastatic propensity and
tropism based on INCOMMON classification and genome interpretation
can be performed.
INCOMMON provides the following functions for analysis of metastases:
met_propensity
: uses a logistic regression test to compute the odds ratio (OR) of metastatisation between patients identified by the mutant dosage of agene
(low, balanced or high) for specific types of primary tumorstumor_type
.met_tropism
: uses a logistic regression test to compute the odds ratio (OR) to metastatise to a specific siteMETASTATIC_SITE
between patients identified by the two mutational statuses of agene
(Mutant TSG with versus without LOH and Mutant oncogene with versus without amplification) for specific types of primary tumorstumor_type
.