2. Inference of copy number and mutation multiplicity • INCOMMON

library(INCOMMON)
#> Warning: replacing previous import 'cli::num_ansi_colors' by
#> 'crayon::num_ansi_colors' when loading 'INCOMMON'
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DT)

In this vignette, we classify mutations from a single sample of the MSK-MetTropsim dataset provided within the package.

2.1 Input preparation

The minimal input for INCOMMON analyses consists of two pieces.

2.1.1 Genomic data

First we need a table of genomic_data (mutations) with required columns chr, from, to, ref, alt, DP, NV, VAF, and gene.

The following example is taken from the internal dataset obtained from the MSK-MetTropism cohort:

2.1.2 Clinical data

Second, we need a table of clinical data with at least the columns sample (sample names matching the ones in genomic_data) and purity (purity of each sample). For the INCOMMMON classification task, it might be helpful to have also a column tumor_type specifying the tumour type of the sample, required for using tumour-specific priors. The following example is taken from the internal dataset obtained from the MSK-MetTropism cohort:

2.2 Classification of sample ‘P-0002081’

We now focus on a specific sample:

sample = 'P-0002081'

genomic_data = MSK_genomic_data %>% filter(sample == !!sample)
clinical_data = MSK_clinical_data %>% filter(sample == !!sample)

print(genomic_data)
#> # A tibble: 4 × 10
#>   sample    chr       from       to ref   alt      DP    NV   VAF gene   
#>   <chr>     <chr>    <dbl>    <dbl> <chr> <chr> <int> <int> <dbl> <chr>  
#> 1 P-0002081 chr12 25398285 25398285 C     A       743   378 0.509 KRAS   
#> 2 P-0002081 chr17  7577139  7577139 G     A       246   116 0.472 TP53   
#> 3 P-0002081 chr19  1221293  1221293 C     A       260   122 0.469 STK11  
#> 4 P-0002081 chr19 11141472 11141473 -     C       271   133 0.491 SMARCA4
print(clinical_data)
#> # A tibble: 1 × 17
#> # Rowwise: 
#>   sample    tumor_type purity OS_MONTHS OS_STATUS SAMPLE_TYPE MET_COUNT
#>   <chr>     <chr>       <dbl>     <dbl>     <dbl> <chr>           <dbl>
#> 1 P-0002081 LUAD          0.6      0.36         1 Metastasis          6
#> # ℹ 10 more variables: METASTATIC_SITE <chr>, MET_SITE_COUNT <dbl>,
#> #   PRIMARY_SITE <chr>, SUBTYPE_ABBREVIATION <chr>, GENE_PANEL <chr>,
#> #   SEX <chr>, TMB_NONSYNONYMOUS <dbl>, FGA <dbl>, AGE_AT_SEQUENCING <dbl>,
#> #   RACE <chr>

By inspecting the clinical data table, we can see that this is a sample of a metastatic lung adenocarcinoma (LUAD), with purity 0.6, sequenced through the MSK-IMPACT targeted panel version 341.

The genomic data table contains 4 mutations affecting KRAS, TP53, STK11 and SMARCA4 genes.

2.2.1 Initialisation of the input

The first thing to do is to initialise the input so that we will have it in INCOMMON format, through function init. This function takes as input the tables of genomic_data and clinical_data, plus optionally, a list of gene roles.

INCOMMON provides a default list cancer_gene_census obtained from the COSMIC Cancer Gene Census v.98. The required format is as following:

Let’s have a look at the output of function init:

x = init(genomic_data = genomic_data, 
         clinical_data = clinical_data, 
         gene_roles = cancer_gene_census)
#> ── INCOMMON - Inference of copy number and mutation multiplicity in oncology ───
#> 
#> ── Genomic data ──
#> 
#> ✔ Found 1 samples, with 4 mutations in 4 genes
#> 
#> ── Clinical data ──
#> 
#> ℹ Provided clinical features:
#> ✔ sample (required for classification)
#> ✔ purity (required for classification)
#> ✔ tumor_type
#> ✔ OS_MONTHS
#> ✔ OS_STATUS
#> ✔ SAMPLE_TYPE
#> ✔ MET_COUNT
#> ✔ METASTATIC_SITE
#> ✔ MET_SITE_COUNT
#> ✔ PRIMARY_SITE
#> ✔ SUBTYPE_ABBREVIATION
#> ✔ GENE_PANEL
#> ✔ SEX
#> ✔ TMB_NONSYNONYMOUS
#> ✔ FGA
#> ✔ AGE_AT_SEQUENCING
#> ✔ RACE
#> ✔ Found 1 matching samples
#> ✔ No mismatched samples

print(x)
#> ── [ INCOMMON ]  4 PASS mutations across 1 samples, with 4 mutant genes across 1
#> ℹ Average sample purity: 0.6
#> ℹ Average sequencing depth: 380
#> # A tibble: 4 × 27
#>   sample    tumor_type purity chr      from     to ref   alt      DP    NV   VAF
#>   <chr>     <chr>       <dbl> <chr>   <dbl>  <dbl> <chr> <chr> <int> <int> <dbl>
#> 1 P-0002081 LUAD          0.6 chr12  2.54e7 2.54e7 C     A       743   378 0.509
#> 2 P-0002081 LUAD          0.6 chr17  7.58e6 7.58e6 G     A       246   116 0.472
#> 3 P-0002081 LUAD          0.6 chr19  1.22e6 1.22e6 C     A       260   122 0.469
#> 4 P-0002081 LUAD          0.6 chr19  1.11e7 1.11e7 -     C       271   133 0.491
#> # ℹ 16 more variables: gene <chr>, gene_role <chr>, OS_MONTHS <dbl>,
#> #   OS_STATUS <dbl>, SAMPLE_TYPE <chr>, MET_COUNT <dbl>, METASTATIC_SITE <chr>,
#> #   MET_SITE_COUNT <dbl>, PRIMARY_SITE <chr>, SUBTYPE_ABBREVIATION <chr>,
#> #   GENE_PANEL <chr>, SEX <chr>, TMB_NONSYNONYMOUS <dbl>, FGA <dbl>,
#> #   AGE_AT_SEQUENCING <dbl>, RACE <chr>

All the requirements for INCOMMON classification are satisfied. The average sequencing depth is 380. Mutations flagged are PASS are the ones that satisfy the requirements for INCOMMON classification: available, non-negative sample purity, integer sequencing depth and number of reads with the variant, character gene names etc. In this sample, all 4 mutations have all the required information.

2.2.2 Running the INCOMMON classifier

We now run the classification step through function classify, using the default priors (see the dedicated section), no entropy cutoff, and over-dispersion parameter.

x = classify(x = x, 
             priors = pcawg_priors, 
             entropy_cutoff = NULL,
             rho = 0.01)
#> 
#> ── INCOMMON inference of copy number and mutation multiplicity for sample  ─────
#> ℹ Performing classification
#> → No LUAD-specific prior probability specified for KRAS
#> → Using a pan-cancer prior
#> ✔ Loading CNAqc, 'Copy Number Alteration quality check'. Support : <https://caravagn.github.io/CNAqc/>
#> → No LUAD-specific prior probability specified for TP53
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for STK11
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for SMARCA4
#> → Using a pan-cancer prior
#> ℹ There are:
#> • N = 0 mutations (HMD)
#> • N = 3 mutations (LOH)
#> • N = 0 mutations (CNLOH)
#> • N = 1 mutations (AM)
#> • N = 0 mutations (Tier-2)
#> ℹ The mean classification entropy is 0.04 (min: 0.01, max: 0.06)

There are 3 mutant genes with loss of heterozygosty (LOH) 1 with amplification of the mutant allele (AM). The average entropy of 0.04 indicates a high confidence of the classification, with the largest uncertainty being 0.07.

print(x)
#> ── [ INCOMMON ]  4 PASS mutations across 1 samples, with 4 mutant genes across 1
#> ℹ Average sample purity: 0.6
#> ℹ Average sequencing depth: 380
#> ── [ INCOMMON ]  Classified mutations with overdispersion parameter 0.01 and ent
#> ℹ There are:
#> • N = 0 mutations (HMD)
#> • N = 3 mutations (LOH)
#> • N = 0 mutations (CNLOH)
#> • N = 1 mutations (AM)
#> • N = 0 mutations (Tier-2)
#> # A tibble: 4 × 18
#>   sample    tumor_type purity chr      from     to ref   alt      DP    NV   VAF
#>   <chr>     <chr>       <dbl> <chr>   <dbl>  <dbl> <chr> <chr> <int> <int> <dbl>
#> 1 P-0002081 LUAD          0.6 chr12  2.54e7 2.54e7 C     A       743   378 0.509
#> 2 P-0002081 LUAD          0.6 chr17  7.58e6 7.58e6 G     A       246   116 0.472
#> 3 P-0002081 LUAD          0.6 chr19  1.22e6 1.22e6 C     A       260   122 0.469
#> 4 P-0002081 LUAD          0.6 chr19  1.11e7 1.11e7 -     C       271   133 0.491
#> # ℹ 7 more variables: gene <chr>, gene_role <chr>, id <chr>, label <chr>,
#> #   state <chr>, posterior <dbl>, entropy <dbl>

The only mutant oncogene in the sample is KRAS, that is mutated with amplification, whereas all the TSGs (TP53, SMARCA4, STK11) are in LOH.

2.3 Visualising INCOMMON classification

INCOMMON allows visualising a representation of the maximum a posteriori classification through function plot_classification.

plot_classification(x, sample = sample, assembly = T)
#> → No LUAD-specific prior probability specified for KRAS
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for KRAS
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for TP53
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for TP53
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for STK11
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for STK11
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for SMARCA4
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for SMARCA4
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for KRAS
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for TP53
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for STK11
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for SMARCA4
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for KRAS
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for TP53
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for STK11
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for SMARCA4
#> → Using a pan-cancer prior

These plots show the posterior mixtures of Beta-Binomial distributions (one for each INCOMMON class) spanning the total sequencing depth of the mutation site. Each color and marker shape identify a different combination of ploidy and multiplicity. The vertical dashed line corresponds to the number of reads with the variant $NV$ , whereas the horizontal line corresponds to the value of the classification entropy.