2. Inference of copy number and mutation multiplicity
Source:vignettes/a2_classify_mutations.Rmd
a2_classify_mutations.Rmd
library(INCOMMON)
#> Warning: replacing previous import 'cli::num_ansi_colors' by
#> 'crayon::num_ansi_colors' when loading 'INCOMMON'
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(DT)
In this vignette, we classify mutations from a single sample of the MSK-MetTropsim dataset provided within the package.
2.1 Input preparation
The minimal input for INCOMMON analyses consists of two pieces.
2.1.1 Genomic data
First we need a table of genomic_data
(mutations) with
required columns chr
, from
, to
,
ref
, alt
, DP
, NV
,
VAF
, and gene
.
The following example is taken from the internal dataset obtained from the MSK-MetTropism cohort:
2.1.2 Clinical data
Second, we need a table of clinical data with at least the columns
sample
(sample names matching the ones in
genomic_data
) and purity
(purity of each
sample). For the INCOMMMON classification task, it might be helpful to
have also a column tumor_type
specifying the tumour type of
the sample, required for using tumour-specific priors. The following
example is taken from the internal dataset obtained from the
MSK-MetTropism cohort:
2.2 Classification of sample ‘P-0002081’
We now focus on a specific sample:
sample = 'P-0002081'
genomic_data = MSK_genomic_data %>% filter(sample == !!sample)
clinical_data = MSK_clinical_data %>% filter(sample == !!sample)
print(genomic_data)
#> # A tibble: 4 × 10
#> sample chr from to ref alt DP NV VAF gene
#> <chr> <chr> <dbl> <dbl> <chr> <chr> <int> <int> <dbl> <chr>
#> 1 P-0002081 chr12 25398285 25398285 C A 743 378 0.509 KRAS
#> 2 P-0002081 chr17 7577139 7577139 G A 246 116 0.472 TP53
#> 3 P-0002081 chr19 1221293 1221293 C A 260 122 0.469 STK11
#> 4 P-0002081 chr19 11141472 11141473 - C 271 133 0.491 SMARCA4
print(clinical_data)
#> # A tibble: 1 × 17
#> # Rowwise:
#> sample tumor_type purity OS_MONTHS OS_STATUS SAMPLE_TYPE MET_COUNT
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 P-0002081 LUAD 0.6 0.36 1 Metastasis 6
#> # ℹ 10 more variables: METASTATIC_SITE <chr>, MET_SITE_COUNT <dbl>,
#> # PRIMARY_SITE <chr>, SUBTYPE_ABBREVIATION <chr>, GENE_PANEL <chr>,
#> # SEX <chr>, TMB_NONSYNONYMOUS <dbl>, FGA <dbl>, AGE_AT_SEQUENCING <dbl>,
#> # RACE <chr>
By inspecting the clinical data table, we can see that this is a sample of a metastatic lung adenocarcinoma (LUAD), with purity 0.6, sequenced through the MSK-IMPACT targeted panel version 341.
The genomic data table contains 4 mutations affecting KRAS, TP53, STK11 and SMARCA4 genes.
2.2.1 Initialisation of the input
The first thing to do is to initialise the input so that we will have
it in INCOMMON format, through function init
. This function
takes as input the tables of genomic_data
and
clinical_data
, plus optionally, a list of gene roles.
INCOMMON provides a default list cancer_gene_census
obtained from the COSMIC Cancer Gene Census v.98. The required format is
as following:
Let’s have a look at the output of function init
:
x = init(genomic_data = genomic_data,
clinical_data = clinical_data,
gene_roles = cancer_gene_census)
#> ── INCOMMON - Inference of copy number and mutation multiplicity in oncology ───
#>
#> ── Genomic data ──
#>
#> ✔ Found 1 samples, with 4 mutations in 4 genes
#>
#> ── Clinical data ──
#>
#> ℹ Provided clinical features:
#> ✔ sample (required for classification)
#> ✔ purity (required for classification)
#> ✔ tumor_type
#> ✔ OS_MONTHS
#> ✔ OS_STATUS
#> ✔ SAMPLE_TYPE
#> ✔ MET_COUNT
#> ✔ METASTATIC_SITE
#> ✔ MET_SITE_COUNT
#> ✔ PRIMARY_SITE
#> ✔ SUBTYPE_ABBREVIATION
#> ✔ GENE_PANEL
#> ✔ SEX
#> ✔ TMB_NONSYNONYMOUS
#> ✔ FGA
#> ✔ AGE_AT_SEQUENCING
#> ✔ RACE
#> ✔ Found 1 matching samples
#> ✔ No mismatched samples
print(x)
#> ── [ INCOMMON ] 4 PASS mutations across 1 samples, with 4 mutant genes across 1
#> ℹ Average sample purity: 0.6
#> ℹ Average sequencing depth: 380
#> # A tibble: 4 × 27
#> sample tumor_type purity chr from to ref alt DP NV VAF
#> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <int> <int> <dbl>
#> 1 P-0002081 LUAD 0.6 chr12 2.54e7 2.54e7 C A 743 378 0.509
#> 2 P-0002081 LUAD 0.6 chr17 7.58e6 7.58e6 G A 246 116 0.472
#> 3 P-0002081 LUAD 0.6 chr19 1.22e6 1.22e6 C A 260 122 0.469
#> 4 P-0002081 LUAD 0.6 chr19 1.11e7 1.11e7 - C 271 133 0.491
#> # ℹ 16 more variables: gene <chr>, gene_role <chr>, OS_MONTHS <dbl>,
#> # OS_STATUS <dbl>, SAMPLE_TYPE <chr>, MET_COUNT <dbl>, METASTATIC_SITE <chr>,
#> # MET_SITE_COUNT <dbl>, PRIMARY_SITE <chr>, SUBTYPE_ABBREVIATION <chr>,
#> # GENE_PANEL <chr>, SEX <chr>, TMB_NONSYNONYMOUS <dbl>, FGA <dbl>,
#> # AGE_AT_SEQUENCING <dbl>, RACE <chr>
All the requirements for INCOMMON classification are satisfied. The average sequencing depth is 380. Mutations flagged are PASS are the ones that satisfy the requirements for INCOMMON classification: available, non-negative sample purity, integer sequencing depth and number of reads with the variant, character gene names etc. In this sample, all 4 mutations have all the required information.
2.2.2 Running the INCOMMON classifier
We now run the classification step through function
classify
, using the default priors (see the dedicated section), no entropy cutoff,
and over-dispersion parameter.
x = classify(x = x,
priors = pcawg_priors,
entropy_cutoff = NULL,
rho = 0.01)
#>
#> ── INCOMMON inference of copy number and mutation multiplicity for sample ─────
#> ℹ Performing classification
#> → No LUAD-specific prior probability specified for KRAS
#> → Using a pan-cancer prior
#> ✔ Loading CNAqc, 'Copy Number Alteration quality check'. Support : <https://caravagn.github.io/CNAqc/>
#> → No LUAD-specific prior probability specified for TP53
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for STK11
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for SMARCA4
#> → Using a pan-cancer prior
#> ℹ There are:
#> • N = 0 mutations (HMD)
#> • N = 3 mutations (LOH)
#> • N = 0 mutations (CNLOH)
#> • N = 1 mutations (AM)
#> • N = 0 mutations (Tier-2)
#> ℹ The mean classification entropy is 0.04 (min: 0.01, max: 0.06)
There are 3 mutant genes with loss of heterozygosty (LOH) 1 with amplification of the mutant allele (AM). The average entropy of 0.04 indicates a high confidence of the classification, with the largest uncertainty being 0.07.
print(x)
#> ── [ INCOMMON ] 4 PASS mutations across 1 samples, with 4 mutant genes across 1
#> ℹ Average sample purity: 0.6
#> ℹ Average sequencing depth: 380
#> ── [ INCOMMON ] Classified mutations with overdispersion parameter 0.01 and ent
#> ℹ There are:
#> • N = 0 mutations (HMD)
#> • N = 3 mutations (LOH)
#> • N = 0 mutations (CNLOH)
#> • N = 1 mutations (AM)
#> • N = 0 mutations (Tier-2)
#> # A tibble: 4 × 18
#> sample tumor_type purity chr from to ref alt DP NV VAF
#> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <int> <int> <dbl>
#> 1 P-0002081 LUAD 0.6 chr12 2.54e7 2.54e7 C A 743 378 0.509
#> 2 P-0002081 LUAD 0.6 chr17 7.58e6 7.58e6 G A 246 116 0.472
#> 3 P-0002081 LUAD 0.6 chr19 1.22e6 1.22e6 C A 260 122 0.469
#> 4 P-0002081 LUAD 0.6 chr19 1.11e7 1.11e7 - C 271 133 0.491
#> # ℹ 7 more variables: gene <chr>, gene_role <chr>, id <chr>, label <chr>,
#> # state <chr>, posterior <dbl>, entropy <dbl>
The only mutant oncogene in the sample is KRAS, that is mutated with amplification, whereas all the TSGs (TP53, SMARCA4, STK11) are in LOH.
2.3 Visualising INCOMMON classification
INCOMMON allows visualising a representation of the maximum a
posteriori classification through function
plot_classification
.
plot_classification(x, sample = sample, assembly = T)
#> → No LUAD-specific prior probability specified for KRAS
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for KRAS
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for TP53
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for TP53
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for STK11
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for STK11
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for SMARCA4
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for SMARCA4
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for KRAS
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for TP53
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for STK11
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for SMARCA4
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for KRAS
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for TP53
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for STK11
#> → Using a pan-cancer prior
#> → No LUAD-specific prior probability specified for SMARCA4
#> → Using a pan-cancer prior
These plots show the posterior mixtures of Beta-Binomial distributions (one for each INCOMMON class) spanning the total sequencing depth of the mutation site. Each color and marker shape identify a different combination of ploidy and multiplicity. The vertical dashed line corresponds to the number of reads with the variant , whereas the horizontal line corresponds to the value of the classification entropy.