Process input data into an object of class 'INCOMMON'
, ready for
downstream analyses (e.g. classify
).
Usage
init(genomic_data, clinical_data, gene_roles = INCOMMON::cancer_gene_census)
Arguments
- genomic_data
a data table of annotated mutations with columns sample name
sample
, mutant chromosomechr
, mutation start positionfrom
, mutation end positionto
, reference alleleref
, alternative allelealt
, integer sequencing depthDP
, integer number of reads with variantNV
, variant allele frequencyVAF
and gene namegene
as Hugo Symbol, protein sequence of the variant in HGVS recommended format, preferably 1-letter amino-acid codeHGVSp_Short
.- clinical_data
a data table of clinical data with compulsory matching sample names
sample
and sample puritypurity
, and optional clinical features like tumor type ONCOTREE codetumor_type
(required for tumor specific priors), overall survival statusOS_STATUS
and timeOS_MONTHS
(required for survival analysis),SAMPLE_TYPE
(Primary or Metastasis) and number of metastasesMET_COUNT
(required for metastatic propensity analysis), metastatic siteMETASTATIC_SITE
(required for metastatic tropism analysis), plus any other useful covariate.- gene_roles
A data table reporting
gene
names and associatedgene_role
("oncogene" or "TSG"). The default is taken from COSMIC Cancer Gene Census v98.
Examples
# Example input data from the MSK-MetTropism cohort, released with the package
data(MSK_genomic_data)
print(MSK_genomic_data)
#> # A tibble: 224,939 × 10
#> sample chr from to ref alt DP NV VAF gene
#> <chr> <chr> <dbl> <dbl> <chr> <chr> <int> <int> <dbl> <chr>
#> 1 P-0028912 chr17 7577121 7577121 G A 837 133 0.159 TP53
#> 2 P-0028912 chr6 111983080 111983081 - A 698 141 0.202 FYN
#> 3 P-0028912 chrX 53246994 53246994 G A 832 85 0.102 KDM5C
#> 4 P-0003698 chr17 7576852 7576852 C A 437 109 0.249 TP53
#> 5 P-0003698 chr3 49933259 49933259 C A 591 86 0.146 MST1R
#> 6 P-0003698 chr5 149435631 149435631 C T 360 36 0.1 CSF1R
#> 7 P-0003698 chr13 32913797 32913797 G C 1027 162 0.158 BRCA2
#> 8 P-0003698 chr13 32914259 32914259 G C 1021 182 0.178 BRCA2
#> 9 P-0003698 chr19 11136104 11136104 G T 573 98 0.171 SMARCA4
#> 10 P-0003698 chr22 41543840 41543840 G A 416 45 0.108 EP300
#> # ℹ 224,929 more rows
data(MSK_clinical_data)
print(MSK_clinical_data)
#> # A tibble: 25,368 × 17
#> # Rowwise:
#> sample tumor_type purity OS_MONTHS OS_STATUS SAMPLE_TYPE MET_COUNT
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 P-0000004 BRCA 0.5 3.78 1 Primary 2
#> 2 P-0000015 BRCA 0.4 13.9 1 Metastasis 8
#> 3 P-0000024 UCEC 0.4 35.1 1 Metastasis 8
#> 4 P-0000025 UCEC 0.3 46 1 Metastasis 13
#> 5 P-0000026 UCEC 0.1 80.6 0 Metastasis 11
#> 6 P-0000034 BLCA 0.4 0.79 1 Primary 4
#> 7 P-0000037 HCC 0.9 80.9 0 Metastasis 1
#> 8 P-0000039 PLEMESO 0.4 5.62 1 Primary 5
#> 9 P-0000041 BRCA 0.3 13.6 1 Primary 9
#> 10 P-0000042 PLEMESO 0.4 56.9 1 Primary 0
#> # ℹ 25,358 more rows
#> # ℹ 10 more variables: METASTATIC_SITE <chr>, MET_SITE_COUNT <dbl>,
#> # PRIMARY_SITE <chr>, SUBTYPE_ABBREVIATION <chr>, GENE_PANEL <chr>,
#> # SEX <chr>, TMB_NONSYNONYMOUS <dbl>, FGA <dbl>, AGE_AT_SEQUENCING <dbl>,
#> # RACE <chr>
# Initialize the INCOMMON object (note the outputs to screen)
x = init(genomic_data = MSK_genomic_data, clinical_data = MSK_clinical_data)
#> ── INCOMMON - Inference of copy number and mutation multiplicity in oncology ───
#>
#> ── Genomic data ──
#>
#> ✔ Found 25659 samples, with 224939 mutations in 491 genes
#> ! No read counts found for 1393 mutations in 1393 samples
#> ! Gene name not provided for 1393 mutations
#> ! 201 genes could not be assigned a role (TSG or oncogene)
#>
#> ── Clinical data ──
#>
#> ℹ Provided clinical features:
#>
#> ✔ sample (required for classification)
#> ✔ purity (required for classification)
#> ✔ tumor_type
#> ✔ OS_MONTHS
#> ✔ OS_STATUS
#> ✔ SAMPLE_TYPE
#> ✔ MET_COUNT
#> ✔ METASTATIC_SITE
#> ✔ MET_SITE_COUNT
#> ✔ PRIMARY_SITE
#> ✔ SUBTYPE_ABBREVIATION
#> ✔ GENE_PANEL
#> ✔ SEX
#> ✔ TMB_NONSYNONYMOUS
#> ✔ FGA
#> ✔ AGE_AT_SEQUENCING
#> ✔ RACE
#>
#> ✔ Found 25257 matching samples
#> ✖ Found 513 unmatched samples
# An S3 method can be used to report to screen what is in the object
print(x)
#> ── [ INCOMMON ] 175054 PASS mutations across 24018 samples, with 290 mutant gen
#> ℹ Average sample purity: 0.4
#> ℹ Average sequencing depth: 649
#> # A tibble: 175,054 × 27
#> sample tumor_type purity chr from to ref alt DP NV VAF
#> <chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <int> <int> <dbl>
#> 1 P-0028912 CHOL 0.3 chr17 7.58e6 7.58e6 G A 837 133 0.159
#> 2 P-0028912 CHOL 0.3 chrX 5.32e7 5.32e7 G A 832 85 0.102
#> 3 P-0003698 BLCA 0.2 chr17 7.58e6 7.58e6 C A 437 109 0.249
#> 4 P-0003698 BLCA 0.2 chr5 1.49e8 1.49e8 C T 360 36 0.1
#> 5 P-0003698 BLCA 0.2 chr13 3.29e7 3.29e7 G C 1027 162 0.158
#> 6 P-0003698 BLCA 0.2 chr13 3.29e7 3.29e7 G C 1021 182 0.178
#> 7 P-0003698 BLCA 0.2 chr19 1.11e7 1.11e7 G T 573 98 0.171
#> 8 P-0003698 BLCA 0.2 chr22 4.15e7 4.15e7 G A 416 45 0.108
#> 9 P-0003698 BLCA 0.2 chrX 4.49e7 4.49e7 C T 730 194 0.266
#> 10 P-0003823 BLCA 0.6 chr5 1.30e6 1.30e6 G A 218 138 0.633
#> # ℹ 175,044 more rows
#> # ℹ 16 more variables: gene <chr>, gene_role <chr>, OS_MONTHS <dbl>,
#> # OS_STATUS <dbl>, SAMPLE_TYPE <chr>, MET_COUNT <dbl>, METASTATIC_SITE <chr>,
#> # MET_SITE_COUNT <dbl>, PRIMARY_SITE <chr>, SUBTYPE_ABBREVIATION <chr>,
#> # GENE_PANEL <chr>, SEX <chr>, TMB_NONSYNONYMOUS <dbl>, FGA <dbl>,
#> # AGE_AT_SEQUENCING <dbl>, RACE <chr>