#> ✔ Loading CNAqc, 'Copy Number Alteration quality check'.
Note: variant annotation should be carried out with dedicated tools. CNAqc functions should only be used to get a preliminary idea of the most important mutations annotated in a sample.
CNAqc can annotate input mutations and flag potential driver mutations. Using the VariantAnnotation package and the intOGen database, CNAqc performs the following steps:
.This functionality works with a CNAqc object.
# Dataset available with the package
data('example_dataset_CNAqc', package = 'CNAqc')
x = CNAqc::init(
mutations = example_dataset_CNAqc$mutations,
cna = example_dataset_CNAqc$cna,
purity = example_dataset_CNAqc$purity,
ref = 'hg19')
# What we annotate
x %>% Mutations
CNAqc uses databases from Bioconductor to annotate the variants; installation of these databases might take a bit of time because ~1GB of data have to be downloaded. This will happen only the first time the annotation is run.
# Reference against which we mapped the reads
reference_genome <- example_dataset_CNAqc$reference
# All those packages are distributed in Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager", repos="")
# We have to install the corresponding txdb package for transcript annotations
paste0("TxDb.Hsapiens.UCSC.",reference_genome, ".knownGene") %>% BiocManager::install()
# We have to install also the BS database for the sequences (it may take some time)
paste0("BSgenome.Hsapiens.UCSC.",reference_genome) %>% BiocManager::install()
# Then these two packages provide useful utilities to deal with biological databases
"Organism.dplyr" %>% BiocManager::install()
"" %>% BiocManager::install()
CNAqc has pre-loaded a list of 568 driver genes for 66 cancer types, compiled from intOGen release date 2020.02.01.
# The available list:
# - gene id
# - tumour code where the gene has been flagged as driver
# - tumour code longname (Esophageal cancer)
data("intogen_drivers", package = 'CNAqc')
# Number of genes (568)
intogen_drivers$gene %>% unique
# Tumour types (66)
intogen_drivers$synopsis %>% unique
# Organised table
We plot the list of genes that appear in at least 20 tumour types.
n <- 60
qual_col_pals =[$category == 'qual',]
col_vector = unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
intogen_drivers %>%
group_by(gene) %>%
filter(n() > 20) %>%
ggplot() +
geom_bar(aes(x = gene, fill = synopsis)) +
coord_flip() +
scale_fill_manual(values = col_vector) +
# Run default annotation function
x_new <- annotate_variants(x)
#> # A tibble: 3 × 17
#> chr from to ref alt NV DP VAF FILTER ANNOVAR_FUNCTION
#> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 chr2 179431633 1.79e8 C T 77 117 0.658 PASS exonic
#> 2 chr16 67646006 6.76e7 C T 54 120 0.45 PASS exonic
#> 3 chr17 7577106 7.58e6 G C 78 84 0.929 PASS exonic
#> # ℹ 7 more variables: GENE <chr>, is_driver <lgl>, driver_label <chr>,
#> # type <chr>, karyotype <chr>, segment_id <chr>, cna <chr>
#> # A tibble: 5 × 10
Note that there can be multiple locations and consequences for a
single variant. This happens as we try to annotate the mutations in a
transcript-agnostic manner, consequently we report all possible effects
and locations for any transcript (separated by :
Comparison between known and newly-annotated drivers seems
# New driver mutations
x_new %>% print()
#> ── [ CNAqc ] MySample 12963 mutations in 267 segments (267 clonal, 0 subclonal).
#> ── Clonal CNAs
#> 2:2 [n = 7478, L = 1483 Mb] ■■■■■■■■■■■■■■■■■■■■■■■■■■■ { CTCF_->, POLD1_-> }
#> 4:2 [n = 1893, L = 331 Mb] ■■■■■■■ { MECOM_-> }
#> 3:2 [n = 1625, L = 357 Mb] ■■■■■■ { HSPG2_-> }
#> 2:1 [n = 1563, L = 420 Mb] ■■■■■■
#> 3:0 [n = 312, L = 137 Mb] ■
#> 2:0 [n = 81, L = 39 Mb] { TP53_-> }
#> 16:2 [n = 4, L = 0 Mb]
#> 25:2 [n = 2, L = 1 Mb]
#> 3:1 [n = 2, L = 1 Mb]
#> 106:1 [n = 1, L = 0 Mb]
#> ℹ Sample Purity: 89% ~ Ploidy: 4.
#> ℹ There are 5 annotated driver(s) mapped to clonal CNAs.
One can restrict the list of potential genes to use for drivers detection. To this extent, it is convenient to use the available database and the information regarding the input cancer type.
# We pretend to work with OV, ovarian cancer
OV_drivers = intogen_drivers %>% dplyr::filter(tumour == 'OV')
OV_drivers$gene %>% unique()
#> [1] "ACKR3" "ARID1A" "ATRX" "BRAF" "BRCA1" "BRCA2" "CACNA1D"
#> [8] "CDK12" "ELL" "EPHA7" "ERBB2" "ERG" "FAT1" "FGFR1"
#> [15] "FOXL2" "JAK1" "KMT2A" "KMT2C" "KMT2D" "KRAS" "LATS1"
#> [22] "LRP1B" "MYH9" "NF1" "NF2" "NOTCH1" "NRAS" "PIK3CA"
#> [29] "PPP2R1A" "PTEN" "PTPRB" "PTPRT" "RB1" "SETD1B" "STAT5B"
#> [36] "TP53" "UGT2B17"
# Run annotation function
x_new_ov <- annotate_variants(x, drivers = OV_drivers)
#> # A tibble: 3 × 17
#> # A tibble: 1 × 10
#> ── CNAqc - CNA Quality Check ───────────────────────────────────────────────────
#> ── [ CNAqc ] MySample 12963 mutations in 267 segments (267 clonal, 0 subclonal).
#> ── [ CNAqc ] MySample 12963 mutations in 267 segments (267 clonal, 0 subclonal).
#> Warning: Removed 24 rows containing missing values or values outside the scale range
#> (`geom_segment()`).
#> Warning: Removed 24 rows containing missing values or values outside the scale range
#> (`geom_rect()`).
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_hline()`).
#> Removed 1 row containing missing values or values outside the scale range
#> (`geom_hline()`).