1. Introduction
a1_introduction.Rmd
library(tickTack)
require(dplyr)
#> Loading required package: dplyr
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) :
#> object 'type_sum.accel' not found
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
The tickTack
package provides tools for timing the
occurrence of copy number alterations (CNA). This vignette introduces
the structure of the input data required for tickTack
functions, using the example dataset
tickTack::pcawg_example
.
Input Data Structure
The input data consists of three components stored as named elements
in a list: mutations
, cna
, and
metadata
. Below is a description of each component.
1. Mutations
The mutations
component is a tibble containing
information about somatic mutations. Each row represents a mutation,
with the following columns:
-
chr
: Chromosome where the mutation occurs. -
from
andto
: Start and end positions of the mutation on the chromosome. -
ref
andalt
: Reference and alternate alleles. -
DP
: Depth of sequencing coverage at the mutation site. -
NV
: Number of reads supporting the variant. -
VAF
: Variant allele frequency, calculated asNV / DP
. -
sample
: Unique identifier for the sample.
For example, the first few rows of mutations
look like
this:
tickTack::pcawg_example$mutations %>% head()
#> # A tibble: 6 × 9
#> chr from to ref alt DP NV VAF sample
#> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 chr1 1015594 1015594 C C 99 16 0.162 00db1b95-8ca3-4cc4-bb46-6…
#> 2 chr1 1866371 1866371 C C 250 67 0.268 00db1b95-8ca3-4cc4-bb46-6…
#> 3 chr1 1921712 1921712 C C 62 20 0.323 00db1b95-8ca3-4cc4-bb46-6…
#> 4 chr1 2049858 2049858 G G 118 15 0.127 00db1b95-8ca3-4cc4-bb46-6…
#> 5 chr1 2357842 2357842 C C 84 12 0.143 00db1b95-8ca3-4cc4-bb46-6…
#> 6 chr1 2771915 2771915 G G 90 27 0.3 00db1b95-8ca3-4cc4-bb46-6…
2. Copy Number Alterations (CNA)
The cna
component is a tibble that describes regions of
the genome with alterations in copy number. Each row represents a
genomic segment, with the following columns:
-
chr
: Chromosome of the segment. -
from
andto
: Start and end positions of the segment. -
Major
andminor
: Major and minor allele copy numbers. -
CCF
Cancer cell fraction for the segment. -
total_cn
: Total copy number (sum ofMajor
andminor
).
Here is the preview of the cna
data:
tickTack::pcawg_example$cna %>% head()
#> # A tibble: 6 × 7
#> chr from to Major minor CCF total_cn
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 chr1 10001 790008 2 2 1 4
#> 2 chr1 790009 13212499 2 2 1 4
#> 3 chr1 13212500 33458785 2 2 1 4
#> 4 chr1 33458786 33564126 2 2 0.194 4
#> 5 chr1 33564127 56834601 2 2 1 4
#> 6 chr1 56834602 121499999 2 2 1 4
3. Metadata
The metadata
component is a tibble containing
sample-level information, with the following columns:
-
sample
: Unique identifier for the sample. -
purity
: Tumor purity, representing the proportion of cancer cells in the sample. -
ploidy
: Average ploidy of the sample. -
purity_conf_mad
: Confidence interval for the purity estimate. -
wgd_status
: Whole genome doubling status (e.g.,wgd
orno wgd
). -
wgd_uncertain
: Logical indicating uncertainty in thewgd_status
.
An example of the metadata is shown below
:
tickTack::pcawg_example$metadata
#> # A tibble: 1 × 6
#> sample purity ploidy purity_conf_mad wgd_status wgd_uncertain
#> <chr> <dbl> <dbl> <dbl> <chr> <lgl>
#> 1 00db1b95-8ca3-4cc4-bb4… 0.406 4.08 0.009 wgd FALSE
Types of Copy Number Alterations (CNAs)
The cna
component can include different types of copy
number segments, categorized as follows:
Clonal Simple CNAs
These are straightforward alterations with specific copy number configurations:
-
1:0
: Loss of heterozygosity (LOH). -
2:0
: Copy neutral LOH. -
1:1
: Diploid heterozygous (assumed to be the normal reference). -
2:1
: Trisomy. -
2:2
: Tetraploidy.
Reference genome coordinates
tickTack uses a genome coordinates reference system to convert
relative relative to absolute coordinates, a step required to plot
segments across the whole genome. For instance, if a mutation maps to
position
of chromosome chr2
, its absolute coordinate is
where
is the length of chr1
. The reference system adopted by
tickTack needs therefore to report the length of each chromosome, plus
the information regarding the boundary of each centromere.
Note: mapping of mutations onto segments is independent of the reference genome, and it will work as far as both mutation and CNA segments are mapped to the same reference.
tickTack supports two coordinates reference genomes:
-
hg19
orGRCh37
; -
hg38
orGRCh38
(default),
for which two dataframes are stored inside the package.
tickTack::chr_coordinates_hg19
#> # A tibble: 24 × 6
#> chr length from to centromerStart centromerEnd
#> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 chr1 249250621 0 249250621 121535434 124535434
#> 2 chr2 243199373 249250621 492449994 341576792 344576792
#> 3 chr3 198022430 492449994 690472424 582954848 585954848
#> 4 chr4 191154276 690472424 881626700 740132541 743132541
#> 5 chr5 180915260 881626700 1062541960 928032341 931032341
#> 6 chr6 171115067 1062541960 1233657027 1121372126 1124372126
#> 7 chr7 159138663 1233657027 1392795690 1291711358 1294711358
#> 8 chr8 146364022 1392795690 1539159712 1436634577 1439634577
#> 9 chr9 141213431 1539159712 1680373143 1586527391 1589527391
#> 10 chr10 135534747 1680373143 1815907890 1719628078 1722628078
#> # ℹ 14 more rows
tickTack::chr_coordinates_GRCh38
#> # A tibble: 24 × 6
#> chr length from to centromerStart centromerEnd
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 chr1 248956422 0 248956422 122026459 124849229
#> 2 chr2 242193529 248956422 491149951 341144567 341144567
#> 3 chr3 198295559 491149951 689445510 581922409 582703370
#> 4 chr4 190214555 689445510 879660065 739157571 739157571
#> 5 chr5 181538259 879660065 1061198324 926145965 929381368
#> 6 chr6 170805979 1061198324 1232004303 1119752212 1119752212
#> 7 chr7 159345973 1232004303 1391350276 1290173956 1293382091
#> 8 chr8 145138636 1391350276 1536488912 1435384020 1435384020
#> 9 chr9 138394717 1536488912 1674883629 1579878547 1579878547
#> 10 chr10 133797422 1674883629 1808681051 1714570311 1716429449
#> # ℹ 14 more rows