4. Timing Clonal Peaks in a hierarchical fashioin

The _fit_h function in the tickTack package estimates the timing of K clonal peaks in cancer genome sequencing data according to which a clustering can be performed. This vignette describes the functionality of the fit_h function, including input requirements, output, and an example analysis using the tickTack::pcawg_example_2 dataset.

Overview of the `fit_h` Function

The fit_h function uses a hierarchical model to fit clonal peaks in sequencing data considering the grouping tructure of the chromosomes segments. It identifies segments of the genome with specific karyotypes and mutations that meet the input criteria, then estimates the timing of the groups of events and assign each segment to a clock.

Key Parameters

x: a CNAqc object with mutations, cna and metadata
max_attempts: Number of times the variational inference is repeated to avoid local minima.
INIT: Logical flag to pass some initialization values tothe variational inference, default is TRURE.
tolerance: tolerance between two value of subsequent iterations of gradient ascent pn elbo, default is 0.01.
possible_k: A character vector of possible karyotypes, defaulting to c("2:1", "2:2", "2:0").
alpha: Significance level, defaulting to 0.05.
min_mutations_number: Minimum number of mutations required for analysis, defaulting to 2.
n_components: If 0, then the strategy to choose the #components follows the default procedure, otherwise the inference is repeated for K equal up to a maximun of n_components.

Output

The function returns a list containing:

data: The data used to perform the inference after selecting the ones that respect the assumptions to be used in the model.
draws_and_summary: List of 3 for each K the inference is performed with. draws are available both for the clocks and for the weights Summary statistics for the estimated timing of clonal peaks.
log_lik_matrix_list: Summary statistics for the estimated timing of clonal peaks.
elbo_iterations: Summary statistics for the estimated timing of clonal peaks.

If no segments meet the criteria, the function returns NULL.

Analyzing `tickTack::pcawg_example_2`

We will use the tickTack::pcawg_example_2 dataset to demonstrate how to use the fit_h function.

Input Data

The tickTack::pcawg_example_2 dataset contains three components:

mutations: Mutation data.
cna: Copy number alterations (CNA).
metadata: Sample metadata, including tumor purity.

Preview the data:

library(tickTack)

# View example dataset components
mutations <- tickTack::pcawg_example_2$mutations
cna <- tickTack::pcawg_example_2$cna
metadata <- tickTack::pcawg_example_2$metadata

head(mutations)

##    chr    from      to ref alt DP NV       VAF
## 1 chr1 1018754 1018754   C   C 51 24 0.4705882
## 2 chr1 1107556 1107556   C   C 53 17 0.3207547
## 3 chr1 1127192 1127192   C   C 78 13 0.1666667
## 4 chr1 1255263 1255263   C   C 66 23 0.3484848
## 5 chr1 1474126 1474126   C   C 68 17 0.2500000
## 6 chr1 1532289 1532289   G   G 39  6 0.1538462
##                                 sample NR Hugo_Symbol Strand
## 1 3b7810f7-f8ff-4d62-b766-3ba06170194c 27    C1orf159      +
## 2 3b7810f7-f8ff-4d62-b766-3ba06170194c 36      TTLL10      +
## 3 3b7810f7-f8ff-4d62-b766-3ba06170194c 65      TTLL10      +
## 4 3b7810f7-f8ff-4d62-b766-3ba06170194c 43      CPSF3L      +
## 5 3b7810f7-f8ff-4d62-b766-3ba06170194c 51     TMEM240      +
## 6 3b7810f7-f8ff-4d62-b766-3ba06170194c 33    C1orf233      +
##   Variant_Classification Variant_Type Tumor_Seq_Allele2    dbSNP_RS
## 1                 Intron          SNP                 T rs188930619
## 2                5'Flank          SNP                 T        <NA>
## 3                 Intron          SNP                 T        <NA>
## 4                 Intron          SNP                 T rs552646507
## 5                 Intron          SNP                 T rs572658610
## 6                    IGR          SNP                 A rs575347755
##   dbSNP_Val_Status          Matched_Norm_Sample_Barcode     Genome_Change
## 1             <NA> 5d5a6488-dc29-4ab5-8a98-e94fd5be113f g.chr1:1018754C>T
## 2             <NA> 5d5a6488-dc29-4ab5-8a98-e94fd5be113f g.chr1:1107556C>T
## 3             <NA> 5d5a6488-dc29-4ab5-8a98-e94fd5be113f g.chr1:1127192C>T
## 4      byFrequency 5d5a6488-dc29-4ab5-8a98-e94fd5be113f g.chr1:1255263C>T
## 5      byFrequency 5d5a6488-dc29-4ab5-8a98-e94fd5be113f g.chr1:1474126C>T
## 6             <NA> 5d5a6488-dc29-4ab5-8a98-e94fd5be113f g.chr1:1532289G>A
##             ref_context gc_content i_1000genomes_AF i_1000genomes_ID
## 1 cagGGGGTTACGTCAGGGTCT      0.602       0.00019968      rs188930619
## 2 ccagctcaggcatcagtggct      0.692               NA             <NA>
## 3 gtatcccttacgggaaacgaa      0.418               NA             <NA>
## 4 gCAAGGGCAGCGCCTGGCAAG      0.652       0.00099840      rs552646507
## 5 GCATGTCACACGCTTAGCAAC      0.587               NA             <NA>
## 6 ACTTCGGGGCGGGACTGGAAG      0.607               NA             <NA>
##                i_Callers i_GERM1000G i_GERMOVLP i_LOWSUPPORT i_NORMALPANEL
## 1 broad,dkfz,muse,sanger          NA         NA           NA            NA
## 2 broad,dkfz,muse,sanger          NA         NA           NA            NA
## 3 broad,dkfz,muse,sanger          NA         NA           NA            NA
## 4 broad,dkfz,muse,sanger          NA         NA           NA            NA
## 5 broad,dkfz,muse,sanger          NA         NA           NA            NA
## 6 broad,dkfz,muse,sanger          NA         NA           NA            NA
##   i_NumCallers i_OXOGFAIL i_REMAPFAIL i_SEXF  i_VAF i_bPcr i_bSeq i_qual
## 1            4         NA          NA     NA 0.4706     NA     NA   None
## 2            4         NA          NA     NA 0.3208     NA     NA   None
## 3            4         NA          NA     NA 0.1667     NA     NA   None
## 4            4         NA          NA     NA 0.3485     NA     NA   None
## 5            4         NA          NA     NA   0.25     NA     NA   None
## 6            4         NA          NA     NA 0.1538     NA     NA   None
##   i_repeat_masker i_signature_N3 i_signature_R1 i_signature_R2 i_snv_near_indel
## 1            <NA>          FALSE          FALSE          FALSE            FALSE
## 2           MLT1D          FALSE          FALSE          FALSE            FALSE
## 3          MER9a1          FALSE          FALSE          FALSE            FALSE
## 4            <NA>          FALSE          FALSE          FALSE            FALSE
## 5            <NA>          FALSE          FALSE          FALSE            FALSE
## 6            <NA>          FALSE          FALSE          FALSE            FALSE
##   i_model_score i_n_vaf Project_Code Donor_ID
## 1            NA      NA   Kidney-RCC  DO46877
## 2            NA      NA   Kidney-RCC  DO46877
## 3            NA      NA   Kidney-RCC  DO46877
## 4            NA      NA   Kidney-RCC  DO46877
## 5            NA      NA   Kidney-RCC  DO46877
## 6            NA      NA   Kidney-RCC  DO46877

head(cna)

##    chr      from        to Major minor CCF total_cn star level methods_agree
## 1 chr1     10001 121499999     1     1   1        2    3     a             6
## 2 chr1 121500000 128899999     1     1   1        2    2     d             4
## 3 chr1 128900000 247247500     1     1   1        2    3     a             6
## 4 chr1 247247501 249250620     1     1   1        2    2     d             4
## 5 chr2     66017  90499999     1     1   1        2    3     a             6
## 6 chr2  90500000  96799999     1     1   1        2    3     a             6
##   absolute_broad_major_cn absolute_broad_minor_cn absolute_broad_het_error
## 1                       1                       1               0.03993165
## 2                       1                       1               0.05000000
## 3                       1                       1               0.04043537
## 4                      NA                      NA                       NA
## 5                       1                       1               0.04002173
## 6                       1                       1               0.05000000
##   absolute_broad_cov_error aceseq_copy_number aceseq_minor_cn aceseq_major_cn
## 1                0.1174128           2.000000        1.000000        1.000000
## 2                0.1000000           2.000000        1.000000        1.000000
## 3                0.1258711           2.000000        1.000000        1.000000
## 4                       NA           2.136427        1.068214        1.068214
## 5                0.1149430           2.000000        1.000000        1.000000
## 6                0.3640664           2.000000        1.000000        1.000000
##   battenberg_nMaj1_A battenberg_nMin1_A battenberg_frac1_A battenberg_nMaj2_A
## 1                  1                  1                  1                 NA
## 2                  1                  1                  1                 NA
## 3                  1                  1                  1                 NA
## 4                  1                  1                  1                 NA
## 5                  1                  1                  1                 NA
## 6                  1                  1                  1                 NA
##   battenberg_nMin2_A battenberg_frac2_A battenberg_SDfrac_A
## 1                 NA                 NA                  NA
## 2                 NA                 NA                  NA
## 3                 NA                 NA                  NA
## 4                 NA                 NA                  NA
## 5                 NA                 NA                  NA
## 6                 NA                 NA                  NA
##   battenberg_SDfrac_A_BS battenberg_frac1_A_0.025 battenberg_frac1_A_0.975
## 1                     NA                       NA                       NA
## 2                     NA                       NA                       NA
## 3                     NA                       NA                       NA
## 4                     NA                       NA                       NA
## 5                     NA                       NA                       NA
## 6                     NA                       NA                       NA
##   clonehd_copy_number clonehd_minor_cn clonehd_major_cn sclust_nMaj1_A
## 1                   2                1                1              1
## 2                  NA               NA               NA             NA
## 3                   2                1                1              1
## 4                   2                1                1              1
## 5                   2                1                1              1
## 6                   2                1                1              1
##   sclust_nMin1_A sclust_frac1_A sclust_nMaj2_A sclust_nMin2_A sclust_frac2_A
## 1              1              1             NA             NA             NA
## 2             NA             NA             NA             NA             NA
## 3              1              1             NA             NA             NA
## 4              1              1             NA             NA             NA
## 5              1              1             NA             NA             NA
## 6              1              1             NA             NA             NA
##   jabba_copy_number jabba_minor_cn jabba_major_cn
## 1                 2              1              1
## 2                 2              1              1
## 3                 2              1              1
## 4                 2              1              1
## 5                 2              1              1
## 6                 2              1              1

metadata

##                                 sample purity ploidy purity_conf_mad wgd_status
## 1 3b7810f7-f8ff-4d62-b766-3ba06170194c  0.611  1.924           0.002     no_wgd
##   wgd_uncertain
## 1         FALSE

Running the `fit_h` function

We can run the fit_h function on the tickTack::pcawg_example_2 data to infer the timing of clonal peaks

# Extract input data
data <- tickTack::pcawg_example_2
tolerance = 0.1

# Run the fit function
data <- fit_h(
  x = data,
  max_attempts = 2,
  INIT = TRUE,
  tolerance = tolerance
)

Results

The results object that is returned together with the CNAqc input object contains four components: data, draws_and_summary, log_lik_matrix_list and elbo_iterations.

# View summary for a specific K, here K = 2
results <- data$results

Interpreting the output

We can inspect the main output of interest to understand the timing of clonal peaks. results$draws_and_summary contains: - draws the draws from the approximate posterior distribution of the taus and weights; - summary a summary with the main statistics of the approximate posterior distributions; - summarized_results represents the clock assignment, a tibble with the estimate of taus for each segment with a copy number event that has been included in the hierarchical inference

# View summary for a specific K, here K = 2
results$draws_and_summary[[2]]$summary

## # A tibble: 22 × 7
##    variable    mean   median      sd      mad         q5     q95
##    <chr>      <dbl>    <dbl>   <dbl>    <dbl>      <dbl>   <dbl>
##  1 tau[1]   0.124   0.119    0.0367  0.0376   0.0696     0.190  
##  2 tau[2]   0.957   0.959    0.0150  0.0135   0.928      0.977  
##  3 w[1,1]   0.887   0.959    0.172   0.0550   0.505      0.999  
##  4 w[2,1]   0.893   0.964    0.169   0.0478   0.519      0.999  
##  5 w[3,1]   0.980   0.990    0.0284  0.00972  0.928      0.999  
##  6 w[4,1]   0.997   0.997    0.00198 0.00143  0.993      0.999  
##  7 w[5,1]   0.0326  0.000683 0.122   0.00100  0.00000255 0.150  
##  8 w[6,1]   0.00120 0.000430 0.00264 0.000499 0.0000405  0.00497
##  9 w[7,1]   0.0260  0.00127  0.0972  0.00183  0.0000118  0.106  
## 10 w[8,1]   0.974   0.990    0.0465  0.0113   0.898      0.999  
## # ℹ 12 more rows

# View detailed summarized results for a specific K, here K = 2
results$draws_and_summary[[2]]$summarized_results

## # A tibble: 10 × 8
##    segment_original_indx segment_name      segment_id karyotype chr   clock_mean
##                    <int> <chr>                  <dbl> <chr>     <chr>      <dbl>
##  1                    38 chr5_75225353_75…          1 2:1       chr5       0.119
##  2                    40 chr5_76857368_76…          2 2:1       chr5       0.119
##  3                    42 chr5_77822720_82…          3 2:1       chr5       0.119
##  4                    44 chr5_83468190_18…          4 2:1       chr5       0.119
##  5                    46 chr6_242501_4335…          5 2:0       chr6       0.959
##  6                    48 chr6_4347801_290…          6 2:0       chr6       0.959
##  7                    50 chr6_29161839_38…          7 2:0       chr6       0.959
##  8                    54 chr7_65454_99371…          8 2:1       chr7       0.119
##  9                    56 chr7_9946564_579…          9 2:1       chr7       0.119
## 10                    58 chr7_61700000_15…         10 2:1       chr7       0.119
## # ℹ 2 more variables: clock_low <dbl>, clock_high <dbl>

Obtain the best K with model_selection_h

W e can run the model_selection_h function to obtain the scores for each inference performed with a different K and take the one with best ICL score if the BIC score prefer 2 components instead of 1, otherwise choose 1 as best K. The function takes as input the results and n_components and outputs the best_K and the corresponding best_fit together with the model_selection_tibble and the entropy_list used to evaluate the ICL score.

results_model_selection <- tickTack::model_selection_h(results, n_components = 0)

## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

best_K <- results_model_selection$best_K
model_selection_tibble <- results_model_selection$model_selection_tibble
entropy <- results_model_selection$entropy_list

Visulizing the output

The results can be viewed is genome-wise perspective using the tickTack::plot_timing_h function.

tickTack::plot_timing_h(results, 2)

Visualize distributions of draws from the approximate posterior

The approximate posterior distributions can be viewed using the tickTack::plot_posterior_clocks_h and tickTack::plot_posterior_weights_h functions, that internally use functions from Bayesplot.

posterior_clocks <- tickTack::plot_posterior_clocks_h(results, 2)

## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.

posterior_weights <- tickTack::plot_posterior_weights_h(results, 2)

Visualize the behavior of the ELBO during the inference

K = nrow(results_model_selection$model_selection_tibble)

p_elbo <- list()
for (i in 1:K){
  p_elbo[[i]] <- tickTack::plot_elbo_h(results$elbo_iterations[[i]]) + ggplot2::ggtitle(paste0("K = ", i))
}
p_elbo <- gridExtra::grid.arrange(grobs = p_elbo, ncol = 2)  #add global title

p_elbo

## TableGrob (3 x 2) "arrange": 5 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
## 4 4 (2-2,2-2) arrange gtable[layout]
## 5 5 (3-3,1-1) arrange gtable[layout]

Visualize all the inference results for each K

plot_model_selection_inference <- list()
for (i in 1:K){
  plot_model_selection_inference[[i]] <- tickTack::plot_timing_h(results, i) + ggplot2::ggtitle(paste0("K = ", i))
}
plot_model_selection_inference <- gridExtra::grid.arrange(grobs = plot_model_selection_inference, ncol = 2) #add global title

plot_model_selection_inference

## TableGrob (3 x 2) "arrange": 5 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
## 4 4 (2-2,2-2) arrange gtable[layout]
## 5 5 (3-3,1-1) arrange gtable[layout]

Overview of the fit_h Function