Skip to contents

Note: This article presents advanced topics on sequencing simulation. Refer to vignette("sequencing") for an introduction on the subject.

Disclaimer: RACES/rRACES internally implements the probability distributions using the C++11 random number distribution classes. The standard does not specify their algorithms, and the class implementations are left free for the compiler. Thus, the simulation output depends on the compiler used to compile RACES, and because of that, the results reported in this article may differ from those obtained by the reader.

The sequencing simulation supports sample partition analoguous to FACS. Users can provide a labelling function for the sampled cells by using the simulate_seq() parameter cell_labelling. This function must take in input an object of the class SampledCell and return a string representing the cell label. The function simulate_seq() applies the labelling function to all the sampled cells and partition them according to the labelling function outputs. The resulting sub-samples are named according the format <sample name>_<cell label>.

Let us assume to have built the phylogenetic forest phylo_forest as detailed in vignette("mutations").

The sampled cells can be grouped by epigenetic state as it follows.

# the labelling function parameter has type `SampledCell`
epi_labelling <- function(cell) {

  # epigenetic state is "+", "-", "". However, "+" and "-"
  # are not supported in R dataframe column names, so, we
  # can replace them by "P" and "N", respectively
  if (cell$epistate == "+") {
    return("P")
  }

  if (cell$epistate == "-") {
    return("N")
  }

  return("")
}

seq_results <- simulate_seq(phylo_forest, coverage = 0.5,
                            cell_labelling = epi_labelling)
#>  [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Processing chr. 22 [█---------------------------------------] 0% [00m:05s] Processing chr. 22 [█████████████---------------------------] 32% [00m:06s] Processing chr. 22 [███████████████████---------------------] 47% [00m:07s] Processing chr. 22 [████████████████████--------------------] 48% [00m:08s] Processing chr. 22 [████████████████████████████------------] 68% [00m:09s] Processing chr. 22 [███████████████████████████████---------] 77% [00m:10s] Processing chr. 22 [████████████████████████████████--------] 78% [00m:11s] Processing chr. 22 [███████████████████████████████████-----] 85% [00m:12s] Processing chr. 22 [████████████████████████████████████----] 88% [00m:13s] Processing chr. 22 [████████████████████████████████████----] 88% [00m:14s] Processing chr. 22 [████████████████████████████████████----] 89% [00m:15s] Processing chr. 22 [██████████████████████████████████████--] 93% [00m:16s] Processing chr. 22 [████████████████████████████████████████] 99% [00m:17s] Processing chr. 22 [████████████████████████████████████████] 99% [00m:18s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:22s] Reads simulated

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

seq_results$mutations %>% head()
#>   chr  chr_pos ref   alt causes  classes S_1_1_N.occurrences S_1_1_N.coverage
#> 1  22 16051493   G     A        germinal                   1                1
#> 2  22 16052080   G     A        germinal                   0                0
#> 3  22 16052167   A AAAAC        germinal                   1                1
#> 4  22 16053659   A     C        germinal                   0                0
#> 5  22 16053863   G     A        germinal                   0                0
#> 6  22 16054740   A     G        germinal                   1                1
#>   S_1_1_N.VAF S_1_1_P.occurrences S_1_1_P.coverage S_1_1_P.VAF
#> 1           1                   1                1           1
#> 2         NaN                   0                0         NaN
#> 3           1                   0                0         NaN
#> 4         NaN                   1                1           1
#> 5         NaN                   0                0         NaN
#> 6           1                   0                0         NaN
#>   S_1_2_N.occurrences S_1_2_N.coverage S_1_2_N.VAF S_1_2_P.occurrences
#> 1                   0                0         NaN                   0
#> 2                   1                1           1                   0
#> 3                   2                2           1                   0
#> 4                   2                2           1                   1
#> 5                   0                0         NaN                   2
#> 6                   2                2           1                   0
#>   S_1_2_P.coverage S_1_2_P.VAF S_2_1_N.occurrences S_2_1_N.coverage S_2_1_N.VAF
#> 1                0         NaN                   0                0         NaN
#> 2                0         NaN                   0                0         NaN
#> 3                0         NaN                   0                0         NaN
#> 4                1         1.0                   0                0         NaN
#> 5                4         0.5                   0                0         NaN
#> 6                0         NaN                   1                1           1
#>   S_2_1_P.occurrences S_2_1_P.coverage S_2_1_P.VAF S_2_2_N.occurrences
#> 1                   0                0         NaN                   0
#> 2                   0                1           0                   0
#> 3                   1                1           1                   0
#> 4                   0                0         NaN                   0
#> 5                   0                0         NaN                   0
#> 6                   1                1           1                   0
#>   S_2_2_N.coverage S_2_2_N.VAF S_2_2_P.occurrences S_2_2_P.coverage S_2_2_P.VAF
#> 1                0         NaN                   1                1         1.0
#> 2                2           0                   1                2         0.5
#> 3                0         NaN                   4                4         1.0
#> 4                0         NaN                   0                0         NaN
#> 5                0         NaN                   0                1         0.0
#> 6                0         NaN                   1                1         1.0
#>   normal_sample.occurrences normal_sample.coverage normal_sample.VAF
#> 1                         0                      0               NaN
#> 2                         0                      1                 0
#> 3                         0                      0               NaN
#> 4                         0                      0               NaN
#> 5                         0                      0               NaN
#> 6                         0                      0               NaN

The same approach can be easily applied to group sampled cells by mutant name…

# a mutant-name-based labelling function
mutant_labelling <- function(cell) {
  return(cell$mutant)
}

seq_results <- simulate_seq(phylo_forest, coverage = 0.5,
                            cell_labelling = mutant_labelling)
#>  [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Reading 22 [█---------------------------------------] 0% [00m:01s] Processing chr. 22 [█---------------------------------------] 0% [00m:05s] Processing chr. 22 [██████████████--------------------------] 33% [00m:06s] Processing chr. 22 [███████████████████████-----------------] 56% [00m:07s] Processing chr. 22 [████████████████████████████████--------] 78% [00m:08s] Processing chr. 22 [███████████████████████████████████-----] 85% [00m:09s] Processing chr. 22 [█████████████████████████████████████---] 92% [00m:10s] Processing chr. 22 [████████████████████████████████████████] 99% [00m:11s] Processing chr. 22 [████████████████████████████████████████] 99% [00m:12s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:14s] Reads simulated

seq_results$mutations %>% head()
#>   chr  chr_pos ref   alt causes  classes S_1_1_A.occurrences S_1_1_A.coverage
#> 1  22 16051493   G     A        germinal                   0                2
#> 2  22 16052080   G     A        germinal                   0                0
#> 3  22 16052167   A AAAAC        germinal                   0                0
#> 4  22 16053659   A     C        germinal                   0                0
#> 5  22 16054740   A     G        germinal                   1                1
#> 6  22 16055070   G     A        germinal                   1                1
#>   S_1_1_A.VAF S_1_2_A.occurrences S_1_2_A.coverage S_1_2_A.VAF
#> 1           0                   0                0         NaN
#> 2         NaN                   0                1           0
#> 3         NaN                   1                1           1
#> 4         NaN                   2                2           1
#> 5           1                   1                1           1
#> 6           1                   0                0         NaN
#>   S_2_1_A.occurrences S_2_1_A.coverage S_2_1_A.VAF S_2_2_B.occurrences
#> 1                   1                1           1                   0
#> 2                   0                1           0                   0
#> 3                   0                0         NaN                   1
#> 4                   1                1           1                   0
#> 5                   0                0         NaN                   2
#> 6                   0                2           0                   0
#>   S_2_2_B.coverage S_2_2_B.VAF normal_sample.occurrences normal_sample.coverage
#> 1                0         NaN                         0                      0
#> 2                0         NaN                         2                      2
#> 3                1           1                         0                      0
#> 4                0         NaN                         1                      1
#> 5                2           1                         0                      0
#> 6                0         NaN                         1                      1
#>   normal_sample.VAF
#> 1               NaN
#> 2                 1
#> 3               NaN
#> 4                 1
#> 5               NaN
#> 6                 1

…, birth time…

# a birth-time-based labelling function
birth_time_labelling <- function(cell) {
  if (cell$birth_time > 421) {
    return("YOUNG")
  }

  if (cell$birth_time > 321) {
    return("MIDDLE_AGED")
  }

  return("OLD")
}

seq_results <- simulate_seq(phylo_forest, coverage = 0.5,
                            cell_labelling = birth_time_labelling)
#>  [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Processing chr. 22 [█---------------------------------------] 0% [00m:05s] Processing chr. 22 [███-------------------------------------] 6% [00m:06s] Processing chr. 22 [█████████-------------------------------] 21% [00m:07s] Processing chr. 22 [████████████████████--------------------] 48% [00m:08s] Processing chr. 22 [█████████████████████-------------------] 51% [00m:09s] Processing chr. 22 [█████████████████████████---------------] 61% [00m:10s] Processing chr. 22 [████████████████████████████████--------] 78% [00m:11s] Processing chr. 22 [████████████████████████████████--------] 78% [00m:12s] Processing chr. 22 [█████████████████████████████████-------] 81% [00m:13s] Processing chr. 22 [████████████████████████████████████----] 88% [00m:14s] Processing chr. 22 [████████████████████████████████████----] 88% [00m:15s] Processing chr. 22 [█████████████████████████████████████---] 92% [00m:16s] Processing chr. 22 [████████████████████████████████████████] 99% [00m:17s] Processing chr. 22 [████████████████████████████████████████] 99% [00m:18s] Processing chr. 22 [████████████████████████████████████████] 99% [00m:19s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:24s] Reads simulated

seq_results$mutations %>% head()
#>   chr  chr_pos ref   alt causes  classes S_1_1_MIDDLE_AGED.occurrences
#> 1  22 16051493   G     A        germinal                             0
#> 2  22 16052080   G     A        germinal                             1
#> 3  22 16052167   A AAAAC        germinal                             0
#> 4  22 16053659   A     C        germinal                             2
#> 5  22 16053863   G     A        germinal                             0
#> 6  22 16054740   A     G        germinal                             0
#>   S_1_1_MIDDLE_AGED.coverage S_1_1_MIDDLE_AGED.VAF S_1_1_OLD.occurrences
#> 1                          1                   0.0                     0
#> 2                          2                   0.5                     0
#> 3                          0                   NaN                     1
#> 4                          2                   1.0                     1
#> 5                          0                   NaN                     0
#> 6                          0                   NaN                     0
#>   S_1_1_OLD.coverage S_1_1_OLD.VAF S_1_2_MIDDLE_AGED.occurrences
#> 1                  0           NaN                             0
#> 2                  0           NaN                             0
#> 3                  1             1                             1
#> 4                  1             1                             0
#> 5                  0           NaN                             0
#> 6                  0           NaN                             1
#>   S_1_2_MIDDLE_AGED.coverage S_1_2_MIDDLE_AGED.VAF S_1_2_OLD.occurrences
#> 1                          1                     0                     2
#> 2                          0                   NaN                     2
#> 3                          1                     1                     3
#> 4                          0                   NaN                     0
#> 5                          0                   NaN                     0
#> 6                          1                     1                     0
#>   S_1_2_OLD.coverage S_1_2_OLD.VAF S_2_1_MIDDLE_AGED.occurrences
#> 1                  2             1                             0
#> 2                  2             1                             1
#> 3                  3             1                             0
#> 4                  0           NaN                             0
#> 5                  0           NaN                             1
#> 6                  0           NaN                             1
#>   S_2_1_MIDDLE_AGED.coverage S_2_1_MIDDLE_AGED.VAF S_2_1_OLD.occurrences
#> 1                          0                   NaN                     0
#> 2                          1                   1.0                     0
#> 3                          0                   NaN                     1
#> 4                          0                   NaN                     2
#> 5                          2                   0.5                     0
#> 6                          1                   1.0                     1
#>   S_2_1_OLD.coverage S_2_1_OLD.VAF S_2_1_YOUNG.occurrences S_2_1_YOUNG.coverage
#> 1                  0           NaN                       1                    1
#> 2                  1             0                       0                    0
#> 3                  1             1                       0                    0
#> 4                  2             1                       0                    0
#> 5                  1             0                       0                    1
#> 6                  1             1                       1                    1
#>   S_2_1_YOUNG.VAF S_2_2_MIDDLE_AGED.occurrences S_2_2_MIDDLE_AGED.coverage
#> 1               1                             0                          0
#> 2             NaN                             0                          0
#> 3             NaN                             0                          0
#> 4             NaN                             0                          0
#> 5               0                             0                          0
#> 6               1                             0                          0
#>   S_2_2_MIDDLE_AGED.VAF S_2_2_YOUNG.occurrences S_2_2_YOUNG.coverage
#> 1                   NaN                       0                    1
#> 2                   NaN                       0                    1
#> 3                   NaN                       0                    0
#> 4                   NaN                       2                    2
#> 5                   NaN                       1                    1
#> 6                   NaN                       0                    0
#>   S_2_2_YOUNG.VAF normal_sample.occurrences normal_sample.coverage
#> 1               0                         0                      1
#> 2               0                         0                      1
#> 3             NaN                         1                      1
#> 4               1                         0                      0
#> 5               1                         0                      0
#> 6             NaN                         1                      1
#>   normal_sample.VAF
#> 1                 0
#> 2                 0
#> 3                 1
#> 4               NaN
#> 5               NaN
#> 6                 1

…, mutations…

# collect all the sample passenger indels
passenger_indels <- phylo_forest$get_sampled_cell_mutations() %>%
  filter(class == "passenger", type == "indel")

# get one of the passenger indels
p_indel <- passenger_indels[sample(seq_len(nrow(passenger_indels)), 1), ]

p_indel
#>     cell_id chr  chr_pos allele ref                   alt  type cause     class
#> 485   28139  22 31512507      1   C CTAAATAAATAAATAAATAAA indel   ID2 passenger

# a mutation-based labelling function that discriminates sampled cells
# containing indel `p_indel` from the other sampled cells
mutations_labelling <- function(cell) {
  has_indel <- nrow(cell$mutations %>%
                      filter(chr == p_indel[["chr"]],
                             chr_pos == p_indel[["chr_pos"]],
                             ref == p_indel[["ref"]],
                             alt == p_indel[["alt"]])) > 0

  if (has_indel) {
    return("HAS_MUTATION");
  }

  return("MISSES_MUTATION")
}

seq_results <- simulate_seq(phylo_forest, coverage = 0.5,
                            cell_labelling = mutations_labelling)
#>  [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Processing chr. 22 [█---------------------------------------] 0% [00m:05s] Processing chr. 22 [█████████████---------------------------] 31% [00m:06s] Processing chr. 22 [███████████████████---------------------] 46% [00m:07s] Processing chr. 22 [████████████████████--------------------] 48% [00m:08s] Processing chr. 22 [████████████████████████████------------] 69% [00m:09s] Processing chr. 22 [█████████████████████████████████-------] 81% [00m:10s] Processing chr. 22 [████████████████████████████████████----] 88% [00m:11s] Processing chr. 22 [███████████████████████████████████████-] 96% [00m:12s] Processing chr. 22 [████████████████████████████████████████] 99% [00m:13s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:16s] Reads simulated

seq_results$mutations %>% head()
#>   chr  chr_pos ref   alt causes  classes S_1_1_HAS_MUTATION.occurrences
#> 1  22 16051493   G     A        germinal                              0
#> 2  22 16052080   G     A        germinal                              0
#> 3  22 16052167   A AAAAC        germinal                              0
#> 4  22 16053659   A     C        germinal                              0
#> 5  22 16053863   G     A        germinal                              0
#> 6  22 16054740   A     G        germinal                              2
#>   S_1_1_HAS_MUTATION.coverage S_1_1_HAS_MUTATION.VAF
#> 1                           2                      0
#> 2                           0                    NaN
#> 3                           0                    NaN
#> 4                           0                    NaN
#> 5                           0                    NaN
#> 6                           2                      1
#>   S_1_1_MISSES_MUTATION.occurrences S_1_1_MISSES_MUTATION.coverage
#> 1                                 0                              0
#> 2                                 0                              1
#> 3                                 3                              3
#> 4                                 0                              0
#> 5                                 0                              0
#> 6                                 0                              0
#>   S_1_1_MISSES_MUTATION.VAF S_1_2_MISSES_MUTATION.occurrences
#> 1                       NaN                                 0
#> 2                         0                                 0
#> 3                         1                                 0
#> 4                       NaN                                 1
#> 5                       NaN                                 0
#> 6                       NaN                                 0
#>   S_1_2_MISSES_MUTATION.coverage S_1_2_MISSES_MUTATION.VAF
#> 1                              0                       NaN
#> 2                              0                       NaN
#> 3                              0                       NaN
#> 4                              1                         1
#> 5                              1                         0
#> 6                              0                       NaN
#>   S_2_1_MISSES_MUTATION.occurrences S_2_1_MISSES_MUTATION.coverage
#> 1                                 1                              1
#> 2                                 1                              1
#> 3                                 0                              0
#> 4                                 0                              0
#> 5                                 0                              1
#> 6                                 0                              0
#>   S_2_1_MISSES_MUTATION.VAF S_2_2_HAS_MUTATION.occurrences
#> 1                         1                              0
#> 2                         1                              0
#> 3                       NaN                              0
#> 4                       NaN                              0
#> 5                         0                              1
#> 6                       NaN                              1
#>   S_2_2_HAS_MUTATION.coverage S_2_2_HAS_MUTATION.VAF normal_sample.occurrences
#> 1                           0                    NaN                         0
#> 2                           0                    NaN                         0
#> 3                           0                    NaN                         1
#> 4                           0                    NaN                         0
#> 5                           2                    0.5                         0
#> 6                           1                    1.0                         0
#>   normal_sample.coverage normal_sample.VAF
#> 1                      0               NaN
#> 2                      0               NaN
#> 3                      1                 1
#> 4                      0               NaN
#> 5                      0               NaN
#> 6                      0               NaN

… or combination of these properties.