Simulating Mutations • rRACES

Disclaimer: RACES/rRACES internally implements the probability distributions using the C++11 random number distribution classes. The standard does not specify their algorithms, and the class implementations are left free for the compiler. Thus, the simulation output depends on the compiler used to compile RACES, and because of that, the results reported in this article may differ from those obtained by the reader.

rRACES/RACES can simulate genomic mutations on the cells represented in a SamplesForest according to the specified SBS and indel mutational signatures (see (Alexandrov et al. 2020)). This process is performed by the class MutationEngine which also takes into account the mutation rate of the simulated species and gives the chance to dynamically change the signatures.

Setting Up Mutation Engine

The creation of an object of the type MutationEngine requires to download the reference sequence and the signature files and to build the corresponding two signature indices. The function MutationEngine() performs all these steps in a single call.

This function is quite flexible and it allows to customize the mutation engine in many ways. However, the vast majority of rRACES users aim a standard setup for instance involving the human genome.

Because of this, rRACES provides some predefined setups. Their list can be obtained by invoking the function get_mutation_engine_codes().

library(rRACES)

get_mutation_engine_codes()
#>     name           description
#> 1 GRCh37 Homo sapiens (GRCh37)
#> 2 GRCh38 Homo sapiens (GRCh38)
#> 3   demo  A demostative set-up

Let us build a MutationEngine by using the "demo" setup.

# building a mutation engine by using the "demo" setup
m_engine <- MutationEngine(setup_code = "demo")
#> Downloading reference genome...
#> Reference genome downloaded
#> Decompressing reference file...done
#> Downloading SBS file...
#> SBS file downloaded
#> Downloading indel file...
#> indel file downloaded
#> Downloading driver mutation file...
#> Driver mutation file downloaded
#> Downloading passenger CNAs file...
#> Passenger CNAs file downloaded
#> Downloading germline mutations...
#> Germline mutations downloaded
#> Building context index...
#>  [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [██████████------------------------------] 24% [00m:01s] Processing chr. 22 [████████████████████--------------------] 49% [00m:02s] Processing chr. 22 [██████████████████████████████----------] 73% [00m:03s] Processing chr. 22 [████████████████████████████████████████] 98% [00m:04s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:04s] Context index built
#>  [█---------------------------------------] 0% [00m:00s] Saving context index [████████████████████████████████████████] 100% [00m:00s] Context index saved
#> done
#> Building repeated sequence index...
#>  [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:01s] Processing chr. 22 [█---------------------------------------] 0% [00m:05s] Processing chr. 22 [█---------------------------------------] 0% [00m:07s] Processing chr. 22 [█---------------------------------------] 0% [00m:08s] Processing chr. 22 [█---------------------------------------] 0% [00m:09s] Processing chr. 22 [█---------------------------------------] 0% [00m:11s] Processing chr. 22 [█---------------------------------------] 0% [00m:17s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:17s] RS index built
#>  [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 2% [00m:02s] Saving RS index [███████████████████████-----------------] 57% [00m:03s] Saving RS indexdone
#>  [████████████████████████████████████████] 100% [00m:03s] RS index saved
#>  [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#>  [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved

m_engine
#> MutationEngine
#>  Passenger rates
#> 
#>  Driver mutations
#> 
#>  Timed Exposure
#>    SBS Timed Exposures
#> 
#>    indel Timed Exposures

The above call creates the directory demo, downloads all the data required by the mutation engine, and builds it.

dir.exists("demo")
#> [1] TRUE

list.files("demo")
#>  [1] "context_index_100.cif"     "drivers.txt"              
#>  [3] "germline_data"             "germline_data_demo.tar.gz"
#>  [5] "indel_signatures.txt"      "passenger_CNAs.txt"       
#>  [7] "reference.fasta"           "rs_index_50_500000.rsif"  
#>  [9] "SBS_signatures.txt"        "sources.csv"

The execution of the function MutationEngine() may takes some time, but it is meant to be performed one for all and, as long as the user does not need to change the reference genome or the signature files, it is no more required. In this spirit, any call to the function MutationEngine checks whether all the building process sub-tasks are really needed and, if this is not the case, it avoids to performs them.

# building a mutation engine by using the "demo" set-up configuration
m_engine <- MutationEngine(setup_code = "demo")
#>  [█---------------------------------------] 0% [00m:00s] Loading context index [████████████████████████████████████████] 100% [00m:00s] Context index loaded
#>  [█---------------------------------------] 0% [00m:00s] Loading RS index [██████████------------------------------] 24% [00m:01s] Loading RS index [█████████████████████-------------------] 50% [00m:02s] Loading RS index [█████████████████████████████-----------] 72% [00m:03s] Loading RS index [███████████████████████████████████████-] 95% [00m:04s] Loading RS index [████████████████████████████████████████] 100% [00m:04s] RS index loaded
#>  [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded

Mutant Genetic Characterization

Once the mutation engine has been built, we can define mutant genotype and declare species mutation rates.

Let us consider simulation peformed in Section “Two populations with epigenetic state” of vignette("sampling"). It involves the two mutants A and “B”. Both of them has two possible epigenetic states, + and -, leading to the four the species A+, A-, B+, and B-, respectively. Each of these species has a passenger mutation rate that must be declated to the mutation engine before labeling the samples forest. Let 1e-9, 3e-8, 8e-7, and 5e-8 be the passenger SNV rates for the species A+, A-, B+, and B-, respectively. Let the indel rates be 0 for all the species, but A+ which has indel rate 1e-8. Furthermore, let 0, 1e-11, 0, and 0 be the passenger CNA rates of the same species, respectively.

The two mutants may also be genetically characterized by some driver mutations. The driver mutations associated with each of the mutants must occur in any cell belonging to the mutant itself. Hence, they must be declared to the mutant engine before the labeling.

The method MutationEngine$add_mutant() takes care of all these declarations.

For the sake of example, let us assume that A is characterized by one driver mutation on chromosome 22, while B has three driver mutations on the same chromosome.

The genetic specification of the mutant A can be declared as it follows.

# add the mutant "A" characterized by one driver SNV on the allele 1 of
# chromosome 22, one indel deletion on the same chromosome, two CNAs (a
# deletion on the allele 1 of chromosome 22 and an amplification on a random
# allele of the same chromosome) and, finally, a whole genome doubling event
# (WGD). The mutant has two epigenetic states and its species "A+" and "A-"
# have passenger SNV rates 1e-9 and 3e-8, respectively, passenger indel rates
# 1e-10 and 0, respectively, and passenger CNA rates 0 and 1e-11, respectively.
m_engine$add_mutant(mutant_name = "A",
                    passenger_rates = list("+" = c(SNV = 1e-9, indel = 1e-8),
                                           "-" = c(SNV = 3e-8, CNA = 1e-11)),
                    drivers = list(SNV("22", 10510210, "C", allele = 1),
                                   Mutation("22", 15220157, "GTTTTTTTT", "G"),
                                   CNA(type = "A", chr = "22",
                                       chr_pos = 10303470, len = 200000),
                                   CNA("D", "22", 5010000, 200000,
                                       allele = 1),
                                   WGD))
#>  [█---------------------------------------] 0% [00m:00s] Retrieving "A" SNVs [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Reading 22 [█---------------------------------------] 0% [00m:02s] Reading 22 [████████████████████████████████████████] 100% [00m:02s] "A" SNVs retrieved

m_engine
#> MutationEngine
#>  Passenger rates
#>    "A+": {SNV: 1e-09, indel: 1e-08}
#>    "A-": {SNV: 3e-08, CNA: 1e-11}
#> 
#>  Driver mutations
#>    "A":
#>        chr22(10510210)[N>C](allele: 1)
#>        chr22(15220157)[GTTTTTTTT>G](allele: random)
#>        CNA("A",chr22(10303470), len: 200000)
#>        CNA("D",chr22(5010000), len: 200000, allele: 1)
#>        Whole genome duplication
#> 
#>  Timed Exposure
#>    SBS Timed Exposures
#> 
#>    indel Timed Exposures

In the above code, the driver CNAs, SNVs, and indels are declared by calling the functions CNA(), SNV(), and Mutation(), respectively. These function allow to specify the allele in which the mutation must lay. The mutant A is also characterized by a whole genome doubling (WGD): a genomic event that simultaneously duplicates all the chromosome alleles.

The driver mutations are applied according to their specification order. For instance, the following two pieces of code specify two different genomic characterizations for the same mutant E.

m_engine$add_mutant("E", passenger_rates,
                    drivers = list(SNV("22", 10510210, "C", allele = 1),
                                   WGD))

m_engine$add_mutant("E", passenger_rates,
                    drivers = list(WGD,
                                   SNV("22", 10510210, "C", allele = 1)))

The former snippet places an SNV and, afterwards, duplicates all the alleles producing a genome in which two alleles contain the placed SNV. Instead, the latter snippet requires a whole genome doubling event and, then, places the SNV yielding a single occurrence of the SNV in the final genome.

As far as the SNVs and indels concern, rRACES provides the users with a more compact and, sometimes, convenient approach. The mutation engine stores a list of known driver mutations and labels each of them by a code which can be used during the mutant declaration. The corresponding data frame can be obtained by using the method MutationEngine$get_known_drivers().

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

m_engine$get_known_drivers() %>% filter(chr == "22")
#>   chr     from       to ref alt mutation_type tumour_type driver_gene
#> 1  22 30057302 30057302   C   T           SNV          OV         NF2
#> 2  22 30057302 30057302   C   T           SNV          OV         NF2
#>   driver_code
#> 1   NF2 R221*
#> 2   NF2 R262*

The code of the known driver mutations can be used in place of the full specification as follows.

# add the mutant "B" characterized by two driver SNVs on chromosome 22 (no
# CNA) and two epigenetic states. The first SNV is "NF2 R221*" and it must 
# lay in the allele 1. The remaining SNV is specified by using the SNV function
# as done above. The species "B+" and "B-" have passenger SNV rates 8e-8
# and 5e-8, respectively, and CNA rates 0 for both species.
m_engine$add_mutant("B", list("+" = c(SNV = 8e-8), "-" = c(SNV = 5e-8)),
                    list(list("NF2 R221*", allele = 1),   # the first SNV
                         SNV("22", 12028576, "G")))       # the second SNV
#>  [█---------------------------------------] 0% [00m:00s] Retrieving "B" SNVs [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Reading 22 [█---------------------------------------] 0% [00m:02s] Reading 22 [████████████████████████████████████████] 100% [00m:02s] "B" SNVs retrieved

m_engine
#> MutationEngine
#>  Passenger rates
#>    "A+": {SNV: 1e-09, indel: 1e-08}
#>    "A-": {SNV: 3e-08, CNA: 1e-11}
#>    "B+": {SNV: 8e-08}
#>    "B-": {SNV: 5e-08}
#> 
#>  Driver mutations
#>    "A":
#>        chr22(10510210)[N>C](allele: 1)
#>        chr22(15220157)[GTTTTTTTT>G](allele: random)
#>        CNA("A",chr22(10303470), len: 200000)
#>        CNA("D",chr22(5010000), len: 200000, allele: 1)
#>        Whole genome duplication
#>    "B":
#>        chr22(30057302)[C>T](allele: 1)
#>        chr22(12028576)[N>G](allele: random)
#> 
#>  Timed Exposure
#>    SBS Timed Exposures
#> 
#>    indel Timed Exposures

Mutational Exposures

The probability for a mutation to occur depends on both its genomic and environmental context.

A signature is a mutation probability distribution over mutation contexts or mutation structure. SBS (single base substitution) signatures provide for any genomic context (i.e., a triplet of bases) the probability for a SNV to occur on that context. On the contrary, ID (indel) signatures associates the probability of an indel to its length and structure (see (Alexandrov et al. 2020)).

The signature depends on the environmental context and, because of that, more than one signature may be active at the same time with different probabilities. An mutational exposure (or exposure) is a discrete probability distribution among signatures.

In order to simulate passeger mutations of a given type, we need to specify a default exposure for that type. This can be achieved as it follows.

# add SNV and indel default exposures. This will be used from simulated time 0
# up to the successive exposure change.
m_engine$add_exposure(coefficients = c(SBS13 = 0.2, SBS1 = 0.8))
m_engine$add_exposure(c(ID2 = 0.6, ID13 = 0.2, ID21 = 0.2))

Further exposures can also be defined by specifying an activation time for each of them, i.e., the new exposures will be used from the specified time up to the next exposure change.

# add a new SNV exposure that will be used from simulated
# time 100 up to the successive exposure change.
m_engine$add_exposure(time = 100, c(SBS5 = 0.3, SBS2 = 0.2, SBS3 = 0.5))

m_engine
#> MutationEngine
#>  Passenger rates
#>    "A+": {SNV: 1e-09, indel: 1e-08}
#>    "A-": {SNV: 3e-08, CNA: 1e-11}
#>    "B+": {SNV: 8e-08}
#>    "B-": {SNV: 5e-08}
#> 
#>  Driver mutations
#>    "A":
#>        chr22(10510210)[N>C](allele: 1)
#>        chr22(15220157)[GTTTTTTTT>G](allele: random)
#>        CNA("A",chr22(10303470), len: 200000)
#>        CNA("D",chr22(5010000), len: 200000, allele: 1)
#>        Whole genome duplication
#>    "B":
#>        chr22(30057302)[C>T](allele: 1)
#>        chr22(12028576)[N>G](allele: random)
#> 
#>  Timed Exposure
#>    SBS Timed Exposures
#>      [0, 100[: {"SBS1": 0.8, "SBS13": 0.2}
#>      [100, ∞[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#> 
#>    indel Timed Exposures
#>      [0, ∞[: {"ID13": 0.2, "ID2": 0.6, "ID21": 0.2}

The default exposures and the simultaneous changes in SNV and ID exposures can be specified by a sigle call to the function MutationEngine$add_exposure() as it follows.

m_engine$add_exposure(time = 120, c(SBS5 = 0.3, SBS2 = 0.2, SBS3 = 0.5,
                                    ID1 = 0.8, ID9 = 0.2))

m_engine
#> MutationEngine
#>  Passenger rates
#>    "A+": {SNV: 1e-09, indel: 1e-08}
#>    "A-": {SNV: 3e-08, CNA: 1e-11}
#>    "B+": {SNV: 8e-08}
#>    "B-": {SNV: 5e-08}
#> 
#>  Driver mutations
#>    "A":
#>        chr22(10510210)[N>C](allele: 1)
#>        chr22(15220157)[GTTTTTTTT>G](allele: random)
#>        CNA("A",chr22(10303470), len: 200000)
#>        CNA("D",chr22(5010000), len: 200000, allele: 1)
#>        Whole genome duplication
#>    "B":
#>        chr22(30057302)[C>T](allele: 1)
#>        chr22(12028576)[N>G](allele: random)
#> 
#>  Timed Exposure
#>    SBS Timed Exposures
#>      [0, 100[: {"SBS1": 0.8, "SBS13": 0.2}
#>      [100, 120[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#>      [120, ∞[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#> 
#>    indel Timed Exposures
#>      [0, 120[: {"ID13": 0.2, "ID2": 0.6, "ID21": 0.2}
#>      [120, ∞[: {"ID1": 0.8, "ID9": 0.2}

Passenger CNAs and Tumour Type

The passenger CNAs applied during the simulation depend on tumour type. The type of tumour can be specified when building the mutation engine by using the parameters tumour_type or tumour_study. For instance, if the passenger CNAs file used to build the mutation engine contains some of the data identified in breast carcinoma in the UK, the specific CNA can be applied during the simulation as it follows.

m_engine <- MutationEngine(setup_code = "demo",
                           tumour_type = "BRCA",
                           tumour_study = "UK")

The complete list of available tumour types and studies in a set-up can be obtained by using the function get_available_tumours_in().

get_available_tumours_in(setup_code = "demo") %>% head()
#>   type study
#> 1 BLCA    US
#> 2 BOCA    UK
#> 3 BRCA    EU
#> 4 BRCA    UK
#> 5 BRCA    US
#> 6 BTCA    SG

Germline Mutations

rRACES allows users to apply the germline mutations of one of the subjects available in the germline data provided in building the mutation engine. This feature will enable users to simulate a specific cancer type’s evolution on an individual with the desired gender and ethnicity.

The available subjects, together with their sex and ethnicity, can be obtained by using the method [MutationEngine$get_germline_subjects()].

subjects <- m_engine$get_germline_subjects()

subjects
#>    sample pop super_pop gender
#> 1 NA18941 JPT       EAS female
#> 2 NA20513 TSI       EUR   male

The column sample contains the names of the available subjects. The columns pop and super_pop report the subjects’ population and super-population codes. The last column, gender, includes the subject gender.

The method MutationEngine$get_population_descriptions() clarifies the meaning of the codes reported in the pop columns.

m_engine$get_population_descriptions()
#>    code          description
#> 1   CHB          Han Chinese
#> 2   JPT             Japanese
#> 3   CHS Southern Han Chinese
#> 4   CDX          Dai Chinese
#> 5   KHV      Kinh Vietnamese
#> 6   CHD       Denver Chinese
#> 7   CEU                 CEPH
#> 8   TSI               Tuscan
#> 9   GBR              British
#> 10  FIN              Finnish
#> 11  IBS              Spanish
#> 12  YRI               Yoruba
#> 13  LWK                Luhya
#> 14  GWD              Gambian
#> 15  MSL                Mende
#> 16  ESN                 Esan
#> 17  ASW  African-American SW
#> 18  ACB    African-Caribbean
#> 19  MXL     Mexican-American
#> 20  PUR         Puerto Rican
#> 21  CLM            Colombian
#> 22  PEL             Peruvian
#> 23  GIH             Gujarati
#> 24  PJL              Punjabi
#> 25  BEB              Bengali
#> 26  STU           Sri Lankan
#> 27  ITU               Indian
#>                                                      long.description
#> 1                                       Han Chinese in Beijing, China
#> 2                                            Japanese in Tokyo, Japan
#> 3                                                   Han Chinese South
#> 4                                 Chinese Dai in Xishuangbanna, China
#> 5                                   Kinh in Ho Chi Minh City, Vietnam
#> 6                          Chinese in Denver, Colorado (pilot 3 only)
#> 7  Utah residents (CEPH) with Northern and Western European ancestry 
#> 8                                                  Toscani in Italia 
#> 9                                    British in England and Scotland 
#> 10                                                Finnish in Finland 
#> 11                                      Iberian populations in Spain 
#> 12                                          Yoruba in Ibadan, Nigeria
#> 13                                             Luhya in Webuye, Kenya
#> 14                           Gambian in Western Division, The Gambia 
#> 15                                              Mende in Sierra Leone
#> 16                                                    Esan in Nigeria
#> 17                                  African Ancestry in Southwest US 
#> 18                                      African Caribbean in Barbados
#> 19                        Mexican Ancestry in Los Angeles, California
#> 20                                        Puerto Rican in Puerto Rico
#> 21                                    Colombian in Medellin, Colombia
#> 22                                             Peruvian in Lima, Peru
#> 23                                     Gujarati Indian in Houston, TX
#> 24                                        Punjabi in Lahore, Pakistan
#> 25                                              Bengali in Bangladesh
#> 26                                         Sri Lankan Tamil in the UK
#> 27                                            Indian Telugu in the UK

The method MutationEngine$get_active_germline() returns the the active germline subject.

m_engine$get_active_germline()
#>    sample pop super_pop gender
#> 1 NA18941 JPT       EAS female

Users can change the germline subject using the method MutationEngine$set_germline_subject().

When a subject is selected for the first time, rRACES builds a binary representation of the subject genome, saving it for future use. This step may take a few minutes. However, all the successive selections of the same subject directly load the binary file.

m_engine$set_germline_subject(subjects[2, "sample"])
#>  [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#>  [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved

m_engine$get_active_germline()
#>    sample pop super_pop gender
#> 1 NA20513 TSI       EUR   male

Building Phylogenetic Forests

The configurated mutation engine can be used to label each node in a samples forest by mutations.

Since the mutation engine has be configurated to deal with the simulation peformed in Section “Two populations with epigenetic state” of vignette("sampling"), we can load its samples forest from the file "samples_forest.sff" in saved that article.

samples_forest <- load_samples_forest("samples_forest.sff")

# place mutations on the sample forest assuming 1000 pre-neoplastic SNVs and
# 500 indels
phylo_forest <- m_engine$place_mutations(samples_forest, 1000, 500)
#>  [█---------------------------------------] 0% [00m:00s] Placing mutations [█████-----------------------------------] 12% [00m:01s] Placing mutations [██████████------------------------------] 24% [00m:02s] Placing mutations [███████████████-------------------------] 37% [00m:03s] Placing mutations [█████████████████████-------------------] 50% [00m:04s] Placing mutations [██████████████████████████--------------] 63% [00m:05s] Placing mutations [███████████████████████████████---------] 75% [00m:06s] Placing mutations [███████████████████████████████████-----] 86% [00m:07s] Placing mutations [██████████████████████████████████████--] 93% [00m:08s] Placing mutations [████████████████████████████████████████] 100% [00m:09s] Mutations placed

phylo_forest
#> PhylogeneticForest
#>   # of trees: 1
#>   # of nodes: 18845
#>   # of leaves: 5396
#>   samples: {"S_1_1", "S_1_2", "S_2_1", "S_2_2"}
#> 
#>   # of emerged SNVs and indels: 113353
#>   # of emerged CNAs: 39

The phylogenetic forest stores all the mutations labeling the sampled cells which are represented by the forest leaves. Users can retrieve such data by using the methods PhylogeneticForest$get_sampled_cell_mutations() and PhylogeneticForest$get_sampled_cell_CNAs().

library(dplyr)

# select the first mutations among all the mutations occuring in
# the genomes of the sampled cells
phylo_forest$get_sampled_cell_mutations() %>% head()
#>   cell_id chr  chr_pos allele       ref     alt  type cause          class
#> 1    2790  22 15220157      0 GTTTTTTTT       G indel     A         driver
#> 2    2790  22 16079553      0         G       A   SNV  SBS1 pre-neoplastic
#> 3    2790  22 16135375      0         C CAAAAAA indel   ID1 pre-neoplastic
#> 4    2790  22 16149260      0         G       A   SNV  SBS1 pre-neoplastic
#> 5    2790  22 16200607      0         G       A   SNV  SBS1 pre-neoplastic
#> 6    2790  22 16251672      0         C  CTTTTT indel   ID1 pre-neoplastic

# select the first CNAs among all the mutations occuring in
# the genomes of the sampled cells
phylo_forest$get_sampled_cell_CNAs() %>% head()
#>   cell_id type chr    begin      end allele src.allele  class
#> 1    2790    A  22 10303470 10503469      2          1 driver
#> 2    2790    D  22  5010000  5209999      1         NA driver
#> 3    5000    A  22 10303470 10503469      2          1 driver
#> 4    5000    D  22  5010000  5209999      1         NA driver
#> 5    8774    A  22 10303470 10503469      2          1 driver
#> 6    8774    D  22  5010000  5209999      1         NA driver

# get the sampled cells
sampled_cells <- phylo_forest$get_nodes() %>%
  filter(!is.na(.data$sample))

# show the first of them
sampled_cells %>% head()
#>   cell_id ancestor mutant epistate sample birth_time
#> 1    2790     2694      A        -  S_1_1   127.8854
#> 2    5000     4671      A        -  S_1_2   152.6275
#> 3    8774     8274      A        -  S_1_1   180.3081
#> 4    9651     8069      A        -  S_1_1   185.6496
#> 5   11533     7757      A        -  S_1_1   195.7661
#> 6   12905    12464      A        -  S_1_1   202.1142

# get the identifier of the 3rd cell in `sampled_cells`
cell_id <- sampled_cells[3, 1]

# get the SNVs and the indels of the 3rd cell in `sampled_cells`
phylo_forest$get_sampled_cell_mutations(cell_id) %>% head()
#>   cell_id chr  chr_pos allele       ref     alt  type cause          class
#> 1    8774  22 15220157      0 GTTTTTTTT       G indel     A         driver
#> 2    8774  22 16079553      0         G       A   SNV  SBS1 pre-neoplastic
#> 3    8774  22 16135375      0         C CAAAAAA indel   ID1 pre-neoplastic
#> 4    8774  22 16149260      0         G       A   SNV  SBS1 pre-neoplastic
#> 5    8774  22 16200607      0         G       A   SNV  SBS1 pre-neoplastic
#> 6    8774  22 16251672      0         C  CTTTTT indel   ID1 pre-neoplastic

# get the CNAs of the 3rd cell in `sampled_cells`
phylo_forest$get_sampled_cell_CNAs(cell_id) %>% head()
#>   cell_id type chr    begin      end allele src.allele  class
#> 1    8774    A  22 10303470 10503469      2          1 driver
#> 2    8774    D  22  5010000  5209999      1         NA driver

The method PhylogeneticForest$get_samples_info() is analogous to DescendantForest$get_samples_info(): it returns a dataframe containing information about the forest samples. However, the former method adds two columns to the result of the latter: “DNA_quantity” and “equivalent_normal_cells”. The former column reports the overall quantity of tumoral DNA in the sample, i.e., the sum of the lengths of all the alleles in the corresponding sample. This quantity is a natural number, nevertheless, it is stored as a real number as it usually usually exceeds the largest natural number that can be natively represented by R. The column “equivalent_normal_cells” instead contains the number of normal cells that contain as much DNA as the sample tumoral cells.

phylo_forest$get_samples_info()
#>    name id xmin ymin xmax ymax tumour_cells tumour_cells_in_bbox     time
#> 1 S_1_1  0  480  480  530  530         2588                 2588 365.9279
#> 2 S_1_2  1  500  500  550  550         1632                 1632 365.9279
#> 3 S_2_1  2  370  498  394  522          572                  572 421.2143
#> 4 S_2_2  3  370  373  394  397          604                  604 421.2143
#>   DNA_quantity equivalent_normal_cells
#> 1 530855879155                5173.573
#> 2 333071264048                3246.020
#> 3 117255398095                1142.738
#> 4 123951831456                1208.000

The method PhylogeneticForest$get_germline_mutations() returns the SNVs and the indels in the germline.

# extract the germline mutation
phylo_forest$get_germline_mutations() %>% head()
#>   chr  chr_pos allele ref   alt  type cause    class
#> 1  22 16051493      0   G     A   SNV       germinal
#> 2  22 16052167      0   A AAAAC indel       germinal
#> 3  22 16053659      0   A     C   SNV       germinal
#> 4  22 16054740      0   A     G   SNV       germinal
#> 5  22 16055942      0   C     T   SNV       germinal
#> 6  22 16058070      0   A     G   SNV       germinal

Users can also identify the cell in which a mutation emerged even when the cell was not sampled.

# select one of the mutations
mutation_row <- phylo_forest$get_sampled_cell_mutations(cell_id)[2, ]

# rebuild the corresponding mutation
mutation <- Mutation(mutation_row["chr"][1, ],
                     mutation_row["chr_pos"][1, ],
                     mutation_row["ref"][1, ],
                     mutation_row["alt"][1, ])

# get the identifier of the oldest cells in which the mutation occurs
phylo_forest$get_first_occurrences(mutation)
#> [[1]]
#> [1] 0

The exposures used in placing the mutations on the cells in the phylogenetic forest can be obtained by using the method PhylogeneticForest$get_exposures().

# get the exposures used in placing the mutations
phylo_forest$get_exposures()
#>    time signature exposure  type
#> 1     0      SBS1      0.8   SNV
#> 2     0     SBS13      0.2   SNV
#> 3   100      SBS2      0.2   SNV
#> 4   100      SBS3      0.5   SNV
#> 5   100      SBS5      0.3   SNV
#> 6   120      SBS2      0.2   SNV
#> 7   120      SBS3      0.5   SNV
#> 8   120      SBS5      0.3   SNV
#> 9     0      ID13      0.2 indel
#> 10    0       ID2      0.6 indel
#> 11    0      ID21      0.2 indel
#> 12  120       ID1      0.8 indel
#> 13  120       ID9      0.2 indel

The method MutationEngine$get_bulk_allelic_fragmentation() returns a dataframe reporting the allelic type per genome fragment.

# get the name of the first sample
sample_name <- phylo_forest$get_samples_info()[["name"]][1]

# print the bulk allelic fragmentation
phylo_forest$get_bulk_allelic_fragmentation(sample_name) %>% head()
#>   chr   begin      end major minor       ratio
#> 1  22       1  5009999     2     2 0.998840804
#> 2  22       1  5009999     3     2 0.001159196
#> 3  22 5010000  5209999     2     0 0.998840804
#> 4  22 5010000  5209999     3     0 0.001159196
#> 5  22 5210000 10303469     2     2 0.998840804
#> 6  22 5210000 10303469     3     2 0.001159196

Instead, the method MutationEngine$get_cell_allelic_fragmentation() returns the allelic fragmentation per cell.

# print the cell allelic fragmentation
phylo_forest$get_cell_allelic_fragmentation() %>% head()
#>   cell_id chr    begin      end major minor
#> 1    2790  22        1  5009999     2     2
#> 2    2790  22  5010000  5209999     2     0
#> 3    2790  22  5210000 10303469     2     2
#> 4    2790  22 10303470 10503469     4     2
#> 5    2790  22 10503470 51304566     2     2
#> 6    5000  22        1  5009999     2     2

The details about the SNV and indel signatures adopted during the evolution are available in the mutation engine and they can be retrieved by using the methods MutationEngine$get_SNV_signatures() and MutationEngine$get_indel_signatures().

# get the SNV signatures used in placing the mutations
m_engine$get_SNV_signatures()[1:6, 1:5]
#>      Type        SBS1        SBS2       SBS3        SBS4
#> 1 A[C>A]A 0.000886157 5.80000e-07 0.02080832 0.042196498
#> 2 A[C>A]C 0.002280405 1.48004e-04 0.01650660 0.033297236
#> 3 A[C>A]G 0.000177031 5.23000e-05 0.00175070 0.015598705
#> 4 A[C>A]T 0.001280227 9.78000e-05 0.01220488 0.029497552
#> 5 A[C>G]A 0.001860330 2.23000e-16 0.01970788 0.006889428
#> 6 A[C>G]C 0.001220217 1.33004e-04 0.01170468 0.002839764

# get the indel signatures used in placing the mutations
m_engine$get_indel_signatures()[1:6, 1:5]
#>        Type          ID1         ID2        ID3         ID4
#> 1 1:Del:C:0 1.598890e-04 0.004824116 0.12472711 0.007249717
#> 2 1:Del:C:1 7.735230e-04 0.000022100 0.20887617 0.002734869
#> 3 1:Del:C:2 3.310000e-18 0.000003110 0.17632422 0.002041063
#> 4 1:Del:C:3 1.907613e-03 0.002472076 0.06404276 0.001112283
#> 5 1:Del:C:4 7.059900e-04 0.003856976 0.04398981 0.001075684
#> 6 1:Del:C:5 3.370115e-03 0.014615424 0.02241749 0.002246178

Finally, the data of the subject whose germline corresponds to wild-type genome in the phylogenetic forest can be obtained by the method Phylogenetic$get_germline_subject().

phylo_forest$get_germline_subject()
#>    sample pop super_pop gender
#> 1 NA20513 TSI       EUR   male

Storing Phylogenetic Forests

As in the case of the samples forests, the phylogenetic forests can be saved by using the method PhylogeneticForest$save() and load by the function load_phylogenetic_forest().

# save the phylogenetic forest in the file "phylo_forest.sff"
phylo_forest$save("phylo_forest.sff")
#>  [█---------------------------------------] 0% [00m:00s] Saving forest [█---------------------------------------] 0% [00m:00s] Saving forest [█---------------------------------------] 0% [00m:01s] Saving forest [█---------------------------------------] 0% [00m:02s] Saving forest [█---------------------------------------] 0% [00m:03s] Saving forest [█---------------------------------------] 0% [00m:04s] Saving forest [█---------------------------------------] 0% [00m:06s] Saving forest [█---------------------------------------] 0% [00m:07s] Saving forest [█---------------------------------------] 0% [00m:08s] Saving forest [█---------------------------------------] 0% [00m:09s] Saving forest [█---------------------------------------] 0% [00m:10s] Saving forest [█---------------------------------------] 0% [00m:11s] Saving forest [█---------------------------------------] 0% [00m:12s] Saving forest [█---------------------------------------] 0% [00m:13s] Saving forest [█---------------------------------------] 0% [00m:14s] Saving forest [█---------------------------------------] 0% [00m:15s] Saving forest [█---------------------------------------] 0% [00m:16s] Saving forest [█---------------------------------------] 1% [00m:17s] Saving forest [█████-----------------------------------] 11% [00m:18s] Saving forest [████████--------------------------------] 18% [00m:19s] Saving forest [███████████-----------------------------] 25% [00m:20s] Saving forest [█████████████---------------------------] 30% [00m:21s] Saving forest [███████████████-------------------------] 35% [00m:22s] Saving forest [█████████████████-----------------------] 42% [00m:23s] Saving forest [████████████████████--------------------] 49% [00m:24s] Saving forest [███████████████████████-----------------] 56% [00m:25s] Saving forest [██████████████████████████--------------] 63% [00m:27s] Saving forest [█████████████████████████████-----------] 70% [00m:28s] Saving forest [███████████████████████████████---------] 77% [00m:29s] Saving forest [██████████████████████████████████------] 84% [00m:30s] Saving forest [█████████████████████████████████████---] 90% [00m:31s] Saving forest [███████████████████████████████████████-] 96% [00m:32s] Saving forest [████████████████████████████████████████] 100% [00m:33s] Forest saved

# loading the saved forest
loaded_phylo_forest <- load_phylogenetic_forest("phylo_forest.sff")
#>  [█---------------------------------------] 0% [00m:00s] Loading forest [███-------------------------------------] 7% [00m:01s] Loading forest [█████-----------------------------------] 11% [00m:02s] Loading forest [███████---------------------------------] 15% [00m:03s] Loading forest [████████--------------------------------] 19% [00m:04s] Loading forest [██████████------------------------------] 23% [00m:05s] Loading forest [███████████-----------------------------] 27% [00m:06s] Loading forest [█████████████---------------------------] 31% [00m:07s] Loading forest [███████████████-------------------------] 35% [00m:08s] Loading forest [████████████████------------------------] 39% [00m:09s] Loading forest [██████████████████----------------------] 43% [00m:10s] Loading forest [███████████████████---------------------] 47% [00m:11s] Loading forest [█████████████████████-------------------] 51% [00m:12s] Loading forest [███████████████████████-----------------] 55% [00m:13s] Loading forest [████████████████████████----------------] 58% [00m:14s] Loading forest [██████████████████████████--------------] 63% [00m:15s] Loading forest [███████████████████████████-------------] 67% [00m:16s] Loading forest [█████████████████████████████-----------] 71% [00m:17s] Loading forest [███████████████████████████████---------] 75% [00m:19s] Loading forest [████████████████████████████████--------] 79% [00m:20s] Loading forest [██████████████████████████████████------] 83% [00m:21s] Loading forest [███████████████████████████████████-----] 87% [00m:22s] Loading forest [█████████████████████████████████████---] 91% [00m:23s] Loading forest [███████████████████████████████████████-] 95% [00m:24s] Loading forest [████████████████████████████████████████] 99% [00m:25s] Loading forest [████████████████████████████████████████] 100% [00m:26s] Forest loaded

loaded_phylo_forest
#> PhylogeneticForest
#>   # of trees: 1
#>   # of nodes: 18845
#>   # of leaves: 5396
#>   samples: {"S_1_1", "S_1_2", "S_2_1", "S_2_2"}
#> 
#>   # of emerged SNVs and indels: 113353
#>   # of emerged CNAs: 39

Getting and Setting the Reference Genome Path

The phylogenetic forest object contains the reference genome FASTA file path. The methods PhylogeneticForest$get_reference_path() and PhylogeneticForest$set_reference_path() can be used to get the path and set it, respectively.

phylo_forest$get_reference_path()
#> [1] "/Users/alberto/Library/CloudStorage/Dropbox/Lavoro/Code/rRACES/rRACES/vignettes/demo/reference.fasta"

phylo_forest$set_reference_path("demo/reference.fasta")

phylo_forest$get_reference_path()
#> [1] "/Users/alberto/Library/CloudStorage/Dropbox/Lavoro/Code/rRACES/rRACES/vignettes/demo/reference.fasta"

Alexandrov, Ludmil B, Jaegil Kim, Nicholas J Haradhvala, Mi Ni Huang, Alvin Wei Tian Ng, Yang Wu, Arnoud Boot, et al. 2020. “The Repertoire of Mutational Signatures in Human Cancer.” Nature 578 (7793): 94–101.