Disclaimer: RACES/rRACES internally implements the probability distributions using the C++11 random number distribution classes. The standard does not specify their algorithms, and the class implementations are left free for the compiler. Thus, the simulation output depends on the compiler used to compile RACES, and because of that, the results reported in this article may differ from those obtained by the reader.
rRACES/RACES can simulate genomic mutations on the cells represented
in a SamplesForest
according to the specified SBS and indel
mutational signatures (see (Alexandrov et al.
2020)). This process is performed by the class
MutationEngine
which also takes into account the mutation
rate of the simulated species and gives the chance to dynamically change
the signatures.
Setting Up Mutation Engine
The creation of an object of the type MutationEngine
requires to download the reference sequence and the signature files and
to build the corresponding two signature indices. The function
MutationEngine()
performs all these steps in a single
call.
This function is quite flexible and it allows to customize the mutation engine in many ways. However, the vast majority of rRACES users aim a standard setup for instance involving the human genome.
Because of this, rRACES provides some predefined setups. Their list
can be obtained by invoking the function
get_mutation_engine_codes()
.
library(rRACES)
get_mutation_engine_codes()
#> name description
#> 1 GRCh37 Homo sapiens (GRCh37)
#> 2 GRCh38 Homo sapiens (GRCh38)
#> 3 demo A demostative set-up
Let us build a MutationEngine
by using the
"demo"
setup.
# building a mutation engine by using the "demo" setup
m_engine <- MutationEngine(setup_code = "demo")
#> Downloading reference genome...
#> Reference genome downloaded
#> Decompressing reference file...done
#> Downloading SBS file...
#> SBS file downloaded
#> Downloading indel file...
#> indel file downloaded
#> Downloading driver mutation file...
#> Driver mutation file downloaded
#> Downloading passenger CNAs file...
#> Passenger CNAs file downloaded
#> Downloading germline mutations...
#> Germline mutations downloaded
#> Building context index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█████████████████-----------------------] 40% [00m:01s] Processing chr. 22 [█████████████████████████████████-------] 81% [00m:02s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:02s] Context index built
#> [█---------------------------------------] 0% [00m:00s] Saving context index [████████████████████████████████████████] 100% [00m:00s] Context index saved
#> done
#> Building repeated sequence index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:03s] Processing chr. 22 [█---------------------------------------] 0% [00m:04s] Processing chr. 22 [█---------------------------------------] 0% [00m:06s] Processing chr. 22 [█---------------------------------------] 0% [00m:07s] Processing chr. 22 [█---------------------------------------] 0% [00m:08s] Processing chr. 22 [█---------------------------------------] 0% [00m:12s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:12s] RS index built
#> [█---------------------------------------] 0% [00m:00s] Saving RS index [███████████████████████-----------------] 57% [00m:01s] Saving RS index [████████████████████████████████████████] 100% [00m:01s] RS index saved
#> done
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#> [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved
m_engine
#> MutationEngine
#> Passenger rates
#>
#> Driver mutations
#>
#> Timed Exposure
#> SBS Timed Exposures
#>
#> indel Timed Exposures
The above call creates the directory demo
, downloads all
the data required by the mutation engine, and builds it.
dir.exists("demo")
#> [1] TRUE
list.files("demo")
#> [1] "context_index_100.cif" "drivers.txt"
#> [3] "germline_data" "germline_data_demo.tar.gz"
#> [5] "indel_signatures.txt" "passenger_CNAs.txt"
#> [7] "reference.fasta" "rs_index_50_500000.rsif"
#> [9] "SBS_signatures.txt" "sources.csv"
The execution of the function MutationEngine()
may takes
some time, but it is meant to be performed one for all and, as long as
the user does not need to change the reference genome or the signature
files, it is no more required. In this spirit, any call to the function
MutationEngine
checks whether all the building process
sub-tasks are really needed and, if this is not the case, it avoids to
performs them.
# building a mutation engine by using the "demo" set-up configuration
m_engine <- MutationEngine(setup_code = "demo")
#> [█---------------------------------------] 0% [00m:00s] Loading context index [████████████████████████████████████████] 100% [00m:00s] Context index loaded
#> [█---------------------------------------] 0% [00m:00s] Loading RS index [███████████████████---------------------] 47% [00m:01s] Loading RS index [█████████████████████████████████████---] 90% [00m:02s] Loading RS index [████████████████████████████████████████] 100% [00m:02s] RS index loaded
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
Mutant Genetic Characterization
Once the mutation engine has been built, we can define mutant genotype and declare species mutation rates.
Let us consider simulation peformed in Section “Two populations
with epigenetic state” of vignette("sampling")
. It
involves the two mutants A
and “B”. Both of them has two
possible epigenetic states, +
and -
, leading
to the four the species A+
, A-
,
B+
, and B-
, respectively. Each of these
species has a passenger mutation rate that must be declated to the
mutation engine before labeling the samples forest. Let 1e-9, 3e-8,
8e-7, and 5e-8 be the passenger SNV rates for the species
A+
, A-
, B+
, and B-
,
respectively. Let the indel rates be 0 for all the species, but
A+
which has indel rate 1e-8. Furthermore, let 0, 1e-11, 0,
and 0 be the passenger CNA rates of the same species, respectively.
The two mutants may also be genetically characterized by some driver mutations. The driver mutations associated with each of the mutants must occur in any cell belonging to the mutant itself. Hence, they must be declared to the mutant engine before the labeling.
The method MutationEngine$add_mutant()
takes care of all
these declarations.
For the sake of example, let us assume that A
is
characterized by one driver mutation on chromosome 22, while
B
has three driver mutations on the same chromosome.
The genetic specification of the mutant A
can be
declared as it follows.
# add the mutant "A" characterized by one driver SNV on the allele 1 of
# chromosome 22, one indel deletion on the same chromosome, and two CNAs:
# a deletion on the allele 1 of chromosome 22 and an amplification on a
# random allele of the same chromosome. The mutant has two epigenetic
# states and its species "A+" and "A-" have passenger SNV rates 1e-9 and
# 3e-8, respectively, passenger indel rates 1e-10 and 0, respectively,
# and passenger CNA rates 0 and 1e-11, respectively.
m_engine$add_mutant(mutant_name = "A",
passenger_rates = list("+" = c(SNV = 1e-9, indel = 1e-8),
"-" = c(SNV = 3e-8, CNA = 1e-11)),
drivers = list(SNV("22", 10510210, "C", allele = 1),
Mutation("22", 15220157, "GTTTTTTTT", "G"),
CNA(type = "A", chr = "22",
chr_pos = 10303470, len = 200000),
CNA("D", "22", 5010000, 200000,
allele = 1)))
#> [█---------------------------------------] 0% [00m:00s] Retrieving "A" SNVs [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Reading 22 [█---------------------------------------] 0% [00m:02s] Reading 22 [████████████████████████████████████████] 100% [00m:02s] "A" SNVs retrieved
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-08}
#> "A-": {SNV: 3e-08, CNA: 1e-11}
#>
#> Driver mutations
#> "A" SNVs&indels:
#> chr22(10510210)[N>C](allele: 1)
#> chr22(15220157)[GTTTTTTTT>G](allele: random)
#> "A" CNAs:
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> CNA("A",chr22(10303470), len: 200000)
#>
#> Timed Exposure
#> SBS Timed Exposures
#>
#> indel Timed Exposures
In the above code, the driver CNAs, SNVs, and indels are declared by
calling the functions CNA()
, SNV()
, and
Mutation()
, respectively. These function allow to specify
the allele in which the mutation must lay.
As far as the SNVs and indels concern, rRACES provides the users with
a more compact and, sometimes, convenient approach. The mutation engine
stores a list of known driver mutations and labels each of them by a
code which can be used during the mutant declaration. The corresponding
data frame can be obtained by using the method
MutationEngine$get_known_drivers()
.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
m_engine$get_known_drivers() %>% filter(chr == "22")
#> chr from to ref alt mutation_type tumour_type driver_gene
#> 1 22 30057302 30057302 C T SNV OV NF2
#> 2 22 30057302 30057302 C T SNV OV NF2
#> driver_code
#> 1 NF2 R221*
#> 2 NF2 R262*
The code of the known driver mutations can be used in place of the full specification as follows.
# add the mutant "B" characterized by three driver SNVs on chromosome 22 (no
# CNA) and two epigenetic states. The first two SNVs are "NF2 R262", which
# can lay in any available allele, and "NF2 R221*", which instead must lay
# in the allele 1. The remaining SNV is specified by using the SNV function
# as done above. The species "B+" and "B-" have passenger SNV rates 8e-7
# and 5e-8, respectively, and CNA rates 0 for both species.
m_engine$add_mutant("B", list("+" = c(SNV = 8e-7), "-" = c(SNV = 5e-8)),
list(SNV("22", 12028576, "G"), # the first SNV
"NF2 R262*", # the second SNV
list("NF2 R221*", allele = 1))) # the third SNV
#> [█---------------------------------------] 0% [00m:00s] Retrieving "B" SNVs [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Reading 22 [████████████████████████████████████████] 100% [00m:01s] "B" SNVs retrieved
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-08}
#> "A-": {SNV: 3e-08, CNA: 1e-11}
#> "B+": {SNV: 8e-07}
#> "B-": {SNV: 5e-08}
#>
#> Driver mutations
#> "A" SNVs&indels:
#> chr22(10510210)[N>C](allele: 1)
#> chr22(15220157)[GTTTTTTTT>G](allele: random)
#> "A" CNAs:
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> CNA("A",chr22(10303470), len: 200000)
#> "B" SNVs&indels:
#> chr22(12028576)[N>G](allele: random)
#> chr22(30057302)[C>T](allele: 1)
#> chr22(30057302)[C>T](allele: random)
#>
#> Timed Exposure
#> SBS Timed Exposures
#>
#> indel Timed Exposures
Mutational Exposures
The probability for a mutation to occur depends on both its genomic and environmental context.
A signature is a mutation probability distribution over mutation contexts or mutation structure. SBS (single base substitution) signatures provide for any genomic context (i.e., a triplet of bases) the probability for a SNV to occur on that context. On the contrary, ID (indel) signatures associates the probability of an indel to its length and structure (see (Alexandrov et al. 2020)).
The signature depends on the environmental context and, because of that, more than one signature may be active at the same time with different probabilities. An mutational exposure (or exposure) is a discrete probability distribution among signatures.
In order to simulate passeger mutations of a given type, we need to specify a default exposure for that type. This can be achieved as it follows.
# add SNV and indel default exposures. This will be used from simulated time 0
# up to the successive exposure change.
m_engine$add_exposure(coefficients = c(SBS13 = 0.2, SBS1 = 0.8))
m_engine$add_exposure(c(ID2 = 0.6, ID13 = 0.2, ID21 = 0.2))
Further exposures can also be defined by specifying an activation time for each of them, i.e., the new exposures will be used from the specified time up to the next exposure change.
# add a new SNV exposure that will be used from simulated
# time 100 up to the successive exposure change.
m_engine$add_exposure(time = 100, c(SBS5 = 0.3, SBS2 = 0.2, SBS3 = 0.5))
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-08}
#> "A-": {SNV: 3e-08, CNA: 1e-11}
#> "B+": {SNV: 8e-07}
#> "B-": {SNV: 5e-08}
#>
#> Driver mutations
#> "A" SNVs&indels:
#> chr22(10510210)[N>C](allele: 1)
#> chr22(15220157)[GTTTTTTTT>G](allele: random)
#> "A" CNAs:
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> CNA("A",chr22(10303470), len: 200000)
#> "B" SNVs&indels:
#> chr22(12028576)[N>G](allele: random)
#> chr22(30057302)[C>T](allele: 1)
#> chr22(30057302)[C>T](allele: random)
#>
#> Timed Exposure
#> SBS Timed Exposures
#> [0, 100[: {"SBS1": 0.8, "SBS13": 0.2}
#> [100, ∞[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#>
#> indel Timed Exposures
#> [0, ∞[: {"ID13": 0.2, "ID2": 0.6, "ID21": 0.2}
The default exposures and the simultaneous changes in SNV and ID
exposures can be specified by a sigle call to the function
MutationEngine$add_exposure()
as it follows.
m_engine$add_exposure(time = 120, c(SBS5 = 0.3, SBS2 = 0.2, SBS3 = 0.5,
ID1 = 0.8, ID9 = 0.2))
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-08}
#> "A-": {SNV: 3e-08, CNA: 1e-11}
#> "B+": {SNV: 8e-07}
#> "B-": {SNV: 5e-08}
#>
#> Driver mutations
#> "A" SNVs&indels:
#> chr22(10510210)[N>C](allele: 1)
#> chr22(15220157)[GTTTTTTTT>G](allele: random)
#> "A" CNAs:
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> CNA("A",chr22(10303470), len: 200000)
#> "B" SNVs&indels:
#> chr22(12028576)[N>G](allele: random)
#> chr22(30057302)[C>T](allele: 1)
#> chr22(30057302)[C>T](allele: random)
#>
#> Timed Exposure
#> SBS Timed Exposures
#> [0, 100[: {"SBS1": 0.8, "SBS13": 0.2}
#> [100, 120[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#> [120, ∞[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#>
#> indel Timed Exposures
#> [0, 120[: {"ID13": 0.2, "ID2": 0.6, "ID21": 0.2}
#> [120, ∞[: {"ID1": 0.8, "ID9": 0.2}
Passenger CNAs and Tumour Type
The passenger CNAs applied during the simulation depend on tumour
type. The type of tumour can be specified when building the mutation
engine by using the tumour_type
parameter. For instance, if
the passenger CNAs file used to build the mutation engine contains some
of the data identified in breast carcinoma in the UK, the specific CNA
can be applied during the simulation as it follows.
m_engine <- MutationEngine(setup_code = "demo",
tumour_type = "BRCA-UK")
Germline Mutations
rRACES allows users to apply the germline mutations of one of the subjects available in the germline data provided in building the mutation engine. This feature will enable users to simulate a specific cancer type’s evolution on an individual with the desired gender and ethnicity.
The available subjects, together with their sex and ethnicity, can be
obtained by using the method
[MutationEngine$get_germline_subjects()]
.
subjects <- m_engine$get_germline_subjects()
subjects
#> sample pop super_pop gender
#> 1 NA18941 JPT EAS female
#> 2 NA20513 TSI EUR male
The column sample
contains the names of the available
subjects. The columns pop
and super_pop
report
the subjects’ population and super-population codes. The last column,
gender
, includes the subject gender.
The method MutationEngine$get_population_descritions()
clarifies the meaning of the codes reported in the pop columns.
m_engine$get_population_descritions()
#> code description
#> 1 CHB Han Chinese
#> 2 JPT Japanese
#> 3 CHS Southern Han Chinese
#> 4 CDX Dai Chinese
#> 5 KHV Kinh Vietnamese
#> 6 CHD Denver Chinese
#> 7 CEU CEPH
#> 8 TSI Tuscan
#> 9 GBR British
#> 10 FIN Finnish
#> 11 IBS Spanish
#> 12 YRI Yoruba
#> 13 LWK Luhya
#> 14 GWD Gambian
#> 15 MSL Mende
#> 16 ESN Esan
#> 17 ASW African-American SW
#> 18 ACB African-Caribbean
#> 19 MXL Mexican-American
#> 20 PUR Puerto Rican
#> 21 CLM Colombian
#> 22 PEL Peruvian
#> 23 GIH Gujarati
#> 24 PJL Punjabi
#> 25 BEB Bengali
#> 26 STU Sri Lankan
#> 27 ITU Indian
#> long.description
#> 1 Han Chinese in Beijing, China
#> 2 Japanese in Tokyo, Japan
#> 3 Han Chinese South
#> 4 Chinese Dai in Xishuangbanna, China
#> 5 Kinh in Ho Chi Minh City, Vietnam
#> 6 Chinese in Denver, Colorado (pilot 3 only)
#> 7 Utah residents (CEPH) with Northern and Western European ancestry
#> 8 Toscani in Italia
#> 9 British in England and Scotland
#> 10 Finnish in Finland
#> 11 Iberian populations in Spain
#> 12 Yoruba in Ibadan, Nigeria
#> 13 Luhya in Webuye, Kenya
#> 14 Gambian in Western Division, The Gambia
#> 15 Mende in Sierra Leone
#> 16 Esan in Nigeria
#> 17 African Ancestry in Southwest US
#> 18 African Caribbean in Barbados
#> 19 Mexican Ancestry in Los Angeles, California
#> 20 Puerto Rican in Puerto Rico
#> 21 Colombian in Medellin, Colombia
#> 22 Peruvian in Lima, Peru
#> 23 Gujarati Indian in Houston, TX
#> 24 Punjabi in Lahore, Pakistan
#> 25 Bengali in Bangladesh
#> 26 Sri Lankan Tamil in the UK
#> 27 Indian Telugu in the UK
The method MutationEngine$get_active_germline()
returns
the the active germline subject.
m_engine$get_active_germline()
#> sample pop super_pop gender
#> 1 NA18941 JPT EAS female
Users can change the germline subject using the method
MutationEngine$set_germline_subject()
.
When a subject is selected for the first time, rRACES builds a binary representation of the subject genome, saving it for future use. This step may take a few minutes. However, all the successive selections of the same subject directly load the binary file.
m_engine$set_germline_subject(subjects[2, "sample"])
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#> [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved
m_engine$get_active_germline()
#> sample pop super_pop gender
#> 1 NA20513 TSI EUR male
Building Phylogenetic Forests
The configurated mutation engine can be used to label each node in a samples forest by mutations.
Since the mutation engine has be configurated to deal with the
simulation peformed in Section “Two populations with epigenetic
state” of vignette("sampling")
, we can load its
samples forest from the file "samples_forest.sff"
in saved
that article.
samples_forest <- load_samples_forest("samples_forest.sff")
# place mutations on the sample forest assuming 1000 pre-neoplastic SNVs and
# 500 indels
phylo_forest <- m_engine$place_mutations(samples_forest, 1000, 500)
#> [█---------------------------------------] 0% [00m:00s] Placing mutations [█████-----------------------------------] 10% [00m:01s] Placing mutations [████████--------------------------------] 19% [00m:02s] Placing mutations [████████████----------------------------] 29% [00m:03s] Placing mutations [█████████████████-----------------------] 40% [00m:04s] Placing mutations [█████████████████████-------------------] 50% [00m:05s] Placing mutations [█████████████████████████---------------] 60% [00m:06s] Placing mutations [████████████████████████████------------] 69% [00m:07s] Placing mutations [█████████████████████████████████-------] 80% [00m:08s] Placing mutations [██████████████████████████████████------] 84% [00m:09s] Placing mutations [███████████████████████████████████-----] 87% [00m:10s] Placing mutations [█████████████████████████████████████---] 90% [00m:12s] Placing mutations [██████████████████████████████████████--] 93% [00m:13s] Placing mutations [████████████████████████████████████████] 100% [00m:13s] Mutations placed
phylo_forest
#> PhylogeneticForest
#> # of trees: 1
#> # of nodes: 20007
#> # of leaves: 5442
#> samples: {"S_1_1", "S_1_2", "S_2_1", "S_2_2"}
#>
#> # of emerged SNVs and indels: 186616
#> # of emerged CNAs: 2
The phylogenetic forest stores all the mutations labeling the sampled
cells which are represented by the forest leaves. Users can retrieve
such data by using the methods
PhylogeneticForest$get_sampled_cell_mutations()
and
PhylogeneticForest$get_sampled_cell_CNAs()
.
library(dplyr)
# select the first mutations among all the mutations occuring in
# the genomes of the sampled cells
phylo_forest$get_sampled_cell_mutations() %>% head()
#> cell_id chr chr_pos allele ref alt type cause class
#> 1 3072 22 15220157 0 GTTTTTTTT G indel A driver
#> 2 3072 22 16125311 0 G A SNV SBS1 pre-neoplastic
#> 3 3072 22 16132695 0 C CAAAAAA indel ID1 pre-neoplastic
#> 4 3072 22 16148349 0 G A SNV SBS1 pre-neoplastic
#> 5 3072 22 16156369 0 G A SNV SBS1 pre-neoplastic
#> 6 3072 22 16192592 0 C CTTTTT indel ID1 pre-neoplastic
# select the first CNAs among all the mutations occuring in
# the genomes of the sampled cells
phylo_forest$get_sampled_cell_CNAs() %>% head()
#> cell_id type chr begin end allele src.allele class
#> 1 3072 D 22 5010000 5209999 1 NA driver
#> 2 3072 A 22 10303470 10503469 2 0 driver
#> 3 6238 D 22 5010000 5209999 1 NA driver
#> 4 6238 A 22 10303470 10503469 2 0 driver
#> 5 7555 D 22 5010000 5209999 1 NA driver
#> 6 7555 A 22 10303470 10503469 2 0 driver
# get the sampled cells
sampled_cells <- phylo_forest$get_nodes() %>%
filter(!is.na(.data$sample))
# show the first of them
sampled_cells %>% head()
#> cell_id ancestor mutant epistate sample birth_time
#> 1 3072 1479 A - S_1_2 147.5783
#> 2 6238 5729 A - S_1_2 178.7256
#> 3 7555 7152 A - S_1_2 187.5936
#> 4 8011 3210 A - S_1_1 190.5756
#> 5 8249 7992 A - S_1_2 191.9463
#> 6 9002 8514 A - S_1_1 196.4062
# get the identifier of the 3rd cell in `sampled_cells`
cell_id <- sampled_cells[3, 1]
# get the SNVs and the indels of the 3rd cell in `sampled_cells`
phylo_forest$get_sampled_cell_mutations(cell_id) %>% head()
#> cell_id chr chr_pos allele ref alt type cause class
#> 1 7555 22 15220157 0 GTTTTTTTT G indel A driver
#> 2 7555 22 16125311 0 G A SNV SBS1 pre-neoplastic
#> 3 7555 22 16132695 0 C CAAAAAA indel ID1 pre-neoplastic
#> 4 7555 22 16148349 0 G A SNV SBS1 pre-neoplastic
#> 5 7555 22 16156369 0 G A SNV SBS1 pre-neoplastic
#> 6 7555 22 16192592 0 C CTTTTT indel ID1 pre-neoplastic
# get the CNAs of the 3rd cell in `sampled_cells`
phylo_forest$get_sampled_cell_CNAs(cell_id) %>% head()
#> cell_id type chr begin end allele src.allele class
#> 1 7555 D 22 5010000 5209999 1 NA driver
#> 2 7555 A 22 10303470 10503469 2 0 driver
The method PhylogeneticForest$get_germline_mutations()
returns the SNVs and the indels in the germline.
# extract the germline mutation
phylo_forest$get_germline_mutations() %>% head()
#> chr chr_pos allele ref alt type cause class
#> 1 22 16051493 0 G A SNV germinal
#> 2 22 16052167 0 A AAAAC indel germinal
#> 3 22 16053659 0 A C SNV germinal
#> 4 22 16054740 0 A G SNV germinal
#> 5 22 16055942 0 C T SNV germinal
#> 6 22 16058070 0 A G SNV germinal
Users can also identify the cell in which a mutation emerged even when the cell was not sampled.
# select one of the mutations
mutation_row <- phylo_forest$get_sampled_cell_mutations(cell_id)[2, ]
# rebuild the corresponding mutation
mutation <- Mutation(mutation_row["chr"][1, ],
mutation_row["chr_pos"][1, ],
mutation_row["ref"][1, ],
mutation_row["alt"][1, ])
# get the identifier of the oldest cells in which the mutation occurs
phylo_forest$get_first_occurrences(mutation)
#> [[1]]
#> [1] 0
The exposures used in placing the mutations on the cells in the
phylogenetic forest can be obtained by using the method
PhylogeneticForest$get_exposures()
.
# get the exposures used in placing the mutations
phylo_forest$get_exposures()
#> time signature exposure type
#> 1 0 SBS1 0.8 SNV
#> 2 0 SBS13 0.2 SNV
#> 3 100 SBS2 0.2 SNV
#> 4 100 SBS3 0.5 SNV
#> 5 100 SBS5 0.3 SNV
#> 6 120 SBS2 0.2 SNV
#> 7 120 SBS3 0.5 SNV
#> 8 120 SBS5 0.3 SNV
#> 9 0 ID13 0.2 indel
#> 10 0 ID2 0.6 indel
#> 11 0 ID21 0.2 indel
#> 12 120 ID1 0.8 indel
#> 13 120 ID9 0.2 indel
The details about the SNV and indel signatures adopted during the
evolution are available in the mutation engine and they can be retrieved
by using the methods MutationEngine$get_SNV_signatures()
and MutationEngine$get_indel_signatures()
.
# get the SNV signatures used in placing the mutations
m_engine$get_SNV_signatures()[1:6, 1:5]
#> Type SBS1 SBS2 SBS3 SBS4
#> 1 A[C>A]A 0.000886157 5.80000e-07 0.02080832 0.042196498
#> 2 A[C>A]C 0.002280405 1.48004e-04 0.01650660 0.033297236
#> 3 A[C>A]G 0.000177031 5.23000e-05 0.00175070 0.015598705
#> 4 A[C>A]T 0.001280227 9.78000e-05 0.01220488 0.029497552
#> 5 A[C>G]A 0.001860330 2.23000e-16 0.01970788 0.006889428
#> 6 A[C>G]C 0.001220217 1.33004e-04 0.01170468 0.002839764
# get the indel signatures used in placing the mutations
m_engine$get_indel_signatures()[1:6, 1:5]
#> Type ID1 ID2 ID3 ID4
#> 1 1:Del:C:0 1.598890e-04 0.004824116 0.12472711 0.007249717
#> 2 1:Del:C:1 7.735230e-04 0.000022100 0.20887617 0.002734869
#> 3 1:Del:C:2 3.310000e-18 0.000003110 0.17632422 0.002041063
#> 4 1:Del:C:3 1.907613e-03 0.002472076 0.06404276 0.001112283
#> 5 1:Del:C:4 7.059900e-04 0.003856976 0.04398981 0.001075684
#> 6 1:Del:C:5 3.370115e-03 0.014615424 0.02241749 0.002246178
The name of the subject whose germline corresponds to wild-type
genome in the phylogenetic forest can be obtained by the method
Phylogenetic$get_germline_subject()
.
phylo_forest$get_germline_subject()
#> [1] "NA20513"
Storing Phylogenetic Forests
As in the case of the samples forests, the phylogenetic forests can
be saved by using the method PhylogeneticForest$save()
and
load by the function load_phylogenetic_forest()
.
# save the phylogenetic forest in the file "phylo_forest.sff"
phylo_forest$save("phylo_forest.sff")
#> [█---------------------------------------] 0% [00m:00s] Saving forest [██--------------------------------------] 4% [00m:01s] Saving forest [███████████████-------------------------] 35% [00m:02s] Saving forest [█████████████████████████████-----------] 70% [00m:03s] Saving forest [████████████████████████████████████████] 100% [00m:03s] Forest saved
# loading the saved forest
loaded_phylo_forest <- load_phylogenetic_forest("phylo_forest.sff")
#> [█---------------------------------------] 0% [00m:00s] Loading forest [████------------------------------------] 9% [00m:01s] Loading forest [█████████-------------------------------] 20% [00m:02s] Loading forest [█████████████---------------------------] 30% [00m:03s] Loading forest [█████████████████-----------------------] 40% [00m:04s] Loading forest [█████████████████████-------------------] 51% [00m:05s] Loading forest [█████████████████████████---------------] 61% [00m:06s] Loading forest [█████████████████████████████-----------] 72% [00m:07s] Loading forest [█████████████████████████████████-------] 82% [00m:08s] Loading forest [█████████████████████████████████████---] 91% [00m:09s] Loading forest [████████████████████████████████████████] 100% [00m:10s] Forest loaded
loaded_phylo_forest
#> PhylogeneticForest
#> # of trees: 1
#> # of nodes: 20007
#> # of leaves: 5442
#> samples: {"S_1_1", "S_1_2", "S_2_1", "S_2_2"}
#>
#> # of emerged SNVs and indels: 186616
#> # of emerged CNAs: 2
Getting and Setting the Reference Genome Path
The phylogenetic forest object contains the reference genome FASTA
file path. The methods
PhylogeneticForest$get_reference_path()
and
PhylogeneticForest$set_reference_path()
can be used to get
the path and set it, respectively.
phylo_forest$get_reference_path()
#> [1] "/Users/alberto/Library/CloudStorage/Dropbox/Lavoro/Code/rRACES/rRACES/vignettes/demo/reference.fasta"
phylo_forest$set_reference_path("demo/reference.fasta")
phylo_forest$get_reference_path()
#> [1] "/Users/alberto/Library/CloudStorage/Dropbox/Lavoro/Code/rRACES/rRACES/vignettes/demo/reference.fasta"