Disclaimer: RACES/rRACES internally implements the probability distributions using the C++11 random number distribution classes. The standard does not specify their algorithms, and the class implementations are left free for the compiler. Thus, the simulation output depends on the compiler used to compile RACES, and because of that, the results reported in this article may differ from those obtained by the reader.
rRACES/RACES can simulate genomic mutations on the cells represented
in a SamplesForest
according to the specified SBS and indel
mutational signatures (see (Alexandrov et al.
2020)). This process is performed by the class
MutationEngine
which also takes into account the mutation
rate of the simulated species and gives the chance to dynamically change
the signatures.
Setting Up Mutation Engine
The creation of an object of the type MutationEngine
requires to download the reference sequence and the signature files and
to build the corresponding two signature indices. The function
MutationEngine()
performs all these steps in a single
call.
This function is quite flexible and it allows to customize the mutation engine in many ways. However, the vast majority of rRACES users aim a standard setup for instance involving the human genome.
Because of this, rRACES provides some predefined setups. Their list
can be obtained by invoking the function
get_mutation_engine_codes()
.
library(rRACES)
get_mutation_engine_codes()
#> name description
#> 1 GRCh37 Homo sapiens (GRCh37)
#> 2 GRCh38 Homo sapiens (GRCh38)
#> 3 demo A demostative set-up
Let us build a MutationEngine
by using the
"demo"
setup.
# building a mutation engine by using the "demo" setup
m_engine <- MutationEngine(setup_code = "demo")
#> Downloading reference genome...
#> Reference genome downloaded
#> Decompressing reference file...done
#> Downloading SBS file...
#> SBS file downloaded
#> Downloading indel file...
#> indel file downloaded
#> Downloading driver mutation file...
#> Driver mutation file downloaded
#> Downloading passenger CNAs file...
#> Passenger CNAs file downloaded
#> Downloading germline mutations...
#> Germline mutations downloaded
#> Building context index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█████████████---------------------------] 32% [00m:01s] Processing chr. 22 [███████████████████████████-------------] 65% [00m:02s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:03s] Context index built
#> [█---------------------------------------] 0% [00m:00s] Saving context index [████████████████████████████████████████] 100% [00m:00s] Context index saved
#> done
#> Building repeated sequence index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:03s] Processing chr. 22 [█---------------------------------------] 0% [00m:04s] Processing chr. 22 [█---------------------------------------] 0% [00m:06s] Processing chr. 22 [█---------------------------------------] 0% [00m:07s] Processing chr. 22 [█---------------------------------------] 0% [00m:08s] Processing chr. 22 [█---------------------------------------] 0% [00m:12s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:12s] RS index built
#> [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:00s] Saving RS index [█████████████████████████---------------] 61% [00m:02s] Saving RS indexdone
#> [████████████████████████████████████████] 100% [00m:02s] RS index saved
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#> [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved
m_engine
#> MutationEngine
#> Passenger rates
#>
#> Driver mutations
#>
#> Timed Exposure
#> SBS Timed Exposures
#>
#> indel Timed Exposures
The above call creates the directory demo
, downloads all
the data required by the mutation engine, and builds it.
dir.exists("demo")
#> [1] TRUE
list.files("demo")
#> [1] "context_index_100.cif" "drivers.txt"
#> [3] "germline_data" "germline_data_demo.tar.gz"
#> [5] "indel_signatures.txt" "passenger_CNAs.txt"
#> [7] "reference.fasta" "rs_index_50_500000.rsif"
#> [9] "SBS_signatures.txt" "sources.csv"
The execution of the function MutationEngine()
may takes
some time, but it is meant to be performed one for all and, as long as
the user does not need to change the reference genome or the signature
files, it is no more required. In this spirit, any call to the function
MutationEngine
checks whether all the building process
sub-tasks are really needed and, if this is not the case, it avoids to
performs them.
# building a mutation engine by using the "demo" set-up configuration
m_engine <- MutationEngine(setup_code = "demo")
#> [█---------------------------------------] 0% [00m:00s] Loading context index [████████████████████████████████████████] 100% [00m:00s] Context index loaded
#> [█---------------------------------------] 0% [00m:00s] Loading RS index [████████████████------------------------] 38% [00m:01s] Loading RS index [███████████████████████████████---------] 75% [00m:02s] Loading RS index [████████████████████████████████████████] 100% [00m:02s] RS index loaded
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
Mutant Genetic Characterization
Once the mutation engine has been built, we can define mutant genotype and declare species mutation rates.
Let us consider simulation peformed in Section “Two populations
with epigenetic state” of vignette("sampling")
. It
involves the two mutants A
and “B”. Both of them has two
possible epigenetic states, +
and -
, leading
to the four the species A+
, A-
,
B+
, and B-
, respectively. Each of these
species has a passenger mutation rate that must be declated to the
mutation engine before labeling the samples forest. Let 1e-9, 3e-8,
8e-7, and 5e-8 be the passenger SNV rates for the species
A+
, A-
, B+
, and B-
,
respectively. Let the indel rates be 0 for all the species, but
A+
which has indel rate 1e-8. Furthermore, let 0, 1e-11, 0,
and 0 be the passenger CNA rates of the same species, respectively.
The two mutants may also be genetically characterized by some driver mutations. The driver mutations associated with each of the mutants must occur in any cell belonging to the mutant itself. Hence, they must be declared to the mutant engine before the labeling.
The method MutationEngine$add_mutant()
takes care of all
these declarations.
For the sake of example, let us assume that A
is
characterized by one driver mutation on chromosome 22, while
B
has three driver mutations on the same chromosome.
The genetic specification of the mutant A
can be
declared as it follows.
# add the mutant "A" characterized by one driver SNV on the allele 1 of
# chromosome 22, one indel deletion on the same chromosome, two CNAs (a
# deletion on the allele 1 of chromosome 22 and an amplification on a random
# allele of the same chromosome) and, finally, a whole genome doubling event
# (WGD). The mutant has two epigenetic states and its species "A+" and "A-"
# have passenger SNV rates 1e-9 and 3e-8, respectively, passenger indel rates
# 1e-10 and 0, respectively, and passenger CNA rates 0 and 1e-11, respectively.
m_engine$add_mutant(mutant_name = "A",
passenger_rates = list("+" = c(SNV = 1e-9, indel = 1e-8),
"-" = c(SNV = 3e-8, CNA = 1e-11)),
drivers = list(SNV("22", 10510210, "C", allele = 1),
Mutation("22", 15220157, "GTTTTTTTT", "G"),
CNA(type = "A", chr = "22",
chr_pos = 10303470, len = 200000),
CNA("D", "22", 5010000, 200000,
allele = 1),
WGD))
#> [█---------------------------------------] 0% [00m:00s] Retrieving "A" SNVs [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Reading 22 [████████████████████████████████████████] 100% [00m:01s] "A" SNVs retrieved
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-08}
#> "A-": {SNV: 3e-08, CNA: 1e-11}
#>
#> Driver mutations
#> "A":
#> chr22(10510210)[N>C](allele: 1)
#> chr22(15220157)[GTTTTTTTT>G](allele: random)
#> CNA("A",chr22(10303470), len: 200000)
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> Whole genome duplication
#>
#> Timed Exposure
#> SBS Timed Exposures
#>
#> indel Timed Exposures
In the above code, the driver CNAs, SNVs, and indels are declared by
calling the functions CNA()
, SNV()
, and
Mutation()
, respectively. These function allow to specify
the allele in which the mutation must lay. The mutant A
is
also characterized by a whole genome doubling
(WGD
): a genomic event that simultaneously duplicates all
the chromosome alleles.
The driver mutations are applied according to their specification
order. For instance, the following two pieces of code specify two
different genomic characterizations for the same mutant
E
.
m_engine$add_mutant("E", passenger_rates,
drivers = list(SNV("22", 10510210, "C", allele = 1),
WGD))
m_engine$add_mutant("E", passenger_rates,
drivers = list(WGD,
SNV("22", 10510210, "C", allele = 1)))
The former snippet places an SNV and, afterwards, duplicates all the alleles producing a genome in which two alleles contain the placed SNV. Instead, the latter snippet requires a whole genome doubling event and, then, places the SNV yielding a single occurrence of the SNV in the final genome.
As far as the SNVs and indels concern, rRACES provides the users with
a more compact and, sometimes, convenient approach. The mutation engine
stores a list of known driver mutations and labels each of them by a
code which can be used during the mutant declaration. The corresponding
data frame can be obtained by using the method
MutationEngine$get_known_drivers()
.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
m_engine$get_known_drivers() %>% filter(chr == "22")
#> chr from to ref alt mutation_type tumour_type driver_gene
#> 1 22 30057302 30057302 C T SNV OV NF2
#> 2 22 30057302 30057302 C T SNV OV NF2
#> driver_code
#> 1 NF2 R221*
#> 2 NF2 R262*
The code of the known driver mutations can be used in place of the full specification as follows.
# add the mutant "B" characterized by two driver SNVs on chromosome 22 (no
# CNA) and two epigenetic states. The first SNV is "NF2 R221*" and it must
# lay in the allele 1. The remaining SNV is specified by using the SNV function
# as done above. The species "B+" and "B-" have passenger SNV rates 8e-8
# and 5e-8, respectively, and CNA rates 0 for both species.
m_engine$add_mutant("B", list("+" = c(SNV = 8e-8), "-" = c(SNV = 5e-8)),
list(list("NF2 R221*", allele = 1), # the first SNV
SNV("22", 12028576, "G"))) # the second SNV
#> [█---------------------------------------] 0% [00m:00s] Retrieving "B" SNVs [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Reading 22 [████████████████████████████████████████] 100% [00m:01s] "B" SNVs retrieved
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-08}
#> "A-": {SNV: 3e-08, CNA: 1e-11}
#> "B+": {SNV: 8e-08}
#> "B-": {SNV: 5e-08}
#>
#> Driver mutations
#> "A":
#> chr22(10510210)[N>C](allele: 1)
#> chr22(15220157)[GTTTTTTTT>G](allele: random)
#> CNA("A",chr22(10303470), len: 200000)
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> Whole genome duplication
#> "B":
#> chr22(30057302)[C>T](allele: 1)
#> chr22(12028576)[N>G](allele: random)
#>
#> Timed Exposure
#> SBS Timed Exposures
#>
#> indel Timed Exposures
Mutational Exposures
The probability for a mutation to occur depends on both its genomic and environmental context.
A signature is a mutation probability distribution over mutation contexts or mutation structure. SBS (single base substitution) signatures provide for any genomic context (i.e., a triplet of bases) the probability for a SNV to occur on that context. On the contrary, ID (indel) signatures associates the probability of an indel to its length and structure (see (Alexandrov et al. 2020)).
The signature depends on the environmental context and, because of that, more than one signature may be active at the same time with different probabilities. An mutational exposure (or exposure) is a discrete probability distribution among signatures.
In order to simulate passeger mutations of a given type, we need to specify a default exposure for that type. This can be achieved as it follows.
# add SNV and indel default exposures. This will be used from simulated time 0
# up to the successive exposure change.
m_engine$add_exposure(coefficients = c(SBS13 = 0.2, SBS1 = 0.8))
m_engine$add_exposure(c(ID2 = 0.6, ID13 = 0.2, ID21 = 0.2))
Further exposures can also be defined by specifying an activation time for each of them, i.e., the new exposures will be used from the specified time up to the next exposure change.
# add a new SNV exposure that will be used from simulated
# time 100 up to the successive exposure change.
m_engine$add_exposure(time = 100, c(SBS5 = 0.3, SBS2 = 0.2, SBS3 = 0.5))
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-08}
#> "A-": {SNV: 3e-08, CNA: 1e-11}
#> "B+": {SNV: 8e-08}
#> "B-": {SNV: 5e-08}
#>
#> Driver mutations
#> "A":
#> chr22(10510210)[N>C](allele: 1)
#> chr22(15220157)[GTTTTTTTT>G](allele: random)
#> CNA("A",chr22(10303470), len: 200000)
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> Whole genome duplication
#> "B":
#> chr22(30057302)[C>T](allele: 1)
#> chr22(12028576)[N>G](allele: random)
#>
#> Timed Exposure
#> SBS Timed Exposures
#> [0, 100[: {"SBS1": 0.8, "SBS13": 0.2}
#> [100, ∞[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#>
#> indel Timed Exposures
#> [0, ∞[: {"ID13": 0.2, "ID2": 0.6, "ID21": 0.2}
The default exposures and the simultaneous changes in SNV and ID
exposures can be specified by a sigle call to the function
MutationEngine$add_exposure()
as it follows.
m_engine$add_exposure(time = 120, c(SBS5 = 0.3, SBS2 = 0.2, SBS3 = 0.5,
ID1 = 0.8, ID9 = 0.2))
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-08}
#> "A-": {SNV: 3e-08, CNA: 1e-11}
#> "B+": {SNV: 8e-08}
#> "B-": {SNV: 5e-08}
#>
#> Driver mutations
#> "A":
#> chr22(10510210)[N>C](allele: 1)
#> chr22(15220157)[GTTTTTTTT>G](allele: random)
#> CNA("A",chr22(10303470), len: 200000)
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> Whole genome duplication
#> "B":
#> chr22(30057302)[C>T](allele: 1)
#> chr22(12028576)[N>G](allele: random)
#>
#> Timed Exposure
#> SBS Timed Exposures
#> [0, 100[: {"SBS1": 0.8, "SBS13": 0.2}
#> [100, 120[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#> [120, ∞[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#>
#> indel Timed Exposures
#> [0, 120[: {"ID13": 0.2, "ID2": 0.6, "ID21": 0.2}
#> [120, ∞[: {"ID1": 0.8, "ID9": 0.2}
Passenger CNAs and Tumour Type
The passenger CNAs applied during the simulation depend on tumour
type. The type of tumour can be specified when building the mutation
engine by using the parameters tumour_type
or
tumour_study
. For instance, if the passenger CNAs file used
to build the mutation engine contains some of the data identified in
breast carcinoma in the UK, the specific CNA can be applied during the
simulation as it follows.
m_engine <- MutationEngine(setup_code = "demo",
tumour_type = "BRCA",
tumour_study = "UK")
The complete list of available tumour types and studies in a set-up
can be obtained by using the function
get_available_tumours_in()
.
get_available_tumours_in(setup_code = "demo") %>% head()
#> type study
#> 1 BLCA US
#> 2 BOCA UK
#> 3 BRCA EU
#> 4 BRCA UK
#> 5 BRCA US
#> 6 BTCA SG
Germline Mutations
rRACES allows users to apply the germline mutations of one of the subjects available in the germline data provided in building the mutation engine. This feature will enable users to simulate a specific cancer type’s evolution on an individual with the desired gender and ethnicity.
The available subjects, together with their sex and ethnicity, can be
obtained by using the method
[MutationEngine$get_germline_subjects()]
.
subjects <- m_engine$get_germline_subjects()
subjects
#> sample pop super_pop gender
#> 1 NA18941 JPT EAS female
#> 2 NA20513 TSI EUR male
The column sample
contains the names of the available
subjects. The columns pop
and super_pop
report
the subjects’ population and super-population codes. The last column,
gender
, includes the subject gender.
The method MutationEngine$get_population_descritions()
clarifies the meaning of the codes reported in the pop columns.
m_engine$get_population_descritions()
#> code description
#> 1 CHB Han Chinese
#> 2 JPT Japanese
#> 3 CHS Southern Han Chinese
#> 4 CDX Dai Chinese
#> 5 KHV Kinh Vietnamese
#> 6 CHD Denver Chinese
#> 7 CEU CEPH
#> 8 TSI Tuscan
#> 9 GBR British
#> 10 FIN Finnish
#> 11 IBS Spanish
#> 12 YRI Yoruba
#> 13 LWK Luhya
#> 14 GWD Gambian
#> 15 MSL Mende
#> 16 ESN Esan
#> 17 ASW African-American SW
#> 18 ACB African-Caribbean
#> 19 MXL Mexican-American
#> 20 PUR Puerto Rican
#> 21 CLM Colombian
#> 22 PEL Peruvian
#> 23 GIH Gujarati
#> 24 PJL Punjabi
#> 25 BEB Bengali
#> 26 STU Sri Lankan
#> 27 ITU Indian
#> long.description
#> 1 Han Chinese in Beijing, China
#> 2 Japanese in Tokyo, Japan
#> 3 Han Chinese South
#> 4 Chinese Dai in Xishuangbanna, China
#> 5 Kinh in Ho Chi Minh City, Vietnam
#> 6 Chinese in Denver, Colorado (pilot 3 only)
#> 7 Utah residents (CEPH) with Northern and Western European ancestry
#> 8 Toscani in Italia
#> 9 British in England and Scotland
#> 10 Finnish in Finland
#> 11 Iberian populations in Spain
#> 12 Yoruba in Ibadan, Nigeria
#> 13 Luhya in Webuye, Kenya
#> 14 Gambian in Western Division, The Gambia
#> 15 Mende in Sierra Leone
#> 16 Esan in Nigeria
#> 17 African Ancestry in Southwest US
#> 18 African Caribbean in Barbados
#> 19 Mexican Ancestry in Los Angeles, California
#> 20 Puerto Rican in Puerto Rico
#> 21 Colombian in Medellin, Colombia
#> 22 Peruvian in Lima, Peru
#> 23 Gujarati Indian in Houston, TX
#> 24 Punjabi in Lahore, Pakistan
#> 25 Bengali in Bangladesh
#> 26 Sri Lankan Tamil in the UK
#> 27 Indian Telugu in the UK
The method MutationEngine$get_active_germline()
returns
the the active germline subject.
m_engine$get_active_germline()
#> sample pop super_pop gender
#> 1 NA18941 JPT EAS female
Users can change the germline subject using the method
MutationEngine$set_germline_subject()
.
When a subject is selected for the first time, rRACES builds a binary representation of the subject genome, saving it for future use. This step may take a few minutes. However, all the successive selections of the same subject directly load the binary file.
m_engine$set_germline_subject(subjects[2, "sample"])
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#> [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved
m_engine$get_active_germline()
#> sample pop super_pop gender
#> 1 NA20513 TSI EUR male
Building Phylogenetic Forests
The configurated mutation engine can be used to label each node in a samples forest by mutations.
Since the mutation engine has be configurated to deal with the
simulation peformed in Section “Two populations with epigenetic
state” of vignette("sampling")
, we can load its
samples forest from the file "samples_forest.sff"
in saved
that article.
samples_forest <- load_samples_forest("samples_forest.sff")
# place mutations on the sample forest assuming 1000 pre-neoplastic SNVs and
# 500 indels
phylo_forest <- m_engine$place_mutations(samples_forest, 1000, 500)
#> [█---------------------------------------] 0% [00m:00s] Placing mutations [█████████-------------------------------] 20% [00m:01s] Placing mutations [█████████████████-----------------------] 40% [00m:02s] Placing mutations [█████████████████████████---------------] 60% [00m:03s] Placing mutations [████████████████████████████████--------] 79% [00m:04s] Placing mutations [█████████████████████████████████████---] 92% [00m:05s] Placing mutations [████████████████████████████████████████] 100% [00m:05s] Mutations placed
phylo_forest
#> PhylogeneticForest
#> # of trees: 1
#> # of nodes: 18845
#> # of leaves: 5396
#> samples: {"S_1_1", "S_1_2", "S_2_1", "S_2_2"}
#>
#> # of emerged SNVs and indels: 113353
#> # of emerged CNAs: 39
The phylogenetic forest stores all the mutations labeling the sampled
cells which are represented by the forest leaves. Users can retrieve
such data by using the methods
PhylogeneticForest$get_sampled_cell_mutations()
and
PhylogeneticForest$get_sampled_cell_CNAs()
.
library(dplyr)
# select the first mutations among all the mutations occuring in
# the genomes of the sampled cells
phylo_forest$get_sampled_cell_mutations() %>% head()
#> cell_id chr chr_pos allele ref alt type cause class
#> 1 2790 22 15220157 0 GTTTTTTTT G indel A driver
#> 2 2790 22 16079553 0 G A SNV SBS1 pre-neoplastic
#> 3 2790 22 16135375 0 C CAAAAAA indel ID1 pre-neoplastic
#> 4 2790 22 16149260 0 G A SNV SBS1 pre-neoplastic
#> 5 2790 22 16200607 0 G A SNV SBS1 pre-neoplastic
#> 6 2790 22 16251672 0 C CTTTTT indel ID1 pre-neoplastic
# select the first CNAs among all the mutations occuring in
# the genomes of the sampled cells
phylo_forest$get_sampled_cell_CNAs() %>% head()
#> cell_id type chr begin end allele src.allele class
#> 1 2790 A 22 10303470 10503469 2 1 driver
#> 2 2790 D 22 5010000 5209999 1 NA driver
#> 3 5000 A 22 10303470 10503469 2 1 driver
#> 4 5000 D 22 5010000 5209999 1 NA driver
#> 5 8774 A 22 10303470 10503469 2 1 driver
#> 6 8774 D 22 5010000 5209999 1 NA driver
# get the sampled cells
sampled_cells <- phylo_forest$get_nodes() %>%
filter(!is.na(.data$sample))
# show the first of them
sampled_cells %>% head()
#> cell_id ancestor mutant epistate sample birth_time
#> 1 2790 2694 A - S_1_1 127.8854
#> 2 5000 4671 A - S_1_2 152.6275
#> 3 8774 8274 A - S_1_1 180.3081
#> 4 9651 8069 A - S_1_1 185.6496
#> 5 11533 7757 A - S_1_1 195.7661
#> 6 12905 12464 A - S_1_1 202.1142
# get the identifier of the 3rd cell in `sampled_cells`
cell_id <- sampled_cells[3, 1]
# get the SNVs and the indels of the 3rd cell in `sampled_cells`
phylo_forest$get_sampled_cell_mutations(cell_id) %>% head()
#> cell_id chr chr_pos allele ref alt type cause class
#> 1 8774 22 15220157 0 GTTTTTTTT G indel A driver
#> 2 8774 22 16079553 0 G A SNV SBS1 pre-neoplastic
#> 3 8774 22 16135375 0 C CAAAAAA indel ID1 pre-neoplastic
#> 4 8774 22 16149260 0 G A SNV SBS1 pre-neoplastic
#> 5 8774 22 16200607 0 G A SNV SBS1 pre-neoplastic
#> 6 8774 22 16251672 0 C CTTTTT indel ID1 pre-neoplastic
# get the CNAs of the 3rd cell in `sampled_cells`
phylo_forest$get_sampled_cell_CNAs(cell_id) %>% head()
#> cell_id type chr begin end allele src.allele class
#> 1 8774 A 22 10303470 10503469 2 1 driver
#> 2 8774 D 22 5010000 5209999 1 NA driver
The method PhylogeneticForest$get_samples_info()
is
analogous to DescendantForest$get_samples_info()
: it
returns a dataframe containing information about the forest samples.
However, the former method adds two columns to the result of the latter:
“DNA_quantity
” and “equivalent_normal_cells
”.
The former column reports the overall quantity of tumoral DNA in the
sample, i.e., the sum of the lengths of all the alleles in the
corresponding sample. This quantity is a natural number, nevertheless,
it is stored as a real number as it usually usually exceeds the largest
natural number that can be natively represented by R. The column
“equivalent_normal_cells
” instead contains the number of
normal cells that contain as much DNA as the sample tumoral cells.
phylo_forest$get_samples_info()
#> name id xmin ymin xmax ymax tumour_cells tumour_cells_in_bbox time
#> 1 S_1_1 0 480 480 530 530 2588 2588 365.9279
#> 2 S_1_2 1 500 500 550 550 1632 1632 365.9279
#> 3 S_2_1 2 370 498 394 522 572 572 421.2143
#> 4 S_2_2 3 370 373 394 397 604 604 421.2143
#> DNA_quantity equivalent_normal_cells
#> 1 530855879155 5173.573
#> 2 333071264048 3246.020
#> 3 117255398095 1142.738
#> 4 123951831456 1208.000
The method PhylogeneticForest$get_germline_mutations()
returns the SNVs and the indels in the germline.
# extract the germline mutation
phylo_forest$get_germline_mutations() %>% head()
#> chr chr_pos allele ref alt type cause class
#> 1 22 16051493 0 G A SNV germinal
#> 2 22 16052167 0 A AAAAC indel germinal
#> 3 22 16053659 0 A C SNV germinal
#> 4 22 16054740 0 A G SNV germinal
#> 5 22 16055942 0 C T SNV germinal
#> 6 22 16058070 0 A G SNV germinal
Users can also identify the cell in which a mutation emerged even when the cell was not sampled.
# select one of the mutations
mutation_row <- phylo_forest$get_sampled_cell_mutations(cell_id)[2, ]
# rebuild the corresponding mutation
mutation <- Mutation(mutation_row["chr"][1, ],
mutation_row["chr_pos"][1, ],
mutation_row["ref"][1, ],
mutation_row["alt"][1, ])
# get the identifier of the oldest cells in which the mutation occurs
phylo_forest$get_first_occurrences(mutation)
#> [[1]]
#> [1] 0
The exposures used in placing the mutations on the cells in the
phylogenetic forest can be obtained by using the method
PhylogeneticForest$get_exposures()
.
# get the exposures used in placing the mutations
phylo_forest$get_exposures()
#> time signature exposure type
#> 1 0 SBS1 0.8 SNV
#> 2 0 SBS13 0.2 SNV
#> 3 100 SBS2 0.2 SNV
#> 4 100 SBS3 0.5 SNV
#> 5 100 SBS5 0.3 SNV
#> 6 120 SBS2 0.2 SNV
#> 7 120 SBS3 0.5 SNV
#> 8 120 SBS5 0.3 SNV
#> 9 0 ID13 0.2 indel
#> 10 0 ID2 0.6 indel
#> 11 0 ID21 0.2 indel
#> 12 120 ID1 0.8 indel
#> 13 120 ID9 0.2 indel
The method
MutationEngine$get_bulk_allelic_fragmentation()
returns a
dataframe reporting the allelic type per genome fragment.
# get the name of the first sample
sample_name <- phylo_forest$get_samples_info()[["name"]][1]
# print the bulk allelic fragmentation
phylo_forest$get_bulk_allelic_fragmentation(sample_name) %>% head()
#> chr begin end major minor ratio
#> 1 22 1 5009999 2 2 0.998840804
#> 2 22 1 5009999 3 2 0.001159196
#> 3 22 5010000 5209999 2 0 0.998840804
#> 4 22 5010000 5209999 3 0 0.001159196
#> 5 22 5210000 10303469 2 2 0.998840804
#> 6 22 5210000 10303469 3 2 0.001159196
Instead, the method
MutationEngine$get_cell_allelic_fragmentation()
returns the
allelic fragmentation per cell.
# print the cell allelic fragmentation
phylo_forest$get_cell_allelic_fragmentation() %>% head()
#> cell_id chr begin end major minor
#> 1 2790 22 1 5009999 2 2
#> 2 2790 22 5010000 5209999 2 0
#> 3 2790 22 5210000 10303469 2 2
#> 4 2790 22 10303470 10503469 4 2
#> 5 2790 22 10503470 51304566 2 2
#> 6 5000 22 1 5009999 2 2
The details about the SNV and indel signatures adopted during the
evolution are available in the mutation engine and they can be retrieved
by using the methods MutationEngine$get_SNV_signatures()
and MutationEngine$get_indel_signatures()
.
# get the SNV signatures used in placing the mutations
m_engine$get_SNV_signatures()[1:6, 1:5]
#> Type SBS1 SBS2 SBS3 SBS4
#> 1 A[C>A]A 0.000886157 5.80000e-07 0.02080832 0.042196498
#> 2 A[C>A]C 0.002280405 1.48004e-04 0.01650660 0.033297236
#> 3 A[C>A]G 0.000177031 5.23000e-05 0.00175070 0.015598705
#> 4 A[C>A]T 0.001280227 9.78000e-05 0.01220488 0.029497552
#> 5 A[C>G]A 0.001860330 2.23000e-16 0.01970788 0.006889428
#> 6 A[C>G]C 0.001220217 1.33004e-04 0.01170468 0.002839764
# get the indel signatures used in placing the mutations
m_engine$get_indel_signatures()[1:6, 1:5]
#> Type ID1 ID2 ID3 ID4
#> 1 1:Del:C:0 1.598890e-04 0.004824116 0.12472711 0.007249717
#> 2 1:Del:C:1 7.735230e-04 0.000022100 0.20887617 0.002734869
#> 3 1:Del:C:2 3.310000e-18 0.000003110 0.17632422 0.002041063
#> 4 1:Del:C:3 1.907613e-03 0.002472076 0.06404276 0.001112283
#> 5 1:Del:C:4 7.059900e-04 0.003856976 0.04398981 0.001075684
#> 6 1:Del:C:5 3.370115e-03 0.014615424 0.02241749 0.002246178
Finally, the data of the subject whose germline corresponds to
wild-type genome in the phylogenetic forest can be obtained by the
method Phylogenetic$get_germline_subject()
.
phylo_forest$get_germline_subject()
#> sample pop super_pop gender
#> 1 NA20513 TSI EUR male
Storing Phylogenetic Forests
As in the case of the samples forests, the phylogenetic forests can
be saved by using the method PhylogeneticForest$save()
and
load by the function load_phylogenetic_forest()
.
# save the phylogenetic forest in the file "phylo_forest.sff"
phylo_forest$save("phylo_forest.sff")
#> [█---------------------------------------] 0% [00m:00s] Saving forest [█---------------------------------------] 0% [00m:00s] Saving forest [█---------------------------------------] 0% [00m:01s] Saving forest [█---------------------------------------] 0% [00m:02s] Saving forest [█---------------------------------------] 0% [00m:04s] Saving forest [█---------------------------------------] 0% [00m:05s] Saving forest [█---------------------------------------] 0% [00m:06s] Saving forest [█---------------------------------------] 0% [00m:07s] Saving forest [█---------------------------------------] 0% [00m:08s] Saving forest [█---------------------------------------] 0% [00m:09s] Saving forest [████------------------------------------] 9% [00m:10s] Saving forest [█████████-------------------------------] 20% [00m:11s] Saving forest [█████████████---------------------------] 31% [00m:12s] Saving forest [█████████████████-----------------------] 42% [00m:13s] Saving forest [██████████████████████------------------] 52% [00m:14s] Saving forest [██████████████████████████--------------] 64% [00m:15s] Saving forest [███████████████████████████████---------] 75% [00m:16s] Saving forest [███████████████████████████████████-----] 86% [00m:17s] Saving forest [███████████████████████████████████████-] 96% [00m:18s] Saving forest [████████████████████████████████████████] 100% [00m:18s] Forest saved
# loading the saved forest
loaded_phylo_forest <- load_phylogenetic_forest("phylo_forest.sff")
#> [█---------------------------------------] 0% [00m:00s] Loading forest [█████-----------------------------------] 10% [00m:01s] Loading forest [███████---------------------------------] 17% [00m:02s] Loading forest [██████████------------------------------] 24% [00m:03s] Loading forest [█████████████---------------------------] 31% [00m:04s] Loading forest [████████████████------------------------] 38% [00m:05s] Loading forest [███████████████████---------------------] 45% [00m:06s] Loading forest [█████████████████████-------------------] 52% [00m:07s] Loading forest [████████████████████████----------------] 58% [00m:08s] Loading forest [███████████████████████████-------------] 66% [00m:09s] Loading forest [██████████████████████████████----------] 73% [00m:10s] Loading forest [█████████████████████████████████-------] 80% [00m:12s] Loading forest [███████████████████████████████████-----] 87% [00m:13s] Loading forest [██████████████████████████████████████--] 94% [00m:14s] Loading forest [████████████████████████████████████████] 100% [00m:15s] Forest loaded
loaded_phylo_forest
#> PhylogeneticForest
#> # of trees: 1
#> # of nodes: 18845
#> # of leaves: 5396
#> samples: {"S_1_1", "S_1_2", "S_2_1", "S_2_2"}
#>
#> # of emerged SNVs and indels: 113353
#> # of emerged CNAs: 39
Getting and Setting the Reference Genome Path
The phylogenetic forest object contains the reference genome FASTA
file path. The methods
PhylogeneticForest$get_reference_path()
and
PhylogeneticForest$set_reference_path()
can be used to get
the path and set it, respectively.
phylo_forest$get_reference_path()
#> [1] "/Users/alberto/Library/CloudStorage/Dropbox/Lavoro/Code/rRACES/rRACES/vignettes/demo/reference.fasta"
phylo_forest$set_reference_path("demo/reference.fasta")
phylo_forest$get_reference_path()
#> [1] "/Users/alberto/Library/CloudStorage/Dropbox/Lavoro/Code/rRACES/rRACES/vignettes/demo/reference.fasta"