Disclaimer: ProCESS/CLONES internally implements the probability distributions using the C++11 random number distribution classes. The standard does not specify their algorithms, and the class implementations are left free for the compiler. Thus, the simulation output depends on the compiler used to compile CLONES, and because of that, the results reported in this article may differ from those obtained by the reader.
ProCESS/CLONES can simulate genomic mutations in the cells
represented in a SampleForest according to the specified
SBS and indel mutational signatures (see (Alexandrov et al. 2020)). This process is
performed by the class MutationEngine, which also takes
into account the mutation rate of the simulated species and gives the
chance to dynamically change the signatures.
Setting Up Mutation Engine
The creation of an object of the type MutationEngine
requires downloading the reference sequence and the signature files and
building the corresponding two signature indices. The function
MutationEngine() performs all these steps in a single
call.
This function is quite flexible and allows customising the mutation engine in many ways. However, the vast majority of ProCESS users would aim for a standard setup, for instance, involving the human genome.
Because of this, ProCESS provides some predefined setups. Their list
can be obtained by invoking the function
get_mutation_engine_codes().
library(ProCESS)
get_mutation_engine_codes()
#> name description
#> 1 GRCh37 Homo sapiens (GRCh37)
#> 2 GRCh38 Homo sapiens (GRCh38)
#> 3 demo A demonstrative set-upLet us build a MutationEngine by using the
"demo" setup.
# building a mutation engine by using the "demo" setup
m_engine <- MutationEngine(setup_code = "demo")
#> Downloading reference genome...
#> Reference genome downloaded
#> Decompressing reference genome...done
#> Downloading signature files...
#> Signature file downloaded
#> Downloading driver mutation file...
#> Driver mutation file downloaded
#> Decompressing driver mutation file...done
#> Downloading passenger CNAs file...
#> Passenger CNAs file downloaded
#> Decompressing passenger CNAs file...done
#> Downloading germline...
#> Germline downloaded
#> Decompressing mutations...
#> done
#> Building context index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█████████████████-----------------------] 40% [00m:00s] Processing chr. 22 [█████████████████████████████████-------] 81% [00m:01s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:02s] Context index built
#> [█---------------------------------------] 0% [00m:00s] Saving context index [████████████████████████████████████████] 100% [00m:00s] Context index saved
#> done
#> Building repeated sequence index...
#> [█---------------------------------------] 0% [00m:00s] Reading 22 [████████████████████████████████████████] 100% [00m:01s] Reading 22
#> [████████████████████████████████████████] 100% [00m:01s] Reading 22
#> [████████████████████████████████████████] 100% [00m:01s] Reading 22
#> [████████████████████████████████████████] 100% [00m:01s] Reading 22
#> [████████████████████████████████████████] 100% [00m:01s] Reading 22
#> [████████████████████████████████████████] 100% [00m:01s] Reading 22
#> [████████████████████████████████████████] 100% [00m:01s] Reading 22
#> [████████████████████████████████████████] 100% [00m:01s] RS index built
#> [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:00s] Saving RS index [████████████████████████████------------] 68% [00m:02s] Saving RS indexdone
#> [████████████████████████████████████████] 100% [00m:02s] RS index saved
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#> [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved
m_engine
#> MutationEngine
#> Passenger rates
#>
#> Driver mutations
#>
#> Timed Exposure
#> SBS Timed Exposures
#>
#> indel Timed ExposuresThe above call creates the directory demo, downloads all
the data required by the mutation engine, and builds it.
dir.exists("demo")
#> [1] TRUE
list.files("demo")
#> [1] "context_index_100.cif" "drivers.txt"
#> [3] "germline_data" "indel_signatures.txt"
#> [5] "passenger_CNAs.txt" "reference.fasta"
#> [7] "rs_index_50_500000.rsif" "SBS_signatures.txt"
#> [9] "sources.csv"The execution of the function MutationEngine() may take
some time, but it is meant to be performed one for all and, as long as
the user does not need to change the reference genome or the signature
files, it is no longer required. In this spirit, any call to
MutationEngine() exclusively performs the building process
sub-tasks that have not been already fulfilled.
# building a mutation engine by using the "demo" set-up configuration
m_engine <- MutationEngine(setup_code = "demo")
#> [█---------------------------------------] 0% [00m:00s] Loading context index [████████████████████████████████████████] 100% [00m:00s] Context index loaded
#> [█---------------------------------------] 0% [00m:00s] Loading RS index [██████████████--------------------------] 33% [00m:01s] Loading RS index [███████████████████████████-------------] 65% [00m:02s] Loading RS index [███████████████████████████████████████-] 96% [00m:03s] Loading RS index [████████████████████████████████████████] 100% [00m:03s] RS index loaded
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loadedMutant Genetic Characterization
Once the mutation engine has been built, we can define a mutant genotype and declare species mutation rates.
Let us consider the simulation of Section “Two populations with
epigenetic state” in this article. It
involves the two mutants A and B. Both of them
have two possible epigenetic states, + and -,
leading to the four species A+, A-,
B+, and B-, respectively. Each of these
species has a passenger mutation rate that must be declared to the
mutation engine before the sample forest is labelled. Let 1e-9, 3e-8,
8e-7, and 5e-8 be the passenger SNV rates for the species
A+, A-, B+, and B-,
respectively. Let the indel rates be 0 for all the species, but
A+, which has an indel rate of 1e-8. Furthermore, let 0,
1e-11, 0, and 0 be the passenger CNA rates of the same species,
respectively.
The two mutants may also be genetically characterised by some driver mutations. The driver mutations associated with each mutant must occur in any cell belonging to that mutant. Hence, they must be declared to the mutant engine before the labelling.
The method MutationEngine$add_mutant()
takes care of all these declarations.
For the sake of example, let us assume that A is
characterised by one driver mutation on chromosome 22, while
B has three driver mutations on the same chromosome.
The genetic specification of the mutant A can be
declared as it follows.
# add the mutant "A" characterized by one driver SNV on the allele 1 of
# chromosome 22, one indel deletion on the same chromosome, two CNAs (a
# deletion on the allele 1 of chromosome 22 and an amplification on a random
# allele of the same chromosome) and, finally, a whole genome doubling event
# (WGD). The mutant has two epigenetic states and its species "A+" and "A-"
# have passenger SNV rates 1e-9 and 3e-9, respectively, passenger indel rates
# 1e-9 and 0, respectively, and passenger CNA rates 0 and 1e-11, respectively.
m_engine$add_mutant(mutant_name = "A",
passenger_rates = list("+" = c(SNV = 1e-9, indel = 1e-9),
"-" = c(SNV = 3e-9, CNA = 1e-11)),
drivers = list(SNV("22", 10510210, "C", allele = 1),
Mutation("22", 16085675, "GCCTCCCGA", "G"),
CNA(type = "A", chr = "22",
chr_pos = 10303470, len = 200000),
CNA("D", "22", 5010000, 200000,
allele = 1),
WGD))
#> [█---------------------------------------] 0% [00m:00s] Retrieving "A" SIDs [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [██████████████--------------------------] 33% [00m:01s] Reading 22 [███████████████████████████-------------] 66% [00m:02s] Reading 22 [████████████████████████████████████████] 100% [00m:03s] Reading 22
#> [████████████████████████████████████████] 100% [00m:03s] "A"'s SIDs validated
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-09}
#> "A-": {SNV: 3e-09, CNA: 1e-11}
#>
#> Driver mutations
#> "A":
#> (chr22(10510210)[N>C]) on allele 1
#> (chr22(16085675)[GCCTCCCGA>G]) on random allele
#> CNA("A",chr22(10303470), len: 200000)
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> Whole genome duplication
#>
#> Timed Exposure
#> SBS Timed Exposures
#>
#> indel Timed ExposuresIn the above code, the driver CNAs, SNVs, and indels are declared by
calling the functions CNA(), SNV(), and
Mutation(), respectively. These functions allow specifying
the allele in which the mutation must lie. The mutant A is
also characterised by whole genome doubling (WGD):
a genomic event that simultaneously duplicates all the chromosome
alleles.
The driver mutations are applied in the order specified. For
instance, the following two pieces of code specify two different genomic
characterisations for the same mutant E.
m_engine$add_mutant("E", passenger_rates,
drivers = list(SNV("22", 10510210, "C", allele = 1),
WGD))
m_engine$add_mutant("E", passenger_rates,
drivers = list(WGD,
SNV("22", 10510210, "C", allele = 1)))The former snippet places an SNV and, afterwards, duplicates all the alleles producing a genome in which two alleles contain the placed SNV. Instead, the latter snippet requires a whole genome doubling event and, then, places the SNV yielding a single occurrence of the SNV in the final genome.
As far as SNVs and indels are concerned, ProCESS provides users with
a more compact and, at times, convenient approach. The mutation engine
stores a list of known driver mutations and labels each with a code that
can be used during mutant declaration. The corresponding data frame can
be obtained by using the method MutationEngine$get_known_drivers().
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
m_engine$get_known_drivers() %>% filter(chr == "22") %>% head()
#> chr from to ref alt mutation_type driver_gene driver_code
#> 1 22 20073503 20073503 G A SNV DGCR8 DGCR8 S6N
#> 2 22 20073503 20073503 G A SNV DGCR8 DGCR8 S6N
#> 3 22 20073503 20073503 G A SNV DGCR8 DGCR8 S6N
#> 4 22 20073503 20073503 G A SNV DGCR8 DGCR8 S6N
#> 5 22 20073511 20073511 C T SNV DGCR8 DGCR8 P9S
#> 6 22 20073511 20073511 C T SNV DGCR8 DGCR8 P9S
#> driver_CDS tumour_type
#> 1 c.17G>A DLBCLNOS
#> 2 c.17G>A PLMESO
#> 3 c.17G>A WDTC
#> 4 c.17G>A WT
#> 5 c.25C>T DLBCLNOS
#> 6 c.25C>T PLMESOThe code of the known driver mutations can be used in place of the full specification as follows.
# add the mutant "B" characterized by two driver SNVs on chromosome 22 (no
# CNA) and two epigenetic states. The first SNV is "DGCR8 P26L" and it must
# lay in the allele 1. The remaining SNV is specified by using the SNV function
# as done above. The species "B+" and "B-" have passenger SNV rates 8e-9
# and 2e-8, respectively, and CNA rates 0 for both species.
m_engine$add_mutant("B", list("+" = c(SNV = 8e-9), "-" = c(SNV = 2e-8)),
list(list("DGCR8 P26L", allele = 1), # the first SNV
SNV("22", 12028576, "G"))) # the second SNV
#> [█---------------------------------------] 0% [00m:00s] Retrieving "B" SIDs [████████████████████████████████████████] 100% [00m:00s] "B"'s SIDs validated
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-09}
#> "A-": {SNV: 3e-09, CNA: 1e-11}
#> "B+": {SNV: 8e-09}
#> "B-": {SNV: 2e-08}
#>
#> Driver mutations
#> "A":
#> (chr22(10510210)[N>C]) on allele 1
#> (chr22(16085675)[GCCTCCCGA>G]) on random allele
#> CNA("A",chr22(10303470), len: 200000)
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> Whole genome duplication
#> "B":
#> DGCR8 P26L (chr22(20073563)[C>T]) on allele 1
#> (chr22(12028576)[N>G]) on random allele
#>
#> Timed Exposure
#> SBS Timed Exposures
#>
#> indel Timed ExposuresMutational Exposures
The probability of a mutation occurring depends on both its genomic and environmental context.
A signature is a mutation probability distribution over mutation contexts or mutation structure. SBS (single-base substitution) signatures provide for any genomic context (i.e., a triplet of bases) the probability that an SNV occurs on that context. On the contrary, ID (indel) signatures associate the probability of an indel to its length and structure (see (Alexandrov et al. 2020)).
The signature depends on the environmental context and, as a result, more than one signature may be active at the same time with different probabilities. A mutational exposure (or exposure) is a discrete probability distribution among signatures.
To simulate passenger mutations of a given type, we need to specify a default exposure for that type. This can be achieved as follows.
# add SNV and indel default exposures. This will be used from simulated time 0
# up to the successive exposure change.
m_engine$add_exposure(coefficients = c(SBS13 = 0.2, SBS1 = 0.8))
m_engine$add_exposure(c(ID2 = 0.6, ID13 = 0.2, ID21 = 0.2))Further exposures can also be defined by specifying an activation time for each of them, i.e., the new exposures will be used from that until the next exposure change.
# add a new SNV exposure that will be used from simulated
# time 100 up to the successive exposure change.
m_engine$add_exposure(time = 100, c(SBS5 = 0.3, SBS2 = 0.2, SBS3 = 0.5))
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-09}
#> "A-": {SNV: 3e-09, CNA: 1e-11}
#> "B+": {SNV: 8e-09}
#> "B-": {SNV: 2e-08}
#>
#> Driver mutations
#> "A":
#> (chr22(10510210)[N>C]) on allele 1
#> (chr22(16085675)[GCCTCCCGA>G]) on random allele
#> CNA("A",chr22(10303470), len: 200000)
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> Whole genome duplication
#> "B":
#> DGCR8 P26L (chr22(20073563)[C>T]) on allele 1
#> (chr22(12028576)[N>G]) on random allele
#>
#> Timed Exposure
#> SBS Timed Exposures
#> [0, 100[: {"SBS1": 0.8, "SBS13": 0.2}
#> [100, ∞[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#>
#> indel Timed Exposures
#> [0, ∞[: {"ID13": 0.2, "ID2": 0.6, "ID21": 0.2}The default exposures and the simultaneous changes in SNV and ID
exposures can be specified by a single call to the function MutationEngine$add_exposure()
as follows.
m_engine$add_exposure(time = 120, c(SBS5 = 0.3, SBS2 = 0.2, SBS3 = 0.5,
ID1 = 0.8, ID9 = 0.2))
m_engine
#> MutationEngine
#> Passenger rates
#> "A+": {SNV: 1e-09, indel: 1e-09}
#> "A-": {SNV: 3e-09, CNA: 1e-11}
#> "B+": {SNV: 8e-09}
#> "B-": {SNV: 2e-08}
#>
#> Driver mutations
#> "A":
#> (chr22(10510210)[N>C]) on allele 1
#> (chr22(16085675)[GCCTCCCGA>G]) on random allele
#> CNA("A",chr22(10303470), len: 200000)
#> CNA("D",chr22(5010000), len: 200000, allele: 1)
#> Whole genome duplication
#> "B":
#> DGCR8 P26L (chr22(20073563)[C>T]) on allele 1
#> (chr22(12028576)[N>G]) on random allele
#>
#> Timed Exposure
#> SBS Timed Exposures
#> [0, 100[: {"SBS1": 0.8, "SBS13": 0.2}
#> [100, 120[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#> [120, ∞[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#>
#> indel Timed Exposures
#> [0, 120[: {"ID13": 0.2, "ID2": 0.6, "ID21": 0.2}
#> [120, ∞[: {"ID1": 0.8, "ID9": 0.2}Passenger CNAs and Tumour Type
The passenger CNAs applied during the simulation depend on tumour
type. The type of tumour can be specified when building the mutation
engine by using the parameters tumour_type or
tumour_study. For instance, if the passenger CNA file used
to build the mutation engine contains some of the data identified in
breast carcinoma in the UK, the specific CNA can be applied during the
simulation as follows.
m_engine <- MutationEngine(setup_code = "demo",
tumour_type = "BRCA",
tumour_study = "UK")The complete list of available tumour types in a set-up can be
obtained by using the function
get_available_tumours_in().
get_available_tumours_in(setup_code = "demo") %>% head()
#> type
#> 1 ACC
#> 2 AML
#> 3 ANGS
#> 4 ANSC
#> 5 BCC
#> 6 BLCAGermline Mutations
ProCESS allows users to apply germline mutations from one of the subjects in the germline data used to build the mutation engine. This feature will enable users to simulate a specific cancer type’s evolution on an individual with the desired gender and ethnicity.
The available subjects, together with their sex and ethnicity, can be
obtained by using the method MutationEngine$get_germline_subjects().
subjects <- m_engine$get_germline_subjects()
subjects
#> sample pop super_pop gender
#> 1 NA18941 JPT EAS female
#> 2 NA20513 TSI EUR maleThe column sample contains the names of the available
subjects. The columns pop and super_pop report
the subjects’ population and super-population codes. The last column,
gender, includes the subject gender.
The method MutationEngine$get_population_descriptions()
clarifies the meaning of the codes reported in the pop columns.
m_engine$get_population_descriptions()
#> code description
#> 1 CHB Han Chinese
#> 2 JPT Japanese
#> 3 CHS Southern Han Chinese
#> 4 CDX Dai Chinese
#> 5 KHV Kinh Vietnamese
#> 6 CHD Denver Chinese
#> 7 CEU CEPH
#> 8 TSI Tuscan
#> 9 GBR British
#> 10 FIN Finnish
#> 11 IBS Spanish
#> 12 YRI Yoruba
#> 13 LWK Luhya
#> 14 GWD Gambian
#> 15 MSL Mende
#> 16 ESN Esan
#> 17 ASW African-American SW
#> 18 ACB African-Caribbean
#> 19 MXL Mexican-American
#> 20 PUR Puerto Rican
#> 21 CLM Colombian
#> 22 PEL Peruvian
#> 23 GIH Gujarati
#> 24 PJL Punjabi
#> 25 BEB Bengali
#> 26 STU Sri Lankan
#> 27 ITU Indian
#> long.description
#> 1 Han Chinese in Beijing, China
#> 2 Japanese in Tokyo, Japan
#> 3 Han Chinese South
#> 4 Chinese Dai in Xishuangbanna, China
#> 5 Kinh in Ho Chi Minh City, Vietnam
#> 6 Chinese in Denver, Colorado (pilot 3 only)
#> 7 Utah residents (CEPH) with Northern and Western European ancestry
#> 8 Toscani in Italia
#> 9 British in England and Scotland
#> 10 Finnish in Finland
#> 11 Iberian populations in Spain
#> 12 Yoruba in Ibadan, Nigeria
#> 13 Luhya in Webuye, Kenya
#> 14 Gambian in Western Division, The Gambia
#> 15 Mende in Sierra Leone
#> 16 Esan in Nigeria
#> 17 African Ancestry in Southwest US
#> 18 African Caribbean in Barbados
#> 19 Mexican Ancestry in Los Angeles, California
#> 20 Puerto Rican in Puerto Rico
#> 21 Colombian in Medellin, Colombia
#> 22 Peruvian in Lima, Peru
#> 23 Gujarati Indian in Houston, TX
#> 24 Punjabi in Lahore, Pakistan
#> 25 Bengali in Bangladesh
#> 26 Sri Lankan Tamil in the UK
#> 27 Indian Telugu in the UKThe method MutationEngine$get_active_germline()
returns the the active germline subject.
m_engine$get_active_germline()
#> sample pop super_pop gender
#> 1 NA18941 JPT EAS femaleUsers can change the germline subject using the method MutationEngine$set_germline_subject().
When a subject is selected for the first time, ProCESS builds a binary representation of the subject’s genome and save it for future use. This step may take a few minutes. However, all the successive selections of the same subject directly load the binary file.
m_engine$set_germline_subject(subjects[2, "sample"])
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#> [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved
m_engine$get_active_germline()
#> sample pop super_pop gender
#> 1 NA20513 TSI EUR maleBuilding Phylogenetic Forests
The configured mutation engine can be used to label each node in a sample forest with mutations.
Since the mutation engine has been configured to deal with the
simulation performed in Section “Two populations with epigenetic
state” in this article, we can load its
sample forest from the file "sample_forest.sff" as saved
there.
sample_forest <- load_sample_forest("sample_forest.sff")
# place mutations on the sample forest assuming 1000 pre-neoplastic SNVs and
# 500 indels
phylo_forest <- m_engine$place_mutations(sample_forest, 1000, 500)
#> [█---------------------------------------] 0% [00m:00s] Placing mutations [██████████████████████████████----------] 74% [00m:01s] Placing mutations [████████████████████████████████████████] 100% [00m:01s] Mutations placed
phylo_forest
#> PhylogeneticForest
#> # of trees: 1
#> # of nodes: 22471
#> # of leaves: 5364
#> samples: {"S_1_1", "S_1_2", "S_2_1", "S_2_2"}
#>
#> # of emerged SNVs and indels: 17147
#> # of emerged CNAs: 32The phylogenetic forest stores all mutations, labelling the sampled
cells represented by its leaves. Users can retrieve such data by using
the methods PhylogeneticForestget_sampled_cell_mutations()](../reference/PhylogeneticForest-cash-get_sampled_cell_mutations.html)</code>
and
<code>[PhylogeneticForestget_sampled_cell_CNAs().
library(dplyr)
# select the first mutations among all the mutations occurring in
# the genomes of the sampled cells
phylo_forest$get_sampled_cell_mutations() %>% head()
#> cell_id chr chr_pos allele ref alt type cause class
#> 1 7622 22 16085675 0 GCCTCCCGA G indel A driver
#> 2 7622 22 16109050 0 GC G indel ID1 pre-neoplastic
#> 3 7622 22 16113256 0 G A SNV SBS1 pre-neoplastic
#> 4 7622 22 16144912 0 A G SNV SBS1 pre-neoplastic
#> 5 7622 22 16160761 0 A G SNV SBS1 pre-neoplastic
#> 6 7622 22 16188578 0 AGAG A indel ID1 pre-neoplastic
# select the first CNAs among all the mutations occurring in
# the genomes of the sampled cells
phylo_forest$get_sampled_cell_CNAs() %>% head()
#> cell_id type chr begin end allele src.allele class
#> 1 7622 A 22 10303470 10503469 2 0 driver
#> 2 7622 D 22 5010000 5209999 1 NA driver
#> 3 11799 A 22 10303470 10503469 2 0 driver
#> 4 11799 D 22 5010000 5209999 1 NA driver
#> 5 12137 A 22 10303470 10503469 2 0 driver
#> 6 12137 D 22 5010000 5209999 1 NA driver
# get the sampled cells
sampled_cells <- phylo_forest$get_nodes() %>%
filter(!is.na(.data$sample))
# show the first of them
sampled_cells %>% head()
#> cell_id ancestor depth mutant epistate sample birth_time
#> 1 7622 1087 22 A - S_1_1 197.1944
#> 2 11799 9949 37 A - S_1_2 226.7489
#> 3 12137 7784 25 A - S_1_1 228.7809
#> 4 14474 8082 20 A - S_1_1 242.1959
#> 5 16113 12528 37 A - S_1_2 250.7551
#> 6 18662 7054 32 A - S_1_1 262.6765
# get the identifier of the 3rd cell in `sampled_cells`
cell_id <- sampled_cells[3, 1]
# get the SNVs and the indels of the 3rd cell in `sampled_cells`
phylo_forest$get_sampled_cell_mutations(cell_id) %>% head()
#> cell_id chr chr_pos allele ref alt type cause class
#> 1 12137 22 16085675 0 GCCTCCCGA G indel A driver
#> 2 12137 22 16109050 0 GC G indel ID1 pre-neoplastic
#> 3 12137 22 16113256 0 G A SNV SBS1 pre-neoplastic
#> 4 12137 22 16144912 0 A G SNV SBS1 pre-neoplastic
#> 5 12137 22 16160761 0 A G SNV SBS1 pre-neoplastic
#> 6 12137 22 16188578 0 AGAG A indel ID1 pre-neoplastic
# get the CNAs of the 3rd cell in `sampled_cells`
phylo_forest$get_sampled_cell_CNAs(cell_id) %>% head()
#> cell_id type chr begin end allele src.allele class
#> 1 12137 A 22 10303470 10503469 2 0 driver
#> 2 12137 D 22 5010000 5209999 1 NA driverThe method PhylogeneticForestget_samples_info()](../reference/PhylogeneticForest-cash-get_samples_info.html)</code>
is analogous to
<code>[SampleForestget_samples_info(): it
returns a data frame containing information about the forest samples.
However, the former method adds two columns to the result of the latter:
“DNA_quantity” and “equivalent_normal_cells”.
The former column reports the total quantity of tumoral DNA in the
sample, i.e., the sum of the lengths of all alleles in the sample. This
quantity is a natural number, nevertheless, it is stored as a real
number as it usually exceeds the largest natural number that can be
natively represented by R. The column
“equivalent_normal_cells” instead contains the number of
normal cells that contain as much DNA as the sample tumour cells.
phylo_forest$get_samples_info()
#> name id xmin ymin xmax ymax tumour_cells tumour_cells_in_bbox time
#> 1 S_1_1 0 480 480 530 530 2560 2560 555.7961
#> 2 S_1_2 1 500 500 550 550 1619 1619 555.7961
#> 3 S_2_1 2 385 509 409 533 572 572 620.9211
#> 4 S_2_2 3 460 334 484 358 613 613 620.9211
#> DNA_quantity equivalent_normal_cells
#> 1 525587676186 5122.231
#> 2 332646308095 3241.878
#> 3 117099208770 1141.216
#> 4 125798795832 1226.000The method PhylogeneticForest$get_germline_mutations()
returns the SNVs and the indels in the germline.
# extract the germline mutation
phylo_forest$get_germline_mutations() %>% head()
#> chr chr_pos allele ref alt type cause class
#> 1 22 16051493 0 G A SNV germinal
#> 2 22 16052167 0 A AAAAC indel germinal
#> 3 22 16053659 0 A C SNV germinal
#> 4 22 16054740 0 A G SNV germinal
#> 5 22 16055942 0 C T SNV germinal
#> 6 22 16058070 0 A G SNV germinalUsers can also identify the cell in which a mutation emerged even when the cell was not sampled.
# select one of the mutations
mutation_row <- phylo_forest$get_sampled_cell_mutations(cell_id)[2, ]
# rebuild the corresponding mutation
mutation <- Mutation(mutation_row["chr"][1, ],
mutation_row["chr_pos"][1, ],
mutation_row["ref"][1, ],
mutation_row["alt"][1, ])
# get the identifier of the oldest cells in which the mutation occurs
phylo_forest$get_first_occurrences(mutation)
#> [[1]]
#> [1] 0The exposures used in placing the mutations on the cells in the
phylogenetic forest can be obtained by using the method PhylogeneticForest$get_exposures().
# get the exposures used in placing the mutations
phylo_forest$get_exposures()
#> time signature exposure type
#> 1 0 SBS1 0.8 SNV
#> 2 0 SBS13 0.2 SNV
#> 3 100 SBS2 0.2 SNV
#> 4 100 SBS3 0.5 SNV
#> 5 100 SBS5 0.3 SNV
#> 6 120 SBS2 0.2 SNV
#> 7 120 SBS3 0.5 SNV
#> 8 120 SBS5 0.3 SNV
#> 9 0 ID13 0.2 indel
#> 10 0 ID2 0.6 indel
#> 11 0 ID21 0.2 indel
#> 12 120 ID1 0.8 indel
#> 13 120 ID9 0.2 indelThe method MutationEngine$get_bulk_allelic_fragmentation()
returns a data frame reporting the allelic type per genome fragment.
# get the name of the first sample
sample_name <- phylo_forest$get_samples_info()[["name"]][1]
# print the bulk allelic fragmentation
phylo_forest$get_bulk_allelic_fragmentation(sample_name) %>% head()
#> chr begin end major minor ratio
#> 1 22 1 5009999 2 2 1.0000000
#> 2 22 5010000 5209999 2 0 1.0000000
#> 3 22 5210000 10303469 2 2 1.0000000
#> 4 22 10303470 10503469 4 2 1.0000000
#> 5 22 10503470 17098700 2 2 1.0000000
#> 6 22 17098701 19484880 2 2 0.9996094Instead, the method MutationEngine$get_cell_allelic_fragmentation()
returns the allelic fragmentation per cell.
# print the cell allelic fragmentation
phylo_forest$get_cell_allelic_fragmentation() %>% head()
#> cell_id chr begin end major minor
#> 1 7622 22 1 5009999 2 2
#> 2 7622 22 5010000 5209999 2 0
#> 3 7622 22 5210000 10303469 2 2
#> 4 7622 22 10303470 10503469 4 2
#> 5 7622 22 10503470 51304566 2 2
#> 6 11799 22 1 5009999 2 2The details about the SNV and indel signatures adopted during the
evolution are available in the mutation engine and they can be retrieved
by using the methods MutationEngineget_SNV_signatures()](../reference/MutationEngine-cash-get_SNV_signatures.html)</code>
and
<code>[MutationEngineget_indel_signatures().
# get the SNV signatures used in placing the mutations
m_engine$get_SNV_signatures()[1:6, 1:5]
#> Type SBS1 SBS2 SBS3 SBS4
#> 1 A[C>A]A 0.01041667 0.01041667 0.01041667 0.01041667
#> 2 A[C>A]C 0.01041667 0.01041667 0.01041667 0.01041667
#> 3 A[C>A]G 0.01041667 0.01041667 0.01041667 0.01041667
#> 4 A[C>A]T 0.01041667 0.01041667 0.01041667 0.01041667
#> 5 A[C>G]A 0.01041667 0.01041667 0.01041667 0.01041667
#> 6 A[C>G]C 0.01041667 0.01041667 0.01041667 0.01041667
# get the indel signatures used in placing the mutations
m_engine$get_indel_signatures()[1:6, 1:5]
#> Type ID1 ID2 ID3 ID4
#> 1 1:Del:C:0 0.01204819 0.01204819 0.01204819 0.01204819
#> 2 1:Del:C:1 0.01204819 0.01204819 0.01204819 0.01204819
#> 3 1:Del:C:2 0.01204819 0.01204819 0.01204819 0.01204819
#> 4 1:Del:C:3 0.01204819 0.01204819 0.01204819 0.01204819
#> 5 1:Del:C:4 0.01204819 0.01204819 0.01204819 0.01204819
#> 6 1:Del:C:5 0.01204819 0.01204819 0.01204819 0.01204819Finally, the data of the subject whose germline corresponds to
wild-type genome in the phylogenetic forest can be obtained by the
method PhylogeneticForest$get_germline_subject().
phylo_forest$get_germline_subject()
#> sample pop super_pop gender
#> 1 NA20513 TSI EUR maleStoring Phylogenetic Forests
As in the case of the sample forests, the phylogenetic forests can be
saved by using the method PhylogeneticForest$save()
and load by the function load_phylogenetic_forest().
# save the phylogenetic forest in the file "phylo_forest.sff"
phylo_forest$save("phylo_forest.sff")
#> [█---------------------------------------] 0% [00m:00s] Saving forest [█---------------------------------------] 0% [00m:00s] Saving forest [█---------------------------------------] 0% [00m:01s] Saving forest [█---------------------------------------] 0% [00m:02s] Saving forest [█---------------------------------------] 0% [00m:03s] Saving forest [█---------------------------------------] 0% [00m:04s] Saving forest [█---------------------------------------] 0% [00m:05s] Saving forest [███-------------------------------------] 6% [00m:07s] Saving forest [█████████-------------------------------] 22% [00m:08s] Saving forest [███████████████-------------------------] 37% [00m:09s] Saving forest [█████████████████████-------------------] 52% [00m:10s] Saving forest [███████████████████████████-------------] 67% [00m:11s] Saving forest [██████████████████████████████████------] 83% [00m:12s] Saving forest [████████████████████████████████████████] 99% [00m:13s] Saving forest [████████████████████████████████████████] 100% [00m:13s] Forest saved
# loading the saved forest
loaded_phylo_forest <- load_phylogenetic_forest("phylo_forest.sff")
#> [█---------------------------------------] 0% [00m:00s] Loading forest [█████-----------------------------------] 10% [00m:01s] Loading forest [████████--------------------------------] 19% [00m:02s] Loading forest [████████████----------------------------] 29% [00m:03s] Loading forest [████████████████------------------------] 38% [00m:04s] Loading forest [████████████████████--------------------] 48% [00m:05s] Loading forest [███████████████████████-----------------] 57% [00m:06s] Loading forest [███████████████████████████-------------] 66% [00m:07s] Loading forest [███████████████████████████████---------] 75% [00m:08s] Loading forest [██████████████████████████████████------] 84% [00m:09s] Loading forest [██████████████████████████████████████--] 93% [00m:10s] Loading forest [████████████████████████████████████████] 100% [00m:11s] Forest loaded
loaded_phylo_forest
#> PhylogeneticForest
#> # of trees: 1
#> # of nodes: 22471
#> # of leaves: 5364
#> samples: {"S_1_1", "S_1_2", "S_2_1", "S_2_2"}
#>
#> # of emerged SNVs and indels: 17147
#> # of emerged CNAs: 32Getting and Setting the Reference Genome Path
The phylogenetic forest object contains the reference genome FASTA
file path. The methods PhylogeneticForestget_reference_path()](../reference/PhylogeneticForest-cash-get_reference_path.html)</code>
and
<code>[PhylogeneticForestset_reference_path()
can be used to get the path and set it, respectively.
phylo_forest$get_reference_path()
#> [1] "/Users/alberto/Library/CloudStorage/Dropbox/Lavoro/Code/ProCESS/ProCESS/vignettes/demo/reference.fasta"
phylo_forest$set_reference_path("demo/reference.fasta")
phylo_forest$get_reference_path()
#> [1] "/Users/alberto/Library/CloudStorage/Dropbox/Lavoro/Code/ProCESS/ProCESS/vignettes/demo/reference.fasta"