Simulating Mutations • ProCESS

Disclaimer: ProCESS/CLONES internally implements the probability distributions using the C++11 random number distribution classes. The standard does not specify their algorithms, and the class implementations are left free for the compiler. Thus, the simulation output depends on the compiler used to compile CLONES, and because of that, the results reported in this article may differ from those obtained by the reader.

ProCESS/CLONES can simulate genomic mutations in the cells represented in a SampleForest according to the specified SBS and indel mutational signatures (see (Alexandrov et al. 2020)). This process is performed by the class MutationEngine, which also takes into account the mutation rate of the simulated species and gives the chance to dynamically change the signatures.

Setting Up Mutation Engine

The creation of an object of the type MutationEngine requires downloading the reference sequence and the signature files and building the corresponding two signature indices. The function MutationEngine() performs all these steps in a single call.

This function is quite flexible and allows customising the mutation engine in many ways. However, the vast majority of ProCESS users would aim for a standard setup, for instance, involving the human genome.

Because of this, ProCESS provides some predefined setups. Their list can be obtained by invoking the function get_mutation_engine_codes().

library(ProCESS)

get_mutation_engine_codes()
#>     name            description
#> 1 GRCh37  Homo sapiens (GRCh37)
#> 2 GRCh38  Homo sapiens (GRCh38)
#> 3   demo A demonstrative set-up

Let us build a MutationEngine by using the "demo" setup.

# building a mutation engine by using the "demo" setup
m_engine <- MutationEngine(setup_code = "demo")
#> Downloading reference genome...
#> Reference genome downloaded
#> Decompressing reference genome...done
#> Downloading signature files...
#> Signature file downloaded
#> Downloading driver mutation file...
#> Driver mutation file downloaded
#> Decompressing driver mutation file...done
#> Downloading passenger CNAs file...
#> Passenger CNAs file downloaded
#> Decompressing passenger CNAs file...done
#> Downloading germline...
#> Germline downloaded
#> Decompressing mutations...
#> done
#> Building context index...
#>  [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█████████████████-----------------------] 40% [00m:00s] Processing chr. 22 [█████████████████████████████████-------] 81% [00m:01s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:02s] Context index built
#>  [█---------------------------------------] 0% [00m:00s] Saving context index [████████████████████████████████████████] 100% [00m:00s] Context index saved
#> done
#> Building repeated sequence index...
#>  [█---------------------------------------] 0% [00m:00s] Reading 22 [████████████████████████████████████████] 100% [00m:01s] Reading 22
#>  [████████████████████████████████████████] 100% [00m:01s] Reading 22
#>  [████████████████████████████████████████] 100% [00m:01s] Reading 22
#>  [████████████████████████████████████████] 100% [00m:01s] Reading 22
#>  [████████████████████████████████████████] 100% [00m:01s] Reading 22
#>  [████████████████████████████████████████] 100% [00m:01s] Reading 22
#>  [████████████████████████████████████████] 100% [00m:01s] Reading 22
#>  [████████████████████████████████████████] 100% [00m:01s] RS index built
#>  [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:00s] Saving RS index [████████████████████████████------------] 68% [00m:02s] Saving RS indexdone
#>  [████████████████████████████████████████] 100% [00m:02s] RS index saved
#>  [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#>  [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved

m_engine
#> MutationEngine
#>  Passenger rates
#> 
#>  Driver mutations
#> 
#>  Timed Exposure
#>    SBS Timed Exposures
#> 
#>    indel Timed Exposures

The above call creates the directory demo, downloads all the data required by the mutation engine, and builds it.

dir.exists("demo")
#> [1] TRUE

list.files("demo")
#> [1] "context_index_100.cif"   "drivers.txt"            
#> [3] "germline_data"           "indel_signatures.txt"   
#> [5] "passenger_CNAs.txt"      "reference.fasta"        
#> [7] "rs_index_50_500000.rsif" "SBS_signatures.txt"     
#> [9] "sources.csv"

The execution of the function MutationEngine() may take some time, but it is meant to be performed one for all and, as long as the user does not need to change the reference genome or the signature files, it is no longer required. In this spirit, any call to MutationEngine() exclusively performs the building process sub-tasks that have not been already fulfilled.

# building a mutation engine by using the "demo" set-up configuration
m_engine <- MutationEngine(setup_code = "demo")
#>  [█---------------------------------------] 0% [00m:00s] Loading context index [████████████████████████████████████████] 100% [00m:00s] Context index loaded
#>  [█---------------------------------------] 0% [00m:00s] Loading RS index [██████████████--------------------------] 33% [00m:01s] Loading RS index [███████████████████████████-------------] 65% [00m:02s] Loading RS index [███████████████████████████████████████-] 96% [00m:03s] Loading RS index [████████████████████████████████████████] 100% [00m:03s] RS index loaded
#>  [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded

Mutant Genetic Characterization

Once the mutation engine has been built, we can define a mutant genotype and declare species mutation rates.

Let us consider the simulation of Section “Two populations with epigenetic state” in this article. It involves the two mutants A and B. Both of them have two possible epigenetic states, + and -, leading to the four species A+, A-, B+, and B-, respectively. Each of these species has a passenger mutation rate that must be declared to the mutation engine before the sample forest is labelled. Let 1e-9, 3e-8, 8e-7, and 5e-8 be the passenger SNV rates for the species A+, A-, B+, and B-, respectively. Let the indel rates be 0 for all the species, but A+, which has an indel rate of 1e-8. Furthermore, let 0, 1e-11, 0, and 0 be the passenger CNA rates of the same species, respectively.

The two mutants may also be genetically characterised by some driver mutations. The driver mutations associated with each mutant must occur in any cell belonging to that mutant. Hence, they must be declared to the mutant engine before the labelling.

The method MutationEngine$add_mutant() takes care of all these declarations.

For the sake of example, let us assume that A is characterised by one driver mutation on chromosome 22, while B has three driver mutations on the same chromosome.

The genetic specification of the mutant A can be declared as it follows.

# add the mutant "A" characterized by one driver SNV on the allele 1 of
# chromosome 22, one indel deletion on the same chromosome, two CNAs (a
# deletion on the allele 1 of chromosome 22 and an amplification on a random
# allele of the same chromosome) and, finally, a whole genome doubling event
# (WGD). The mutant has two epigenetic states and its species "A+" and "A-"
# have passenger SNV rates 1e-9 and 3e-9, respectively, passenger indel rates
# 1e-9 and 0, respectively, and passenger CNA rates 0 and 1e-11, respectively.
m_engine$add_mutant(mutant_name = "A",
                    passenger_rates = list("+" = c(SNV = 1e-9, indel = 1e-9),
                                           "-" = c(SNV = 3e-9, CNA = 1e-11)),
                    drivers = list(SNV("22", 10510210, "C", allele = 1),
                                   Mutation("22", 16085675, "GCCTCCCGA", "G"),
                                   CNA(type = "A", chr = "22",
                                       chr_pos = 10303470, len = 200000),
                                   CNA("D", "22", 5010000, 200000,
                                       allele = 1),
                                   WGD))
#>  [█---------------------------------------] 0% [00m:00s] Retrieving "A" SIDs [█---------------------------------------] 0% [00m:00s] Found 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [██████████████--------------------------] 33% [00m:01s] Reading 22 [███████████████████████████-------------] 66% [00m:02s] Reading 22 [████████████████████████████████████████] 100% [00m:03s] Reading 22
#>  [████████████████████████████████████████] 100% [00m:03s] "A"'s SIDs validated

m_engine
#> MutationEngine
#>  Passenger rates
#>    "A+": {SNV: 1e-09, indel: 1e-09}
#>    "A-": {SNV: 3e-09, CNA: 1e-11}
#> 
#>  Driver mutations
#>    "A":
#>         (chr22(10510210)[N>C]) on allele 1
#>         (chr22(16085675)[GCCTCCCGA>G]) on random allele
#>        CNA("A",chr22(10303470), len: 200000)
#>        CNA("D",chr22(5010000), len: 200000, allele: 1)
#>        Whole genome duplication
#> 
#>  Timed Exposure
#>    SBS Timed Exposures
#> 
#>    indel Timed Exposures

In the above code, the driver CNAs, SNVs, and indels are declared by calling the functions CNA(), SNV(), and Mutation(), respectively. These functions allow specifying the allele in which the mutation must lie. The mutant A is also characterised by whole genome doubling (WGD): a genomic event that simultaneously duplicates all the chromosome alleles.

The driver mutations are applied in the order specified. For instance, the following two pieces of code specify two different genomic characterisations for the same mutant E.

m_engine$add_mutant("E", passenger_rates,
                    drivers = list(SNV("22", 10510210, "C", allele = 1),
                                   WGD))

m_engine$add_mutant("E", passenger_rates,
                    drivers = list(WGD,
                                   SNV("22", 10510210, "C", allele = 1)))

The former snippet places an SNV and, afterwards, duplicates all the alleles producing a genome in which two alleles contain the placed SNV. Instead, the latter snippet requires a whole genome doubling event and, then, places the SNV yielding a single occurrence of the SNV in the final genome.

As far as SNVs and indels are concerned, ProCESS provides users with a more compact and, at times, convenient approach. The mutation engine stores a list of known driver mutations and labels each with a code that can be used during mutant declaration. The corresponding data frame can be obtained by using the method MutationEngine$get_known_drivers().

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

m_engine$get_known_drivers() %>% filter(chr == "22") %>% head()
#>   chr     from       to ref alt mutation_type driver_gene driver_code
#> 1  22 20073503 20073503   G   A           SNV       DGCR8   DGCR8 S6N
#> 2  22 20073503 20073503   G   A           SNV       DGCR8   DGCR8 S6N
#> 3  22 20073503 20073503   G   A           SNV       DGCR8   DGCR8 S6N
#> 4  22 20073503 20073503   G   A           SNV       DGCR8   DGCR8 S6N
#> 5  22 20073511 20073511   C   T           SNV       DGCR8   DGCR8 P9S
#> 6  22 20073511 20073511   C   T           SNV       DGCR8   DGCR8 P9S
#>   driver_CDS tumour_type
#> 1    c.17G>A    DLBCLNOS
#> 2    c.17G>A      PLMESO
#> 3    c.17G>A        WDTC
#> 4    c.17G>A          WT
#> 5    c.25C>T    DLBCLNOS
#> 6    c.25C>T      PLMESO

The code of the known driver mutations can be used in place of the full specification as follows.

# add the mutant "B" characterized by two driver SNVs on chromosome 22 (no
# CNA) and two epigenetic states. The first SNV is "DGCR8 P26L" and it must
# lay in the allele 1. The remaining SNV is specified by using the SNV function
# as done above. The species "B+" and "B-" have passenger SNV rates 8e-9
# and 2e-8, respectively, and CNA rates 0 for both species.
m_engine$add_mutant("B", list("+" = c(SNV = 8e-9), "-" = c(SNV = 2e-8)),
                    list(list("DGCR8 P26L", allele = 1),  # the first SNV
                         SNV("22", 12028576, "G")))       # the second SNV
#>  [█---------------------------------------] 0% [00m:00s] Retrieving "B" SIDs [████████████████████████████████████████] 100% [00m:00s] "B"'s SIDs validated

m_engine
#> MutationEngine
#>  Passenger rates
#>    "A+": {SNV: 1e-09, indel: 1e-09}
#>    "A-": {SNV: 3e-09, CNA: 1e-11}
#>    "B+": {SNV: 8e-09}
#>    "B-": {SNV: 2e-08}
#> 
#>  Driver mutations
#>    "A":
#>         (chr22(10510210)[N>C]) on allele 1
#>         (chr22(16085675)[GCCTCCCGA>G]) on random allele
#>        CNA("A",chr22(10303470), len: 200000)
#>        CNA("D",chr22(5010000), len: 200000, allele: 1)
#>        Whole genome duplication
#>    "B":
#>        DGCR8 P26L (chr22(20073563)[C>T]) on allele 1
#>         (chr22(12028576)[N>G]) on random allele
#> 
#>  Timed Exposure
#>    SBS Timed Exposures
#> 
#>    indel Timed Exposures

Mutational Exposures

The probability of a mutation occurring depends on both its genomic and environmental context.

A signature is a mutation probability distribution over mutation contexts or mutation structure. SBS (single-base substitution) signatures provide for any genomic context (i.e., a triplet of bases) the probability that an SNV occurs on that context. On the contrary, ID (indel) signatures associate the probability of an indel to its length and structure (see (Alexandrov et al. 2020)).

The signature depends on the environmental context and, as a result, more than one signature may be active at the same time with different probabilities. A mutational exposure (or exposure) is a discrete probability distribution among signatures.

To simulate passenger mutations of a given type, we need to specify a default exposure for that type. This can be achieved as follows.

# add SNV and indel default exposures. This will be used from simulated time 0
# up to the successive exposure change.
m_engine$add_exposure(coefficients = c(SBS13 = 0.2, SBS1 = 0.8))
m_engine$add_exposure(c(ID2 = 0.6, ID13 = 0.2, ID21 = 0.2))

Further exposures can also be defined by specifying an activation time for each of them, i.e., the new exposures will be used from that until the next exposure change.

# add a new SNV exposure that will be used from simulated
# time 100 up to the successive exposure change.
m_engine$add_exposure(time = 100, c(SBS5 = 0.3, SBS2 = 0.2, SBS3 = 0.5))

m_engine
#> MutationEngine
#>  Passenger rates
#>    "A+": {SNV: 1e-09, indel: 1e-09}
#>    "A-": {SNV: 3e-09, CNA: 1e-11}
#>    "B+": {SNV: 8e-09}
#>    "B-": {SNV: 2e-08}
#> 
#>  Driver mutations
#>    "A":
#>         (chr22(10510210)[N>C]) on allele 1
#>         (chr22(16085675)[GCCTCCCGA>G]) on random allele
#>        CNA("A",chr22(10303470), len: 200000)
#>        CNA("D",chr22(5010000), len: 200000, allele: 1)
#>        Whole genome duplication
#>    "B":
#>        DGCR8 P26L (chr22(20073563)[C>T]) on allele 1
#>         (chr22(12028576)[N>G]) on random allele
#> 
#>  Timed Exposure
#>    SBS Timed Exposures
#>      [0, 100[: {"SBS1": 0.8, "SBS13": 0.2}
#>      [100, ∞[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#> 
#>    indel Timed Exposures
#>      [0, ∞[: {"ID13": 0.2, "ID2": 0.6, "ID21": 0.2}

The default exposures and the simultaneous changes in SNV and ID exposures can be specified by a single call to the function MutationEngine$add_exposure() as follows.

m_engine$add_exposure(time = 120, c(SBS5 = 0.3, SBS2 = 0.2, SBS3 = 0.5,
                                    ID1 = 0.8, ID9 = 0.2))

m_engine
#> MutationEngine
#>  Passenger rates
#>    "A+": {SNV: 1e-09, indel: 1e-09}
#>    "A-": {SNV: 3e-09, CNA: 1e-11}
#>    "B+": {SNV: 8e-09}
#>    "B-": {SNV: 2e-08}
#> 
#>  Driver mutations
#>    "A":
#>         (chr22(10510210)[N>C]) on allele 1
#>         (chr22(16085675)[GCCTCCCGA>G]) on random allele
#>        CNA("A",chr22(10303470), len: 200000)
#>        CNA("D",chr22(5010000), len: 200000, allele: 1)
#>        Whole genome duplication
#>    "B":
#>        DGCR8 P26L (chr22(20073563)[C>T]) on allele 1
#>         (chr22(12028576)[N>G]) on random allele
#> 
#>  Timed Exposure
#>    SBS Timed Exposures
#>      [0, 100[: {"SBS1": 0.8, "SBS13": 0.2}
#>      [100, 120[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#>      [120, ∞[: {"SBS2": 0.2, "SBS3": 0.5, "SBS5": 0.3}
#> 
#>    indel Timed Exposures
#>      [0, 120[: {"ID13": 0.2, "ID2": 0.6, "ID21": 0.2}
#>      [120, ∞[: {"ID1": 0.8, "ID9": 0.2}

Passenger CNAs and Tumour Type

The passenger CNAs applied during the simulation depend on tumour type. The type of tumour can be specified when building the mutation engine by using the parameters tumour_type or tumour_study. For instance, if the passenger CNA file used to build the mutation engine contains some of the data identified in breast carcinoma in the UK, the specific CNA can be applied during the simulation as follows.

m_engine <- MutationEngine(setup_code = "demo",
                           tumour_type = "BRCA",
                           tumour_study = "UK")

The complete list of available tumour types in a set-up can be obtained by using the function get_available_tumours_in().

get_available_tumours_in(setup_code = "demo") %>% head()
#>   type
#> 1  ACC
#> 2  AML
#> 3 ANGS
#> 4 ANSC
#> 5  BCC
#> 6 BLCA

Germline Mutations

ProCESS allows users to apply germline mutations from one of the subjects in the germline data used to build the mutation engine. This feature will enable users to simulate a specific cancer type’s evolution on an individual with the desired gender and ethnicity.

The available subjects, together with their sex and ethnicity, can be obtained by using the method MutationEngine$get_germline_subjects().

subjects <- m_engine$get_germline_subjects()

subjects
#>    sample pop super_pop gender
#> 1 NA18941 JPT       EAS female
#> 2 NA20513 TSI       EUR   male

The column sample contains the names of the available subjects. The columns pop and super_pop report the subjects’ population and super-population codes. The last column, gender, includes the subject gender.

The method MutationEngine$get_population_descriptions() clarifies the meaning of the codes reported in the pop columns.

m_engine$get_population_descriptions()
#>    code          description
#> 1   CHB          Han Chinese
#> 2   JPT             Japanese
#> 3   CHS Southern Han Chinese
#> 4   CDX          Dai Chinese
#> 5   KHV      Kinh Vietnamese
#> 6   CHD       Denver Chinese
#> 7   CEU                 CEPH
#> 8   TSI               Tuscan
#> 9   GBR              British
#> 10  FIN              Finnish
#> 11  IBS              Spanish
#> 12  YRI               Yoruba
#> 13  LWK                Luhya
#> 14  GWD              Gambian
#> 15  MSL                Mende
#> 16  ESN                 Esan
#> 17  ASW  African-American SW
#> 18  ACB    African-Caribbean
#> 19  MXL     Mexican-American
#> 20  PUR         Puerto Rican
#> 21  CLM            Colombian
#> 22  PEL             Peruvian
#> 23  GIH             Gujarati
#> 24  PJL              Punjabi
#> 25  BEB              Bengali
#> 26  STU           Sri Lankan
#> 27  ITU               Indian
#>                                                      long.description
#> 1                                       Han Chinese in Beijing, China
#> 2                                            Japanese in Tokyo, Japan
#> 3                                                   Han Chinese South
#> 4                                 Chinese Dai in Xishuangbanna, China
#> 5                                   Kinh in Ho Chi Minh City, Vietnam
#> 6                          Chinese in Denver, Colorado (pilot 3 only)
#> 7  Utah residents (CEPH) with Northern and Western European ancestry 
#> 8                                                  Toscani in Italia 
#> 9                                    British in England and Scotland 
#> 10                                                Finnish in Finland 
#> 11                                      Iberian populations in Spain 
#> 12                                          Yoruba in Ibadan, Nigeria
#> 13                                             Luhya in Webuye, Kenya
#> 14                           Gambian in Western Division, The Gambia 
#> 15                                              Mende in Sierra Leone
#> 16                                                    Esan in Nigeria
#> 17                                  African Ancestry in Southwest US 
#> 18                                      African Caribbean in Barbados
#> 19                        Mexican Ancestry in Los Angeles, California
#> 20                                        Puerto Rican in Puerto Rico
#> 21                                    Colombian in Medellin, Colombia
#> 22                                             Peruvian in Lima, Peru
#> 23                                     Gujarati Indian in Houston, TX
#> 24                                        Punjabi in Lahore, Pakistan
#> 25                                              Bengali in Bangladesh
#> 26                                         Sri Lankan Tamil in the UK
#> 27                                            Indian Telugu in the UK

The method MutationEngine$get_active_germline() returns the the active germline subject.

m_engine$get_active_germline()
#>    sample pop super_pop gender
#> 1 NA18941 JPT       EAS female

Users can change the germline subject using the method MutationEngine$set_germline_subject().

When a subject is selected for the first time, ProCESS builds a binary representation of the subject’s genome and save it for future use. This step may take a few minutes. However, all the successive selections of the same subject directly load the binary file.

m_engine$set_germline_subject(subjects[2, "sample"])
#>  [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#>  [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved

m_engine$get_active_germline()
#>    sample pop super_pop gender
#> 1 NA20513 TSI       EUR   male

Building Phylogenetic Forests

The configured mutation engine can be used to label each node in a sample forest with mutations.

Since the mutation engine has been configured to deal with the simulation performed in Section “Two populations with epigenetic state” in this article, we can load its sample forest from the file "sample_forest.sff" as saved there.

sample_forest <- load_sample_forest("sample_forest.sff")

# place mutations on the sample forest assuming 1000 pre-neoplastic SNVs and
# 500 indels
phylo_forest <- m_engine$place_mutations(sample_forest, 1000, 500)
#>  [█---------------------------------------] 0% [00m:00s] Placing mutations [██████████████████████████████----------] 74% [00m:01s] Placing mutations [████████████████████████████████████████] 100% [00m:01s] Mutations placed

phylo_forest
#> PhylogeneticForest
#>   # of trees: 1
#>   # of nodes: 22471
#>   # of leaves: 5364
#>   samples: {"S_1_1", "S_1_2", "S_2_1", "S_2_2"}
#> 
#>   # of emerged SNVs and indels: 17147
#>   # of emerged CNAs: 32

The phylogenetic forest stores all mutations, labelling the sampled cells represented by its leaves. Users can retrieve such data by using the methods PhylogeneticForestget_sampled_cell_mutations()](../reference/PhylogeneticForest-cash-get_sampled_cell_mutations.html)</code> and <code>[PhylogeneticForestget_sampled_cell_CNAs().

library(dplyr)

# select the first mutations among all the mutations occurring in
# the genomes of the sampled cells
phylo_forest$get_sampled_cell_mutations() %>% head()
#>   cell_id chr  chr_pos allele       ref alt  type cause          class
#> 1    7622  22 16085675      0 GCCTCCCGA   G indel     A         driver
#> 2    7622  22 16109050      0        GC   G indel   ID1 pre-neoplastic
#> 3    7622  22 16113256      0         G   A   SNV  SBS1 pre-neoplastic
#> 4    7622  22 16144912      0         A   G   SNV  SBS1 pre-neoplastic
#> 5    7622  22 16160761      0         A   G   SNV  SBS1 pre-neoplastic
#> 6    7622  22 16188578      0      AGAG   A indel   ID1 pre-neoplastic

# select the first CNAs among all the mutations occurring in
# the genomes of the sampled cells
phylo_forest$get_sampled_cell_CNAs() %>% head()
#>   cell_id type chr    begin      end allele src.allele  class
#> 1    7622    A  22 10303470 10503469      2          0 driver
#> 2    7622    D  22  5010000  5209999      1         NA driver
#> 3   11799    A  22 10303470 10503469      2          0 driver
#> 4   11799    D  22  5010000  5209999      1         NA driver
#> 5   12137    A  22 10303470 10503469      2          0 driver
#> 6   12137    D  22  5010000  5209999      1         NA driver

# get the sampled cells
sampled_cells <- phylo_forest$get_nodes() %>%
  filter(!is.na(.data$sample))

# show the first of them
sampled_cells %>% head()
#>   cell_id ancestor depth mutant epistate sample birth_time
#> 1    7622     1087    22      A        -  S_1_1   197.1944
#> 2   11799     9949    37      A        -  S_1_2   226.7489
#> 3   12137     7784    25      A        -  S_1_1   228.7809
#> 4   14474     8082    20      A        -  S_1_1   242.1959
#> 5   16113    12528    37      A        -  S_1_2   250.7551
#> 6   18662     7054    32      A        -  S_1_1   262.6765

# get the identifier of the 3rd cell in `sampled_cells`
cell_id <- sampled_cells[3, 1]

# get the SNVs and the indels of the 3rd cell in `sampled_cells`
phylo_forest$get_sampled_cell_mutations(cell_id) %>% head()
#>   cell_id chr  chr_pos allele       ref alt  type cause          class
#> 1   12137  22 16085675      0 GCCTCCCGA   G indel     A         driver
#> 2   12137  22 16109050      0        GC   G indel   ID1 pre-neoplastic
#> 3   12137  22 16113256      0         G   A   SNV  SBS1 pre-neoplastic
#> 4   12137  22 16144912      0         A   G   SNV  SBS1 pre-neoplastic
#> 5   12137  22 16160761      0         A   G   SNV  SBS1 pre-neoplastic
#> 6   12137  22 16188578      0      AGAG   A indel   ID1 pre-neoplastic

# get the CNAs of the 3rd cell in `sampled_cells`
phylo_forest$get_sampled_cell_CNAs(cell_id) %>% head()
#>   cell_id type chr    begin      end allele src.allele  class
#> 1   12137    A  22 10303470 10503469      2          0 driver
#> 2   12137    D  22  5010000  5209999      1         NA driver

The method PhylogeneticForestget_samples_info()](../reference/PhylogeneticForest-cash-get_samples_info.html)</code> is analogous to <code>[SampleForestget_samples_info(): it returns a data frame containing information about the forest samples. However, the former method adds two columns to the result of the latter: “DNA_quantity” and “equivalent_normal_cells”. The former column reports the total quantity of tumoral DNA in the sample, i.e., the sum of the lengths of all alleles in the sample. This quantity is a natural number, nevertheless, it is stored as a real number as it usually exceeds the largest natural number that can be natively represented by R. The column “equivalent_normal_cells” instead contains the number of normal cells that contain as much DNA as the sample tumour cells.

phylo_forest$get_samples_info()
#>    name id xmin ymin xmax ymax tumour_cells tumour_cells_in_bbox     time
#> 1 S_1_1  0  480  480  530  530         2560                 2560 555.7961
#> 2 S_1_2  1  500  500  550  550         1619                 1619 555.7961
#> 3 S_2_1  2  385  509  409  533          572                  572 620.9211
#> 4 S_2_2  3  460  334  484  358          613                  613 620.9211
#>   DNA_quantity equivalent_normal_cells
#> 1 525587676186                5122.231
#> 2 332646308095                3241.878
#> 3 117099208770                1141.216
#> 4 125798795832                1226.000

The method PhylogeneticForest$get_germline_mutations() returns the SNVs and the indels in the germline.

# extract the germline mutation
phylo_forest$get_germline_mutations() %>% head()
#>   chr  chr_pos allele ref   alt  type cause    class
#> 1  22 16051493      0   G     A   SNV       germinal
#> 2  22 16052167      0   A AAAAC indel       germinal
#> 3  22 16053659      0   A     C   SNV       germinal
#> 4  22 16054740      0   A     G   SNV       germinal
#> 5  22 16055942      0   C     T   SNV       germinal
#> 6  22 16058070      0   A     G   SNV       germinal

Users can also identify the cell in which a mutation emerged even when the cell was not sampled.

# select one of the mutations
mutation_row <- phylo_forest$get_sampled_cell_mutations(cell_id)[2, ]

# rebuild the corresponding mutation
mutation <- Mutation(mutation_row["chr"][1, ],
                     mutation_row["chr_pos"][1, ],
                     mutation_row["ref"][1, ],
                     mutation_row["alt"][1, ])

# get the identifier of the oldest cells in which the mutation occurs
phylo_forest$get_first_occurrences(mutation)
#> [[1]]
#> [1] 0

The exposures used in placing the mutations on the cells in the phylogenetic forest can be obtained by using the method PhylogeneticForest$get_exposures().

# get the exposures used in placing the mutations
phylo_forest$get_exposures()
#>    time signature exposure  type
#> 1     0      SBS1      0.8   SNV
#> 2     0     SBS13      0.2   SNV
#> 3   100      SBS2      0.2   SNV
#> 4   100      SBS3      0.5   SNV
#> 5   100      SBS5      0.3   SNV
#> 6   120      SBS2      0.2   SNV
#> 7   120      SBS3      0.5   SNV
#> 8   120      SBS5      0.3   SNV
#> 9     0      ID13      0.2 indel
#> 10    0       ID2      0.6 indel
#> 11    0      ID21      0.2 indel
#> 12  120       ID1      0.8 indel
#> 13  120       ID9      0.2 indel

The method MutationEngine$get_bulk_allelic_fragmentation() returns a data frame reporting the allelic type per genome fragment.

# get the name of the first sample
sample_name <- phylo_forest$get_samples_info()[["name"]][1]

# print the bulk allelic fragmentation
phylo_forest$get_bulk_allelic_fragmentation(sample_name) %>% head()
#>   chr    begin      end major minor     ratio
#> 1  22        1  5009999     2     2 1.0000000
#> 2  22  5010000  5209999     2     0 1.0000000
#> 3  22  5210000 10303469     2     2 1.0000000
#> 4  22 10303470 10503469     4     2 1.0000000
#> 5  22 10503470 17098700     2     2 1.0000000
#> 6  22 17098701 19484880     2     2 0.9996094

Instead, the method MutationEngine$get_cell_allelic_fragmentation() returns the allelic fragmentation per cell.

# print the cell allelic fragmentation
phylo_forest$get_cell_allelic_fragmentation() %>% head()
#>   cell_id chr    begin      end major minor
#> 1    7622  22        1  5009999     2     2
#> 2    7622  22  5010000  5209999     2     0
#> 3    7622  22  5210000 10303469     2     2
#> 4    7622  22 10303470 10503469     4     2
#> 5    7622  22 10503470 51304566     2     2
#> 6   11799  22        1  5009999     2     2

The details about the SNV and indel signatures adopted during the evolution are available in the mutation engine and they can be retrieved by using the methods MutationEngineget_SNV_signatures()](../reference/MutationEngine-cash-get_SNV_signatures.html)</code> and <code>[MutationEngineget_indel_signatures().

# get the SNV signatures used in placing the mutations
m_engine$get_SNV_signatures()[1:6, 1:5]
#>      Type       SBS1       SBS2       SBS3       SBS4
#> 1 A[C>A]A 0.01041667 0.01041667 0.01041667 0.01041667
#> 2 A[C>A]C 0.01041667 0.01041667 0.01041667 0.01041667
#> 3 A[C>A]G 0.01041667 0.01041667 0.01041667 0.01041667
#> 4 A[C>A]T 0.01041667 0.01041667 0.01041667 0.01041667
#> 5 A[C>G]A 0.01041667 0.01041667 0.01041667 0.01041667
#> 6 A[C>G]C 0.01041667 0.01041667 0.01041667 0.01041667

# get the indel signatures used in placing the mutations
m_engine$get_indel_signatures()[1:6, 1:5]
#>        Type        ID1        ID2        ID3        ID4
#> 1 1:Del:C:0 0.01204819 0.01204819 0.01204819 0.01204819
#> 2 1:Del:C:1 0.01204819 0.01204819 0.01204819 0.01204819
#> 3 1:Del:C:2 0.01204819 0.01204819 0.01204819 0.01204819
#> 4 1:Del:C:3 0.01204819 0.01204819 0.01204819 0.01204819
#> 5 1:Del:C:4 0.01204819 0.01204819 0.01204819 0.01204819
#> 6 1:Del:C:5 0.01204819 0.01204819 0.01204819 0.01204819

Finally, the data of the subject whose germline corresponds to wild-type genome in the phylogenetic forest can be obtained by the method PhylogeneticForest$get_germline_subject().

phylo_forest$get_germline_subject()
#>    sample pop super_pop gender
#> 1 NA20513 TSI       EUR   male

Storing Phylogenetic Forests

As in the case of the sample forests, the phylogenetic forests can be saved by using the method PhylogeneticForest$save() and load by the function load_phylogenetic_forest().

# save the phylogenetic forest in the file "phylo_forest.sff"
phylo_forest$save("phylo_forest.sff")
#>  [█---------------------------------------] 0% [00m:00s] Saving forest [█---------------------------------------] 0% [00m:00s] Saving forest [█---------------------------------------] 0% [00m:01s] Saving forest [█---------------------------------------] 0% [00m:02s] Saving forest [█---------------------------------------] 0% [00m:03s] Saving forest [█---------------------------------------] 0% [00m:04s] Saving forest [█---------------------------------------] 0% [00m:05s] Saving forest [███-------------------------------------] 6% [00m:07s] Saving forest [█████████-------------------------------] 22% [00m:08s] Saving forest [███████████████-------------------------] 37% [00m:09s] Saving forest [█████████████████████-------------------] 52% [00m:10s] Saving forest [███████████████████████████-------------] 67% [00m:11s] Saving forest [██████████████████████████████████------] 83% [00m:12s] Saving forest [████████████████████████████████████████] 99% [00m:13s] Saving forest [████████████████████████████████████████] 100% [00m:13s] Forest saved

# loading the saved forest
loaded_phylo_forest <- load_phylogenetic_forest("phylo_forest.sff")
#>  [█---------------------------------------] 0% [00m:00s] Loading forest [█████-----------------------------------] 10% [00m:01s] Loading forest [████████--------------------------------] 19% [00m:02s] Loading forest [████████████----------------------------] 29% [00m:03s] Loading forest [████████████████------------------------] 38% [00m:04s] Loading forest [████████████████████--------------------] 48% [00m:05s] Loading forest [███████████████████████-----------------] 57% [00m:06s] Loading forest [███████████████████████████-------------] 66% [00m:07s] Loading forest [███████████████████████████████---------] 75% [00m:08s] Loading forest [██████████████████████████████████------] 84% [00m:09s] Loading forest [██████████████████████████████████████--] 93% [00m:10s] Loading forest [████████████████████████████████████████] 100% [00m:11s] Forest loaded

loaded_phylo_forest
#> PhylogeneticForest
#>   # of trees: 1
#>   # of nodes: 22471
#>   # of leaves: 5364
#>   samples: {"S_1_1", "S_1_2", "S_2_1", "S_2_2"}
#> 
#>   # of emerged SNVs and indels: 17147
#>   # of emerged CNAs: 32

Getting and Setting the Reference Genome Path

The phylogenetic forest object contains the reference genome FASTA file path. The methods PhylogeneticForestget_reference_path()](../reference/PhylogeneticForest-cash-get_reference_path.html)</code> and <code>[PhylogeneticForestset_reference_path() can be used to get the path and set it, respectively.

phylo_forest$get_reference_path()
#> [1] "/Users/alberto/Library/CloudStorage/Dropbox/Lavoro/Code/ProCESS/ProCESS/vignettes/demo/reference.fasta"

phylo_forest$set_reference_path("demo/reference.fasta")

phylo_forest$get_reference_path()
#> [1] "/Users/alberto/Library/CloudStorage/Dropbox/Lavoro/Code/ProCESS/ProCESS/vignettes/demo/reference.fasta"

Alexandrov, Ludmil B, Jaegil Kim, Nicholas J Haradhvala, et al. 2020. “The Repertoire of Mutational Signatures in Human Cancer.” Nature 578 (7793): 94–101.