Disclaimer: RACES/rRACES internally implements the probability distributions using the C++11 random number distribution classes. The standard does not specify their algorithms, and the class implementations are left free for the compiler. Thus, the simulation output depends on the compiler used to compile RACES, and because of that, the results reported in this article may differ from those obtained by the reader.
rRACES/RACES can simulate genomic mutations on the cells represented
in a SamplesForest
according to the specified SBS and indel
mutational signatures (see vignette["mutations"]
). This
task is achieved by the mutation engine which uses two indices on the
reference genome to place the most probable mutations according to
signatures: the context index and the repeted sequence
index.
The function MutationEngine()
builds the mutation engine
and these two indices.
Customizing the mutation engine
While the most convient way to invoke MutationEngine()
is by using the parameter setup_code
(see
vignette["mutations"]
), the are cases in which a custom
set-up is required. In these cases, the function
MutationEngine()
can be called by specifying the name of
the set-up directory, the path or URL of the reference sequence, the
signature files, the driver mutation file, the passenger CNAs file, and
the germline data directory.
library("rRACES")
reference_url <- paste0("https://ftp.ensembl.org/pub/grch37/release-111/",
"fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.",
"dna.chromosome.22.fa.gz")
sbs_url <- paste0("https://cancer.sanger.ac.uk/signatures/documents/2123/",
"COSMIC_v3.4_SBS_GRCh37.txt")
indel_url <- paste0("https://cancer.sanger.ac.uk/signatures/documents/",
"2121/COSMIC_v3.4_ID_GRCh37.txt")
drivers_url <- paste0("https://raw.githubusercontent.com/",
"caravagnalab/rRACES/main/inst/extdata/",
"driver_mutations_hg19.csv")
passenger_cnas_url <- paste0("https://raw.githubusercontent.com/",
"caravagnalab/rRACES/main/inst/extdata/",
"passenger_CNAs_hg19.csv")
germline_url <- paste0("https://zenodo.org/records/13166780/files/",
"germline_data_demo.tar.gz")
# build a mutation engine and place all the files in the directory "Test"
m_engine <- MutationEngine(directory = "Test",
reference_src = reference_url,
SBS_signatures_src = sbs_url,
indel_signatures_src = indel_url,
drivers_src = drivers_url,
passenger_CNAs_src = passenger_cnas_url,
germline_src = germline_url)
#> Downloading reference genome...
#> Reference genome downloaded
#> Decompressing reference file...done
#> Downloading SBS file...
#> SBS file downloaded
#> Downloading indel file...
#> indel file downloaded
#> Downloading driver mutation file...
#> Driver mutation file downloaded
#> Downloading passenger CNAs file...
#> Passenger CNAs file downloaded
#> Downloading germline mutations...
#> Germline mutations downloaded
#> Building context index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█████████████████-----------------------] 40% [00m:01s] Processing chr. 22 [█████████████████████████████████-------] 81% [00m:02s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:02s] Context index built
#> [█---------------------------------------] 0% [00m:00s] Saving context index [████████████████████████████████████████] 100% [00m:00s] Context index saved
#> done
#> Building repeated sequence index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:01s] Processing chr. 22 [█---------------------------------------] 0% [00m:04s] Processing chr. 22 [█---------------------------------------] 0% [00m:05s] Processing chr. 22 [█---------------------------------------] 0% [00m:07s] Processing chr. 22 [█---------------------------------------] 0% [00m:08s] Processing chr. 22 [█---------------------------------------] 0% [00m:11s] Processing chr. 22 [█---------------------------------------] 0% [00m:15s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:15s] RS index built
#> [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:00s] Saving RS index [███████████████████████-----------------] 57% [00m:02s] Saving RS indexdone
#> [████████████████████████████████████████] 100% [00m:02s] RS index saved
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#> [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved
dir.exists("Test")
#> [1] TRUE
The Context Index
The SBS signatures fixes, for every possible context (i.e., a triplet
of consetutive nucleotides), the probability for a SNV to occur in such
a context (see (Alexandrov et al. 2020)).
Hence, the function MutationEngine()
builds a context index
of the reference sequence to place SNVs on the forest cells. The
complete context index of a sequence whose size is up to 4Gbps takes 4
times the length of the sequence itself both in memory and on the disk
because 4 bytes per nucleotide are required to store the context
position. For instance, the complete contex index of a human genome
takes about 12GBytes. In order to avoid such requirement in memory and
disk space, MutationEngine()
allows to sample the reference
genome contexts and to stores only some of the them in the context
index. This is achieved by the optional parameter
context_sampling
which specifies how many occurences of the
same context must be identified before adding one of them to the context
index. The larger the number of context sampling, the larger the context
index. On the other side, the lower the number of context sampling, the
lower the number of sites in the refernce genome that can be affected by
simulated mutations. The context_sampling
is set to 100 by
default, but it can be specified during the
MutationEngine()
call.
m_engine <- MutationEngine(setup_code = "demo")
#> Downloading reference genome...
#> Reference genome downloaded
#> Decompressing reference file...done
#> Downloading SBS file...
#> SBS file downloaded
#> Downloading indel file...
#> indel file downloaded
#> Downloading driver mutation file...
#> Driver mutation file downloaded
#> Downloading passenger CNAs file...
#> Passenger CNAs file downloaded
#> Downloading germline mutations...
#> Germline mutations downloaded
#> Building context index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█████████████---------------------------] 32% [00m:01s] Processing chr. 22 [███████████████████████████-------------] 65% [00m:02s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:02s] Context index built
#> [█---------------------------------------] 0% [00m:00s] Saving context index [████████████████████████████████████████] 100% [00m:00s] Context index saved
#> done
#> Building repeated sequence index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:03s] Processing chr. 22 [█---------------------------------------] 0% [00m:05s] Processing chr. 22 [█---------------------------------------] 0% [00m:06s] Processing chr. 22 [█---------------------------------------] 0% [00m:08s] Processing chr. 22 [█---------------------------------------] 0% [00m:10s] Processing chr. 22 [█---------------------------------------] 0% [00m:14s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:14s] RS index built
#> [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:00s] Saving RS index [████████████████████████----------------] 58% [00m:02s] Saving RS indexdone
#> [████████████████████████████████████████] 100% [00m:02s] RS index saved
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#> [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved
# get the size of the context index when `context_sampling` is 100
utils:::format.object_size(file.size("demo/context_index_100.cif"), "auto")
#> [1] "1.3 Mb"
Let us rebuild the context index sampling one context every 50.
# building a mutation engine by using the "demo" set-up configuration
# and setting context_sampling to 50
m_engine <- MutationEngine(setup_code = "demo", context_sampling = 50)
#> Building context index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█████████████████-----------------------] 40% [00m:01s] Processing chr. 22 [██████████████████████████████----------] 73% [00m:02s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:02s] Context index built
#> [█---------------------------------------] 0% [00m:00s] Saving context index [████████████████████████████████████████] 100% [00m:00s] Context index saved
#> done
#> [█---------------------------------------] 0% [00m:00s] Loading RS index [██████████████--------------------------] 34% [00m:01s] Loading RS index [█████████████████████████████-----------] 70% [00m:02s] Loading RS index [████████████████████████████████████████] 100% [00m:02s] RS index loaded
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
Above call uses the already downloaded data in the directory
"demo"
to build the context and produces the file
"demo/context_index_50.cif"
.
# get the size of the context index when `context_sampling` is 50
utils:::format.object_size(file.size("demo/context_index_50.cif"), "auto")
#> [1] "2.7 Mb"
The Repeated Sequence Index
Analoguosly to the SBS signatures, the ID (indel) signatures fixes,
for some kinds of repeated sequences, the probability for an indel to
occur in them (see (Alexandrov et al.
2020)). Thus, the MutationEngine
function also
builds an index for the repeated sequences in the reference. As in the
case of the SBS signatures, the ID signatures can be memory eager. To
mitigate the memory requirement, the function
MutationEngine()
accepts as parameters the maximum size of
the considered indel (the parameter max_index_size
) and the
maximum number of reference position stored per repetition type (the
parameter max_repetition_storage
). The larger the two
parameters, the larger the repetition index. As far the building time
concerns, the larger max_index_size
, the longer takes the
index construction. On the contrary, max_repetition_storage
does not directly affects the computation time. By default, the two
parameters are set to 50 and 500000, respectively. However, their values
can be specified during the MutationEngine()
call.
# get the size of the repeated index when the default parameter values
# are used
utils:::format.object_size(file.size("demo/rs_index_50_500000.rsif"), "auto")
#> [1] "212.5 Mb"
Let us rebuild the context index sampling one context every 50.
# building a mutation engine by using the "demo" set-up configuration
# and setting max_repetition_storage to 10,000,000
m_engine <- MutationEngine(setup_code = "demo",
max_repetition_storage = 10000000)
#> [█---------------------------------------] 0% [00m:00s] Loading context index [████████████████████████████████████████] 100% [00m:00s] Context index loaded
#> Building repeated sequence index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█---------------------------------------] 0% [00m:03s] Processing chr. 22 [█---------------------------------------] 0% [00m:05s] Processing chr. 22 [█---------------------------------------] 0% [00m:06s] Processing chr. 22 [█---------------------------------------] 0% [00m:08s] Processing chr. 22 [█---------------------------------------] 0% [00m:09s] Processing chr. 22 [█---------------------------------------] 0% [00m:11s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:12s] RS index built
#> [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:01s] Saving RS index [█---------------------------------------] 0% [00m:02s] Saving RS index [█---------------------------------------] 0% [00m:03s] Saving RS index [█---------------------------------------] 0% [00m:04s] Saving RS index [███████---------------------------------] 17% [00m:06s] Saving RS index [█████████████---------------------------] 32% [00m:07s] Saving RS index [██████████████████████------------------] 54% [00m:08s] Saving RS index [██████████████████████████████----------] 73% [00m:09s] Saving RS index [█████████████████████████████████████---] 92% [00m:10s] Saving RS indexdone
#> [████████████████████████████████████████] 100% [00m:10s] RS index saved
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
Above call uses the already downloaded data in the directory
"demo"
to build the repeated sequence index and produces
the file "demo/rs_index_50_10000000.rsif"
.
# get the size of the context index when `context_sampling` is 50
utils:::format.object_size(file.size("demo/rs_index_50_10000000.rsif"),
"auto")
#> [1] "732.1 Mb"