Disclaimer: RACES/ProCESS internally implements the probability distributions using the C++11 random number distribution classes. The standard does not specify their algorithms, and the class implementations are left free for the compiler. Thus, the simulation output depends on the compiler used to compile RACES, and because of that, the results reported in this article may differ from those obtained by the reader.
ProCESS/RACES can simulate genomic mutations on the cells represented
in a SampleForest according to the specified SBS and indel
mutational signatures (see vignette["mutations"]). This
task is achieved by the mutation engine which uses two indices on the
reference genome to place the most probable mutations according to
signatures: the context index and the repeated sequence
index.
The function MutationEngine() builds the mutation engine
and these two indices.
Customizing the mutation engine
While the most convenient way to invoke MutationEngine()
is by using the parameter setup_code (see
vignette["mutations"]), the are cases in which a custom
set-up is required. In these cases, the function
MutationEngine() can be called by specifying the name of
the set-up directory, the path or URL of the reference sequence, the
signature files, the driver mutation file, the passenger CNAs file, and
the germline data directory.
library("ProCESS")
reference_url <- paste0("https://ftp.ensembl.org/pub/grch37/release-111/",
"fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.",
"dna.chromosome.22.fa.gz")
sbs_url <- paste0("https://zenodo.org/records/15875185/files/",
"SBS_demo_signatures.txt")
indel_url <- paste0("https://zenodo.org/records/15875185/files/",
"indel_demo_signatures.txt")
drivers_url <- paste0("https://zenodo.org/records/15875185/files/",
"driver_mutations_hg19.csv.bz2")
passenger_cnas_url <- paste0("https://zenodo.org/records/15875185/files/",
"passenger_CNAs_hg19.csv.bz2")
germline_url <- paste0("https://zenodo.org/records/15875185/files/",
"germline_data_demo.tar.bz2")
# build a mutation engine and place all the files in the directory "Test"
m_engine <- MutationEngine(directory = "Test",
reference_src = reference_url,
SBS_signatures_src = sbs_url,
indel_signatures_src = indel_url,
drivers_src = drivers_url,
passenger_CNAs_src = passenger_cnas_url,
germline_src = germline_url)
#> Downloading reference genome...
#> Reference genome downloaded
#> Decompressing reference genome...done
#> Downloading signature files...
#> Signature file downloaded
#> Downloading driver mutation file...
#> Driver mutation file downloaded
#> Decompressing driver mutation file...done
#> Downloading passenger CNAs file...
#> Passenger CNAs file downloaded
#> Decompressing passenger CNAs file...done
#> Downloading germline...
#> Germline downloaded
#> Decompressing mutations...
#> done
#> Building context index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█████████████████-----------------------] 40% [00m:00s] Processing chr. 22 [█████████████████████████████████-------] 81% [00m:02s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:02s] Context index built
#> [█---------------------------------------] 0% [00m:00s] Saving context index [████████████████████████████████████████] 100% [00m:00s] Context index saved
#> done
#> Building repeated sequence index...
#> [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Reading 22 [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] RS index built
#> [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:00s] Saving RS index [█████████████████████████---------------] 62% [00m:02s] Saving RS indexdone
#> [████████████████████████████████████████] 100% [00m:02s] RS index saved
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#> [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved
dir.exists("Test")
#> [1] TRUEThe Context Index
The SBS signatures fixes, for every possible context (i.e., a triplet
of consecutive nucleotides), the probability for a SNV to occur in such
a context (see (Alexandrov
et al. 2020)). Hence, the function
MutationEngine() builds a context index of the reference
sequence to place SNVs on the forest cells. The complete context index
of a sequence whose size is up to 4Gbps takes 4 times the length of the
sequence itself both in memory and on the disk because 4 bytes per
nucleotide are required to store the context position. For instance, the
complete context index of a human genome takes about 12GBytes. In order
to avoid such requirement in memory and disk space,
MutationEngine() allows to sample the reference genome
contexts and to stores only some of the them in the context index. This
is achieved by the optional parameter context_sampling
which specifies how many occurrences of the same context must be
identified before adding one of them to the context index. The larger
the number of context sampling, the larger the context index. On the
other side, the lower the number of context sampling, the lower the
number of sites in the reference genome that can be affected by
simulated mutations. The context_sampling is set to 100 by
default, but it can be specified during the
MutationEngine() call.
m_engine <- MutationEngine(setup_code = "demo")
#> Downloading reference genome...
#> Reference genome downloaded
#> Decompressing reference genome...done
#> Downloading signature files...
#> Signature file downloaded
#> Downloading driver mutation file...
#> Driver mutation file downloaded
#> Decompressing driver mutation file...done
#> Downloading passenger CNAs file...
#> Passenger CNAs file downloaded
#> Decompressing passenger CNAs file...done
#> Downloading germline...
#> Germline downloaded
#> Decompressing mutations...
#> done
#> Building context index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█████████████████-----------------------] 40% [00m:00s] Processing chr. 22 [█████████████████████████████████-------] 81% [00m:02s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:02s] Context index built
#> [█---------------------------------------] 0% [00m:00s] Saving context index [████████████████████████████████████████] 100% [00m:00s] Context index saved
#> done
#> Building repeated sequence index...
#> [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:01s] Reading 22 [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] RS index built
#> [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:00s] Saving RS index [█████████████████████████████-----------] 72% [00m:02s] Saving RS indexdone
#> [████████████████████████████████████████] 100% [00m:02s] RS index saved
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loaded
#> [█---------------------------------------] 0% [00m:00s] Saving germline [████████████████████████████████████████] 100% [00m:00s] Germline saved
# get the size of the context index when `context_sampling` is 100
utils:::format.object_size(file.size("demo/context_index_100.cif"), "auto")
#> [1] "1.3 Mb"Let us rebuild the context index sampling one context every 50.
# building a mutation engine by using the "demo" set-up configuration
# and setting context_sampling to 50
m_engine <- MutationEngine(setup_code = "demo", context_sampling = 50)
#> Building context index...
#> [█---------------------------------------] 0% [00m:00s] Processing chr. 22 [█████████████████-----------------------] 40% [00m:00s] Processing chr. 22 [█████████████████████████████████-------] 81% [00m:02s] Processing chr. 22 [████████████████████████████████████████] 100% [00m:02s] Context index built
#> [█---------------------------------------] 0% [00m:00s] Saving context index [████████████████████████████████████████] 100% [00m:00s] Context index saved
#> done
#> [█---------------------------------------] 0% [00m:00s] Loading RS index [████████████████████--------------------] 48% [00m:01s] Loading RS index [██████████████████████████████████████--] 94% [00m:02s] Loading RS index [████████████████████████████████████████] 100% [00m:02s] RS index loaded
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loadedAbove call uses the already downloaded data in the directory
"demo" to build the context and produces the file
"demo/context_index_50.cif".
# get the size of the context index when `context_sampling` is 50
utils:::format.object_size(file.size("demo/context_index_50.cif"), "auto")
#> [1] "2.7 Mb"The Repeated Sequence Index
Analogously to the SBS signatures, the ID (indel) signatures fixes,
for some kinds of repeated sequences, the probability for an indel to
occur in them (see (Alexandrov et al. 2020)). Thus, the
MutationEngine function also builds an index for the
repeated sequences in the reference. As in the case of the SBS
signatures, the ID signatures can be memory eager. To mitigate the
memory requirement, the function MutationEngine() accepts
as parameters the maximum size of the considered indel (the parameter
max_index_size) and the maximum number of reference
position stored per repetition type (the parameter
max_repetition_storage). The larger the two parameters, the
larger the repetition index. As far the building time concerns, the
larger max_index_size, the longer takes the index
construction. On the contrary, max_repetition_storage does
not directly affects the computation time. By default, the two
parameters are set to 50 and 500000, respectively. However, their values
can be specified during the MutationEngine() call.
# get the size of the repeated index when the default parameter values
# are used
utils:::format.object_size(file.size("demo/rs_index_50_500000.rsif"), "auto")
#> [1] "212.5 Mb"Let us rebuild the context index sampling one context every 50.
# building a mutation engine by using the "demo" set-up configuration
# and setting max_repetition_storage to 10,000,000
m_engine <- MutationEngine(setup_code = "demo",
max_repetition_storage = 10000000)
#> [█---------------------------------------] 0% [00m:00s] Loading context index [████████████████████████████████████████] 100% [00m:00s] Context index loaded
#> Building repeated sequence index...
#> [█---------------------------------------] 0% [00m:00s] Reading 22 [█---------------------------------------] 0% [00m:00s] Reading 22 [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] Reading 22
#> [████████████████████████████████████████] 100% [00m:02s] RS index built
#> [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:00s] Saving RS index [█---------------------------------------] 0% [00m:01s] Saving RS index [█---------------------------------------] 0% [00m:02s] Saving RS index [█---------------------------------------] 0% [00m:03s] Saving RS index [███-------------------------------------] 5% [00m:05s] Saving RS index [█████████████---------------------------] 30% [00m:06s] Saving RS index [██████████████████████------------------] 54% [00m:07s] Saving RS index [███████████████████████████████---------] 75% [00m:08s] Saving RS index [███████████████████████████████████████-] 95% [00m:09s] Saving RS indexdone
#> [████████████████████████████████████████] 100% [00m:09s] RS index saved
#> [█---------------------------------------] 0% [00m:00s] Loading germline [████████████████████████████████████████] 100% [00m:00s] Germline loadedAbove call uses the already downloaded data in the directory
"demo" to build the repeated sequence index and produces
the file "demo/rs_index_50_10000000.rsif".
# get the size of the context index when `context_sampling` is 50
utils:::format.object_size(file.size("demo/rs_index_50_10000000.rsif"),
"auto")
#> [1] "732.1 Mb"