Synthetic Cohort of Universal Tumours

SCOUT

A next-generation synthetic cancer genomics benchmark for reproducible tumour evolution analysis, whole-genome sequencing validation, and scalable bioinformatics workflow benchmarking.

SCOUT at a glance

A large-scale synthetic benchmarking resource for reproducible cancer genomics analyses, tumour evolution studies, and workflow validation.

223
Total samples
Matched tumour–normal whole-genome sequencing (WGS) datasets encompassing multiple tumour types, multiregion and longitudinal sampling, and pre- and post-treatment conditions.
7
Tumour types
Comprehensive characterization of seven tumour types, including five solid cancers and two haematological malignancies.
27 TB
Sequencing Data
Synthetic 150 bp paired-end FASTQ reads simulating Illumina NovaSeq 6000 sequencing, with realistic base-calling quality profiles.
13
Bioinformatics tools benchmarked
Systematic benchmarking of somatic and germline variant callers, copy number alteration methods, tumour purity estimators, and evolutionary inference frameworks for subclonal reconstruction and phylogenetic modelling.
9
Purity/Coverage Conditions
Simulated tumour–normal datasets across different tumour purities (0.3, 0.6, and 0.9) and sequencing depths (50x, 100x, and 150x), reflecting common WGS conditions encountered in cancer research.

Core features of SCOUT

SCOUT integrates curated genomic references and cancer-specific knowledge to generate biologically realistic tumour genomes and evolutionary dynamics across multiple scales.

Germline genetic foundation
Simulations are grounded in real human genetic diversity using germline polymorphisms derived from the 1000 Genomes Project, ensuring realistic population-level background variation in all tumour–normal samples.
Cancer Gene Census–driven drivers
Tumour-specific driver alterations are informed by the Cancer Gene Census, enabling realistic oncogenic architectures that reflect known cancer pathways and disease-specific mutational landscapes.
COSMIC mutational processes
Mutational processes are modelled using COSMIC signatures, capturing context-dependent substitution patterns and indel formation mechanisms that reproduce tumour-type-specific and treatment-related mutational spectra.
Empirical copy-number landscapes
Copy-number alterations are derived from PCAWG and Hartwig datasets, reproducing realistic patterns of aneuploidy, focal amplifications, and deletions across diverse cancer types.
Multi-scale evolutionary modelling
SCOUT integrates clonal expansion, spatial growth, and treatment-driven selection into a unified framework, enabling coherent evolution across cellular, genomic, and population scales.

Why SCOUT matters

SCOUT is designed to reflect the structured, multi-layered complexity of cancer evolution, enabling benchmarking under realistic biological, technical, and evolutionary constraints rather than isolated perturbations.

Cancer genomes are interconnected systems
Genomic signals such as variant allele frequencies, copy-number states, mutational signatures, and phylogenetic structure arise jointly from shared evolutionary processes, producing correlated patterns across multiple molecular layers.
Beyond synthetic spike-ins
SCOUT replaces simplified spike-in methods with explicit evolutionary ground truth, capturing tumour growth, subclonal diversification, and treatment-driven selection in a unified simulation framework.
Mechanistic evaluation of inference
Methods are assessed against explicit evolutionary mechanisms, including clonal ancestry, spatial structure, mutational processes, and treatment selection, enabling direct comparison between inferred outputs and true tumour histories.

What’s next?

Ongoing developments expanding SCOUT toward single-cell resolution, multi-omic integration, and richer models of tumour evolution and phenotype.

Single-cell DNA sequencing benchmarking
Extension of benchmarking to single-cell DNA sequencing technologies This will enable evaluation of SNV and copy-number calling performance, as well as phylogenetic reconstruction accuracy at single-cell resolution under realistic noise and coverage constraints.
Multi-omic tumour evolution modelling
Integration of phenotypic evolution with genomic changes through coupled simulation of epigenetic regulation and transcriptional programs, enabling joint modelling of genotype–phenotype interplay during tumour progression.
Single-cell transcriptomics and chromatin accessibility
New simulated experimental layers including single-cell RNA-seq and single-cell ATAC-seq, capturing gene expression dynamics and chromatin accessibility landscapes across evolving tumour subclones.

Software & Pipelines

Software tools and analytical workflows used to generate and analyse the SCOUT cohort.

ProCESS
We generated SCOUT using ProCESS (Programmable Cancer Evolution Spatial Simulator), a framework that integrates spatial tumour evolution with synthetic whole-genome sequencing.
nf-core/sarek
We benchmarked somatic and germline variant calling and copy-number detection in tumour–normal WGS using nf-core/sarek, evaluating performance across tumour purity, sequencing depth, and genomic profiles against ProCESS-derived ground truth.
nf-core/tumourevo
We benchmarked tumour evolution inference methods using nf-core/tumourevo, evaluating driver mutation detection, subclonal deconvolution, and mutational signatures analysis. Using ProCESS-generated ground-truth evolutionary scenarios, we assessed the accuracy of reconstructed clonal architectures and mutational processes.

Reproducibility

Resources and containers for fully reproducible analyses.

SCOUT Package
An R package providing streamlined access to SCOUT datasets generated via ProCESS. It enables efficient download and querying of processed sequencing data, along with convenient functions to access associated evolutionary ground truth and benchmarking results.
Analysis code
A public GitHub repository containing all scripts required to reproduce the analyses presented in the SCOUT study, including data processing workflows, benchmarking procedures, and figure generation.
ProCESS Container
A containerised computing environment providing full reproducibility of the SCOUT analysis pipeline. The Docker image includes all dependencies, tools, and workflow configurations required to regenerate the analyses and figures.

Data Resources

Public repositories containing raw sequencing data and bioinformatics pipelines outputs.

Paired-end FASTQ files for each sample and purity level at the maximum simulated coverage are available in ENA under the accession number PRJEB97253T.
Simulated ground truth mutations from ProCESS, somatic SNVs and CNA calls from nf-core/sarek and output files from nf-core/tumourevo are available in Zenodo.

Publications

Research articles and resources associated with the SCOUT project.

2026 Research Article

Tumour evolution as ground truth for cancer whole-genome sequencing

Lucrezia Valeriani, Giorgia Gandolfi, Elena Buscaroli, Katsiaryna Davydzenka, Giovanni Santacatterina, Alice Antonello, Azad Sadr, Virginia Anna Gazziero, Salvatore Milite, Elena Rivaroli, Anna Kabanova, Guido Sanguinetti, Stefano Cozzini, Alberto Cazzaniga, Giovanni Tonon, Trevor Graham, Andrea Sottoriva, Riccardo Bergamin, Nicola Calonaci, Alberto Casagrande and Giulio Caravagna

biorxiv Preprint

2026 Software Paper

Automated and reproducible tumour evolutionary analysis from cancer genome sequencing

Elena Buscaroli, Katsiaryna Davydzenka, Giorgia Gandolfi, Virginia Anna Gazziero, Lucrezia Valeriani, Rodolfo Tolloi, Brandon T Hastings, Chela T. James, Davide Rambaldi, Andrea Sottoriva, Trevor Graham, Guido Sanguinetti, Giovanni Tonon, Stefano Cozzini, Alberto Cazzaniga, Nicola Calonaci and Giulio Caravagna

biorxiv Preprint (In progress)