Getting started with SCOUT • SCOUT

The SCOUT package gives you direct access to the Simulated Cohort of Universal Tumours from R.

Source	What is stored there	Key functions
Tables	Cohort metadata, ground truth tables	`get_metadata()`, `get_ground_truth_cna()`, `get_ground_truth_drivers()`, `get_ground_truth_exposures()`, `get_sampling_information()`
Zenodo	Sequencing RDS files, normal and tumour sarek results, tumourevo results	`get_sequencing_data()`, `get_normal_data()`, `get_sarek_results()`, `get_tumourevo_results()`
ENA (PRJEB97253)	Raw paired-end FASTQ files — 144 tumour + 14 normal (200× per sample, subsamplable to 50×/100×/150×)	— see Raw FASTQ data article

Installation

devtools::install_github("caravagnalab/SCOUT")
library(SCOUT)

Tables

Cohort metadata and ground truth tables are stored as public Google Sheets.

Function	Description
`get_metadata()`	Sample-level annotations (tumour type, clonal class, WGD, sex, …)
`get_ground_truth_cna()`	Ground truth copy number segments
`get_ground_truth_drivers()`	Ground truth driver events (SNVs, CNAs, WGD)
`get_ground_truth_exposures()`	Ground truth mutational signature exposures
`get_sampling_information()`	Per-sample clone proportions and sampling time
`get_sample_names()`	Sample names for a given SPN
`get_tumour_type()`	Tumour type for a given SPN
`get_gender()`	Sex chromosome for a given SPN

All table functions accept optional spn and sample arguments:

get_metadata()
get_ground_truth_cna("SPN01")
get_ground_truth_cna("SPN01", sample = "1.1")
get_ground_truth_drivers("SPN01")
get_ground_truth_exposures("SPN01", type = "SBS")
get_sampling_information("SPN01")

get_sample_names("SPN01")
get_tumour_type("SPN01")
get_gender("SPN01")

See the Tables article for the full column-level reference.

Zenodo

Data on Zenodo are organised as follows:

Data type	Content
Sequencing ground truth	One record per SPN — `SPN0X_sequencing.tar.gz`
Normal sarek outputs	One shared record — `SPN0X_normal.tar.gz` per SPN
Sarek + tumourevo (SPN01–06)	One record per SPN per purity (`0.9`, `0.6`, `0.3`)
SPN07 sarek	One record per purity
SPN07 tumourevo	One record for purity 0.9 + 0.6, one for 0.3

For the full list of record IDs see the Zenodo article.

Files are downloaded once and cached at ~/.cache/SCOUT/<spn>/. Override the cache root with SCOUT_CACHE_DIR:

Sys.setenv(SCOUT_CACHE_DIR = "/scratch/shared/SCOUT")

Download functions

# Tumour sequencing ground truth (all purities and coverages)
get_sequencing_data("SPN04")

# Normal sarek VCF outputs
get_normal_data("SPN04")

# Sarek and tumourevo pipeline results for a given purity
get_sarek_results("SPN04", purity = 0.9)
get_tumourevo_results("SPN04", purity = 0.9)

Getter functions

Once downloaded, dedicated getters resolve file paths without manual directory navigation:

# Ground truth mutations
get_mutations("SPN04", type = "tumour", coverage = 100, purity = 0.9)
get_mutations("SPN04", type = "normal")

# Sarek VCF and CNA files
get_sarek_vcf("SPN04", "SPN04_1.1", 100, 0.9, "mutect2", "tumour")
get_sarek_cna("SPN04", "SPN04_1.1", 100, 0.9, "ascat")

# tumourevo outputs
get_tumourevo_snv("SPN04", 50, 0.6, "mutect2", "sequenza", "SPN04_1.1")
get_tumourevo_cna("SPN04", 50, 0.6, "mutect2", "sequenza", "SPN04_1.1")
get_tumourevo_driver("SPN04", 50, 0.6, "mutect2", "sequenza", "SPN04_1.1")
get_tumourevo_subclonal("SPN04", 50, 0.6, "mutect2", "sequenza", "mobster", "SPN04_1.1")
get_tumourevo_qc("SPN04", 50, 0.6, "mutect2", "sequenza", "cnaqc", "SPN04_1.1")
get_tumourevo_signatures("SPN04", 50, 0.6, "mutect2", "sequenza", "BASCULE")

See the Zenodo article for the full function reference.

Raw FASTQ data

Raw paired-end FASTQ files are available in the European Nucleotide Archive under accession PRJEB97253. There are 79 entries in total: 72 tumour sample × purity combinations across all SPNs, plus 7 normal samples (one per SPN). Each entry contains files named tXX_{spn}_{sample}.R1.fastq.gz for t00–t39 (40 bins × 5× = 200× total coverage per sample).

Standard coverage levels can be reproduced by subsetting consecutive bins with SeqKit:

Coverage	Bins
50×	`t00`–`t09`
100×	`t00`–`t19`
150×	`t00`–`t29`

See the Raw FASTQ data article for installation instructions and the full subsampling commands.