Skip to contents

The SCOUT package gives you direct access to the Simulated Cohort of Universal Tumours from R.

Source What is stored there Key functions
Tables Cohort metadata, ground truth tables get_metadata(), get_ground_truth_cna(), get_ground_truth_drivers(), get_ground_truth_exposures(), get_sampling_information()
Zenodo Sequencing RDS files, normal and tumour sarek results, tumourevo results get_sequencing_data(), get_normal_data(), get_sarek_results(), get_tumourevo_results()
ENA (PRJEB97253) Raw paired-end FASTQ files — 144 tumour + 14 normal (200× per sample, subsamplable to 50×/100×/150×) — see Raw FASTQ data article

Installation

devtools::install_github("caravagnalab/SCOUT")
library(SCOUT)

Tables

Cohort metadata and ground truth tables are stored as public Google Sheets.

Function Description
get_metadata() Sample-level annotations (tumour type, clonal class, WGD, sex, …)
get_ground_truth_cna() Ground truth copy number segments
get_ground_truth_drivers() Ground truth driver events (SNVs, CNAs, WGD)
get_ground_truth_exposures() Ground truth mutational signature exposures
get_sampling_information() Per-sample clone proportions and sampling time
get_sample_names() Sample names for a given SPN
get_tumour_type() Tumour type for a given SPN
get_gender() Sex chromosome for a given SPN

All table functions accept optional spn and sample arguments:

get_metadata()
get_ground_truth_cna("SPN01")
get_ground_truth_cna("SPN01", sample = "1.1")
get_ground_truth_drivers("SPN01")
get_ground_truth_exposures("SPN01", type = "SBS")
get_sampling_information("SPN01")

get_sample_names("SPN01")
get_tumour_type("SPN01")
get_gender("SPN01")

See the Tables article for the full column-level reference.


Zenodo

Data on Zenodo are organised as follows:

Data type Content
Sequencing ground truth One record per SPN — SPN0X_sequencing.tar.gz
Normal sarek outputs One shared record — SPN0X_normal.tar.gz per SPN
Sarek + tumourevo (SPN01–06) One record per SPN per purity (0.9, 0.6, 0.3)
SPN07 sarek One record per purity
SPN07 tumourevo One record for purity 0.9 + 0.6, one for 0.3

For the full list of record IDs see the Zenodo article.

Files are downloaded once and cached at ~/.cache/SCOUT/<spn>/. Override the cache root with SCOUT_CACHE_DIR:

Sys.setenv(SCOUT_CACHE_DIR = "/scratch/shared/SCOUT")

Download functions

# Tumour sequencing ground truth (all purities and coverages)
get_sequencing_data("SPN04")

# Normal sarek VCF outputs
get_normal_data("SPN04")

# Sarek and tumourevo pipeline results for a given purity
get_sarek_results("SPN04", purity = 0.9)
get_tumourevo_results("SPN04", purity = 0.9)

Getter functions

Once downloaded, dedicated getters resolve file paths without manual directory navigation:

# Ground truth mutations
get_mutations("SPN04", type = "tumour", coverage = 100, purity = 0.9)
get_mutations("SPN04", type = "normal")

# Sarek VCF and CNA files
get_sarek_vcf("SPN04", "SPN04_1.1", 100, 0.9, "mutect2", "tumour")
get_sarek_cna("SPN04", "SPN04_1.1", 100, 0.9, "ascat")

# tumourevo outputs
get_tumourevo_snv("SPN04", 50, 0.6, "mutect2", "sequenza", "SPN04_1.1")
get_tumourevo_cna("SPN04", 50, 0.6, "mutect2", "sequenza", "SPN04_1.1")
get_tumourevo_driver("SPN04", 50, 0.6, "mutect2", "sequenza", "SPN04_1.1")
get_tumourevo_subclonal("SPN04", 50, 0.6, "mutect2", "sequenza", "mobster", "SPN04_1.1")
get_tumourevo_qc("SPN04", 50, 0.6, "mutect2", "sequenza", "cnaqc", "SPN04_1.1")
get_tumourevo_signatures("SPN04", 50, 0.6, "mutect2", "sequenza", "BASCULE")

See the Zenodo article for the full function reference.


Raw FASTQ data

Raw paired-end FASTQ files are available in the European Nucleotide Archive under accession PRJEB97253. There are 79 entries in total: 72 tumour sample × purity combinations across all SPNs, plus 7 normal samples (one per SPN). Each entry contains files named tXX_{spn}_{sample}.R1.fastq.gz for t00t39 (40 bins × 5× = 200× total coverage per sample).

Standard coverage levels can be reproduced by subsetting consecutive bins with SeqKit:

Coverage Bins
50× t00t09
100× t00t19
150× t00t29

See the Raw FASTQ data article for installation instructions and the full subsampling commands.