Skip to contents

Raw FASTQ files for all SCOUT samples are deposited in the European Nucleotide Archive (ENA) under accession number PRJEB97253.


ENA record structure

The accession PRJEB97253 contains one file per sample × purity × read direction, plus matched normals:

Data type Files Count
Tumour (24 samples × 3 purities × R1+R2) {spn}_{sample}_{purity}.R1.fastq.gz / .R2.fastq.gz 144
Normal (7 SPNs × R1+R2) {spn}_normal.R1.fastq.gz / .R2.fastq.gz 14
Total 158

Examples:

SPN02_1.1_0.9.R1.fastq.gz    SPN02_1.1_0.9.R2.fastq.gz
SPN02_1.1_0.6.R1.fastq.gz    SPN02_1.1_0.6.R2.fastq.gz
SPN02_1.1_0.3.R1.fastq.gz    SPN02_1.1_0.3.R2.fastq.gz
SPN05_normal.R1.fastq.gz      SPN05_normal.R2.fastq.gz

Internal structure of each FASTQ file

Each tumour FASTQ file (e.g. SPN02_1.1_0.9.R1.fastq.gz) is a concatenation of 40 bins t00t39, where each bin contains reads from one 5× coverage chunk. The reads are named with the bin prefix and sample identifier:

t14_SPN01_1.2.R2.fastq.gz    (individual bin file, tumour)
n3_normal_sample.R2.fastq.gz  (individual bin file, normal)

Each bin × 5× = 200× total per sample across all 40 bins.


Coverage bins

Each tXX bin corresponds to exactly 5× sequencing coverage. Consecutive bins are combined to reproduce standard coverage levels:

Coverage Bins used Number of bins
50× t00t09 10
100× t00t19 20
150× t00t29 30
200× t00t39 40

Subsampling with SeqKit

SeqKit is required to subset FASTQ files by header pattern while preserving valid FASTQ format.

Installation

# Conda (recommended)
conda install bioconda::seqkit

# Docker
docker pull quay.io/biocontainers/seqkit:2.9.0--h9ee0642_0

# Singularity
singularity pull https://depot.galaxyproject.org/singularity/seqkit:2.9.0--h9ee0642_0

The input is the merged FASTQ file from ENA (e.g. SPN02_1.1_0.9.R1.fastq.gz), which contains reads from all 40 bins concatenated together. SeqKit filters by the tXX prefix in each read name to extract a specific coverage level.

50× coverage

Select bins t00t09 (10 × 5× = 50×):

seqkit grep --threads 2 -n -r -p "^t0[0-9]" \
  SPN02_1.1_0.9.R1.fastq.gz -o SPN02_1.1_0.9_50x.R1.fastq.gz

seqkit grep --threads 2 -n -r -p "^t0[0-9]" \
  SPN02_1.1_0.9.R2.fastq.gz -o SPN02_1.1_0.9_50x.R2.fastq.gz

100× coverage

Select bins t00t19 (20 × 5× = 100×):

seqkit grep --threads 2 -n -r -p "^t(0[0-9]|1[0-9])_" \
  SPN02_1.1_0.9.R1.fastq.gz -o SPN02_1.1_0.9_100x.R1.fastq.gz

seqkit grep --threads 2 -n -r -p "^t(0[0-9]|1[0-9])_" \
  SPN02_1.1_0.9.R2.fastq.gz -o SPN02_1.1_0.9_100x.R2.fastq.gz

150× coverage

Select bins t00t29 (30 × 5× = 150×):

seqkit grep --threads 2 -n -r -p "^t(0[0-9]|1[0-9]|2[0-9])_" \
  SPN02_1.1_0.9.R1.fastq.gz -o SPN02_1.1_0.9_150x.R1.fastq.gz

seqkit grep --threads 2 -n -r -p "^t(0[0-9]|1[0-9]|2[0-9])_" \
  SPN02_1.1_0.9.R2.fastq.gz -o SPN02_1.1_0.9_150x.R2.fastq.gz

200× coverage (full)

Use the file as downloaded — no filtering needed.


Notes

  • The same bin pattern must be applied consistently to both R1 and R2 to maintain paired-end integrity.
  • Output files are valid FASTQ and can be passed directly to any alignment or variant calling pipeline (e.g. Sarek).
  • SeqKit is recommended for large files due to its speed and low memory usage.