
Raw FASTQ data
raw_data.RmdRaw FASTQ files for all SCOUT samples are deposited in the European Nucleotide Archive (ENA) under accession number PRJEB97253.
ENA record structure
The accession PRJEB97253 contains one file per sample × purity × read direction, plus matched normals:
| Data type | Files | Count |
|---|---|---|
| Tumour (24 samples × 3 purities × R1+R2) |
{spn}_{sample}_{purity}.R1.fastq.gz /
.R2.fastq.gz
|
144 |
| Normal (7 SPNs × R1+R2) |
{spn}_normal.R1.fastq.gz /
.R2.fastq.gz
|
14 |
| Total | 158 |
Examples:
SPN02_1.1_0.9.R1.fastq.gz SPN02_1.1_0.9.R2.fastq.gz
SPN02_1.1_0.6.R1.fastq.gz SPN02_1.1_0.6.R2.fastq.gz
SPN02_1.1_0.3.R1.fastq.gz SPN02_1.1_0.3.R2.fastq.gz
SPN05_normal.R1.fastq.gz SPN05_normal.R2.fastq.gz
Internal structure of each FASTQ file
Each tumour FASTQ file (e.g. SPN02_1.1_0.9.R1.fastq.gz)
is a concatenation of 40 bins t00–t39, where
each bin contains reads from one 5× coverage chunk. The reads are named
with the bin prefix and sample identifier:
t14_SPN01_1.2.R2.fastq.gz (individual bin file, tumour)
n3_normal_sample.R2.fastq.gz (individual bin file, normal)
Each bin × 5× = 200× total per sample across all 40 bins.
Coverage bins
Each tXX bin corresponds to exactly 5×
sequencing coverage. Consecutive bins are combined to reproduce
standard coverage levels:
| Coverage | Bins used | Number of bins |
|---|---|---|
| 50× |
t00–t09
|
10 |
| 100× |
t00–t19
|
20 |
| 150× |
t00–t29
|
30 |
| 200× |
t00–t39
|
40 |
Subsampling with SeqKit
SeqKit is required to subset FASTQ files by header pattern while preserving valid FASTQ format.
Installation
# Conda (recommended)
conda install bioconda::seqkit
# Docker
docker pull quay.io/biocontainers/seqkit:2.9.0--h9ee0642_0
# Singularity
singularity pull https://depot.galaxyproject.org/singularity/seqkit:2.9.0--h9ee0642_0The input is the merged FASTQ file from ENA
(e.g. SPN02_1.1_0.9.R1.fastq.gz), which contains reads from
all 40 bins concatenated together. SeqKit filters by the
tXX prefix in each read name to extract a specific coverage
level.