This function parses a set of mutations in the Variant Calling Format (VCF), using the vcfR R package. The function takes in input the column names of the INFO field referring to allele depth counts, and number of reads with the variant. The two columns are used to compute the variant allele frequency (VAF) value; if any of the required columns are not found, an error is returned. This function does not filter any of the variants annotated in the input VCF file.

load_vcf(file, DP_column = "DP", NV_column = "NV")

Arguments

file

A VCF file name.

DP_column

Column with the total depth (DP), i.e., sum of reads at a locus.

NV_column

Column with the number of reads with the variant (NV).

Value

A tibble with the loaded data which contains, besides all the content parsable from the VCF file, three columns named DP, NV and VAF where VAF = NV/DP.

Examples

# Example VCF file in the https://github.com/openvax/varcode repository file = 'https://raw.githubusercontent.com/openvax/varcode/master/test/data/strelka-example.vcf' download.file(url = file, destfile = 'strelka-example.vcf') # We pretend that the number of variants is the gt_DP2 field, which is wrong # Anyway, this shows how you can load a VCF file. load_vcf(file = 'strelka-example.vcf', DP_column = 'gt_DP', NV_column = 'gt_DP2')
#> Loading VCF file strelka-example.vcf with package vcfR, file size 11.6 Kb.
#> # A tibble: 40 x 33 #> CHROM POS ID REF ALT QUAL FILTER INFO Key QSI TQSI NT #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <chr> #> 1 chr1 1638… NA GC G NA PASS IC=1… 1 33 1 ref #> 2 chr1 1708… NA G GC NA PASS IC=3… 1 33 1 ref #> 3 chr2 9662… NA GC G NA PASS IC=0… 2 46 1 ref #> 4 chr4 4911… NA C CCCT… NA PASS IC=1… 2 46 1 ref #> 5 chr4 4912… NA C CGGA… NA PASS IC=1… 3 33 1 ref #> 6 chr4 1909… NA C CAT NA PASS IC=1… 3 33 1 ref #> 7 chr9 1399… NA T TGTT… NA PASS IC=1… 4 36 2 ref #> 8 chr9 1399… NA TGTT… T NA PASS IC=0… 4 36 2 ref #> 9 chr10 4236… NA C CGGA… NA PASS IC=1… 5 40 2 ref #> 10 chr11 1271… NA A AAAG NA PASS IC=1… 5 40 2 ref #> # … with 30 more rows, and 21 more variables: QSI_NT <int>, TQSI_NT <int>, #> # SGT <chr>, RU <chr>, RC <int>, IC <int>, IHP <int>, SVTYPE <chr>, #> # SOMATIC <lgl>, OVERLAP <lgl>, Indiv <chr>, DP <dbl>, NV <dbl>, #> # gt_TAR <chr>, gt_TIR <chr>, gt_TOR <chr>, gt_DP50 <dbl>, gt_FDP50 <dbl>, #> # gt_SUBDP50 <dbl>, gt_GT_alleles <chr>, VAF <dbl>