I’m trying to parse the VEP fields for the v4 data set and the fields do not seem to match the header format, beyond just missing some fields at the end of the header, for instance:
bcftools view -h gnomad.genomes.v4.0.sites.chr8.vcf.bgz \
| grep "^##INFO=<ID=vep" | grep -o "Format:.*" | tr '|' '\n' | nl
...
30 CCDS
31 ENSP
32 UNIPROT_ISOFORM
33 SOURCE
34 DOMAINS
35 miRNA
...
zgrep -m1 "rs1304835881" gnomad.genomes.v4.0.sites.chr8.vcf.bgz \
| cut -f8 | tr ';' '\n' | grep "^vep" | cut -d'=' -f2- | tr ',' '\n' | head -1 \
| tr '|' '\n' | nl -ba
...
30 CCDS34792.1
31 ENSP00000318878
32
33 Ensembl
34
35
36 PANTHER:PTHR48002&PANTHER:PTHR48002&Gene3D:1
37
38
Here the PANTHER stuff are the Uniprot domains affected and so should
be in the 34th field.
At the UCSC Genome Browser, these VEP annotations are very important because it’s how we decide what color the variant will show up, and how we implement filters, but right now we can’t parse them.
Similarly, there looks to be issues with transcription factor binding
site variants like:
where the VEP field:
T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00001078304|open_chromatin_region||||||||||1||||SNV||||||||||||||||||||||||||,
T|TF_binding_site_variant|MODIFIER|||MotifFeature|ENSM00195392673|||||||||||1||-1||SNV||||||||||||||||||ENSPFM0508|13|Y|-0.058|SRF||||,
T|TF_binding_site_variant|MODIFIER|||MotifFeature|ENSM00195568423|||||||||||1||-1||SNV||||||||||||||||||ENSPFM0481|22|N|-0.046|RFX3::SRF||||,
T|intergenic_variant|MODIFIER|||Intergenic||||||||||||1||||SNV||||||||||||||||||||||||||
has “-0.058” in the LoF for the second annotation, but that value
should be the MOTIF_SCORE_CHANGE value.