VEP INFO fields unparseable

christopherlee · February 8, 2024, 6:40pm

I’m trying to parse the VEP fields for the v4 data set and the fields do not seem to match the header format, beyond just missing some fields at the end of the header, for instance:

bcftools view -h gnomad.genomes.v4.0.sites.chr8.vcf.bgz \
    | grep "^##INFO=<ID=vep" | grep -o "Format:.*" | tr '|' '\n' | nl
...
    30    CCDS
    31    ENSP
    32    UNIPROT_ISOFORM
    33    SOURCE
    34    DOMAINS
    35    miRNA
...

zgrep -m1 "rs1304835881" gnomad.genomes.v4.0.sites.chr8.vcf.bgz \
    | cut -f8 | tr ';' '\n' | grep "^vep" | cut -d'=' -f2- | tr ',' '\n' | head -1 \
    | tr '|' '\n' | nl -ba
...
    30    CCDS34792.1
    31    ENSP00000318878
    32
    33    Ensembl
    34
    35
    36    PANTHER:PTHR48002&PANTHER:PTHR48002&Gene3D:1
    37
    38

Here the PANTHER stuff are the Uniprot domains affected and so should
be in the 34th field.

At the UCSC Genome Browser, these VEP annotations are very important because it’s how we decide what color the variant will show up, and how we implement filters, but right now we can’t parse them.

Similarly, there looks to be issues with transcription factor binding
site variants like:

where the VEP field:
T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00001078304|open_chromatin_region||||||||||1||||SNV||||||||||||||||||||||||||,
T|TF_binding_site_variant|MODIFIER|||MotifFeature|ENSM00195392673|||||||||||1||-1||SNV||||||||||||||||||ENSPFM0508|13|Y|-0.058|SRF||||,
T|TF_binding_site_variant|MODIFIER|||MotifFeature|ENSM00195568423|||||||||||1||-1||SNV||||||||||||||||||ENSPFM0481|22|N|-0.046|RFX3::SRF||||,
T|intergenic_variant|MODIFIER|||Intergenic||||||||||||1||||SNV||||||||||||||||||||||||||

has “-0.058” in the LoF for the second annotation, but that value
should be the MOTIF_SCORE_CHANGE value.

Daniel_Marten · February 9, 2024, 3:46pm

Hello,

Thank you for reaching out and using our forum. This topic has already been addressed in our forum for the v4.0 release - discussed here by production team member Mike Wilson. Read his response for more, but a fix will be coming in gnomAD v4.1, which is actively in the works. Please stay in touch and let us know if you have any other comments or questions or concerns!

christopherlee · February 9, 2024, 5:06pm

When I read over that post, it seemed like the missing fields were just at the end, whereas I’m describing missing fields in the middle of the annotation. Can I really just discard the final couple fields? If the same fields are missing for every annotation I can work around the problem easily enough.

Do you have a rough estimate of when v4.1 will be released?

Daniel_Marten · February 9, 2024, 5:16pm

Our rough estimate for the gnomAD v4.1 release is late March 2024.

And let me see about looping in that previously mentioned production team member - they’re much more knowledgeable about the particular VEP export bug than myself.

mike · March 19, 2024, 1:23pm

Hi @christopherlee,

Thank you for your patience. The missing fields are not the last fields – I should have specified a bit more in my initial response in the other post. We removed the SIFT and PolyPhen annotations from the VEP annotation in our release Hail Table and nest them under our in_silico_predictor struct. That release Hail Table gets converted to a VCF and we did not account for this in the VEP header, thus the two extra fields. If you remove those two fields from the VCF header, it will parse properly.

Topic		Replies	Views
Invalid INFO field "vep" in gnomad v4 VCFs General bugfix	4	439	January 10, 2024
Can I find HGVS Consequence of variant ID in gnomad.vcf.bgz? Methods	3	82	January 10, 2025
Please document VCF INFO changes between versions General	4	467	December 14, 2023
Gnomad structural variants bed file Structural Variation	11	424	September 4, 2024
VEP annotation for joint VCFs? General	0	19	April 1, 2025

VEP INFO fields unparseable

Related topics