VEP INFO fields unparseable

I’m trying to parse the VEP fields for the v4 data set and the fields do not seem to match the header format, beyond just missing some fields at the end of the header, for instance:

bcftools view -h gnomad.genomes.v4.0.sites.chr8.vcf.bgz \
    | grep "^##INFO=<ID=vep" | grep -o "Format:.*" | tr '|' '\n' | nl
...
    30    CCDS
    31    ENSP
    32    UNIPROT_ISOFORM
    33    SOURCE
    34    DOMAINS
    35    miRNA
...

zgrep -m1 "rs1304835881" gnomad.genomes.v4.0.sites.chr8.vcf.bgz \
    | cut -f8 | tr ';' '\n' | grep "^vep" | cut -d'=' -f2- | tr ',' '\n' | head -1 \
    | tr '|' '\n' | nl -ba
...
    30    CCDS34792.1
    31    ENSP00000318878
    32
    33    Ensembl
    34
    35
    36    PANTHER:PTHR48002&PANTHER:PTHR48002&Gene3D:1
    37
    38

Here the PANTHER stuff are the Uniprot domains affected and so should
be in the 34th field.

At the UCSC Genome Browser, these VEP annotations are very important because it’s how we decide what color the variant will show up, and how we implement filters, but right now we can’t parse them.

Similarly, there looks to be issues with transcription factor binding
site variants like:

where the VEP field:
T|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00001078304|open_chromatin_region||||||||||1||||SNV||||||||||||||||||||||||||,
T|TF_binding_site_variant|MODIFIER|||MotifFeature|ENSM00195392673|||||||||||1||-1||SNV||||||||||||||||||ENSPFM0508|13|Y|-0.058|SRF||||,
T|TF_binding_site_variant|MODIFIER|||MotifFeature|ENSM00195568423|||||||||||1||-1||SNV||||||||||||||||||ENSPFM0481|22|N|-0.046|RFX3::SRF||||,
T|intergenic_variant|MODIFIER|||Intergenic||||||||||||1||||SNV||||||||||||||||||||||||||

has “-0.058” in the LoF for the second annotation, but that value
should be the MOTIF_SCORE_CHANGE value.

Hello,

Thank you for reaching out and using our forum. This topic has already been addressed in our forum for the v4.0 release - discussed here by production team member Mike Wilson. Read his response for more, but a fix will be coming in gnomAD v4.1, which is actively in the works. Please stay in touch and let us know if you have any other comments or questions or concerns!

When I read over that post, it seemed like the missing fields were just at the end, whereas I’m describing missing fields in the middle of the annotation. Can I really just discard the final couple fields? If the same fields are missing for every annotation I can work around the problem easily enough.

Do you have a rough estimate of when v4.1 will be released?

Our rough estimate for the gnomAD v4.1 release is late March 2024.

And let me see about looping in that previously mentioned production team member - they’re much more knowledgeable about the particular VEP export bug than myself.

Hi @christopherlee,

Thank you for your patience. The missing fields are not the last fields – I should have specified a bit more in my initial response in the other post. We removed the SIFT and PolyPhen annotations from the VEP annotation in our release Hail Table and nest them under our in_silico_predictor struct. That release Hail Table gets converted to a VCF and we did not account for this in the VEP header, thus the two extra fields. If you remove those two fields from the VCF header, it will parse properly.