Invalid INFO field "vep" in gnomad v4 VCFs

Hi,

There seems to be an error in the vcf files (at least in genome versions, I did not look at exome) in the INFO column, for the field vep.

Files investigated:

64.64 GiB  2023-11-01T00:03:45Z  gs://gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chr1.vcf.bgz

890.03 MiB  2023-11-01T00:03:47Z  gs://gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chrY.vcf.bgz

for “synonymous” variants,

the value of the vep field is typically:

G|synonymous_variant|LOW|XKR3|ENSG00000172967|Transcript|ENST00000331428|protein_coding|4/4||ENST00000331428.5:c.985T>C|ENSP00000331704.5:p.Leu329=|1088|985|329|L|Ttg/Ctg|1||-1||SNV|HGNC|HGNC:28778||||1|P1|CCDS42975.1|ENSP00000331704||Ensembl|||PANTHER:PTHR14297&PANTHER:PTHR14297&Pfam:PF09815||||||||||||,G|synonymous_variant|LOW|XKR3|ENSG00000172967|Transcript|ENST00000684488|protein_coding|4/4||ENST00000684488.1:c.985T>C|ENSP00000507478.1:p.Leu329=|1116|985|329|L|Ttg/Ctg|1||-1||SNV|HGNC|HGNC:28778|YES|NM_001386955.1|||P1|CCDS42975.1|ENSP00000507478||Ensembl|||Pfam:PF09815&PANTHER:PTHR14297&PANTHER:PTHR14297||||||||||||,G|synonymous_variant|LOW|XKR3|150165|Transcript|NM_001318251.3|protein_coding|4/4||NM_001318251.3:c.985T>C|NP_001305180.1:p.Leu329=|1091|985|329|L|Ttg/Ctg|1||-1||SNV|EntrezGene|HGNC:28778|||||||NP_001305180.1||RefSeq|||||||||||||||,G|synonymous_variant|LOW|XKR3|150165|Transcript|NM_001386955.1|protein_coding|4/4||NM_001386955.1:c.985T>C|NP_001373884.1:p.Leu329=|1116|985|329|L|Ttg/Ctg|1||-1||SNV|EntrezGene|HGNC:28778|YES|ENST00000684488.1|||||NP_001373884.1||RefSeq|||||||||||||||,G|synonymous_variant|LOW|XKR3|150165|Transcript|NM_001386956.1|protein_coding|4/4||NM_001386956.1:c.985T>C|NP_001373885.1:p.Leu329=|1057|985|329|L|Ttg/Ctg|1||-1||SNV|EntrezGene|HGNC:28778|||||||NP_001373885.1||RefSeq|||||||||||||||,G|synonymous_variant|LOW|XKR3|150165|Transcript|NM_001386957.1|protein_coding|6/6||NM_001386957.1:c.985T>C|NP_001373886.1:p.Leu329=|1511|985|329|L|Ttg/Ctg|1||-1||SNV|EntrezGene|HGNC:28778|||||||NP_001373886.1||RefSeq|||||||||||||||,G|synonymous_variant|LOW|XKR3|150165|Transcript|NM_175878.5|protein_coding|4/4||NM_175878.5:c.985T>C|NP_787074.2:p.Leu329=|1088|985|329|L|Ttg/Ctg|1||-1||SNV|EntrezGene|HGNC:28778|||||||NP_787074.2||RefSeq|||||||||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000301023|CTCF_binding_site||||||||||1||||SNV||||||||||||||||||||||||||,G|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00001057386|TF_binding_site||||||||||1||||SNV||||||||||||||||||||||||||,G|TF_binding_site_variant|MODIFIER|||MotifFeature|ENSM00525037649|||||||||||1||1||SNV||||||||||||||||||ENSPFM0378|7|N|0.059|MEIS2&MEIS3||||,G|TF_binding_site_variant|MODIFIER|||MotifFeature|ENSM00145341180|||||||||||1||-1||SNV||||||||||||||||||ENSPFM0379|1|N|-0.070|MEIS2&MEIS3&TGIF2&TGIF2LX&PKNOX1&PKNOX2&TGIF1||||

You can see that this string contains a number of = signs, for instance ENSP00000331704.5:p.Leu329=

However, a field value in the INFO column of a VCF file cannot contain an = sign since this sign is reserved for assigning the value to the vep variable/field, (and the field value in a VCF is not isolated by quotes).

This formatting error crashes parsing programs such as snpsift, bcftools, etc…

Thank you for your attention and hoping to help,

Best

Christophe

I think there may be a related bug related to VEP in gnomAD 4.0 related to the VEP VCF header. Sent email, but saw this on the forum so adding in here.

Discovered the following working with a customer on importing gnomad4.0 into HealthOmics annotation stores. Repasting email below

In gnomAD 4.0 (chr11 is what we used to debug), we noticed that there are 46 fields in the VEP header, but 48 in the actual info columns. Alas, I’ve lost a bit of my command line skills over the years, but hopefully you can follow the below. I think there are likely two fields missing in your header.

In gnomAD 4.0 (header vs first variant line):

(base) f84d898eb2d7:temp ajfriedm$ echo "Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|CCDS|ENSP|UNIPROT_ISOFORM|SOURCE|DOMAINS|miRNA|HGVS_OFFSET|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|LoF|LoF_filter|LoF_flags|LoF_info" | awk 'BEGIN{FS="|"}{print NF}'

46

(base) f84d898eb2d7:temp ajfriedm$ echo "T|downstream_gene_variant|MODIFIER|OR4F2P|ENSG00000224777|Transcript|ENST00000424047|unprocessed_pseudogene||||||||||1|147|-1||SNV|HGNC|HGNC:8299|YES||||||||Ensembl|||||||||||||||" | awk 'BEGIN{FS="|"}{print NF}'

48

In gnomAD 2.1 (header vs first variant line):

(base) f84d898eb2d7:temp ajfriedm$ echo "llele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|GMAF|AFR_MAF|AMR_MAF|EAS_MAF|EUR_MAF|SAS_MAF|AA_MAF|EA_MAF|ExAC_MAF|ExAC_Adj_MAF|ExAC_AFR_MAF|ExAC_AMR_MAF|ExAC_EAS_MAF|ExAC_FIN_MAF|ExAC_NFE_MAF|ExAC_OTH_MAF|ExAC_SAS_MAF|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info" | awk 'BEGIN{FS="|"}{print NF}'

68

(base) f84d898eb2d7:temp ajfriedm$ echo "G|non_coding_transcript_exon_variant&non_coding_transcript_variant|MODIFIER|LINC01001|ENSG00000230724|Transcript|ENST00000526704|lincRNA|5/7||ENST00000526704.2:n.1522T>C||1522||||||1||-1||SNV|1|HGNC|38540||||||||||||||||||||||||||||||||||||||||||" | awk 'BEGIN{FS="|"}{print NF}'

68

Hi @Christophe,

Thank you for your report! @ajfriedman18 is correct that our VCFs have 46 fields in the VEP header and 48 fields in the annotation: both extra annotation fields are simply missing values. This was a bug in our VCF export code and the annotation should have matched the VCF header with 46 fields. We will be fixing this in our v4.1 release.

Apologies for the confusion/frustration this has caused!

Best,
Mike

Thanks, Mike! I’ve let our customer know

Thank you so much for the follow-up mike and @ajfriedman18 !
And Happy new year

Best

Chris