Please document VCF INFO changes between versions

David_Lawrence · November 16, 2023, 6:07am

Hi, thanks for gnomAD, everyone in clinical genetics needs you thanks so much etc, I hope I don’t sound too grumpy…

The VCF INFO fields seem to change almost every release, eg “Non Finnish European” history:

v2 AF_nfe
v3 AF-nfe
v3.1 AF_nfe
v4.0 nfe_AF

This causes scripts to break and extra effort to maintain conversion code etc. Ideally, you’d just keep them the same and not break backwards compatability, but if you feel you need to do this, it would be useful to document it so people can see how the fields have changed over time

For instance a table with rows being a label like “Non-finnish Euroean Allele Frequency” and columns being gnomAD releases, and the cells being as my example at the start.

This would help find out how eg some INFO fields disappeared between versions, some were renamed etc

Thanks!

David_Lawrence · November 16, 2023, 6:17am

I remember things breaking in 3.0 because you switched to using dashes in INFO names:

github.com/samtools/bcftools

query -f and hyphens in field names [feature request: tag renaming]

opened 11:41AM - 03 Nov 20 UTC

closed 11:18AM - 19 Nov 20 UTC

pdl

enhancement D3: Easy

Given a VCF with INFO fields containing hyphens, is it possible to use `bcftools… query -f` to extract those fields? ``` bcftools query -f '%AC-XX' ``` returns values of the form `1-XX`, i.e. `${AC}-XX` not `${AC-XX}`. The documentation doesn't seem from what I've found to indicate how to handle field names that need any sort of quoting/escaping. (Observed on bcftools 1.7 Using htslib 1.7-2 in case it makes a difference).

github.com/broadinstitute/gnomad_qc

Fix v3.1 VCF release files to use underscore field separators instead of dash

opened 07:59PM - 14 Jan 21 UTC

closed 07:49PM - 02 Mar 21 UTC

jkgoodrich

Discussion of problem: https://atgu.slack.com/archives/CNL7NA6H2/p16105714692193…00 There are reports of users trying to use tools on the gnomAD v3.1 VCF and running into problems because we replaced some underscores with a `-`. We have decided that we will keep this as is for the HT, but modify the VCF to replace the `-` with `_`. Since we need to remake the v3.1 release VCF to fix the VEP issue, we can also fix this issue at that time.

And had hoped 3.1 going back to v2 names meant that things were stable now…

mike · November 20, 2023, 1:58pm

Hi David,

Thank you for the feedback and we understand how these breaking changes are frustrating. For your example, each version of gnomAD should contain <metric>_<sampling_grouping> for call statistics. In the v4 example you provide, sampling grouping, nfe, is first. Could you confirm this within the file your accessing? I am seeing metric first within the v4 exome chrY VCF.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chrY    2784606 .       C       T       .       AC0     AC=0;AN=0;nhomalt_XX=2147483647;AC_XY=0;AN_XY=0;nhomalt_XY=0;nhomalt=0;nhomalt_afr_XX=2147483647;A
C_afr_XY=0;AN_afr_XY=0;nhomalt_afr_XY=0;AC_afr=0;AN_afr=0;...

Each version of gnomAD, and even v4 exomes and genomes, contain some different sample groupings, e.g. subsets were dropped in v4 exomes but v4 genomes contain subsets like HGDP and TGP. We only compute statistics for the sample groupings within each dataset so that should explain why a large number of fields disappeared between v3 and v4.

Again thank you for the feedback and we will be more cognizant of any format changes that will cause headaches for users and will look to update documentation for these types of changes.

David_Lawrence · November 21, 2023, 1:21am

Hi, I was looking in the Structural Variants VCF, and thought you had changed them everywhere.

So - for my example non-Finnish Europeans, it seems the issue isn’t that gnomAD has changed it’s more that for that field, SV file is inconsistent with the genome/exome INFO fields.

Eg:

SV has “nfe_AF” while genomes/exomes is “AF_nfe”

Another one:

SV has “POPMAX_AF” while genomes/exomes is “AF_grpmax”

But some fields did change from v3 to v4, so in general what I’d like is eg documenting:

“nonpar” changed to “non_par”
“X_popmax” changed to “X_grpmax”

mike · December 14, 2023, 5:11pm

Sorry for the delayed response! Yes SV and SNV datasets are using different schemas and this is something we will remedy with v4.1 next year.

As for the documentation on field names changing, that is a great suggestion that we will add to future releases.

Topic		Replies	Views
Missing END2 in Gnomad SVs VCF Structural Variation	1	25	February 20, 2025
Inconsistent SNP MAF values between VCF v4.0 and browser General	5	353	October 28, 2024
Invalid INFO field "vep" in gnomad v4 VCFs General bugfix	4	434	January 10, 2024
Is there a data dictionary for the gnomad v4 Structural Variant bed file? Structural Variation	1	23	May 27, 2025
Number of variants in each vcf file General	2	157	July 23, 2024

Please document VCF INFO changes between versions

Related topics