Please document VCF INFO changes between versions

Hi, thanks for gnomAD, everyone in clinical genetics needs you thanks so much etc, I hope I don’t sound too grumpy…

The VCF INFO fields seem to change almost every release, eg “Non Finnish European” history:

v2 AF_nfe
v3 AF-nfe
v3.1 AF_nfe
v4.0 nfe_AF

This causes scripts to break and extra effort to maintain conversion code etc. Ideally, you’d just keep them the same and not break backwards compatability, but if you feel you need to do this, it would be useful to document it so people can see how the fields have changed over time

For instance a table with rows being a label like “Non-finnish Euroean Allele Frequency” and columns being gnomAD releases, and the cells being as my example at the start.

This would help find out how eg some INFO fields disappeared between versions, some were renamed etc

Thanks!

I remember things breaking in 3.0 because you switched to using dashes in INFO names:

And had hoped 3.1 going back to v2 names meant that things were stable now…

Hi David,

Thank you for the feedback and we understand how these breaking changes are frustrating. For your example, each version of gnomAD should contain <metric>_<sampling_grouping> for call statistics. In the v4 example you provide, sampling grouping, nfe, is first. Could you confirm this within the file your accessing? I am seeing metric first within the v4 exome chrY VCF.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chrY    2784606 .       C       T       .       AC0     AC=0;AN=0;nhomalt_XX=2147483647;AC_XY=0;AN_XY=0;nhomalt_XY=0;nhomalt=0;nhomalt_afr_XX=2147483647;A
C_afr_XY=0;AN_afr_XY=0;nhomalt_afr_XY=0;AC_afr=0;AN_afr=0;...

Each version of gnomAD, and even v4 exomes and genomes, contain some different sample groupings, e.g. subsets were dropped in v4 exomes but v4 genomes contain subsets like HGDP and TGP. We only compute statistics for the sample groupings within each dataset so that should explain why a large number of fields disappeared between v3 and v4.

Again thank you for the feedback and we will be more cognizant of any format changes that will cause headaches for users and will look to update documentation for these types of changes.

Hi, I was looking in the Structural Variants VCF, and thought you had changed them everywhere.

So - for my example non-Finnish Europeans, it seems the issue isn’t that gnomAD has changed it’s more that for that field, SV file is inconsistent with the genome/exome INFO fields.

Eg:

SV has “nfe_AF” while genomes/exomes is “AF_nfe”

Another one:

SV has “POPMAX_AF” while genomes/exomes is “AF_grpmax”

But some fields did change from v3 to v4, so in general what I’d like is eg documenting:

“nonpar” changed to “non_par”
“X_popmax” changed to “X_grpmax”

Sorry for the delayed response! Yes SV and SNV datasets are using different schemas and this is something we will remedy with v4.1 next year.

As for the documentation on field names changing, that is a great suggestion that we will add to future releases.