AN vs coverage values

Is there a way to calculate the allele number (AN) value at each locus from the exome summary coverage file in v4? It appears to be similar to the over_10 column multiplied by the total number of samples (730,947 times 2 for diploid) but not exactly, presumably due to different per-sample filters. A summary coverage file with the AN at all loci would be nice. Additionally, this does not correspond to the >20x coverage track on the browser.

Also, why are AC=0 and particularly AN=0 variants included in the VCF files? Presumably these are for variants that had low QC and were later filtered out, but if AC=0, it seems unnecessary to include the “variants” in the final output unless there is some other reason for doing so.

Thanks for posting @aca309 — I just wanted to note that the team is working through a large support backlog following the new release. They have seen this question, and will be getting back soon!

@aca309

A summary coverage file with the AN at all loci would be nice.

The summary coverage is included in our downloadable VCFs (we report AN for all variants).

We include AC=0/AN=0 variants to provide all variants we processed and let the users filter what they need. When you see AC=0/AN=0 variants that are for the adjusted genotype after applying criteria on GP/DP/AB, not the raw genotype.

Let us know if anything is unclear.

Thanks so much for your response.

I did want to clarify that while AN is included for all variants in the VCF files, it does not appear to be included anywhere for positions at which there is not a variant call. There is of course the coverage file, and while I’m not entirely sure I am interpreting the columns in that file correctly (note - the link to the v4 readme file on the downloads page is broken), it appears to be a bit more granular and bins the % at each depth of coverage, but does not show the AN (or number called =ref) at each position after quality controls. Adding a column that corresponds to this in that file would be ideal as one could then compute the frequency of a novel variant that is not in the gnomad population. I think the ESP6500 data set may have had something similar.

Also, note that if this were included, the AC=0 variant information in the VCF files would be mostly redundant as any variant that is not included in the VCF files could be assumed to have AC=0 across the number of samples defined by AN (after all filtering) in the coverage file.

Thanks again for your efforts and work on this project.

Hi @aca309,

We just wanted to let you know we are planning to calculate and release the AN across all positions within gnomAD as part of our 4.1 release next year.

1 Like

Mike, that’s great news. That would be helpful. You might also consider removing AN=0 (or even AF=0) “variants” from the vcf files in conjunction with that. This would reduce file sizes without any obvious loss of useful information.

Understanding that you are targeting 2024 for the 4.1 release, do you have any more specific timing expectations?

Thanks!