Variant QC SNV pass count v4

Dear Katherine Chao / gnomAD Production Team,

In gnomAD v4.0 | gnomAD browser it is stated that “62,901,592 SNVs and 6,189,261 indels pass all filters in the v4 release.”. On trying to reproduce this number we arrive at 600,294,364 SNVs that pass all filters (out of a total of 786,500,648 SNVs listed in https://gnomad.broadinstitute.org/stats). Is there an error in my reasoning?

Best regards,
Dennis

Hello, my name is Daniel Marten and I’m a member of the gnomAD Production Team:

1 - One thing to note is that parts of the blog post you linked, including that Variant QC number you’re citing, only covers the gnomAD v4 Exomes Release. This is discussed under the ‘Creating gnomAD v4’ section , linked here. The exome release only contains 167,897,387 SNVs total (either high quality or containing reasons to be filtered out), so I’m going to assume that your number - 600,294,364 SNVs - is from genomes.

2 - For the number you did arrive at though, what methods did you use to arrive at it?

Hello Daniel Marten,

Thank you for your quick reply and thank you for clarifying that the numbers are based on gnomAD v4 exomes only!

600,294,364 is based on the combined exomes/genomes data and includes SNVs for which either the exomes/genomes QC is ‘no variant’ or ‘pass’.

I understand that, but could you be a bit more specific so I can try and replicate this?

Exactly what tables were you using? And if applicable, what code did you use to get the numbers ?

we’ve created a combined resource using vip/utils/create_gnomad.sh at v7.2.1 · molgenis/vip · GitHub resulting in https://downloads.molgeniscloud.org/downloads/vip/resources/GRCh38/gnomad.total.v4.0.sites.stripped.tsv.gz

the SNV count results from running: zcat gnomad.total.v4.0.sites.stripped.tsv.gz | awk ‘BEGIN { FS=OFS=“\t” } NR>1 { if(length($3)==1 && length($4)==1 && ($17 == “NO_VAR” || $17 == “PASS”) && ($18 == “NO_VAR” || $18 == “PASS”)) print }’ | wc -l
600294364

Ah hello - sorry for the delay, and thank you for sharing your methods. It’s a bit helpful to know how you got this and how people are using our resource.

For SNVs that pass all filters, we have 565,523,876 in genomes and 62,901,592 in exomes - of which 537,949,472 are genome-exclusive and 44,108,890 are exome-exclusive.

Any further or more detailed bioinformatics support for outside scripts and their results is a bit beyond the scope of the production team in this forum, but it’s nice to see how you arrived at those numbers and we’re happy to clear up that high-level confusion from the start.

With your first question answered, I’m going to mark this ticket as resolved, but comment on this thread or DM me if you have any further concerns!

Feel free to close, I appreciate your time and effort, thanks!