I have a question regarding the procedure used to compute total allele number (AN) for multiallelic sites in the gnomAD v4.1 dataset.
I have come across multiple instances of apparent discrepancies between genomic AN values provided for different alleles at the same site. For example, if we look at variants found at the position chr1:1228635, we will see that there are two alternative alleles T and A both present in the gnomAD v4.1 genomic dataset (https://gnomad.broadinstitute.org/variant/rs61743559?dataset=gnomad_r4). However, AN values reported for these two alleles are slightly different (152256 and 152374 respectively). I am wondering what is the reason for this difference, as these SNPs are obviously called from the same data and I would expect the sample size to be the same for all alleles at a given locus. Could you please clarify this for me?
This is due to the downcoding of genotypes when splitting multiallelic variants. We use Hail to split multiallelic variants prior to calculating aggregate variant frequency statistics, and the process of downcoding genotypes is nicely explained in their documentation.
Note that we also released results from allele number calculated across all possible sites as part of gnomAD v4.1. These data, which were calculated prior to splitting multiallelics, are available for download here. In the all sites AN results, the total genomes AN at chr1:1228635 is 152374.
Would it therefore be reasonable to use allele numbers computed prior to splitting multiallelic variants and ignore AN values provided for individual alleles?
I would like to obtain estimates of the reference allele number/frequency for each multiallelic site. Given that alternative allele counts at such sites are sometimes computed using different AN values, subtracting AC values from the AN value might result in a slightly incorrect estimate of the reference allele number. Still, to obtain an estimate of the reference allele frequency, I guess, I could simply subtract allele frequencies for all alternative alleles from 1. Would it be a reasonable approach?
I would greatly appreciate your comments and suggestions.