GroupMax genetic ancestry differs in total from G & E?

Forwarding this question on behalf of a user at Stanford U working on the ClinGen project…

Is it right that both the genome/exome GroupMax values might be from one genetic ancestry group, but the total values (which I thought were genome + exome) could be from another genetic ancestry group?

example variant: 17-7662018-AAG-A

Thanks for posting @Larry_Babb — just noting that the team is discussing this and will reply back soon.

@Larry_Babb This is possible in some edge cases, as in this example, it’s poorly covered in exomes and well-covered in genomes, but there are differences in ancestry representation between the exome and genome datasets. Since all the numbers are low, you may want to take the coverage per dataset per ancestry into account during interpretation.

@qin, could you give more guidance on how to know when to ignore the total ancestry group? For example this variant has the same issue: gnomAD, but seems to pass all QC metrics for both genomes and exomes? Thanks.

@Christine_Preston Sorry for my late reply, I went back to read the paper and understand the code. In our combined_faf data, you will encounter 77,624 variants that have the same genetic ancestry of grpmax_faf in exomes and genomes, but a different genetic ancestry in the joint data.

I don’t have a guidance, but this might also explain why:
"Usually, this is from the population with the highest nominal allele frequency.
However, because the tightness of a 95% confidence interval in the Poisson distribution depends upon sample size, the stringency of the filter depends upon the allele number (AN). The stringency of the filter therefore varies appropriately according the the size of the sub-population in which the variant is observed, and sequencing coverage at that site, and af_filter is occasionally derived from a population other than the one with the highest nominal allele frequency. "

For this example variant you found, the FAF calculated from the observed AC and AN in each ancestry group is in detail as follow:

The FAF value itself is correct, but it depends on the size of the sub-population, AC and AN, I would say we added relatively less AC than AN for NFE, the FAF didn’t increase as much (but it’s not linear) as in AMR. We might see opposite situations, added more AC relative to added AN, the FAF grpmax ancestry would also switch.

It seems to me, either using each dataset alone or joint, for both AMR and NFE, this variant can be ruled out as Mendelian disease causing, because its AF is bigger than the FAF computed.

Our team is currently short of bandwidth to dig more, but welcome to report more examples if this explanation doesn’t apply. If you want to look at the code on how the FAF is calculated, it’s here.