Reference bias in variant calling?

Hello, I appreciate the use of Discourse for this forum, great choice.

I have a set of variant calls I created from de novo assembly of individuals, construction of a graph genome and projection of variation against GRCh38 which creates a VCF. Specifically, I’m interested in SNVs.

I would like to use gnomAD as a resource of “known variants” I can compare my VCF against as a way to potentially identify false positive SNVs. In other words, if one of my SNVs is in gnomAD with AF > 0, I have more confidence it’s a true variant call.

But one question I have is how reference bias will affect this. If gnomAD variants are called from alignments against GRCh38, then regions with lower mappability will generally under-call variants. Whereas this is less true for my assembled SNVs. Additionally, regions that have systematic misalignments across multiple individuals my produce false positive variant calls in gnomAD.

So my question is regarding whether there is a way to quantify these issues. If a SNV does not appear in gnomAD, is there any way to discriminate between a false positive call on my end versus a poorly called region in gnomAD? And for SNVs that do appear in gnomAD, what measures were done to avoid calling SNVs from spurious alignments?

Hope this makes sense. Any thoughts on this are helpful

Thank you for your patience.

If a SNV does not appear in gnomAD, is there any way to discriminate between a false positive call on my end versus a poorly called region in gnomAD?

One way to check whether a region has alignment issues is to check the coverage and all sites allele number metrics (the latter is described in the v4.1 blog post). Also, note that there are a handful of genes that are not well covered in gnomAD due to false duplications in the GRCh38 reference (e.g., KCNE1).

And for SNVs that do appear in gnomAD, what measures were done to avoid calling SNVs from spurious alignments?

As described in this help page, we do not apply any special filtering. However, we do flag when variants fall within low complexity regions or overlap segmental duplications.