Number of observed variants vs hail query (v4.1.0)

Hello,

I’m interested in the number of pLoF SNVs observed for a particular set of genes, one being “FGA”. I would like to make a table of those specific variants and have tried to follow the steps previously highlighted to do this on the really helpful forum post. ([Constraint metric - #4 by jkgoodrich)]

For the example gene: “FGA” when I run the following script on hail, I only return 13 variants when the observed should be 16 ( After filtering for: * High-confidence LOFTEE variants (without any flags), * Only variants in the MANE Select transcript, * PASS variants that are SNVs with MAF ≤ 0.1%, and * Exome median depth ≥ 30,).

The question is should I see the exact same number of variants (i.e. 16) that is posted on the table for the gene (screenshot below) or is 13 now the correct number?

The script for hail I am using is below (adapted from previous post, thank you! :pray:)

**import hail as hl**

**from gnomad.resources.grch38.gnomad import public_release, coverage**
**from gnomad.utils.vep import filter_vep_transcript_csqs**

# Define the interval for FGA gene
FGA_interval = "chr4:154583128-154590745"

# Get v4.1 exome sites Hail Table and exome coverage.
ht = public_release("exomes").ht()
coverage_ht = coverage("exomes").ht()

# Filter the exome release Hail Table to the FGA gene interval.
ht = hl.filter_intervals(ht, [hl.parse_locus_interval(FGA_interval, reference_genome="GRCh38")])

# Filter to variants in FGA that are LOFTEE high-confidence (with no flags) in the MANE select transcript.
ht = filter_vep_transcript_csqs(
    ht,
    synonymous=False,
    mane_select=True,
    genes=["FGA"],
    match_by_gene_symbol=True,
    additional_filtering_criteria=[lambda x: (x.lof == "HC") & hl.is_missing(x.lof_flags)],
)

# Filter to PASS SNVs with AF <= 0.1% and median exome depth ≥ 30.
ht = ht.filter(
    (hl.len(ht.filters) == 0)
    & (ht.allele_info.allele_type == "snv")
    & (ht.freq[0].AF <= 0.001)
    & (coverage_ht[ht.locus].median_approx >= 30)
)

# Output the number of variants
print(f"Number of variants: {ht.count()}")

# Select and show the filtered variants
ht.select(
    freq=ht.freq[0],
    csq=ht.vep.transcript_consequences[0].consequence_terms,
    coverage=coverage_ht[ht.locus],
).show(-1)

Results

Best wishes,
Alex

Hi Alex, thanks for reaching out! I’d be happy to help you troubleshoot.

What do you get when you do the exact same for Genomes ? The cause of this may be three genome-specific pLoF variants , since our individuals with genome sequencing are different than our individuals with exome sequencing.

EDIT: Scratch that, sorry. We’ll get back to you later with more on this

1 Like

Thank you for pointing this out! We truly appreciate your diligence in bringing this to our attention.

Regarding the discrepancy in the number of variants you’re seeing for the gene “FGA,” we recently discovered a bug that affected the filtering of variants by LOFTEE flag. Specifically, while we intended to filter based on flags, this wasn’t done correctly due to a bug in the gnomad_methods code run at the time of the v4.1 constraint release.

The bug was fixed in a recent update. You can check out the details in this pull request. The relevant fix is: “Allow lof_flags to be missing in addition to lof_flags == "" for lof == "HC" to have no flag penalty” (note that this is a change from previous LOFTEE annotations).

This bug likely explains why you’re seeing 13 variants instead of the 16 listed in the constraint table. The table may include variants that were not properly filtered due to the bug.

Additionally, I’d like to emphasize two points:

  1. The v4.0 constraint metrics are still experimental, as highlighted in our constraint blog post. If you’re looking for a more established version, we recommend continuing to use the gnomAD v2.1.1 constraint metrics.
  2. If you’re interested in retrieving all observed pLoF variants for a gene, you may want to adjust your filtering criteria, as the constraint model applies a coverage threshold of ≥30, which might exclude some variants relevant to your analysis but not to the constraint calculations.

Thank you again for pointing this out, and please let me know if you have any further questions!

1 Like

Ok great, understood. Thanks for taking the time to comprehensively explain this!