Why does the Lof observed not correlate with the number of LOF found in the gene?
Example ASH1L
In the constraint table: pLof, observed 18 and pLof score 1
However there are reported 60 LOF in ASH1L Main transcript. Some might be artifacts, but LOF curration has not been performed.
Best regards, Mari Ann
Hi Mari Ann,
Please see this response:
Thanks. However, non of the LOF in ASH1L (Main select transcript) have a Flag for low confidence. Could that be that the process of LOFTEE quality test has jet not be performed for ASH1L?
Hi Mari Ann,
Thanks for your patience! I did some further digging and wanted to clarify a few points.
As detailed in our gnomAD blog post and the constraint documentation, the observed variant count in the constraint table only includes unique single nucleotide variants (SNVs) with a minor allele frequency (MAF) < 0.1% and a median depth of at least 30 in exome samples. The reasoning behind this is to limit the inclusion of false positives, which can be more common among higher-frequency pLoF variants.
To confirm, I ran a small analysis on the gnomAD v4.1 dataset for the ASH1L gene using these criteria. After filtering for:
- High-confidence LOFTEE variants (without any flags),
- Only variants in the MANE Select transcript,
- PASS variants that are SNVs with MAF ≤ 0.1%, and
- Exome median depth ≥ 30,
import hail as hl
from gnomad.resources.grch38.gnomad import public_release, coverage
from gnomad.utils.vep import filter_vep_transcript_csqs
ASH1L_interval = "chr1:155335268-155563162"
# Get v4.1 exome sites Hail Table and exome coverage.
ht = public_release("exomes").ht()
coverage_ht = coverage("exomes").ht()
# Filter the exome release Hail Table to the ASH1L gene interval.
ht = hl.filter_intervals(ht, [hl.parse_locus_interval(ASH1L_interval, reference_genome="GRCh38")])
# Filter to variants in ASH1L that are LOFTEE high-confidence (with no flags) in the MANE select transcript.
ht = filter_vep_transcript_csqs(
ht,
synonymous=False,
mane_select=True,
genes=["ASH1L"],
match_by_gene_symbol=True,
additional_filtering_criteria=[lambda x: (x.lof == "HC") & hl.is_missing(x.lof_flags)],
)
#Filter to PASS SNVs with AF <= 0.1% and median exome depth ≥ 30.
ht = ht.filter(
(hl.len(ht.filters) == 0)
& (ht.allele_info.allele_type == "snv")
& (ht.freq[0].AF <= 0.001)
& (coverage_ht[ht.locus].median_approx >= 30)
)
print(f"Number of variants: {ht.count()}")
ht.select(
freq=ht.freq[0],
csq=ht.vep.transcript_consequences[0].consequence_terms,
coverage=coverage_ht[ht.locus],
).show(-1)
I found 18 variants, which matches the count observed in the constraint table.
This suggests that the difference between comes from the additional filtering steps applied to the variants included in the constraint model, such as minor allele frequency and depth thresholds.
I hope this provides some clarity! Please feel free to reach out with any more questions.
Thanks for clarifing.