V4: Number of missense variants in VCFs doesn't match the stats

Hi all,
I downloaded the exomes of gnomAD release v4.1.0 and I filtered only for GENE_PHENO=Ensembl and missense variants. For some reason, the number of rows I get is 50,162,936, while in the statistics at the website, only 16,412,219 are declared. What is the reason for that?

Hi @Tehila_Leiman,

We are reporting the number of missense variants found on canonical transcripts. However, even when I don’t filter to canonical, I am only seeing 18M sites with a “missense_variant” in the consequence_term array within any VEP transcript_consequences array. Here is the code I used:

import hail as hl
from gnomad.resources.grch38.gnomad import public_release

ht = public_release("exomes").ht()
ht = ht.filter(
    hl.any(
        hl.map(
            lambda x: (x.consequence_terms.contains("missense_variant")),
            ht.vep.transcript_consequences,
        )
    )
)
ht.count()

It returns 18,231,426. How exactly are you filtering the release to missense?