Implausibly large structural variants causing false positive overlaps

There are 1314 SVs in gnomAD v4 SV that are over 80% of their chromosome, and 4728 that are over 50% of their chromosome

As these are so large, they basically overlap with everything, so combined they cause a huge numbers of false positive overlaps

Yes, they are almost all flagged with a FILTER, but if someone forgets they will get a lot of false positive overlaps (which could be bad if they are using this to discard variants with overlaps above an AF threshold)

A lot of these are not biologically plausible, an example is gnomAD-SV_v3_DEL_chr1_2a75678c has SVLEN=203,277,062

chr1 in GRCh38 - NC_000001.11 = 248,956,422 so this represents a deletion of 82% of chr1

There are 2 reported homozygotes - but it is not plausible there are people walking around with only 18% of chromosome 1 left with no ill effect

Example test program:

import cyvcf2

CONTIG_SIZES_GRCH38 = {
    "chr1": 248956422,
    "chr2": 242193529,
    "chr3": 198295559,
    "chr4": 190214555,
    "chr5": 181538259,
    "chr6": 170805979,
    "chr7": 159345973,
    "chr8": 145138636,
    "chr9": 138394717,
    "chr10": 133797422,
    "chr11": 135086622,
    "chr12": 133275309,
    "chr13": 114364328,
    "chr14": 107043718,
    "chr15": 101991189,
    "chr16": 90338345,
    "chr17": 83257441,
    "chr18": 80373285,
    "chr19": 58617616,
    "chr20": 64444167,
    "chr21": 46709983,
    "chr22": 50818468,
    "chrX": 156040895,
    "chrY": 57227415
}

over_50_percent = 0
over_80_percent = 0

for v in cyvcf2.Reader("/data/annotation/VEP/annotation_data/GRCh38/gnomad.v4.0.sv.merged.vcf.gz"):
    contig_size = CONTIG_SIZES_GRCH38[v.CHROM]
    if svlen := v.INFO.get("SVLEN"):
        svlen = int(svlen)
        if svlen > (contig_size * .8):
            over_80_percent += 1
        if svlen > (contig_size * .5):
            over_50_percent += 1

print(f"{over_80_percent=}, {over_50_percent=}")

Hey David,
We agree that false positive calls in gnomAD SV callset could be confusing for interpretation. To ensure high precision of this dataset, we required support from the sequencing depth profile for CNVs that are over 5Kb in our pipeline, and manually reviewed CNVs that are over 1Mb. The calls that were reported by the pipeline but failed manual review (eg. the example you mentioned) were kept in the dataset but flagged with a failure FILTER, and they are not displayed on the gnomad browser by default, unless the “include filtered variants” option is on. For interpretation, we recommend restrict to PASS events.