Irregularities in the CNV v4.0 control dataset

Dear gnomAD CNV team,

Thank you for publicly releasing the gnomAD v4.0 data. It’s such a valuable resource. I reached out to your team via email and the email reply suggested I post my query in the forums instead. My query consists of some irregularities I saw. Some of these don’t affect my research, but I thought I’d list them so you can make changes for the next version. Others are a little odd, and I’m hoping to get some clarification before I submit for publication.

The dataset is the “CNV v4.0 non neuro control dataset” excel file

  1. The excel file has a column for gene content. Many CNV variants do not seem to include the genes in their region, or include only some genes. For example, variant 432421_Dup should also include the NSD1 gene (this can be seen on the website gnomAD ). This means if you filter by the NSD1 gene on the excel file, there are no CNVs in the file that involve this gene, which is incorrect. Other examples include Variant 224065__DUP and 167825__DUP. I suspect there may be thousands, actually, with this issue.

  2. 193076__DUP is mislabelled. I think the label was meant to say 16p12.2_DUP.

  3. Variant 17q11.2 is listed in the excel sheet twice. One of these should be 17q11.2-NF1__DEL and the other should be 17q11.2-NF1__DUP

  4. The 22q11.2 distal type 1 and type 2 (del and dup) on the gnomad website may be listed with incorrect coordinates (too short). The CNVs on the gnomad website do not appear to include BCR which is one of the critical genes. Decipher data suggests different coordinates for both type 1 and type 2. Clarifying this is important for my upcoming publication (systematic review on penetrance estimates for 83 CNVs). I was hoping to clarify each CNV (dels and dups) that intersect MAPK1 and BCR, their coordinates and ‘site count’ please? I’d be really grateful for that. Thank you.

  5. The excel file has a variant that is labelled 15q11-q13-BP1-BP3__DEL. I think this is mislabelled - it should be BP2-BP3. The reciprocal duplication with the same coordinates is listed correctly. Relevant to my work, can I check if the 4 individuals listed with this del are actually a BP2-BP3 or a BP1-BP3?

I love the dataset. Thank you for the work you’ve put into releasing it.

Hi there!

Thank you for your kind words! Please allow me to try and address your questions and comments. Two general comments before I address each of your points in more detail: (a) we will be releasing the V4.1 CNV files, mostly involving renaming of the frequency and count fields to align with the format in the SNV/indel data; and (b) are you referring to the VCF file as the excel file? I’m unsure where the excel file would be coming from, could you help point out where you accessed the excel file?

  1. This is actually by design of our annotation algorithm. Deletions that delete >=10% of the CDS of a gene is labeled as a deletion of that gene in the column that you refer to, while duplications that duplicate >= 75% of the CDS of a gene is labeled as a duplication of that gene. Details of this annotation/labeling are outlined a bit more in our blog post: Rare coding CNVs from exome sequenced individuals in gnomAD v4 | gnomAD browser. If you would like to annotate genic impact by any overlap, definitely feel free to use raw coordinates of the deletion and duplication and determine overlap with your preferred gene annotation!

  2. Good observation and deduction. For our curation efforts, we actually do not include 16p12.2 duplication as a locus, so this labeling was intentional. I’m happy to provide much more details too about our GD curation process if you would be interested!

  3. I see 2 different entries for 17q11.2, one for the deletion and one for the duplication. These would be entries 37,178 and 37,179 in the non-neuro control subset. Can you confirm?

  4. What coordinates do you have in mind? We label GDs using a 50% reciprocal overlap setting (type I is 21,562,828-23,306,924 and type II is 22,776,924-23,306,924 in our curated coordinates). Viewing the browser, it does look like our labeled type II del overlaps BCR, while type I does not. gnomAD. If you could provide the coordinates that you have in mind I can followup in more detail!

  5. Thank you for pointing this out. I will followup in detail and should be able to get back to you soon!