I’m wondering if there is any appetite for gnomad to publish frequency information on “short range haplotypes”. In our work, we’re seeing more and more requests for this kind of information for use in gene-editing off-target prediction. I.e. people don’t want to consider each variant independently, but rather the actual local sequence with all variation included.
For most gene editing applications we’re looking at ~20-30bp windows sequence. The only public datasets that allow for construction of this kind of information are 1KG and HGDP (and thank you for the combined 1KG+HGDP phased VCFs!) where there are phased genotypes available for a reasonable number of samples. We take those VCFs, identify the unique set of “short range haplotypes” and then annotate each haplotype with the frequency in each population.
We would love to be able to do this from a larger set of genomes (and possibly exomes), such as gnomad. While obviously the genotypes cannot be released in gnomad, would the team be open to constructing and publishing this kind of information? If so I think we (Fulcrum Genomics) would be open to developing the tooling to process phased VCFs at the scale required for gnomad.