Ancestry inference v4

We are trying to use the RF / ONNX models from gnomad v4 to infer the ancestry of our clinical samples routinely.
We have located a python notebook with some example code: gnomad_qc/gnomad_qc/example_notebooks/ancestry_classification_using_gnomad_rf.ipynb at main · broadinstitute/gnomad_qc · GitHub

But the first problem we encounter is that we have our original data in VCF format, is there a conversion tool to transform it into VDS? It would be superb if anybody could point us to a tutorial or a more thorough documentation to accomplish this task starting from a vcf.


Hi @biojl,

In place of the VDS code in the notebook’s cells 9 and 11, you could import your VCF directly as a MT and then run the hail’s split_multi_hts and filter to variants present in the loading HT. You could then resume at cell 12. The example below is one way to do this is (assuming the data is on GRCh38):

mt = hl.import_vcf(path_to_your_vcf, reference_genome="GRCh38")
mt = hl.split_multi_hts(mt)
mt = mt.filter_rows(hl.is_defined(v3_loading_ht[mt.row_key])
mt = mt.select_entries("GT", "GQ", "DP", "AD")  # Optional but reduces the data being processed

You can read more on importing a VCF in hail here .