But the first problem we encounter is that we have our original data in VCF format, is there a conversion tool to transform it into VDS? It would be superb if anybody could point us to a tutorial or a more thorough documentation to accomplish this task starting from a vcf.
In place of the VDS code in the notebook’s cells 9 and 11, you could import your VCF directly as a MT and then run the hail’s split_multi_hts and filter to variants present in the loading HT. You could then resume at cell 12. The example below is one way to do this is (assuming the data is on GRCh38):
mt = hl.import_vcf(path_to_your_vcf, reference_genome="GRCh38")
mt = hl.split_multi_hts(mt)
mt = mt.filter_rows(hl.is_defined(v3_loading_ht[mt.row_key])
mt = mt.select_entries("GT", "GQ", "DP", "AD") # Optional but reduces the data being processed
You can read more on importing a VCF in hail here .