Hello, I was wondering if the code / scripts used to train the ancestry RF classifier were available?
Specifically, we’d like to retrain on RNASeq-based SNPs, and looking for the original code. Any GitHub repo or other source would be very helpful!
Our genetic ancestry inference utility functions are in this script, and an example of how we use them is in this v4 sample QC script.
Thank you for your response!
I am aware of the utility functions, but how the RF classifier was trained is vitally important to our research, as we would like to use a different approach with RNASeq data and our own set of SNVs. I was hoping that the methods of training the RF classifier would be documented in further detail as described in your article here: https://gnomad.broadinstitute.org/news/2023-11-genetic-ancestry/.
In gnomAD, we use genetic similarity between samples to infer and create genetic ancestry groups. As described previously3,4,5, we perform a principal component analysis (PCA) on a set of high-quality SNVs to identify clusters of samples based on their genetic similarity, and these clusters roughly correspond to geographic ancestry provided by data contributors. We then train a random forest (RF) classifier on a subset of samples with provided genetic ancestry labels using the principal components from the PCA as features.
How you define “high-quality SNVs”, how clustering was done, and how the RF classifier was trained are all important information for our ability to train our own model, and I was hoping these scripts would be available?
All of the scripts that we use to generate the gnomAD datasets, including the script that describes how we defined our set of high quality SNVs are in the linked repository, gnomad_qc. How those variants were selected was also documented in the v4 blog post.
I see, thank you for pointing me in the right direction!
Are any scripts / functions used for training the RF classifier described here available as well?