Ancestry RF classifer code and scripts

wallbp · February 26, 2025, 12:52pm

Hello, I was wondering if the code / scripts used to train the ancestry RF classifier were available?

wallbp · March 11, 2025, 12:28pm

Specifically, we’d like to retrain on RNASeq-based SNPs, and looking for the original code. Any GitHub repo or other source would be very helpful!

kchao · March 12, 2025, 8:52pm

Our genetic ancestry inference utility functions are in this script, and an example of how we use them is in this v4 sample QC script.

wallbp · March 13, 2025, 11:22am

Thank you for your response!

I am aware of the utility functions, but how the RF classifier was trained is vitally important to our research, as we would like to use a different approach with RNASeq data and our own set of SNVs. I was hoping that the methods of training the RF classifier would be documented in further detail as described in your article here: https://gnomad.broadinstitute.org/news/2023-11-genetic-ancestry/.

In gnomAD, we use genetic similarity between samples to infer and create genetic ancestry groups. As described previously^3,4,5, we perform a principal component analysis (PCA) on a set of high-quality SNVs to identify clusters of samples based on their genetic similarity, and these clusters roughly correspond to geographic ancestry provided by data contributors. We then train a random forest (RF) classifier on a subset of samples with provided genetic ancestry labels using the principal components from the PCA as features.

How you define “high-quality SNVs”, how clustering was done, and how the RF classifier was trained are all important information for our ability to train our own model, and I was hoping these scripts would be available?

kchao · March 17, 2025, 9:02pm

All of the scripts that we use to generate the gnomAD datasets, including the script that describes how we defined our set of high quality SNVs are in the linked repository, gnomad_qc. How those variants were selected was also documented in the v4 blog post.

wallbp · March 18, 2025, 11:29am

I see, thank you for pointing me in the right direction!

Are any scripts / functions used for training the RF classifier described here available as well?

sanjeevgnomad · May 9, 2025, 4:01pm

@kchao
Sorry to hijack the thread.

It’s utterly difficult to follow the script and documentation to perform ancestry analysis.
Is there any argument type tutorial?

kchao · May 12, 2025, 6:52pm

All of the code used in the gnomAD quality control pipelines are available in the gnomAD QC GitHub repository.

re: tutorials – The team created an example notebook here. Note that the code in this notebook may be slightly out of date and is meant to only be a guide to how these functions could be applied.

Please also refer to our blog post, which discusses important caveats around applying gnomAD’s genetic ancestry inference resources.

Topic		Replies	Views
Ancestry inference v4 General	1	296	January 9, 2024
Ancestry inference params in v4 General	1	151	June 5, 2024
The list of variants with incorrect allele frequencies in gnomAD-v4 General	2	342	May 6, 2024
GnomAD v3 random forest performance General	0	91	July 4, 2024
Number of inital labelled samples for ancestry PCA Methods	0	11	June 10, 2025

Ancestry RF classifer code and scripts

Related topics