Ancestry inference params in v4


I am trying to use this example notebook with the rf model in v4 for ancestry inference. There are version specific params such as num_pcs and min_probs. where can I find these values for v4?


Hi @wwong,

In gnomAD v4, we used 20 PCs and a minimum probability of 0.75 however these numbers are specific to our dataset and you should examine your data to better determine the appropriate parameters for your dataset.

We typically look at plots of consecutive PCs and look for when the plots start to lose distinct clusters. Once that happens, we draw the cutoff on pop PCs, e.g. if plotting PC1 vs PC2, PC3 vs PC4…PC9 vs PC10 all have distinct clusters but the plot of PC11 vs PC12 lacks distinct clusters, we would use PCs 1-10. The minimum probability is a balance between over-classifying data and under-classifying data in these artificially created groups – i.e. too few or too many remaining individuals. You will need to examine the numbers output in by the model on your dataset to determine where you would like to draw the line in this clustering process.

For a more complete look at our process you can visit our gnomad_qc repo assign_ancestry module.