Hi, I am trying to create a context-dependent mutation model for constraint modelling. I am using the gnomad v3 paper as a guide.
I want to understand how are the mutation rates calculated for the trinucleotide context? For example, if there are 500 ACA > AGA mutations and ACA occurs 2000 times in the whole genome, is the mutation rate (ACA > AGA) = 500/2000 or is it 500/(2000*3)? I assume *3 because ACA can mutate to 3 different contexts.
Or is it something completely different? I assume I need to normalise the rates, but I don’t know how to go about it. Any help would be appreciated. Thanks.
Hi Agastya, it should be 500/2000 as you are specifically computing the mutation rate for ACA > AGA. Although ACA can be mutated to three different contexts, it does not affect the mutation rate of ACA > AGA here. Say If you want to compute ACA > ATA, then you would count how many ACA > ATA instances and divide this number by 2000.
Right, thanks that’s what I have been doing. So, if the number of samples increases (and consequently the number of mutations) the rates also increase. So, calculating the rates from just 1000 genomes and then calculating expected variants, leads to a underestimation of expected variants compared to observed variants of my complete dataset (10,000 samples) and consequently a very low constraint.
Presently, I have been using the rates from 10,000 samples to calculate rates, to calculate constraint. But of course, that over-extimates the CG rich contexts’ mutation rates. How can I transform the rates from 1000 genomes, which can be applied to 10,000 genomes?
Thanks a lot.