How to extract and analyze variant sequences from gnomAD v4 exome VCF

Pharmacogenomics Study of GPCR Gene Family Using gnomAD v4 Exome Database

I am conducting a pharmacogenomics study focusing on the GPCR gene family, analyzing variant sequences in ~700,000 individuals from the gnomAD v4 exome database. My objective is to:

  1. Extract all possible variant sequences for a specific gene.
  2. Map these sequences to populations.
  3. Analyze the sequence variations.

Pipeline for 1000 Genomes Project VCF

The following pipeline works well for the 1000 Genomes dataset, where individual sample information is present in the VCF headers:

# Extract gene-specific region and prepare reference
bcftools view -Oz -r chromosome_number:start-end vcf.gz > gene.1000g.vcf.gz
tabix -p vcf gene.1000g.vcf.gz
sed -e 's/^>chr/>/' GRCh38_full_analysis_set_plus_decoy_hla.fa > out.fa
samtools faidx out.fa

# Generate consensus sequences for each sample
bcftools view -h gene.1000g.vcf.gz | grep "^#CHROM" | cut -f10- | tr '\t' '\n' | while read -r sample; do
  bcftools view -c1 -Oz -s "$sample" -o "1000g.$sample.vcf.gz" gene.1000g.vcf.gz
  samtools faidx out.fa "chromosome_number:start-end" | bcftools consensus "1000g.$sample.vcf.gz" -o "1000g.gene.$sample.fasta"
done

This pipeline generates consensus sequences for individual samples efficiently.

Problem with gnomAD v4 Exome VCF

Challenges when using gnomAD v4 exome VCF include:

  1. The exome VCF contains aggregated variant sites without individual sample information.
  2. Attempting to generate consensus sequences using the following approach results in errors:

Attempted Approach:

bcftools view -Oz -r chr1:23874535-23875617 gnomad.exomes.v4.1.sites.chr1.vcf.bgz > gene.gnomad.vcf.gz
tabix -p vcf gene.gnomad.vcf.gz
samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa chr1:23874535-23875617 > reference.fa
bcftools consensus gene.gnomad.vcf.gz -f reference.fa -o gene.gnomad.fasta

Errors Encountered:

Note: applying REF,ALT variants  
The site chr1:23874536 overlaps with another variant, skipping...  
The site chr1:23874538 overlaps with another variant, skipping...

Questions

  1. How can I effectively extract all possible variant sequences for a GPCR gene from the gnomAD v4 exome database, given that individual-level data is not available?
  2. What is the best strategy to generate consensus sequences representing all possible variant combinations for downstream analysis?
  3. How can I map these sequences to their respective populations or frequency distributions available in gnomAD metadata?

Any advice or suggestions for improving my pipeline or handling this type of genomic data would be greatly appreciated.

Thank you!