VCFs v4.1.0 appear to not follow spec?

I am getting this error from Arrow when trying to convert the VCFs into Parquet files:

Arrow(InvalidArgumentError("Column 'info_AF' is declared as non-nullable but contains null values"))

I’d like to just verify whether the declaration of the info_AF in your VCFs and the data in those columns are as they should be with regards to the VCF spec, and this is more a general problem with Arrow. Which makes me wonder whether it would be possible to offer your data as Parquet in addition to VCF and Hail?

Hello!

Not an official responder here, but I have been looking into VCF->parquet solutions and have encountered this same issue.

My recommendation would be to import the VCF into HAIL (which should be able to handle this incorrectly typed field), convert to a Spark DataFrame, then export to parquet with df.write.parquet(...).

Thank you, Benjamin! I am trying to avoid Hail. However, it can be solved by passing -I to vcf2parquet CLI tool.

Awesome! It looks like that option isn’t documented in the README, but I do see it in the code!

1 Like