How do I prepare my input genomes?#
Target genome data requirements#
Note
This workflow will work best with the output of an imputation server like Michigan or TopMed.
If you’d like to input WGS genomes, some extra preprocessing steps are required.
Only human chromosomes 1 – 22, X, Y, and XY are supported by the pipeline, although sex chromosomes are rarely used in scoring files.
If input data contain other chromosomes (e.g. patch regions) then the pipeline may complain loudly and stop calculating.
Supported file formats#
The following file formats are currently supported:
VCF
Plink 1 file set (
.bed / .bim / .fam
)Plink 2 file set (
.pgen / .pvar / .psam
)
Compressed input is supported and automatically detected. For example, bgzip
compression of VCF files, or zstd compression of plink2 variant information
files (.pvar
).
VCF from an imputation server#
plink2 --vcf <full_path_to_vcf.vcf.gz> \
--allow-extra-chr \
--chr 1-22, X, Y, XY \
-make-pgen --out <short name>_axy
Note
Non-standard chromosomes/patches should not cause errors in versions >v2.0.0-alpha.3; however, they will be be filtered out from the list of variants available for PGS scoring.
VCF from WGS#
See PGScatalog/pgsc_calc#123 for discussion about tools to convert the VCF files into ones suitable for calculating PGS.
If you input WGS data to the calculator without following the steps above then you will probably encounter match rate errors. For more information, see: Are your target genomes imputed? Are they WGS?
plink
binary fileset (bfile)#
plink2 --bfile <path_to_bfile_prefix> \
--allow-extra-chr \
--chr 1-22, X, Y, XY \
-make-pgen --out <prefix>_axy
plink2
binary fileset (pfile)#
Note
The pipeline will be much faster if you convert your input data to pfile format
plink2 --pfile <path_to_pfile_prefix> \
--allow-extra-chr \
--chr 1-22, X, Y, XY \
-make-pgen --out <prefix>_axy
Warning
Don’t forget to replace paths in <brackets>
with your own data!