How do I prepare my input genomes?#

Target genome data requirements#

Note

This workflow will work best with the output of an imputation server like Michigan or TopMed.

If you’d like to input WGS genomes, some extra preprocessing steps are required.

  • Only human chromosomes 1 – 22, X, Y, and XY are supported by the pipeline, although sex chromosomes are rarely used in scoring files.

  • If input data contain other chromosomes (e.g. patch regions) then the pipeline may complain loudly and stop calculating.

Supported file formats#

The following file formats are currently supported:

  • VCF

  • Plink 1 file set (.bed / .bim / .fam)

  • Plink 2 file set (.pgen / .pvar / .psam)

Compressed input is supported and automatically detected. For example, bgzip compression of VCF files, or zstd compression of plink2 variant information files (.pvar).

VCF from an imputation server#

plink2 --vcf <full_path_to_vcf.vcf.gz> \
    --allow-extra-chr \
    --chr 1-22, X, Y, XY \
    -make-pgen --out <short name>_axy

Note

Non-standard chromosomes/patches should not cause errors in versions >v2.0.0-alpha.3; however, they will be be filtered out from the list of variants available for PGS scoring.

VCF from WGS#

See PGScatalog/pgsc_calc#123 for discussion about tools to convert the VCF files into ones suitable for calculating PGS.

If you input WGS data to the calculator without following the steps above then you will probably encounter match rate errors. For more information, see: Are your target genomes imputed? Are they WGS?

plink2 binary fileset (pfile)#

Note

The pipeline will be much faster if you convert your input data to pfile format

plink2 --pfile <path_to_pfile_prefix> \
    --allow-extra-chr \
    --chr 1-22, X, Y, XY \
    -make-pgen --out <prefix>_axy

Warning

Don’t forget to replace paths in <brackets> with your own data!