How do I prepare my input genomes?#

Target genome data requirements#

Note

This workflow will work best with the output of an imputation server like Michigan or TopMed.

If you’d like to input WGS genomes, some extra preprocessing steps are required.

  • Only human chromosomes 1 – 22, X, Y, and XY are supported by the pipeline, although sex chromosomes are rarely used in scoring files.

  • If input data contain other chromosomes (e.g. patch regions) then the pipeline may complain loudly and stop calculating.

Supported file formats#

The following file formats are currently supported:

  • VCF

  • Plink 1 file set (.bed / .bim / .fam)

  • Plink 2 file set (.pgen / .pvar / .psam)

Compressed input is supported and automatically detected. For example, bgzip compression of VCF files, or zstd compression of plink2 variant information files (.pvar).

VCF from an imputation server#

plink2 --vcf <full_path_to_vcf.vcf.gz> \
    --allow-extra-chr \
    --chr 1-22, X, Y, XY \
    -make-pgen --out <short name>_axy

Note

Non-standard chromosomes/patches should not cause errors in versions >v2.0.0-alpha.3; however, they will be be filtered out from the list of variants available for PGS scoring.

VCF from WGS#

See PGScatalog/pgsc_calc#123 for discussion about tools to convert the VCF files into ones suitable for calculating PGS.

plink2 binary fileset (pfile)#

Note

The pipeline will be much faster if you convert your input data to pfile format

plink2 --pfile <path_to_pfile_prefix> \
    --allow-extra-chr \
    --chr 1-22, X, Y, XY \
    -make-pgen --out <prefix>_axy

Warning

Don’t forget to replace paths in <brackets> with your own data!