Getting started#

pgsc_calc requires Nextflow and one of Docker, Singularity, or Anaconda. You will need a POSIX compatible system, like Linux or macOS, to run pgsc_calc.

  1. Start by installing nextflow:

$ java -version # Java v8+ required
openjdk 11.0.13 2021-10-19

$ curl -fsSL get.nextflow.io | bash

$ mv nextflow ~/bin/
  1. Next, install Docker, Singularity, or Anaconda

  2. Finally, check Nextflow is working:

$ nextflow run pgscatalog/pgsc_calc --help
N E X T F L O W  ~  version 21.04.0
Launching `pgscatalog/pgsc_calc` [condescending_stone] - revision: cf3e5c886b [master]
...

And check if Docker, Singularity, or Anaconda are working by running the workflow with bundled test data and replacing <docker/singularity/conda> in the command below with the specific container manager you intend to use:

$ nextflow run pgscatalog/pgsc_calc -profile test,<docker/singularity/conda>
... <configuration messages intentionally not shown> ...
------------------------------------------------------
If you use pgscatalog/pgsc_calc for your analysis please cite:

* The Polygenic Score Catalog
  https://doi.org/10.1038/s41588-021-00783-5

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/pgscatalog/pgsc_calc/blob/master/CITATIONS.md
------------------------------------------------------
executor >  local (7)
[06/6462a0] process > PGSCATALOG_PGSCALC:PGSCALC:INPUT_CHECK:SAMPLESHEET_JSON (samplesheet.csv)                         [100%] 1 of 1 ✔
[b3/d80f09] process > PGSCATALOG_PGSCALC:PGSCALC:INPUT_CHECK:COMBINE_SCOREFILES (1)                                     [100%] 1 of 1 ✔
[bd/ad4d8c] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_RELABELBIM (cineca_synthetic_subset chromoso... [100%] 1 of 1 ✔
[-        ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR                                     -
[-        ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_VCF                                             -
[09/bda9b3] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:MATCH_VARIANTS (cineca_synthetic_subset)               [100%] 1 of 1 ✔
[23/2decd9] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:PLINK2_SCORE (cineca_synthetic_subset chromosome 22 eff... [100%] 1 of 1 ✔
[25/6b87fc] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:SCORE_REPORT (1)                                           [100%] 1 of 1 ✔
[6b/52087d] process > PGSCATALOG_PGSCALC:PGSCALC:DUMPSOFTWAREVERSIONS (1)                                               [100%] 1 of 1 ✔
-[pgscatalog/pgsc_calc] Pipeline completed successfully-

Note

Replace <docker/singularity/conda> with what you have installed on your computer (e.g., docker, singularity, or conda). These options are mutually exclusive!

Calculate your first polygenic scores#

If you’ve completed the installation process that means you’ve already calculated some polygenic scores 😍 However, these scores were calculated using synthetic data from a single chromosome. Let’s try calculating scores with your genomic data, which are probably genotypes from real people!

Warning

You might need to prepare input genomic data before calculating polygenic scores, see How do I prepare my input genomes?

1. Samplesheet setup#

First, you need to describe the structure of your genomic data in a standardised way. To do this, set up a spreadsheet that looks like one of the examples below:

Example bfile samplesheet#

sampleset

vcf_path

bfile_path

pfile_path

chrom

cineca_synthetic_subset

/full/path/to/bfile_prefix

22

cineca_synthetic_subset

/full/path/to/bfile_prefix

21

Example multi-chromosome bfile samplesheet#

sampleset

vcf_path

bfile_path

pfile_path

chrom

cineca_synthetic_subset

/full/path/to/bfile_prefix

Example split VCF samplesheet#

sampleset

vcf_path

bfile_path

pfile_path

chrom

cineca_synthetic_subset_vcf

/full/path/to/vcf.gz

22

cineca_synthetic_subset_vcf

/full/path/to/vcf.gz

21

There are five mandatory columns. Columns that specify genomic data paths (vcf_path, bfile_path, and pfile_path) are mutually exclusive:

  • sampleset: A text string referring to the name of a target dataset of genotyping data containing at least one sample/individual (however cohort datasets will often contain many individuals with combined genotyped/imputed data). Data from a sampleset may be input as a single file, or split across chromosomes into multiple files. Scores generated from files with the same sampleset name are combined in later stages of the analysis.

  • vcf_path: A text string of a file path pointing to a multi-sample VCF file. File names must be unique. It’s best to use full file paths, not relative file paths. By default hard-called genotypes (GT field) are imported, if you would like to import dosages (DS) a it needs to be specified in the vcf_genotype_field column, see How to set up a samplesheet for additional information.

  • bfile_path: A text string of a file path pointing to the prefix of a plink binary fileset. For example, if a binary fileset consists of plink.bed, plink.bim, and plink.fam then the prefix would be “plink”. Must be unique. It’s best to use full file paths, not relative file paths.

  • pfile_path: Like bfile_path, but for a PLINK2 format fileset (pgen / psam / pvar)

  • chrom: An integer (range 1-22) or string (X, Y). If the target genomic data file contains multiple chromosomes, leave empty. Don’t use a mix of empty and integer chromosomes in the same sample.

Save this spreadsheet in CSV format (e.g., samplesheet.csv).

Note

All samplesets have to be in the same genome build (either GRCh37 or GRCh38) which is specified using the --target_build [GRCh3#] command. All scoring files are downloaded or mapped to match the specified genome build, no liftover/re-mapping of the genotyping data is performed within the pipeline.

2. Select scoring files#

pgsc_calc makes it simple to work with polygenic scores that have been published in the PGS Catalog. You can specify one or more scores using the --pgs_id parameter:

--pgs_id PGS001229 # one score
--pgs_id PGS001229,PGS001405 # many scores separated by , (no spaces)

If you would like to use a custom scoring file not published in the PGS Catalog, that’s OK too (see How to use a custom scoring file).

Users are required to specify the genome build that to their genotyping calls are in reference to using the --target_build parameter. The --target_build parameter only supports builds GRCh37 (hg19) and GRCh38 (hg38).

--pgs_id PGS001229,PGS001405 --target_build GRCh38

In the case of the example above, both PGS001229 and PGS001405 are reported in genome build GRCh37. In cases where the build of your genomic data are different from the original build of the PGS Catalog score then the pipeline will download a harmonized (remapped rsIDs and/or lifted positions) versions of the scoring file(s) in the user-specified build of the genotyping datasets.

Custom scoring files can be lifted between genome builds using the --liftover flag, (see How to liftover scoring files to match your input genome build for more information). An example would look like:

---scorefile MyPGSFile.txt --target_build GRCh38

3. Putting it all together#

For this example, we’ll assume that the input genomes are in build GRCh37 and that they match the scoring file genome build.

$ nextflow run pgscatalog/pgsc_calc \
    -profile <docker/singularity/conda> \
    --input samplesheet.csv --target_build GRCh37 \
    --pgs_id PGS001229

Congratulations, you’ve now (hopefully) calculated some scores! 🥳

After the workflow executes successfully, the calculated scores and a summary report should be available in the results/score/ directory in your current working directory ($PWD) by default. If you’re interested in more information, see pgsc_calc Outputs & results.

If the workflow didn’t execute successfully, have a look at the Troubleshooting section. Remember to replace <docker/singularity/conda> with the software you have installed on your computer.

4. Next steps & advanced usage#

The pipeline distributes with settings that easily allow for it to be run on a personal computer on smaller datasets (e.g. 1000 Genomes, HGDP). The minimum requirements to run on these smaller datasets are:

  • Linux
    • 16GB RAM

    • 2 CPUs

  • macOS
    • 32GB RAM

    • 2 CPUs

Warning

If you use macOS, Docker will use 50% of your memory at most by default. This means that if you have a Mac with 16GB RAM, pgsc_calc may run out of RAM (most likely during the variant matching step).

For information on how to run the pipelines on larger datasets/computers/job-schedulers, see How do I run pgsc_calc on larger datasets and more powerful computers?.

If you are using an newer Mac computer with an M-series chip, see How do I run the pipeline on an M1 Mac?.