Getting started#

pgsc_calc has a few important software dependencies:

  • Nextflow

  • Docker, Singularity, or Anaconda

  • Linux or macOS

Without these dependencies installed you won’t be able to run pgsc_calc.

Step by step setup#

  1. Install nextflow:

$ java -version # Java v8+ required
openjdk 11.0.13 2021-10-19

$ curl -fsSL get.nextflow.io | bash

$ mv nextflow ~/bin/
  1. Next, install Docker, Singularity, or Anaconda (Docker or Singularity are best)

  2. Run the pgsc_calc test profile:

$ nextflow run pgscatalog/pgsc_calc -profile test,<docker|singularity|conda>

Note

Remember to replace <docker|singularity|conda> one of docker, singularity, or conda

Warning

If you have an ARM processor (like Apple sillicon) please check How do I run the pipeline on an M1 Mac?

Please note the test profile genomes are not biologically meaningful, won’t produce valid scores, and aren’t compatible with scores on the PGS Catalog. We provide these genomes to make checking installation and automated testing easier.

Calculate your first polygenic scores#

If you’ve completed the setup guide successfully then you’re ready to calculate scores with your genomic data, which are probably genotypes from real people. Exciting!

Warning

The format of samplesheets changed in v2.0.0 to better accommodate extra file formats in the future

Warning

You might need to prepare input genomic data before calculating polygenic scores, see How do I prepare my input genomes?

1. Set up samplesheet#

First, you need to describe the structure of your genomic data in a standardised way. To do this, set up a spreadsheet that looks like:

Example samplesheet for a combined plink2 file set#

sampleset

path_prefix

chrom

format

cineca

/path/to/target_genomes/cineca_synthetic_subset

pfile

Example samplesheet for a plink2 file set split by chromosome#

sampleset

path_prefix

chrom

format

cineca

target_genomes/cineca_synthetic_subset

22

pfile

Example samplesheet for a combined VCF file#

sampleset

path_prefix

chrom

format

cineca

/path/to/target_genomes/cineca_synthetic_subset

vcf

See How to set up a samplesheet for more details.

2. Select scoring files#

pgsc_calc makes it simple to work with polygenic scores that have been published in the PGS Catalog. You can specify one or more scores using the --pgs_id parameter:

--pgs_id PGS001229 # one score
--pgs_id PGS001229,PGS001405 # many scores separated by , (no spaces)

Note

You can also select scores associated with traits (--trait_efo) and publications (--pgp_id)

If you would like to use a custom scoring file not published in the PGS Catalog, that’s OK too (see How to use a custom scoring file).

Users are required to specify the genome build that to their genotyping calls are in reference to using the --target_build parameter. The --target_build parameter only supports builds GRCh37 (hg19) and GRCh38 (hg38).

--pgs_id PGS001229,PGS001405 --target_build GRCh38

In the case of the example above, both PGS001229 and PGS001405 are reported in genome build GRCh37. In cases where the build of your genomic data are different from the original build of the PGS Catalog score then the pipeline will download a harmonized (remapped rsIDs and/or lifted positions) versions of the scoring file(s) in the user-specified build of the genotyping datasets.

Custom scoring files can be lifted between genome builds using the --liftover flag, (see How to liftover scoring files to match your input genome build for more information). If your custom PGS was in GRCh37 an example would look like:

---scorefile MyPGSFile.txt --target_build GRCh38 --liftover

3. (Optional) Download reference database#

To enable genetic ancestry similarity calculations and PGS normalisation, download one of our pre-built reference databases:

$ wget https://ftp.ebi.ac.uk/pub/databases/spot/pgs/resources/pgsc_HGDP+1kGP_v1.tar.zst

This database contains a merged 1000 Genomes and Human Genome Diversity Project reference panel, and is the recommended default panel.

You may prefer to use 1000 Genomes only:

See How do I normalise calculated scores across different genetic ancestry groups? for more details.

3. Putting it all together#

For this example, we’ll assume that the input genomes are in build GRCh37 and that they match the scoring file genome build.

$ nextflow run pgscatalog/pgsc_calc \
    -profile <docker/singularity/conda> \
    --input samplesheet.csv --target_build GRCh37 \
    --pgs_id PGS001229 \
    --run_ancestry pgsc_HGDP+1kGP_v1.tar.zst

Congratulations, you’ve now (hopefully) calculated some scores! 🥳

Tip

Don’t include --run_ancestry if you didn’t download the reference database

After the workflow executes successfully, the calculated scores and a summary report should be available in the results/score/ directory in your current working directory ($PWD) by default. If you’re interested in more information, see pgsc_calc Outputs & report.

If the workflow didn’t execute successfully, have a look at the Troubleshooting section. Remember to replace <docker/singularity/conda> with the software you have installed on your computer.

4. Next steps & advanced usage#

The pipeline distributes with settings that easily allow for it to be run on a personal computer on smaller datasets (e.g. 1000 Genomes, HGDP). The minimum requirements to run on these smaller datasets are:

  • Linux
    • 16GB RAM

    • 2 CPUs

  • macOS
    • 32GB RAM

    • 2 CPUs

Warning

If you use macOS, Docker will use 50% of your memory at most by default. This means that if you have a Mac with 16GB RAM, pgsc_calc may run out of RAM (most likely during the variant matching step).

For information on how to run the pipelines on larger datasets/computers/job-schedulers, see How do I run pgsc_calc on larger datasets and more powerful computers?.

If you are running the pipeline multiple times on the same dataset (e.g. different sets of PGS) you can speed the pipeline up by cacheing the genotype harmonization and ancestry steps, see How do I speed up pgsc_calc computation times and avoid re-running code?.

If you are using an newer Mac computer with an M-series chip, see How do I run the pipeline on an M1 Mac?.