`pgsc_calc`: a reproducible workflow to calculate polygenic scores#

The pgsc_calc workflow makes it easy to calculate a polygenic score (PGS) using scoring files published in the Polygenic Score (PGS) Catalog 🧬 and/or custom scoring files.

The calculator workflow automates PGS downloads from the Catalog, variant matching between scoring files and target genotyping samplesets, and the parallel calculation of multiple PGS. Genetic ancestry assignment and PGS normalisation methods are also supported.

Workflow summary#

The workflow performs the following steps:

Downloading scoring files using the PGS Catalog API in a specified genome build (GRCh37 and GRCh38).
Reading custom scoring files (and performing a liftover if genotyping data is in a different build).
Automatically combines and creates scoring files for efficient parallel computation of multiple PGS.
Matches variants in the scoring files against variants in the target dataset (in plink bfile/pfile or VCF format).
Calculates PGS for all samples (linear sum of weights and dosages).
Creates a summary report to visualise score distributions and pipeline metadata (variant matching QC).

And optionally has additional functionality to:

Use a reference panel to obtain genetic ancestry data using PCA, and define the most similar population in the reference panel for each target sample.
Report PGS using methods to adjust for genetic ancestry.

Tip

To enable these optional steps, see How do I normalise calculated scores across different genetic ancestry groups?

See Current development focus section for information about planned updates.

The workflow relies on open source scientific software, including:

A full description of included software is described in Reference: container images.

Quick example#

Install Nextflow
Install Docker or Singularity (minimum v3.8.3) for full reproducibility or Conda as a fallback
Calculate some polygenic scores using synthetic test data:

$ nextflow run pgscatalog/pgsc_calc -profile test,docker

The workflow should output:

... <configuration messages intentionally not shown> ...
------------------------------------------------------
If you use pgscatalog/pgsc_calc for your analysis please cite:

* The Polygenic Score Catalog
  https://doi.org/10.1038/s41588-021-00783-5

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/pgscatalog/pgsc_calc/blob/master/CITATIONS.md
------------------------------------------------------
executor >  local (7)

[49/d28766] process > PGSC_CALC:PGSCALC:INPUT_CHECK:SAMPLESHEET_JSON (samplesheet.csv)           [100%] 1 of 1 ✔
[c3/a8e0d9] process > PGSC_CALC:PGSCALC:INPUT_CHECK:SCOREFILE_CHECK                              [100%] 1 of 1 ✔
[-        ] process > PGSC_CALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_VCF                               -
[7c/5cca6c] process > PGSC_CALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_BFILE (cineca_synthetic_subset)   [100%] 1 of 1 ✔
[3b/ce0e39] process > PGSC_CALC:PGSCALC:MAKE_COMPATIBLE:MATCH_VARIANTS (cineca_synthetic_subset) [100%] 1 of 1 ✔
[2e/fb3233] process > PGSC_CALC:PGSCALC:APPLY_SCORE:PLINK2_SCORE (cineca_synthetic_subset)       [100%] 1 of 1 ✔
[b5/fc5b1e] process > PGSC_CALC:PGSCALC:APPLY_SCORE:SCORE_REPORT (1)                             [100%] 1 of 1 ✔
[03/009cb6] process > PGSC_CALC:PGSCALC:DUMPSOFTWAREVERSIONS (1)                                 [100%] 1 of 1 ✔
-[pgscatalog/pgsc_calc] Pipeline completed successfully-

Note

The docker profile option can be replaced with singularity or conda depending on your local environment

If you want to try the workflow with your own data, have a look at the Getting started section.

Documentation#

Get started: install pgsc_calc and calculate some polygenic scores quickly
How-to guides: step-by-step guides, covering different use cases
Reference guides: technical information about workflow configuration
Explanations: more detailed explanations about PGS calculation and results

Changelog#

The Changelog page describes fixes and enhancements for each version.

Current development focus#

These are some of the features and improvements we’re planning for the pgsc_calc:

Further optimisations to the PCA & ancestry similarity analysis steps focused on improving automatic QC
Continued performance improvements for large scoring-file collections and OmicsPred workloads

Credits#

pgscatalog/pgsc_calc is developed as part of the PGS Catalog project, a collaboration between the University of Cambridge’s Department of Public Health and Primary Care (Michael Inouye, Samuel Lambert) and the European Bioinformatics Institute (Helen Parkinson, Laura Harris).

The pipeline seeks to provide a standardised workflow for PGS calculation and ancestry inference implemented in nextflow derived from an existing set of tools/scripts developed by Inouye lab (Rodrigo Canovas, Scott Ritchie, Jingqin Wu) and PGS Catalog teams (Samuel Lambert, Laurent Gil).

The adaptation of the codebase, nextflow implementation, and PGS Catalog features are written by Benjamin Wingfield, Samuel Lambert, Laurent Gil with additional input from Aoife McMahon (EBI). Development of new features, testing, and code review is ongoing including Inouye lab members (Rodrigo Canovas, Scott Ritchie) and others. A manuscript describing the tool is in preparation (see `Citations <Citations_>`_) and we welcome ongoing community feedback before then via our discussion board or issue tracker.

Citations#

If you use pgscatalog/pgsc_calc in your analysis, please cite:

Lambert, Wingfield, et al. (2024) Enhancing the Polygenic Score Catalog with tools for score calculation and ancestry normalization. Nature Genetics. doi:10.1038/s41588-024-01937-x.

In addition, please remember to cite the primary publications for any PGS Catalog scores you use in your analyses, and the underlying data/software tools described in the citations file.

License Information#

This pipeline is distributed under an Apache 2.0 license, but makes use of multiple open-source software and datasets (complete list in the citations file) that are distributed under their own licenses. Notably:

Nextflow (Apache 2.0 license) and nf-core (MIT license). See & cite Ewels et al. Nature Biotech (2020) for additional information about the project.
PLINK 1/2 software (GPLv3+)
CINECA synthetic cohort data for test dataset (CC-BY-NC-SA)

We note that it is up to end-users to ensure that their use of the pipeline and test data conforms to the license restrictions.

Funding#

This work has received funding from EMBL-EBI core funds, the Baker Institute, the University of Cambridge, Health Data Research UK (HDRUK), and the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101016775 INTERVENE.

Contents

About the project

Useful links

pgsc_calc: a reproducible workflow to calculate polygenic scores

Contents

`pgsc_calc`: a reproducible workflow to calculate polygenic scores#

Workflow summary#

Quick example#

Documentation#

Changelog#

Current development focus#

Credits#

Citations#

License Information#

Funding#

Contents

About the project

Useful links

pgsc_calc: a reproducible workflow to calculate polygenic scores

Contents

pgsc_calc: a reproducible workflow to calculate polygenic scores#

Workflow summary#

Quick example#

Documentation#

Changelog#

Current development focus#

Credits#

Citations#

License Information#

Funding#

`pgsc_calc`: a reproducible workflow to calculate polygenic scores#