Getting started
Contents
Getting started#
pgsc_calc
requires Nextflow and one of Docker, Singularity, or
Anaconda. You will need a POSIX compatible system, like Linux or macOS, to run pgsc_calc
.
Start by installing nextflow:
$ java -version # Java v8+ required
openjdk 11.0.13 2021-10-19
$ curl -fsSL get.nextflow.io | bash
$ mv nextflow ~/bin/
Next, install Docker, Singularity, or Anaconda
Finally, check Nextflow is working:
$ nextflow run pgscatalog/pgsc_calc --help
N E X T F L O W ~ version 21.04.0
Launching `pgscatalog/pgsc_calc` [condescending_stone] - revision: cf3e5c886b [master]
...
And check if Docker, Singularity, or Anaconda are working by running the
workflow with bundled test data and replacing <docker/singularity/conda>
in
the command below with the specific container manager you intend to use:
$ nextflow run pgscatalog/pgsc_calc -profile test,<docker/singularity/conda>
... <configuration messages intentionally not shown> ...
------------------------------------------------------
If you use pgscatalog/pgsc_calc for your analysis please cite:
* The Polygenic Score Catalog
https://doi.org/10.1038/s41588-021-00783-5
* The nf-core framework
https://doi.org/10.1038/s41587-020-0439-x
* Software dependencies
https://github.com/pgscatalog/pgsc_calc/blob/master/CITATIONS.md
------------------------------------------------------
executor > local (7)
[06/6462a0] process > PGSCATALOG_PGSCALC:PGSCALC:INPUT_CHECK:SAMPLESHEET_JSON (samplesheet.csv) [100%] 1 of 1 ✔
[b3/d80f09] process > PGSCATALOG_PGSCALC:PGSCALC:INPUT_CHECK:COMBINE_SCOREFILES (1) [100%] 1 of 1 ✔
[bd/ad4d8c] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_RELABELBIM (cineca_synthetic_subset chromoso... [100%] 1 of 1 ✔
[- ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR -
[- ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_VCF -
[09/bda9b3] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:MATCH_VARIANTS (cineca_synthetic_subset) [100%] 1 of 1 ✔
[23/2decd9] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:PLINK2_SCORE (cineca_synthetic_subset chromosome 22 eff... [100%] 1 of 1 ✔
[25/6b87fc] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:SCORE_REPORT (1) [100%] 1 of 1 ✔
[6b/52087d] process > PGSCATALOG_PGSCALC:PGSCALC:DUMPSOFTWAREVERSIONS (1) [100%] 1 of 1 ✔
-[pgscatalog/pgsc_calc] Pipeline completed successfully-
Note
Replace <docker/singularity/conda>
with what you have installed on
your computer (e.g., docker
, singularity
, or conda
). These
options are mutually exclusive!
Calculate your first polygenic scores#
If you’ve completed the installation process that means you’ve already calculated some polygenic scores 😍 However, these scores were calculated using synthetic data from a single chromosome. Let’s try calculating scores with your genomic data, which are probably genotypes from real people!
Warning
You might need to prepare input genomic data before calculating polygenic scores, see How do I prepare my input genomes?
1. Samplesheet setup#
First, you need to describe the structure of your genomic data in a standardised way. To do this, set up a spreadsheet that looks like one of the examples below:
sampleset |
vcf_path |
bfile_path |
pfile_path |
chrom |
---|---|---|---|---|
cineca_synthetic_subset |
/full/path/to/bfile_prefix |
22 |
||
cineca_synthetic_subset |
/full/path/to/bfile_prefix |
21 |
sampleset |
vcf_path |
bfile_path |
pfile_path |
chrom |
---|---|---|---|---|
cineca_synthetic_subset |
/full/path/to/bfile_prefix |
sampleset |
vcf_path |
bfile_path |
pfile_path |
chrom |
---|---|---|---|---|
cineca_synthetic_subset_vcf |
/full/path/to/vcf.gz |
22 |
||
cineca_synthetic_subset_vcf |
/full/path/to/vcf.gz |
21 |
There are five mandatory columns. Columns that specify genomic data paths (vcf_path, bfile_path, and pfile_path) are mutually exclusive:
sampleset: A text string referring to the name of a target dataset of genotyping data containing at least one sample/individual (however cohort datasets will often contain many individuals with combined genotyped/imputed data). Data from a sampleset may be input as a single file, or split across chromosomes into multiple files. Scores generated from files with the same sampleset name are combined in later stages of the analysis.
vcf_path: A text string of a file path pointing to a multi-sample VCF file. File names must be unique. It’s best to use full file paths, not relative file paths. By default hard-called genotypes (
GT
field) are imported, if you would like to import dosages (DS
) a it needs to be specified in thevcf_genotype_field
column, see How to set up a samplesheet for additional information.bfile_path: A text string of a file path pointing to the prefix of a plink binary fileset. For example, if a binary fileset consists of plink.bed, plink.bim, and plink.fam then the prefix would be “plink”. Must be unique. It’s best to use full file paths, not relative file paths.
pfile_path: Like bfile_path, but for a PLINK2 format fileset (pgen / psam / pvar)
chrom: An integer (range 1-22) or string (X, Y). If the target genomic data file contains multiple chromosomes, leave empty. Don’t use a mix of empty and integer chromosomes in the same sample.
Save this spreadsheet in CSV format (e.g., samplesheet.csv
).
Note
All samplesets have to be in the same genome build (either GRCh37 or GRCh38) which is specified
using the --target_build [GRCh3#]
command. All scoring files are downloaded or mapped to match the specified
genome build, no liftover/re-mapping of the genotyping data is performed within the pipeline.
2. Select scoring files#
pgsc_calc makes it simple to work with polygenic scores that have been published
in the PGS Catalog. You can specify one or more scores using the --pgs_id
parameter:
--pgs_id PGS001229 # one score
--pgs_id PGS001229,PGS001405 # many scores separated by , (no spaces)
If you would like to use a custom scoring file not published in the PGS Catalog, that’s OK too (see How to use a custom scoring file).
Users are required to specify the genome build that to their genotyping calls are in reference
to using the --target_build
parameter. The --target_build
parameter only supports builds
GRCh37
(hg19) and GRCh38
(hg38).
--pgs_id PGS001229,PGS001405 --target_build GRCh38
In the case of the example above, both PGS001229
and PGS001405
are reported in genome build GRCh37.
In cases where the build of your genomic data are different from the original build of the PGS Catalog score
then the pipeline will download a harmonized (remapped rsIDs and/or lifted positions) versions of the
scoring file(s) in the user-specified build of the genotyping datasets.
Custom scoring files can be lifted between genome builds using the --liftover
flag, (see How to liftover scoring files to match your input genome build
for more information). An example would look like:
---scorefile MyPGSFile.txt --target_build GRCh38
3. Putting it all together#
For this example, we’ll assume that the input genomes are in build GRCh37 and that they match the scoring file genome build.
$ nextflow run pgscatalog/pgsc_calc \
-profile <docker/singularity/conda> \
--input samplesheet.csv --target_build GRCh37 \
--pgs_id PGS001229
Congratulations, you’ve now (hopefully) calculated some scores! 🥳
After the workflow executes successfully, the calculated scores and a summary
report should be available in the results/score/
directory in your current
working directory ($PWD
) by default. If you’re interested in more
information, see pgsc_calc Outputs & results.
If the workflow didn’t execute successfully, have a look at the
Troubleshooting section. Remember to replace <docker/singularity/conda>
with the software you have installed on your computer.
4. Next steps & advanced usage#
The pipeline distributes with settings that easily allow for it to be run on a personal computer on smaller datasets (e.g. 1000 Genomes, HGDP). The minimum requirements to run on these smaller datasets are:
- Linux
16GB RAM
2 CPUs
- macOS
32GB RAM
2 CPUs
Warning
If you use macOS, Docker will use 50% of your memory at most by
default. This means that if you have a Mac with 16GB RAM,
pgsc_calc
may run out of RAM (most likely during the variant
matching step).
For information on how to run the pipelines on larger datasets/computers/job-schedulers, see How do I run pgsc_calc on larger datasets and more powerful computers?.
If you are using an newer Mac computer with an M-series chip, see How do I run the pipeline on an M1 Mac?.