How to apply multiple scores in parallel#

pgsc_calc makes it simple to scale up polygenic score calculation. If you want to calculate multiple scores for your genetic data it will always be faster to run the pipeline once with the parallel method described in this section. Running the workflow many times, once for each score, will be much slower 🏃‍♂️

1. Samplesheet setup#

Set up a samplesheet as described in: How to set up a samplesheet.

2. Multiple PGS Catalog scores#

As described in How to use scoring files in the PGS Catalog, use a comma separated list to select multiple accessions, traits, or publications. The pipeline will automatically query the PGS Catalog API, download unique scoring files in the correct genome build, and use your target genomes to calculate scores. For example:

--pgs_id PGS001229,PGS000802

Note

If you’d like to calculate hundreds of PGS Catalog scoring files simultaneously, it’s easiest to store parameters in a text file instead of setting --pgs_id in a terminal. See How to set parameters in a file.

3. Multiple custom scorefiles#

You might have set up multiple custom scorefiles (see How to use a custom scoring file). The --scorefile parameter supports multiple scorefile paths by using star characters (*). If your custom scorefiles are in the directory my_custom_scores/:

--scorefile "my_custom_scores/*.txt"

Assuming your scorefiles all have a .txt extension. This will match all files ending with .txt, so be careful not to include other text files that may match the pattern.

Two stars (**) will match across multiple directories. This can be useful if your scoring files have a structure like:

$ tree diabetes/
diabetes/
├── type1
│   └── type1.txt
└── type2
    └── type2.txt

2 directories, 2 files

In this case, using two stars with the scorefile parameter can be helpful:

--scorefile "diabetes/**.txt"

Note

  • Custom scorefiles must have unique filenames

  • The basename of each file (e.g. type1.txt -> type1) is used to label the score in the workflow output

  • Quotes around stars ("*.txt") are important for matching to work as expected

Setting multiple scores in one custom scoring file#

The examples above assume each scoring file contains a single score. A single custom scoring file can contain multiple scores by using a different scoring file template. The final column effect_weight can be repeated if every column has a suffix:

Scorefile with multiple effect weights#

chr_name

effect_weight_type1

effect_weight_type2

22

0.01045457

0.02000000

The columns chr_position, effect_allele, and other_allele are left out (…) in the example table to save space, but are mandatory (see 2. Scorefile setup). Multiple score columns must follow the pattern effect_weight_suffix, where suffix is a label for each score. Suffixes must be unique.

Setting effect types for variants is not supported with this format (see How to set effect type of variants in a scoring file). An example template is available here.

4. Calculate!#

  • If you’re using multiple scores from the PGS Catalog:

$ nextflow run pgscatalog/pgscalc \
    -profile <docker/singularity/conda> \
    --input samplesheet.csv \
    --pgs_id PGS001229,PGS001405
  • Or you might be using multiple scoring files in the same directory:

$ nextflow run pgscatalog/pgscalc \
    -profile <docker/singularity/conda> \
    --input samplesheet.csv \
    --scorefile "my_custom_scores/*.txt"

Congratulations, you’ve now calculated multiple scores in parallel! 🥳

Note

You can set both --pgs_id and --scorefile parameters to combine scores in the PGS Catalog with your own custom scores

After the workflow executes successfully, the calculated scores and a summary report should be available in the results/make/ directory by default. If you’re interested in more information, see pgsc_calc Outputs & report.

If the workflow didn’t execute successfully, have a look at the Troubleshooting section.