How do I run `pgsc_calc` on larger datasets and more powerful computers?#

If you want to calculate many polygenic scores for a very large dataset (e.g. UK BioBank) you will likely need to adjust the pipeline settings. You might have access to a powerful workstation, a University cluster, or some cloud compute resources. This section will show how to set up pgsc_calc to submit work to these types of systems by creating and editing nextflow .config files.

Warning

--max_cpus and --max_memory don’t increase the amount of resources for each process. These parameters cap the maximum amount of resources a process can use. You need to edit configuration files to increase process resources, as described below.

Configuring `pgsc_calc` to use more resources locally#

If you have a powerful computer available locally, you can configure the amount of resources that each job in the workflow uses.

process {
    executor = 'local'

    withLabel:process_low {
        cpus   = 2
        memory = 8.GB
        time   = 1.h
    }
    withLabel:process_medium {
        cpus   = 8
        memory = 64.GB
        time   = 4.h
    }
    withName: PLINK2_SCORE {
        maxForks = 4
    }
}

You should change cpus, memory, and time to match the amount of resources you have available. The values provided are a sensible starting point for very large datasets. Assuming the configuration file you set up is saved as my_custom.config in your current working directory, you’re ready to run pgsc_calc:

$ nextflow run pgscatalog/pgsc_calc \
    -profile <docker/singularity/conda> \
    --input samplesheet.csv \
    --pgs_id PGS001229 \
    -c my_custom.config

High performance computing cluster#

If you have access to a HPC cluster, you’ll need to configure your cluster’s unique parameters to set correct queues, user accounts, and resource limits.

Note

Your institution may already have a nextflow profile with existing cluster settings that can be adapted instead of setting up a custom config using -c

Warning

You’ll probably want to use -profile singularity on a HPC. The pipeline requires Singularity v3.7 minimum.

Here’s an example configuration running about 100 scores in parallel on UK Biobank with a SLURM cluster:

process {
  errorStrategy = 'retry'
  maxRetries = 3
  maxErrors = '-1'
  executor = 'slurm'

  withName: 'SAMPLESHEET_JSON' {
    cpus = 1
    memory = { 1.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'DOWNLOAD_SCOREFILES' {
    cpus = 1
    memory = { 1.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'COMBINE_SCOREFILES' {
    cpus = 1
    memory = { 16.GB * task.attempt }
    time = { 2.hour * task.attempt }
  }

  withName: 'PLINK2_MAKEBED' {
    cpus = 1
    memory = { 8.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'RELABEL_IDS' {
    cpus = 1
    memory = { 16.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'PLINK2_ORIENT' {
    cpus = 1
    memory = { 8.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'DUMPSOFTWAREVERSIONS' {
    cpus = 1
    memory = { 1.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'ANCESTRY_ANALYSIS' {
    cpus = 1
    memory = { 8.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'SCORE_REPORT' {
    cpus = 1
    memory = { 8.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'EXTRACT_DATABASE' {
    cpus = 1
    memory = { 8.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'PLINK2_RELABELPVAR' {
    cpus = 1
    memory = { 16.GB * task.attempt }
    time = { 2.hour * task.attempt }
  }

  withName: 'INTERSECT_VARIANTS' {
    cpus = 1
    memory = { 8.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'INTERSECT_THINNED' {
    cpus = 1
    memory = { 8.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'MATCH_VARIANTS' {
    cpus = 2
    memory = { 32.GB * task.attempt }
    time = { 6.hour * task.attempt }
  }

  withName: 'FILTER_VARIANTS' {
    cpus = 1
    memory = { 16.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'MATCH_COMBINE' {
    cpus = 2
    memory = { 64.GB * task.attempt }
    time = { 6.hour * task.attempt }
  }

  withName: 'FRAPOSA_PCA' {
    cpus = 2
    memory = { 8.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'PLINK2_SCORE' {
    cpus = 2
    memory = { 8.GB * task.attempt }
    time = { 16.hour * task.attempt }
  }

  withName: 'FRAPOSA_PROJECT' {
    cpus = 1
    memory = { 8.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }

  withName: 'SCORE_AGGREGATE' {
    cpus = 1
    memory = { 16.GB * task.attempt }
    time = { 1.hour * task.attempt }
  }
}

Note

You’ll want to adjust memory usage depending on the complexity of your input scoring files. Allocating more CPUs probably won’t make the workflow complete faster.

Assuming the configuration file you set up is saved as my_custom.config in your current working directory, you’re ready to run pgsc_calc. Instead of running nextflow directly on the shell, save a bash script (run_pgscalc.sh) to a file instead:

#SBATCH -J ukbiobank_pgs
#SBATCH -c 1
#SBATCH -t 24:00:00
#SBATCH --mem=2G

export NXF_ANSI_LOG=false
export NXF_OPTS="-Xms500M -Xmx2G"

module load nextflow-21.10.6-gcc-9.3.0-tkuemwd
module load singularity-3.7.0-gcc-9.3.0-dp5ffrp

nextflow run pgscatalog/pgsc_calc \
    -profile singularity \
    --input samplesheet.csv \
    --pgs_id PGS001229 \
    -c my_custom.config

Note

The name of the nextflow and singularity modules will be different in your local environment

Warning

Make sure to copy input data to fast storage, and run the pipeline on the same fast storage area. You might include these steps in your bash script. Ask your sysadmin for help if you’re not sure what this means.

$ sbatch run_pgsc_calc.sh

This will submit a nextflow driver job, which will submit additional jobs for each process in the workflow. The nextflow driver requires up to 4GB of RAM and 2 CPUs to use (see a guide for HPC users here).

Cloud deployments#

We’ve deployed the calculator to Google Cloud Batch but some special configuration is required.

Contents

About the project

Useful links

How do I run pgsc_calc on larger datasets and more powerful computers?

Contents

How do I run `pgsc_calc` on larger datasets and more powerful computers?#

Configuring `pgsc_calc` to use more resources locally#

High performance computing cluster#

Cloud deployments#

Contents

About the project

Useful links

How do I run pgsc_calc on larger datasets and more powerful computers?

Contents

How do I run pgsc_calc on larger datasets and more powerful computers?#

Configuring pgsc_calc to use more resources locally#

High performance computing cluster#

Cloud deployments#

How do I run `pgsc_calc` on larger datasets and more powerful computers?#

Configuring `pgsc_calc` to use more resources locally#