How do I run pgsc_calc on larger datasets and more powerful computers?#
If you want to calculate many polygenic scores for a very large dataset (e.g. UK
BioBank) you will likely need to adjust the pipeline settings. You might have
access to a powerful workstation, a University cluster, or some cloud compute
resources. This section will show how to set up pgsc_calc to submit work to
these types of systems by creating and editing nextflow .config files.
Warning
--max_cpus and --max_memory don’t increase the amount of
resources for each process. These parameters cap the maximum
amount of resources a process can use. You need to edit
configuration files to increase process resources, as described
below.
Configuring pgsc_calc to use more resources locally#
If you have a powerful computer available locally, you can configure the amount of resources that each job in the workflow uses.
process {
executor = 'local'
withLabel:process_low {
cpus = 2
memory = 8.GB
time = 1.h
}
withLabel:process_medium {
cpus = 8
memory = 64.GB
time = 4.h
}
withName: PLINK2_SCORE {
maxForks = 4
}
}
You should change cpus, memory, and time to match the amount of
resources you have available. The values provided are a sensible starting point
for very large datasets. Assuming the configuration file you set up is saved as
my_custom.config in your current working directory, you’re ready to run
pgsc_calc:
$ nextflow run pgscatalog/pgsc_calc \
-profile <docker/singularity/conda> \
--input samplesheet.csv \
--pgs_id PGS001229 \
-c my_custom.config
High performance computing cluster#
If you have access to a HPC cluster, you’ll need to configure your cluster’s unique parameters to set correct queues, user accounts, and resource limits.
Note
Your institution may already have a nextflow profile with existing
cluster settings that can be adapted instead of setting up a custom
config using -c
Warning
You’ll probably want to use -profile singularity on a HPC. The
pipeline requires Singularity v3.7 minimum.
Here’s an example configuration running about 100 scores in parallel on UK Biobank with a SLURM cluster:
process {
errorStrategy = 'retry'
maxRetries = 3
maxErrors = '-1'
executor = 'slurm'
withName: 'SAMPLESHEET_JSON' {
cpus = 1
memory = { 1.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'DOWNLOAD_SCOREFILES' {
cpus = 1
memory = { 1.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'COMBINE_SCOREFILES' {
cpus = 1
memory = { 16.GB * task.attempt }
time = { 2.hour * task.attempt }
}
withName: 'PLINK2_MAKEBED' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'RELABEL_IDS' {
cpus = 1
memory = { 16.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'PLINK2_ORIENT' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'DUMPSOFTWAREVERSIONS' {
cpus = 1
memory = { 1.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'ANCESTRY_ANALYSIS' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'SCORE_REPORT' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'EXTRACT_DATABASE' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'PLINK2_RELABELPVAR' {
cpus = 1
memory = { 16.GB * task.attempt }
time = { 2.hour * task.attempt }
}
withName: 'INTERSECT_VARIANTS' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'INTERSECT_THINNED' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'MATCH_VARIANTS' {
cpus = 2
memory = { 32.GB * task.attempt }
time = { 6.hour * task.attempt }
}
withName: 'FILTER_VARIANTS' {
cpus = 1
memory = { 16.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'MATCH_COMBINE' {
cpus = 2
memory = { 64.GB * task.attempt }
time = { 6.hour * task.attempt }
}
withName: 'FRAPOSA_PCA' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'PLINK2_SCORE' {
cpus = 2
memory = { 8.GB * task.attempt }
time = { 16.hour * task.attempt }
}
withName: 'FRAPOSA_PROJECT' {
cpus = 1
memory = { 8.GB * task.attempt }
time = { 1.hour * task.attempt }
}
withName: 'SCORE_AGGREGATE' {
cpus = 1
memory = { 16.GB * task.attempt }
time = { 1.hour * task.attempt }
}
}
Note
You’ll want to adjust memory usage depending on the complexity of your input scoring files. Allocating more CPUs probably won’t make the workflow complete faster.
Assuming the configuration file you set up is saved as
my_custom.config in your current working directory, you’re ready
to run pgsc_calc. Instead of running nextflow directly on the shell,
save a bash script (run_pgscalc.sh) to a file instead:
#SBATCH -J ukbiobank_pgs
#SBATCH -c 1
#SBATCH -t 24:00:00
#SBATCH --mem=2G
export NXF_ANSI_LOG=false
export NXF_OPTS="-Xms500M -Xmx2G"
module load nextflow-21.10.6-gcc-9.3.0-tkuemwd
module load singularity-3.7.0-gcc-9.3.0-dp5ffrp
nextflow run pgscatalog/pgsc_calc \
-profile singularity \
--input samplesheet.csv \
--pgs_id PGS001229 \
-c my_custom.config
Note
The name of the nextflow and singularity modules will be different in your local environment
Warning
Make sure to copy input data to fast storage, and run the pipeline on the same fast storage area. You might include these steps in your bash script. Ask your sysadmin for help if you’re not sure what this means.
$ sbatch run_pgsc_calc.sh
This will submit a nextflow driver job, which will submit additional jobs for each process in the workflow. The nextflow driver requires up to 4GB of RAM and 2 CPUs to use (see a guide for HPC users here).
Cloud deployments#
We’ve deployed the calculator to Google Cloud Batch but some special configuration is required.