How do I speed up pgsc_calc computation times and avoid re-running code?

How do I speed up pgsc_calc computation times and avoid re-running code?#

If you intend to run pgsc_calc multiple times on the same target samples (e.g. on different sets of PGS, with different variant matching flags) it is worth cacheing information on invariant steps of the pipeline:

  • Genotype harmonzation (variant relabeling steps)

  • Steps of –run_ancestry that: match variants between the target and reference panel and generate PCA loadings that can be used to adjust the PGS for ancestry.

To do this you must specify a directory that can store these information across runs using the –genotypes_cache flag to the nextflow command (also see Reference: Workflow parameters). Future runs of the pipeline that use the same cache directory should then skip these steps and proceed to run only the steps needed to calculate new PGS. This is slightly different than using the -resume command in nextflow which mainly checks the work directory and is more often used for restarting the pipeline when a specific step has failed (e.g. for exceeding memory limits).

Warning

Always use a new cache directory for different samplesets, as redundant names may clash across runs.