pgsc_calc Outputs & results
Contents
pgsc_calc
Outputs & results#
The pipeline outputs are written to a results directory
(--outdir
default is ./results/
) that contains three subdirectories:
score/
: calculated PGS with summary reportmatch/
: scoring files and variant match metadatapipeline_info/
: nextflow pipeline execution (memory, runtime, etc.)
score/
#
Calculated scores are stored in a gzipped-text space-delimted text file called
aggregated_scores.txt.gz
. Each row represents an individual, and there should
be at least three columns with the following headers:
dataset
: the name of the input samplesetIID
: the identifier of each sample within the dataset[PGS NAME]_SUM
: reports the weighted sum of effect_allele dosages multiplied by their effect_weight for each matched variant in the scoring file. The column name will be different depending on the scores you have chosen to use (e.g.PGS000001_SUM
).
At least one score must be present in this file (the third column). Extra columns might be
present if you calculated more than one score, or if you calculated the PGS on a dataset with a
small sample size (n < 50, in this cases a column named [PGS NAME]_AVG
will be added that
normalizes the PGS using the number of non-missing genotypes to avoid using allele frequency data
from the target sample).
Report#
A summary report is also available (report.html
). The report should open in
a web browser and contains useful information about the PGS that were applied,
how well the variants match with the genotyping data, and some simple graphs
displaying the distribution of scores in your dataset(s) as a density plot.
The fist section of the report reproduces the nextflow command, and metadata (imported from the PGS Catalog for each PGS ID) describing the scoring files that were applied to your sampleset(s):

Within the scoring file metadata section are two tables describing how well the variants within
each scoring file match with target sampleset(s). The first table provides a summary of the
number and percentage of variants within each score that have been matched, and whether that
score passed the --min_overlap
threshold (Passed Matching column) for calculation. The second
table provides a more detailed summary of variant matches broken down by types of variants (strand ambiguous,
multiallelic, duplicates) for the matched, excluded, and unmatched variants (see match/
section for details):

The final section shows an example of the results table that contains the sample identifiers and calculated PGS in the Score extract panel. A visual display of the PGS distribution for a set of example score(s) (up to 6) is provided in the Density plot panel which can be helpful for looking at the distributions in multiple dataset(s):

match/
#
This directory contains information about the matching of scoring file variants to the genotyping data (samplesets). First a summary file (also displayed in the report) details whether each scoring file passes the minimum variant matching threshold, and the types of variants that were included in the score:
Report Field |
|
Description |
---|---|---|
Sampleset |
|
Name of the sampleset/genotyping data |
Scoring file |
|
Name of the scoring file. |
Passed matching |
|
True/False flag to indicate whether the scoring file passes the |
Match type |
|
Indicates whether the variant is matched (included in the final scoring file), excluded (matched but removed based on variant filters), or unmatched. |
Ambiguous |
|
True/False flag indicating whether the matched variant is strand-ambiguous (e.g. A/T and C/G variants). |
Multiallelic |
|
True/False flag indicating whether the matched variant is multi-allelic (multiple ALT alleles). |
Multiple potential matches |
|
True/False flag indicating whether a single scoring file variants has multiple potential matches to the target genome. This usually occurs when the variant has no other_allele, and with variants that have different REF alleles. |
Duplicated matched variants |
|
True/False flag indicating whether multiple scoring file variants match a single target ID. This usually occurs when scoring files have been lifted across builds and two variants now point to the same position (e.g. rsID mergers). |
n |
|
Number of variants with this combination of metadata (grouped by: |
% |
|
Percent of the scoring file’s variants that have the combination of metadata in count. |
The log file is a CSV that contains all possible matches for each variant in the combined input scoring files. This information is useful to debug a score that is causing problems. Columns contain information about how each variant was matched against the target genomes:
|
Description |
---|---|
|
Line number of the variant with reference to the original scoring file (accession). |
|
Name of the scoring file. |
|
Chromosome name/number associated with the variant. |
|
Chromosomal position associated with the variant. |
|
The allele that’s dosage is counted (e.g. {0, 1, 2}) and multiplied by the variant’s weight (effect_weight) when calculating score. The effect allele is also known as the ‘risk allele’. |
|
The other non-effect allele(s) at the loci. |
|
Value of the effect that is multiplied by the dosage of the effect allele (effect_allele) when calculating the score. Additional information on how the effect_weight was derived is in the weight_type field of the header, and score development method in the metadata downloads. |
|
Whether the dosage is calculated as additive ({0, 1, 2}), dominant ({0, 1}) or recessive ({0, 1}). |
|
Identifier of the matched variant. |
|
Matched variant: reference allele. |
|
Matched variant: alternative allele. |
|
Which of the REF/ALT alleles is the effect_allele in the target dataset. |
|
Record of how the scoring file variant |
|
True/False flag indicating whether the matched variant is multi-allelic (multiple ALT alleles). |
|
True/False flag indicating whether the matched variant is strand-ambiguous (e.g. A/T and C/G variants). |
|
True/False flag indicating whether a single scoring file variants has multiple potential matches to the target genome. This usually occurs when the variant has no other_allele, and with variants that have different REF alleles. |
|
True/False flag indicating whether multiple scoring file variants match a single target ID. |
|
Indicates whether the variant is matched (included in the final scoring file), excluded (matched but removed based on variant filters), not_best (a different match candidate was selected for this scoring file variant), or unmatched. |
|
Name of the sampleset/genotyping data. |
Processed scoring files are also present in this directory. Briefly, variants in
the scoring files are matched against the target genomes. Common variants across
different scores are combined (left joined, so each score is an additional
column). The combined scores are then partially split to overcome PLINK2
technical limitations (e.g. calculating different effect types such as dominant
/ recessive). Once scores are calculated from these partially split scoring
files, scores are aggregated to produce the final results in score/
.
pipeline_info/
#
Summary reports generated by nextflow describing the execution of the pipeline in a lot of technical detail (see nextflow tracing & visulisation docs for more detail). The execution report can be useful to see how long a job takes to execute, and how much memory/cpu has been allocated (or overallocated) to specific jobs. The DAG is a visualization of the pipline that may be useful to understand how the pipeline processes data and the ordering of the modules.