Glossary

Glossary#

accession#

A unique and stable PGS Catalog score identifier (ID). PGS Catalog IDs start with the prefix PGS, e.g. PGS000001

CSV#

Comma-separated values, a popular plain text file format. CSVs are good. Please don’t use .xlsx (Excel), it makes bioinformaticians sad.

JSON#

Javascript Object Notation. A popular file format and data interchange format.

polygenic score#

A polygenic score (PGS), aggregates the effects of many genetic variants into a single number which quantifies an individual’s genetic predisposition for a phenotype. PGS are typically composed of hundreds-to-millions of genetic variants (usually SNPs) which are calculated as a weighted sum of allele dosages multiplied by their corresponding effect sizes. The variants and their effect sizes are most often derived from a genome-wide association study (GWAS) using many common software tools (including Pruning/Clumping + Thresholding (e.g. PRSice), LDpred, lassosum, snpnet).

polygenic risk score#

A polygenic risk score (PRS) is a subset of PGS that is used to estimate the risk of disease or other clinically relevant outcomes (binary or discrete). Also sometimes referred to as a genetic or genomic risk score (GRS).

PGS Catalog#

The Polygenic Score (PGS) Catalog is an open database of published polygenic scores (PGS). If you develop and publish polygenic scores, please consider submitting them to the Catalog so they can be reused and applied to new datasets using this pipeline!

PGS Catalog Calculator#

pgsc_calc - a reproducible workflow to calculate one or multiple PGS, implemented in Nextflow.

SNP#

A single nucleotide polymorphism - most PGS only contain this type of variant in addition to smaller common insertions/deletions (INDELS).

Scoring file#

A file containing risk alleles and derived weights for a specific phenotype. Weights are typically calculated with 1) GWAS summary statistics and 2) A large population of people with known phenotypes (e.g. the UK BioBank). These files are distributed through the PGS Catalog in a standardized format, and also provided as harmonized scoring files with consistently-reported positions in common genome builds (GRCh37 and GRCh38). The pipeline

target dataset#

Also referred to as a sampleset within the input samplesheets. The genomes/genotyping data that you want to calculate polygenic scores for. Scores are calculated from an existing scoring file that contains effect alleles and associated weights. These genomes should distinct from those used to develop the polygenic score originally (i.e., those used to derive the risk alleles and weights), as overlapping samples will inflate common metrics of PGS accuracy.

VCF#

Variant Call Format. A standard file format used to store genetic variants and genotypes. By default the pipeline (& plink) use the sample genotypes present in the GT field. However, users can import imputed ALT allele dosages by adding a DS flag to the vcf_genotype_field column of the samplesheet, see How to set up a samplesheet for more information.