How do I set up the reference database?#
A reference database is required to run some parts of the workflow:
Automatic genetic ancestry assignment with Principal Component Analysis
PGS normalisation methods that account for genetic ancestry
Note
It’s simplest to download a reference database we host at the PGS Catalog FTP
Download reference database#
PGS Catalog created reference database(s) are available to download here:
https://ftp.ebi.ac.uk/pub/databases/spot/pgs/resources/pgsc_1000G_v1.tar.zst
https://ftp.ebi.ac.uk/pub/databases/spot/pgs/resources/pgsc_HGDP+1kGP_v1.tar.zst
The databases are either 7GB or 16GB and support both GRCh37 and GRCh38 input target genomes.
Once the reference database is included, remember you must include the --run_ancestry
parameter, which is a path to the reference database (see
Schema).
(Optional) Create reference database#
Warning
Making a reference database from scratch can be slow and frustrating
It’s easiest to download the published reference database from the PGS Catalog FTP
You can choose to create the reference database from scratch. The default reference database uses genomes from the 1000 Genomes project.
Download the
reference sample sheet
Update the URL column to point to the most recent URLs listed on the plink 2 resources page. These URLs change over time.
Note
You can also set the URL column to a local file path if you’ve already downloaded the files.
Mandatory files include:
The compressed binary genotype file (
pgen.zst
)The compressed unannotated variant information file (
pvar.zst
)The sample information file (
psam
)The pedigree information file (
king
)
Run the workflow with the
--ref_samplesheet
parameter, which is mutually exclusive with the--ref
parameter (see Schema).
Note
This approach could be used to create a custom reference database. For example, including genomes from the Human Genome Diversity Project. Please talk to us if you’d like to try this.