How to set up a samplesheet#

A samplesheet describes the structure of your input genotyping datasets. It’s needed because the structure of input data can be very different across use cases (e.g. different file formats, directories, and split vs. unsplit by chromosome).

Warning

The format of samplesheets changed in v2.0.0 to better accommodate additional file formats in the future

Samplesheet#

A samplesheet can be set up in a spreadsheet program, using the following structure:

Example samplesheet#
sampleset	path_prefix	chrom	format
cineca	target_genomes/cineca_synthetic_subset	22	pfile

The file should be in CSV format. A template is available to download here.

There are four mandatory columns:

sampleset: A text string (no spaces, or reserved characters [ ‘.’ or ‘_’ ]) referring to the name of a target dataset of genotyping data containing at least one sample/individual (however cohort datasets will often contain many individuals with combined genotyped/imputed data). Data from a sampleset may be input as a single file, or split across chromosomes into multiple files. Scores generated from files with the same sampleset name are combined in later stages of the analysis.
Danger
- pgsc_calc works best with cohort data
- Scores calculated for low sample sizes will generate warnings in the output report
- You should merge your genomes if they are split per individual before using pgsc_calc
path_prefix should be set to the path of the target genomes excluding all file extensions
- Example path prefix: /home/stuff/data.vcf.gz -> /home/stuff/data
Danger

Always use absolute paths that begin with /, e.g. /home/stuff/...

Note

One plink file set (bed / bim / fam or pgen / pvar / psam) only needs a single path prefix and row in the samplesheet
chrom: An integer (range 1-22) or string (X, Y). If the target genomic data file contains multiple chromosomes, leave empty. Don’t use a mix of empty and integer chromosomes in the same sample.
format: The file format of the target genomes. Currently supports pfile, bfile, or vcf.

Notes#

Note

Multiple samplesheet rows are typically only needed if:

The target genomes are split to have a one file per chromosome
You’re working with multiple cohorts simultaneously

Danger

All samplesets have to be in the same genome build (either GRCh37 or GRCh38) which is specified using the --target_build [GRCh3#] command. All scoring files are downloaded or mapped to match the specified genome build, no liftover/re-mapping of the genotyping data is performed within the pipeline.

Note

Your samplesheet can only contain one sampleset name. If you want to run multiple large cohorts (e.g. 1000G and HGDP) then run the workflow separately or combine the files.

Setting genotype field#

Note

This is an optional process that is only applicable for some types of VCF data

There is one optional column:

vcf_genotype_field: Genotypes present in VCF files are extracted from the GT field (hard-called genotypes) by default. Oftentimes genotypes are imputed from from limited sets of genotyped variants (microarrays, low-coverage sequencing) using imputation tools (Michigan or TopMed Imputation Servers) that output dosages for the ALT allele(s): to extract these data users should enter DS in this column.

An example of a samplesheet with two VCF datasets where you’d like to import different genotypes from each is below:

Example samplesheet with genotype field set#
sampleset	path_prefix	chrom	format	vcf_genotype_field
cineca_sequenced	path/to/vcf	22	vcf	`GT`
cineca_imputed	path/to/vcf_imputed	22	vcf	`DS`

Contents

About the project

Useful links

How to set up a samplesheet

Contents

How to set up a samplesheet#

Samplesheet#

Notes#

Setting genotype field#