How to set up a samplesheet

How to set up a samplesheet#

A samplesheet describes the structure of your input genotyping datasets. It’s needed because the structure of input data can be very different across use cases (e.g. different file formats, directories, and split vs. unsplit by chromosome).

Warning

The format of samplesheets changed in v2.0.0 to better accommodate additional file formats in the future

Note

If your genomes are in cloud storage CSV samplesheets aren’t supported. Instead check How do I launch pgsc_calc using cloud executors?

Samplesheet#

A samplesheet can be set up in a spreadsheet program, using the following structure:

Example samplesheet#

sampleset

path_prefix

chrom

format

cineca

target_genomes/cineca_synthetic_subset

22

pfile

The file should be in CSV format. A template is available to download here.

There are four mandatory columns:

  • sampleset: A text string (no spaces, or reserved characters [ . or _ ]) referring to the name of a target dataset of genotyping data containing at least one sample/individual (however cohort datasets will often contain many individuals with combined genotyped/imputed data). Data from a sampleset may be input as a single file, or split across chromosomes into multiple files. Scores generated from files with the same sampleset name are combined in later stages of the analysis.

    Danger

    • pgsc_calc works best with cohort data

    • Scores calculated for low sample sizes will generate warnings in the output report

    • You should merge your genomes if they are split per individual before using pgsc_calc

  • path_prefix should be set to the path of the target genomes excluding all file extensions

    • Example path prefix: /home/stuff/data.vcf.gz -> /home/stuff/data

    Danger

    Always use absolute paths that begin with /, e.g. /home/stuff/...

    Note

    One plink file set (bed / bim / fam or pgen / pvar / psam) only needs a single path prefix and row in the samplesheet

  • chrom: An integer (range 1-22) or string (X, Y). If the target genomic data file contains multiple chromosomes, leave empty. Don’t use a mix of empty and integer chromosomes in the same sample.

  • format: The file format of the target genomes. Currently supports pfile, bfile, or vcf.

Notes#

Danger

Always include every target genome chromosome in your samplesheet unless you’re certain that missing chromosomes aren’t in the scoring files

Note

Multiple samplesheet rows are typically only needed if the target genomes are split to have a one file per chromosome

Danger

All target genome files have to be in the same genome build (either GRCh37 or GRCh38) which is specified using the --target_build [GRCh3#] command. All scoring files are downloaded or mapped to match the specified genome build, no liftover/re-mapping of the genotyping data is performed within the pipeline.

Note

Your samplesheet can only contain one sampleset name. If you want to run multiple large cohorts (e.g. 1000G and HGDP) then run the workflow separately or combine the files.

Setting genotype field#

Note

This is an optional process that is only applicable for some types of VCF data

There is one optional column:

  • vcf_genotype_field: Genotypes present in VCF files are extracted from the GT field (hard-called genotypes) by default. Oftentimes genotypes are imputed from from limited sets of genotyped variants (microarrays, low-coverage sequencing) using imputation tools (Michigan or TopMed Imputation Servers) that output dosages for the ALT allele(s): to extract these data users should enter DS in this column.

Example samplesheet with genotype field set to hard-calls (default)#

sampleset

path_prefix

chrom

format

vcf_genotype_field

cineca_sequenced

path/to/vcf

22

vcf

GT

Example samplesheet with genotype field set to dosage#

sampleset

path_prefix

chrom

format

vcf_genotype_field

cineca_imputed

path/to/vcf_imputed

22

vcf

DS