How to set up a samplesheet#
A samplesheet describes the structure of your input genotyping datasets. It’s needed because the structure of input data can be very different across use cases (e.g. different file formats, directories, and split vs. unsplit by chromosome).
Warning
The format of samplesheets changed in v2.0.0 to better accommodate additional file formats in the future
Note
If your genomes are in cloud storage CSV samplesheets aren’t supported. Instead check How do I launch pgsc_calc using cloud executors?
Samplesheet#
A samplesheet can be set up in a spreadsheet program, using the following structure:
sampleset |
path_prefix |
chrom |
format |
---|---|---|---|
cineca |
target_genomes/cineca_synthetic_subset |
22 |
pfile |
The file should be in CSV format. A template is available to
download here
.
There are four mandatory columns:
sampleset: A text string (no spaces, or reserved characters [
.
or_
]) referring to the name of a target dataset of genotyping data containing at least one sample/individual (however cohort datasets will often contain many individuals with combined genotyped/imputed data). Data from a sampleset may be input as a single file, or split across chromosomes into multiple files. Scores generated from files with the same sampleset name are combined in later stages of the analysis.Danger
pgsc_calc
works best with cohort dataScores calculated for low sample sizes will generate warnings in the output report
You should merge your genomes if they are split per individual before using
pgsc_calc
path_prefix should be set to the path of the target genomes excluding all file extensions
Example path prefix:
/home/stuff/data.vcf.gz
->/home/stuff/data
Danger
Always use absolute paths that begin with
/
, e.g./home/stuff/...
Note
One plink file set (
bed / bim / fam
orpgen / pvar / psam
) only needs a single path prefix and row in the samplesheetchrom: An integer (range 1-22) or string (X, Y). If the target genomic data file contains multiple chromosomes, leave empty. Don’t use a mix of empty and integer chromosomes in the same sample.
format: The file format of the target genomes. Currently supports
pfile
,bfile
, orvcf
.
Notes#
Danger
Always include every target genome chromosome in your samplesheet unless you’re certain that missing chromosomes aren’t in the scoring files
Note
Multiple samplesheet rows are typically only needed if the target genomes are split to have a one file per chromosome
Danger
All target genome files have to be in the same genome build (either GRCh37 or
GRCh38) which is specified using the --target_build [GRCh3#]
command. All scoring files are downloaded or mapped to match the specified
genome build, no liftover/re-mapping of the genotyping data is performed
within the pipeline.
Note
Your samplesheet can only contain one sampleset name. If you want to run multiple large cohorts (e.g. 1000G and HGDP) then run the workflow separately or combine the files.
Setting genotype field#
Note
This is an optional process that is only applicable for some types of VCF data
There is one optional column:
vcf_genotype_field: Genotypes present in VCF files are extracted from the
GT
field (hard-called genotypes) by default. Oftentimes genotypes are imputed from from limited sets of genotyped variants (microarrays, low-coverage sequencing) using imputation tools (Michigan or TopMed Imputation Servers) that output dosages for the ALT allele(s): to extract these data users should enterDS
in this column.
sampleset |
path_prefix |
chrom |
format |
vcf_genotype_field |
---|---|---|---|---|
cineca_sequenced |
path/to/vcf |
22 |
vcf |
|
sampleset |
path_prefix |
chrom |
format |
vcf_genotype_field |
---|---|---|---|---|
cineca_imputed |
path/to/vcf_imputed |
22 |
vcf |
|