How do I launch pgsc_calc using cloud executors?#
Nextflow provides native support for cloud executors, like:
Amazon Web Services
Google Cloud Platform
Azure
Note
Nextflow also supports data in cloud storage when using local or cloud executors
Your genomes are in cloud storage#
The samplesheet CSV format doesn’t support cloud storage, but you can create a JSON samplesheet:
[
{
"pheno": "gs://bucket/data/all_phase3.psam",
"vcf_import_dosage": false,
"variants": "gs://bucket/data/all_phase3.pvar.zst",
"geno": "gs://bucket/data/all_phase3.pgen",
"sampleset": "test",
"chrom": null,
"format": "pfile"
}
]
The sanplesheet is a JSON array that contains a list of JSON objects. Each row in the CSV samplesheet corresponds to an element in the JSON array.
Warning
Unlike the CSV samplesheet full paths must always be specified, including URIs (s3://...
)
The key differences between the CSV and JSON samplesheet fields are:
pheno
: Path to sample information file: plink 1 fam files, plink 2 fam files, or the VCF pathvariants
: Path to variant information file: plink 1 bim, plink 2 pvar, or the VCF pathgeno
: Path to genoftype file: plink 1 bed, plink 2 pgen, or the VCF pathchrom
: An optional string (not an integer!)
Note
If you’re using VCF input, then the VCF path must be repeated for pheno
, variants
, and geno
Once your samplesheet is ready, you can use it with nextflow:
$ nextflow run pgscatalog/pgsc_calc --input path/to/samplesheet.json --format json ...
Warning
Don’t forget --format json
Note
gs://
is for Google Cloud Storage. Check the Nextflow documentation for other supported cloud storage systems and URIs
Multiple chromosomes example#
Just add elements to the list, each object correponds to a new row in the CSV samplesheet:
[
{
"pheno": "gs://bucket/data/chr1_phase3.vcf.gz",
"vcf_import_dosage": false,
"variants": "gs://bucket/data/chr1_phase3.vcf.gz",
"geno": "gs://bucket/data/chr1_phase3.vcf.gz",
"sampleset": "test",
"chrom": "1",
"format": "bfile"
},
{
"pheno": "gs://bucket/data/chr2_phase3.vcf.gz",
"vcf_import_dosage": false,
"variants": "gs://bucket/data/chr2_phase3.vcf.gz",
"geno": "gs://bucket/data/chr2_phase3.vcf.gz",
"sampleset": "test",
"chrom": "2",
"format": "bfile"
}
]
Warning
If you forget to include every chromosome in the JSON array then score calculation might fail
Why do I need to use JSON?#
The CSV samplesheet parsing does some helpful things like:
Making sure paths exist
Detecting file extensions based on the file format
Finding and using compressed variant information files and preferentially using this data
While aiming to be in a friendly Excel compatible format. Biologists love Excel, and JSON can be a little bit scary. A limitation of the approach is that it only works well with normal file systems, and doesn’t support object storage.
How do I configure my cloud executor?#
We’ve tested and deployed pgsc_calc
on Google Cloud Platform and Seqera Platform using Wave and Fusion file system.
Describing cloud configuration is out of scope for pgsc_calc
documentation. It’s best to check the Nextflow documentation instead.
Please feel free to open an issue or start a discussion if you experience problems running the workflow in the cloud.