This documentation is written in markdown and comes bundled with the Cluster Flow source code.
Cluster Flow is workflow manager designed to run bioinformatics pipelines.
It is operated through a single command cf
, which can be used to launch,
configure, monitor and cancel pipelines.
To get you started with Cluster Flow, there are a few tutorial videos on YouTube. Click here to watch.
Cluster Flow is designed to work with a computing cluster. It currently supports the Sun GRIDEngine, LSF and SLURM job managers (not PBS, Torque or others).
If you don't have a cluster with a supported manager, you can run Cluster Flow on
any command-line machine in local
mode. This writes a bash script and runs it as
a job in the background.
To run analyses, you will also need the required tools to be installed. Cluster Flow
is designed to work with the environment module system and load tools as required, but
if software is available on the PATH
it can work without this.
Cluster Flow itself is written in Perl. It has minimal dependencies, all of which are core Perl packages.
If you are a user on a HPC cluster, you may already have Cluster Flow installed on your cluster as an environment module. If so, you may be able to load it using:
module load clusterflow
Cluster Flow is a collection of stand-alone scripts, mostly written in Perl.
wget https://github.com/ewels/clusterflow/archive/v0.5.tar.gz
tar xvzf v0.5.tar.gz
cd clusterflow-0.5
cp clusterflow.config.example clusterflow.config
vi clusterflow.config
You must specify your environment in the config file (@cluster_environment
:
local
, GRIDEngine
, SLURM
or LSF
), most other things are optional.
The cf
executable must be in your system PATH
, so that you can run it easily
from any directory. Ensure that you run the Configuration Wizard (described below)
so that this config is created in your ~/.bashrc
file.
If you prefer, you can symlink the cf
executable to ~/bin
.
Once Cluster Flow has been set up site-wide, you need to configure it for your personal use:
cf --setup
This will launch a wizard to write a config file for you, with details such as e-mail address and notification settings.
Most analysis pipelines need a reference genome. This can exist in a central location or in your personal setup (or both).
If you're using the Swedish UPPMAX cluster, please see these instructions.
You can add your reference genome paths with the following wizard:
cf --add_genome
That should be it! Log out of your session and in again to activate any new bash settings. Then try launching a test run:
cf --genome GRCh37 sra_bowtie ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sralite/ByExp/litesra/SRX/SRX031/SRX031398/SRR1068378/SRR1068378.sra
This will download SRR1068378 (Human H3K4me3 ChIP-Seq data), convert to FastQ, run FastQC, Trim Galore! and align with bowtie.
Once Cluster Flow is up and running, you can list available pipelines, modules and reference genomes which are available using the following commands:
cf --pipelines # List pipelines
cf --modules # List modules
cf --genomes # List reference genomes
To get instructions for how to use Cluster Flow on the command line, use:
cf --help
You can also use this command to find out more information about pipelines and modules:
cf --help [module-name]
cf --help [pipeline-name]
In its most basic form, analyses are run as follows:
cf [pipeline] [files]
Single modules can also be specified instead of a pipeline:
cf [module] *.bam
Most pipelines and modules will need a reference genome, specified
using --genome
:
cf --genome GRCh37 sra_bowtie *.sra
The ID following --genome
is the ID assigned when adding the reference
genome to Cluster Flow. This can be seen when listing genomes with cf --genomes
.
The default execution of different tools can be modified by using module
parameters. These can be set within pipeline scripts or on the command line.
Specifying --param [example]
will apply the [example]
parameter to every
module in the pipeline.
Different module support different parameters. Some are flags, some are key pairs. To find out more, see the Modules documentation.
Typical things you can do are to set adapter trimming preferences with TrimGalore!:
cf --genome GRCh37 --param clip_r1=6 --param min_readlength=15 sra_bowtie *sra
or run Bismark in PBAT mode:
cf --genome GRCm38 --param pbat fastq_bismark *.fastq.gz
When setting in pipeline scripts, simply add the paramters after the module
names (tab-delimited). For example, this is the trim_bowtie_miRNA
pipeline:
#trim_galore adapter=ATGGAATTCTCG
#bowtie mirna
This sets a custom adapter for trimming and tells the bowtie module to use
the mirna
parameter.
When launching Cluster Flow, a number of filename checks are performed. If input
files are FastQ and the filenames look like paired-end files, it launches in
paired-end mode (this can be overridden with --single
).
If a mixture of file types or paired end / single end FastQ files are found,
Cluster Flow will show an error and exit. This step can be skipped by using the
--no_fn_check
parameter.
If @merge_regex
is configured in the configuration file, matching input files
will be merged before processing.
As well as supplying Cluster Flow with input files, you can give URLs. This will
cause Cluster Flow to add the cf_download
module to the start of your pipeline
to download the data.
Cluster Flow will recognise anything starting with http
, https
or ftp
as
a URL. Downloads are processes in series to avoid overwhelming the internet connection.
If using the --file-list
parameter you can also specify a filename for each download.
This should be added after the download URL, separated by a tab character. This is
particularly useful when downloading arbitrarily named SRA files and is compatible
with the Labrador Dataset Manager. See also
the stand-alone SRA-Explorer tool if you
don't have Labrador installed.
Cluster Flow has a number of features built in to avoid swamping your cluster with jobs.
Firstly, Cluster Flow limits the number of parallel runs created. Defaults are
set in the config file with @split_files
(default 1) and @max_runs
(default 12).
@split_files
defines the minimum number of files per run, @max_runs
defines
the maximum number of parallel runs and adds more files per run if needed.
Cluster Flow also try to intelligently limit the memory usage and number of
cores each module uses. The config options @total_cores
and @total_mem
specify the maximum resources to be used by each Cluster Flow pipeline. These
are split up amongst the max simultaneous jobs and presented to each module.
The modules can then request resources, making use of optional parallelisation
where available.
Cluster Flow pipelines are launched as follows:
cf [flags] <pipeline> <input-files>
These flags are used to customise run-time parameters for the pipeline that Cluster Flow will launch.
Default: none
Some pipelines which carry out a reference genome alignment require a genome directory path to be set. Requirements for format may vary between modules.
Default: Auto-detect
If specified, Cluster Flow will send two files to each run, assuming that the order that the file list is supplied in corresponds to two read files. If an odd number of files is supplied, the final file is submitted as single end.
Default: Auto-detect
If specified, Cluster Flow will ignore its auto-detection of paired end input files and force the single end processing of each input file.
Default: none
Cluster Flow will make sure that all of the input files have the same file extension to avoid accidentally submitting files that aren’t part of the run. Specifying this parameter disables this check.
Default: none
If specified, you can define a file containing a list of filenames to pass to the pipeline (one per line). This is particularly useful when supplying a list of download URLs.
Default: none
Pipelines and their modules are configured to run with sensible defaults. Some
modules accept parameters which change their behaviour. Typically, these are
set within a pipeline config file. By using --params
, you can add extra
parameters at run time. These will be set for every module in the pipeline
(though they probably won’t all recognise them).
Default: (config file - typically 1)
Cluster Flow generates multiple parallel runs for the supplied input files when run. This is typically a good thing, the cluster is designed to run jobs in parallel. Some jobs may involve many small tasks with a large number of input files however, and 1:1 parallelisation may not be practical. In such cases, the number of input files to assign to each run can be set this flag.
Default: none
It can sometimes be a pain to count the number of input files and work out a
sensible number to use with --split_files
. Cluster Flow can take the
--max_runs
value and divide the input files into this number of runs, setting
--split_files
automatically.
A default can be set for --max_runs
in the clusterflow.config
file, and
this value is set to 12 if no value is found in the config files. Set to 0
to
disable.
This parameter will override anything set using --split_files
.
Default: none
Optional custom prefix for run file filenames. This is useful if you are running multiple instances of Cluster Flow with the same input file in the same directory, as it avoids potential clashes / mixups. For example:
cf --runfile_prefix bt1 --genome GRCh37 fastq_bowtie1 my_sample.fq
cf --runfile_prefix bt2 --genome GRCh37 fastq_bowtie2 my_samplefq
Default: none
Specify a reference genome without adding to the genomes.config
file.
Should be in the format <ref_type>=<path>
, eg:
cf --ref bowtie=/path/to/bowtie/index <pipeline> <files>
Default: false
Do everything except for actually launching cluster jobs. Useful for testing and checking that jobs will be created properly.
Typically, Cluster Flow settings are set in static configuration files. However, sometimes it can be useful to specify parameters on a one-off basis on the command line.
Default: (config file)
Cluster Flow can send notification e-mails regarding the status of runs.
Typically, e-mail address should be set using @email
in ~/clusterflow.config
(see above). This parameter allows you to override that setting on a one-off
basis.
Default: (config file - typically -500)
Many cluster managers can use a priority system to manage jobs in the queue. Typically, GRIDEngine priorities can be set ranging from -1000 to 0.
Default: (config file - typically 64)
Override the maximum number of cores allowed for each Cluster Flow pipeline, typically set in the Cluster Flow config file. For more information see Avoiding cluster overload.
Default: (config file - typically 128G)
Setting --mem
allows you to override the maximum amount of simultaneously
assigned memory. For more information see Avoiding cluster overload.
Default: (config file - typically none)
Override the maximum requested time assigned to jobs. For more information, see Avoiding cluster overload.
Default: (config file - typically none)
Specify the project to use on the cluster for this run.
Default: (config file - typically none)
Specify a custom cluster queue to use for this run.
Default: (config file - custom)
Override the default environment to use for this pipeline run. Useful for testing or small jobs, can run using bash commands instead of submitting cluster jobs. For example:
cf --environment local test_pipeline *.txt
Default: (config file - typically cea)
Cluster Flow can e-mail you notifications about the progress of your runs. There are several levels of notification that you can choose using this flag. They are:
c
- Send notification when all runs in a pipeline are completedr
- Send a notification when each run is completede
- Send a notification when a cluster job endss
- Send a notification if a cluster job is suspendeda
- Send a notification if a cluster job is abortedSetting these options at run time with the --notifications
flag will override
the settings present in your clusterflow.config configuration files.
Note: setting the s
flag when using many input files with a long pipeline may
cause your inbox to be flooded.
These flags instruct Cluster Flow to do something other than submit a pipeline.
When you have a lot of jobs running and queued, the qstat summary can get a
little overwhelming. To combat this and show job hierarchy in an intuitive
manner, you can enter into the console cf --qstat
. This parses qstat output
and displays it nicely. cf --qstatall
does the same but for all jobs by all
users.
You'll probably find that you want to run this command quite a lot. To make it
a little less clumsy, you can create aliases in your .bashrc
or
.bash_profile
scripts, which run every time you log in.
alias qs='cf --qstat'
alias qsa='cf --qstatall'
To append these lines to your .bashrc
script you can use the following
command:
echo -e "alias qs='cf --qstat'\nalias qsa='cf --qstatall'" >> ~/.bashrc
Note: These tools don't work with LSF, as I don't have a LSF testing server to work on. Please get in touch if you can help.
Sometimes you may be running multiple pipelines and want to stop just one.
It can be a pain to find the job numbers to do this manually, so instead
you can use Cluster Flow to kill these jobs. When running cf --qstat
,
ID values are printed for each pipeline. For example:
$ qs
======================================================================
Cluster Flow Pipeline: fastq_bowtie
Submitted: 17 hours, 1 minutes, 46 seconds ago
Working Directory: /path/to/working/dir
Cluster Flow ID: fastq_bowtie_1468357637
Submitted Jobs: 29
Running Jobs: 1
Queued Jobs: 2 (dependencies)
Completed Jobs: 26 (89%)
======================================================================
You can then use this Cluster Flow ID to kill all jobs within that pipeline:
cf --qdel fastq_bowtie_1468357637
Run the Cluster Flow interactive wizard to add new genomes.
Run the interactive setup wizard to create a configuration file for Cluster Flow.
Display the currently installed version of Cluster Flow.
Check online for any available Cluster Flow updates.
Show a help message describing the different command line flags available.
Cluster flow will search three locations for a config file every time it is run. Variables found in each file can override those read from a previous config file. They are, in order of priority:
<working directory>/clusterflow.config
~/clusterflow.config
<installation directory>/clusterflow.config
Config files contain key: value pairs. Syntax is as follows: @key value
(tab delimited, one per line). The Cluster Flow source code comes with an
example config file called
clusterflow.config.example
Typically, there will be a config file in the installation directory which contains the settings that make Cluster Flow work. Each user will then have a personal configuration file in their home directory containing settings such as a notification e-mail address.
The key things to set up when installing Cluster Flow are the variables that dictate how CF should interact with your cluster - what commands it should use to submit jobs.
Cluster Flow currently supports GRIDEngine (SGE), SLURM and LSF, as well as
running locally using background bash jobs. You can specify which environment
to use with @cluster_environment
:
/* Options: local, GRIDEngine, SLURM or LSF */
@cluster_environment SLURM
In most cases, that should be enough to get Cluster Flow to work! However,
some people have some specific variables that need to be submitted with
batch jobs (eg. project identifiers, time limits, other custom flags). If this
is the case, the job submission command can be customised with the
@custom_job_submit_command
config variable.
To use this, enter your typical submission command with the following placeholders which will be replaced at run time:
{{command}}
{{job_id}}
{{outfn}}
STDOUT
{{cores}}
{{mem}}
{{time}}
{{priority}}
{{project}}
{{qname}}
{{email}}
{{notifications}}
Simply omit any variables which are not needed on your cluster. For example:
@custom_job_submit_command sbatch -A MY_PROJECT_ID -t 2-00:00:00 -p core -n {{cores}} --open-mode=append -o {{outfn}} -J {{job_id}} {{notifications}} --wrap="{{command}}"
Cluster Flow will generate it's own sensible default if this isn't set, so it's worth trying it without first.
Note: Cluster Flow will append the job dependency strings to the end of your custom command which are system specific, so it's important that
@cluster_environment
is correct.
The following section describes the available variables that can be set in the
config file. For an example, see the clusterflow.config.example
file that
comes bundled with Cluster Flow.
Sets your e-mail address, used for e-mail notifications.
Set to true to make the output from cf --qstat
and cf --qstatall
colourful (and hopefully easier to read).
@colourful 1
A regex used to automatically merge files before pipeline processing starts. This works by matching a single regex group within a filename. If multiple input files have the same matching group, they will be merged. The regex group is then used to give the output filename.
For example, given the following config regex:
@merge_regex [1-8]_[0-9]{6}_[a-zA-Z0-9]+_(P\d+_\d+_[12]).fastq.gz
These input files:
1_160312_CDSH32SDB3889_P1234_001_1.fastq.gz
1_160312_CDSH32SDB3889_P1234_001_2.fastq.gz
2_160312_CDSH32SDB3889_P1234_001_1.fastq.gz
2_160312_CDSH32SDB3889_P1234_001_2.fastq.gz
Would give the resulting merged files:
P1234_001_1.fastq.gz
P1234_001_2.fastq.gz
The default number of input files to send to each run. Typically set to 1.
The maximum number of parallel runs that cluster flow will set off in one go. Default is 12 to avoid swamping the cluster for all other users.
The total number of cores available to a Cluster Flow pipeline. Modules are given a recommended number of cores so that resources can be allocated without swamping the cluster.
The total amount of memory available to a Cluster Flow pipeline. Modules are given a recommended quota so that resources can be allocated without swamping the cluster.
The maximum time that a job should request in a Cluster Flow pipeline. For example, to prevent jobs from requesting more than 10 days:
@max_time 10-00
If your cluster is running slowly and the default time limits specified
in Cluster Flow modules are not enough, jobs will fail due to timing out.
@time_multiplier
is a quick and dirty way to avoid this. Setting
@time_multiplier
to 2
will double the requested time for every job.
Note that these times will still be capped by @max_time
.
The priority to give to cluster jobs.
See above docs: Environment setup.
If you do not use environment modules on your system, you can prevent Cluster Flow from trying to use them (and giving a warning) by adding this line to your config file.
Specify an environment module to always load for every Cluster Flow pipeline. Can be used multiple times.
If using environment modules, you may get some errors claiming that certain
tools are not installed. If you think that you do have that tool installed,
it could be because of a minor difference in the module name (eg. fastqc
versus FastQC
). You can configure aliases in your configuration file.
You can also use these aliases to specify specific software versions for
Cluster Flow.
Aliases are added with the @environment_module_alias
tag. For example:
@environment_module_alias fastqc FastQC/0.11.2
@environment_module_alias trim_galore TrimGalore
To pull out specific highlights or warnings from log files, you can specify search strings with these tags. If found, the e-mail will be highlighted accordingly and the lines from the log file will be displayed at the top of the report e-mail.
For example:
@log_highlight_string at least one reported alignment
@log_warning_string job failed
Multiple @notification
key pairs can be set with the following values:
complete
run
end
suspend
abort
Cluster Flow sends the run
and complete
notifications using the
cf_run_finished
and cf_runs_all_finished
modules. These modules handle
several tasks, such as cleaning useless warning messages from log files.
E-mails contain the contents of all log files, plus a section at the top
of highlighted messages, specified within log messages by being prefixed
with ###CF
.
Cluster Flow can automatically check for new versions. If an update is
available, it will print a notification each time you run a job. You can
specify how often Cluster Flow should check for updates with this parameter.
The syntax is a number followed by d
, w
, m
or y
for days, weeks,
months or years. Cluster Flow will check for an update at runtime if this
period or more has elapsed since you last ran it. You can disable update
checks and alerts by setting @check_updates 0
in your
~/clusterflow.config
file.
You can manually get Cluster Flow to check for updates by running
cf --check_updates
Many modules can have their default behaviour modified through the use of
Cluster Flow --params
. These are described below.
See the documentation about Module Paramters for more information about how to specify these options.
blacklistFile
Use to define a blacklist file (overrides any set as a genome reference).
cf --params blacklistFile="/path/to/file"
pbat
Use the Bismark --pbat
flag.
cf --params pbat
unmapped
Save the unmapped reads to a file (Bismark --unmapped
flag).
cf --params unmapped
bt1
Align with Bowtie1 instead of Bowtie2 (default).
cf --params bt1
single_cell
Use the --non_directional
Bismark flag.
cf --params single_cell
subsample
Only align the first 1000000 reads.
cf --params subsample
mirna
Use alignment paramters suitable for miRNA alignment against miRBase references,
instead of the standard Bowtie1 command.
Uses -n 0 -l 15 -e 99999 -k 200
bowtie flags, instead of default -m 1 --strata
.
cf --params mirna
regex
Override any merge regex set in the Cluster Flow configuration and use this instead.
cf --params regex="/REMOVE_([KEEP]+).fastq.gz/"
fragmentLength
Set the fragment length to use for bamCoverage, instead of taking from the phantompeaktools cross correlation analysis or using the default (200).
cf --params fragmentLength=120
fragmentLength
Set the fragment length to use for bamCoverage, instead of taking from the phantompeaktools cross correlation analysis or using the default (200).
cf --params fragmentLength=120
fastq_screen_config
Use a specific FastQ Screen config file (with --conf
FastQ Screen flag).
cf --params fastq_screen_config="/path/to/config"
nogroup
Use the --nogroup
option with FastQC to prevent automatic grouping of
base pair positions in plots. You can end up with some very large plots if
you have long reads!
cf --params nogroup
stranded
Set the -s 1
flag for featureCounts.
cf --params stranded
stranded_rev
Set the -s 2
flag for featureCounts.
cf --params stranded_rev
id_tag
Specify the tag to use for counting in the GTF file. If not specified,
module tries to guess by looking for a field called gene_id
or ID
.
cf --params id_tag="Gene"
longest
The longest fragment to accept (HiCUP parameter --longest
). Default: 800
cf --params longest=900
shortest
The shortest fragment to accept (HiCUP parameter --shortest
). Default: 100
cf --params shortest=50
re1
The restriction enzyme recognition pattern to use. Default: "A^AGCTT,HindIII"
cf --params re1="A^GATCT,BglII"
stranded
Set the -s yes
flag for HTSeq Counts. Default is to set -s no
cf --params stranded
stranded_rev
Set the -s reverse
flag for HTSeq Counts. Default is to set -s no
.
cf --params stranded_rev
id_tag
Specify the tag to use for counting in the GTF file. If not specified,
module tries to guess by looking for a field called gene_id
or ID
.
cf --params id_tag="Gene"
estFragmentLength
Specify the estimated fragment length (Kallisto --fragment-length
option). Default: 200.
cf --params estFragmentLength=300
est_sd
Specify the fragment length standard deviation (Kallisto --sd
option). Default: 20.
cf --params est_sd=30
template
Specify the MultiQC template to use. Default: default
cf --params template=geo
keep_intermediate
Do not delete the R files used to generate the PDF figures. Useful when running downstream tools such as MultiQC, that use these intermediate files.
cf --params keep_intermediate
byname
Sort by name instead of position (-n
flag).
cf --params byname
forcesort
Don't skip the sorting step, even if the file already seems to be sorted.
cf --params forcesort
LoadAndRemove
Load and remove genome index (--genomeLoad LoadAndRemove
). Default: NoSharedMemory
.
cf --params LoadAndRemove
LoadAndKeep
Load and keep genome index (--genomeLoad LoadAndKeep
). Default: NoSharedMemory
.
cf --params LoadAndKeep
outSAMattributes
Specify SAM attributes (--outSAMattributes [attr]
). Default: Standard
.
cf --params outSAMattributes="attr"
min_readlength
Minimum read length for trimming to run. If the first file in each run group has reads less than this length, trimming will be skipped. Default: 50
cf --params min_readlength=30
force_trim
Force TrimGalore! to run, even if reads are below minimum read length.
cf --params force_trim
q_cutoff
Specify quality for trimming low-quality ends from reads in addition to adapter removal. Default Phred score: 20.
cf --params q_cutoff=10
stringency
Number of bases of overlap with adapter sequence required to trim a sequence. Default: 1
cf --params stringency=3
adapter
Specify an adapter sequence to trim. Default: Auto-detect (Illumina universal, Nextera transposase or Illumina small RNA adapter).
cf --params adapter=ATACAGCTAGCAGTAC
RRBS
Specifies that the input file was an MspI digested RRBS sample.
cf --params RRBS
nofastqc
Do not run FastQC after trimming is complete.
cf --params nofastqc
To remove a custom number of bases from reads after adapter removal, the following parameters can be set:
cf --params clip_r1=<int>
cf --params clip_r2=<int>
cf --params three_prime_clip_r1=<int>
cf --params three_prime_clip_r2=<int>
The following params are presets which are easier to remember and use:
cf --params trim=<int>
clip_r1=<int> clip_r2=<int>
.cf --params pbat
clip_r1 6
clip_r2 6
cf --params ATAC
clip_r1 4
clip_r2 4
cf --params single_cell
clip_r1 9
clip_r2 9
cf --params epignome
clip_r1 7
clip_r2 7
three_prime_clip_r1 7
three_prime_clip_r2 7
cf --params accel
clip_r1 10
clip_r2 15
three_prime_clip_r1 10
three_prime_clip_r2 10
cf --params cegx
clip_r1 6
clip_r2 6
three_prime_clip_r1 2
three_prime_clip_r2 2
All pipelines conform to a standard syntax. The name of the pipeline is
given by the filename, which should end in .config
. The top of the file
should contain a title and description surrounded by /*
and */
Variables can be set using the same @key value
syntax as in
clusterflow.config
files.
Modules are described using #
prefixes. Tab indentation denotes dependencies
between modules. Syntax is #module_name parameters
, where there can be any
number of space separated parameters which will be passed on to the module
at run time.
Here is an example pipeline, which requires a genome path and uses three modules:
/*
Example Pipeline
================
This pipeline is an example of running three modules which depend on
each other. Module 2 is run with a parameter that modifies its behaviour.
This block of text is used when cf --help example_pipeline is run
*/
#module1
#module2
#module2 parameter
#module3
Remember to run dos2unix
on your pipeline before you run it, if you're
working on a windows machine.
Cluster Flow works by creating .run
files for each batch of input files.
These are a copy of the pipeline file, with filenames appended for each
step of the pipeline. These files are used by subsequent steps in the
pipeline to know which input files to use.
Inspecting run files is a quick way to see exactly what analysis was done in a directory.
Modules are the heart of Cluster Flow. Each module is a wrapper around a single bioinformatics tool. Each module has three modes of operation:
Modules are executed using system commands, so can be written in any language. However, most existing modules are written in Perl.
Module filenames must be in the format <module_name>.cfmod.<extension>
,
eg. mymod.cfmod.pl
. They can be stored in the following locations
(chosen in this order of preference):
~/.clusterflow/modules/
<installation_dir>/modules/
An example module comes bundled with Cluster Flow, containing some highly
commented pseudocode which you can modify for your own uses. You can see
it in your modules
directory:
example_module.pl
If you have an existing script or tool, it's tempting to try to convert it into a Cluster Flow module. However, I recommend instead keeping it as a standalone script and creating a Cluster Flow module to launch this instead. In our experience, this is much easier. It also has the advantage that your script can still be run outside Cluster Flow.
At the top of every Cluster Flow module is a hash that defines the resources needed by the tool. It looks something like this:
my %requirements = (
'cores' => $cores,
'memory' => $mem,
'modules' => $modules,
'references'=> $refs,
'time' => $time
);
Each of these variables can be specified as a string, an array specifying a range of appropriate values, or a subroutine to calculate a value based on information specific to the run.
The number of required cores can be specified either as a string or an array.
If your tool always uses a fixed number of cpus (for example, 1 if it's not
multi-threaded), just specify that number in quotes ('cores' => '1'
).
If your tool can be sped up by using multiple cpus, you can specify a minimum
and maximum number in an array ('cores' => ['3','8']
). Cluster Flow will
then allocate a number within that range according to how many jobs are being
created in parallel. This way, jobs will run as fast as possible for a handful
of files, but not overwhelm the cluster if many are being run at once.
Memory works just like cores, above - either specify a string or an array with
a minimum and maximum amount. Numbers with no suffix will be interpreted
as bytes, then you can use K
, M
and G
suffixes to specify kilobytes, megabytes
and gigabytes ('memory' => '8G'
).
In some cases it can be useful to use a subroutine to dynamically calculate the required memory. For example, you could inspect the filesize of a fasta genome reference to determine the required memory:
'memory' => sub {
my $cf = $_[0];
if (defined($cf->{'refs'}{'fasta'}) && -e $cf->{'refs'}{'fasta'}) {
# Multiple the reference filesize (in bytes) by 1.2
my $mem_usage = int(1.2 * -s $cf->{'refs'}{'fasta'});
return CF::Helpers::bytes_to_human_readable($mem_usage);
} else {
# Sensible default
return '8G';
}
},
A string or array of strings describing environment modules that should be loaded.
Try to keep this as generic as possible. People can specify specific versions or
naming in personal config files using @environment_module_alias
.
Genome reference and annotation is labelled with a field to describe it's type. If a reference is required, you should specify its type here. This prevents Cluster Flow from being launched if the reference genome is not specified.
For example, the bowtie2 module specifies 'references'=> 'bowtie2'
; the
featureCounts module specifies 'references' => 'gtf'
.
Some HPC clusters require a time limit to be specified when launching jobs. Here you should predict approximately how long your module should run.
Some modules will always take a fixed amount of time to run, in which case this
can be specified as a string. For ten minutes, specify 'time' => '10'
.
The execution time for most modules will depend on how many input files they are processing. Modules often run with multiple sets of input files. To cope with this, supply a subroutine to this variable which can flexibly request an amount of time according to how many input files will be processed.
The helper function minutes_to_timestamp
is useful here - it takes a number
of minutes and returns a properly formatted timestamp (see below for more
information about helper functions).
If a module typically takes three hours to run, it could request it as follows:
'time' => sub {
my $cf = $_[0];
my $num_files = $cf->{'num_starting_merged_aligned_files'};
return CF::Helpers::minutes_to_timestamp ($num_files * 3 * 60);
}
The $cf
variable is a hash containing information about the job. See below
for a description of the keys available.
Remember to be conservative - high time requests can delay queue priority, but low time requests will result in job failure.
Cluster Flow can request help text from a module if called with
cf --help <module_name>
. You should write some text describing what the
module does, including any parameters or customisation available.
my $helptext = "".("-"x15)."\n My awesome module\n".("-"x15)."\n
This module is brilliant and worked first time because the author
read all of the Cluster Flow documentation! What a hero!\n\n";
Once the requirements hash and help text are written, we call a core
helper function called module_start
. If the module is being called to
request resource requirements or help, the function will exit. If it is
being executed in a cluster job, it will return as hash with useful
information such as the input filenames.
Requirements should be passed to the function as a reference:
my %cf = CF::Helpers::module_start(\%requirements, $helptext);
The returned hash contains the following keys: (NB: Not all of these are available in request subroutines)
%cf = {
refs = '<hash>', # Reference annotation for the specified genome. Keys are the reference type, values are the path to the annotation.
prev_job_files = '<array>', # File names resulting from preceding job.
starting_files = '<array>', # File names for the initial files that this thread of the pipeline was started with.
files = '<hash>', # Hash of arrays with all files from this pipeline thread. Keys are the module job IDs, values are arrays of output files.
cores = '<int>', # The number of cores allocated to the module.
memory = '<str>', # The amount of memory allocated to the module.
params = '<hash>', # A hash of key: value pairs. Value is `True` if only a flag.
config = '<hash>', # Hash containing arbitrary key: value configuration pairs from the run file. Always contains hash with key `notifications`.
num_starting_files = '<int>', # Number of files that this thread of the pipeline started with.
num_starting_merged_files = '<int>', # Number of files that this thread of the pipeline started with, after merging if matched merge regex.
num_starting_merged_aligned_files = '<int>', # Guess at number of files after alignment, based on whether pipeline is running in paired end mode or not.
pipeline_id = '<str>', # The unique Cluster Flow ID of this pipeline. Useful for generating filenames.
pipeline_name = '<str>', # The name of the pipeline that was launched.
pipeline_started = '<int>', # A unix timestamp of when the pipeline was started.
job_id = '<str>', # The unique Cluster Flow ID for this job.
prev_job_id = '<str>', # The unique Cluster Flow ID for the previous job in the pipeline.
run_fn = '<str>', # The filename of the run file for this thread of the pipeline.
run_fns = '<array>', # All run file filenames for this pipeline (summary modules only).
modname = '<str>', # Name of this module
mod_fn = '<str>', # Filename of this module
}
Although not necessary, most modules that use genome references do a sanity check to make sure that they have what they need after this point. For example, the STAR module checks that it has the required reference:
# Check that we have a genome defined
if(!defined($cf{'refs'}{'star'})){
die "\n\n###CF Error: No star path found in run file $cf{run_fn} for job $cf{job_id}. Exiting..";
} else {
warn "\nAligning against $cf{refs}{star}\n\n";
}
Again not necessary, but good practice - modules log the version of software that they're about to run for future reference:
warn "---------- < module > version information ----------\n";
warn `MY_COMMAND --version`;
warn "\n------- End of < module > version information ------\n";
Modules are able to customise the way that they run depending on the presence of custom parameters are run time. These are used for a range of reasons, such as customising bowtie alignments for miRNA data, changing trimming settings depending on library preparation type and many others. You can basically use them however you like, though you'll find may modules doing this sort of thing:
my $extra_flag = (defined($cf{'params'}{'myflag'})) ? '--extra_flag' : '';
my $specific_var = '';
if(defined($cf{'params'}{'myvar'})){
$specific_var = '--myvar '.$cf{'params'}{'myvar'};
}
# ..later..
$cmd = "mycommand --always $extra_flag $specific_var"
Each part of the pipeline has a .run
file, used by the modules to track
the configuration options and output filenames as the pipeline progresses.
Your pipeline will need to open this run file in append mode so that it can add the file names of any output that it creates.
open (RUN,'>>',$cf{'run_fn'}) or die "###CF Error: Can't write to $cf{run_fn}: $!";
Once you have everything ready, you'll want to actually run your tool. Remember that modules typically run with a collection of input files, so you will need to loop through these and process them in sequence.
How you do this looping depends on what input your tool expects. If your tool takes a single file and doesn't care whether it's paired end or single, you can simply loop through all files from the previous job:
foreach my $file (@{$cf{'prev_job_files'}}){
# process $file
}
Most preprocessing and alignment tools need either one single end FastQ
file or two paired end FastQ files. To handle this, you can use the
is_paired_end
helper function to separate the input files into single end
and paired end:
my ($se_files, $pe_files) = CF::Helpers::is_paired_end(\%cf, @{$cf{'prev_job_files'}});
These files can then be looped over in separate loops:
# Go through each single end files and run Bowtie
if($se_files && scalar(@$se_files) > 0){
foreach my $file (@$se_files){
# process $file
}
}
if($pe_files && scalar(@$pe_files) > 0){
foreach my $files_ref (@$pe_files){
my @files = @$files_ref;
if(scalar(@files) == 2){
# process $files[0] and $files[1]
} else {
warn "\n###CF Error! Bowtie paired end files had ".scalar(@files)." input files instead of 2\n";
}
}
}
Typically, Cluster Flow modules build a system command in a string. This is
then printed to stderr with the ###CFCMD
prefix. This is picked up by
Cluster Flow and added to the summary html report and e-mail.
my $command = "my_command -i $file -o $output_fn";
warn "\n###CFCMD $command\n\n";
Once build, the command should be executed using the perl system
command.
This command returns the exit code once complete, which can be checked to
see whether the module has worked or not:
(0 is success, which evaluates to false)
if(!system ($command)){
# command worked
} else {
# Command returned a non-zero result, probably went wrong...
warn "\n###CF Error! Example module (SE mode) failed for input file '$file': $? $!\n\n";
}
If your command ran successfully, you should have created a new output
file. This should be added to the .run
file along with the current
job id, so that it can be used by subsequent modules in the pipeline:
if(-e $output_fn){
print RUN "$cf{job_id}\t$output_fn\n";
} else {
warn "\n###CF Error! Example module output file $output_fn not found..\n";
}
A run file is created by Cluster Flow for each batch of files. It describes variables to be used, the pipeline specified and the filenames used by each module. The syntax of variables and pipeline is described in Pipeline syntax.
File names are described by a job identifier followed by a tab then a filename. Each module is provided with its own job ID and the ID of the job that was run previously. By using these identifiers, the module can read which input files to use and write out the resulting filenames to the run file when complete. Example run file syntax:
first_job_938 filename_1.txt
first_job_938 filename_2.txt
second_job_375 filename_1_processed.txt
second_job_375 filename_2_processed.txt
There can be any number of extra parameters, these are specific the module and are specified in the pipeline configuration.
Any STDOUT
or STDERR
that your module produces will be written to the
Cluster Flow log file. At the end of each run and pipeline, an e-mail will
be sent to the submitter with details of the run results (if specified by
the config settings). Because the log file can be very long Cluster Flow pulls
out any lines starting with ###CF
. Typically, such a line should be printed
when a module finishes, with a concise summary of whether it worked or not.
Messages including the word Error
will be highlighted and cause the final
e-mail to have warning colours. The configuration options @log_highlight_string
and @log_warning_string
can customise this reporting.
Modules should print the command that they are going to run to STDERR so that
this is recorded in the log file. These are also sent in the e-mail
notification and should start with ###CFCMD
.
It's likely that your cluster will continue to fire off the dependent jobs as soon as the parent jobs finish, irrespective of their output. If a module fails, the cleanest way to exit is with a success code, but without printing any resulting output filename. The following modules will not find their input filenames and so should immediately exit.
Cluster Flow modules are expected to respond to the following command line flags:
Flag | Step | Description |
---|---|---|
--requirements |
1 | Request the cluster resources needed by the module |
--run_fn |
2 | Path to the Cluster Flow run file(s) for this pipeline |
--job_id |
2 | Cluster job ID for this job |
--prev_job_id |
2 | Cluster job ID for the previous job |
--cores <int> |
2 | Number of cores allocated to the module |
--mem <str> |
2 | Amount of memory allocated to the module |
--param <str> |
2 | Extra parameters to be used |
--help |
3 | Print module help |
The step number refers to whether the module is being executed:
cf --help <modname>
is specified.If your module is written in Perl, there are some common Cluster Flow packages (perl modules) that you can use to provide some pre-written functions.
There are currently three packages available to Cluster Flow modules.
Helpers
contains subroutines of general use for most modules.
Constants
and Headnodehelpers
contain subroutines primarily for
use in the main cf
script. You can include the Helpers package by
adding the following to the top of your module file:
use FindBin qw($RealBin);
use lib "$FindBin::RealBin/../source";
use CF::Helpers;
We use the package FindBin
to add the binary directory to the path
(where cf
is executing from).
Note that there is a Python version of the Helpers script which contains many of the same functions and works in a comparable way.
Handles the initiation of all modules. See above for a description of use.
Parses .run
files. Called by module_start()
and not usually run directly.
Used to load environment modules into the PATH. Typically used by the
main cf
script, though occasionally used elsewhere for special occasions.
This function takes an array of file names and returns an array of single end files and an array of arrays of paired end files.
First, it checks the configuration set in the .run
file.
If @force_paired_end
is set, it sorts the files from the last job into
pairs and returns them. If @force_single_end
is set it returns all
previous files as single end.
If neither variables are set, it sorts the files alphabetically, then
removes any occurance of _[1-4]
from the filename and compares the list.
Identical pairs are returned as paired end.
my ($se_files, $pe_files) = CF::Helpers::is_paired_end(@$files);
foreach my $file (@$se_files){
print "$file is single end\n";
}
foreach my $files_ref (@$pe_files){
my @files = @$files_ref;
print $files[0]." and ".$files[1]." are paired end.\n";
}
Looks at BAM/SAM file headers and tries to determine whether it has been
generated using paired end input files or single end. The function reads
through the first 1000 reads of the file and counts how many 0x1
flags
it finds (denoting a paired read). If `>= 800
of those first 1000 reads
are paired end, it returns true
.
if(CF::Helpers::is_bam_paired_end($file)){
## do something with paired end BAM
} else {
## do something with single end BAM
}
Scans a FastQ file and tries to determine the encoding type. Returns strings
integer
, solexa
, phred33
, phred64
or 0
if too few reads to safely
determine. This is done by observing the minimum and maximum quality scores.
For more details, see the Wikipedia page on FastQ encoding
($encoding) = CF::Helpers::fastq_encoding_type($file);
Scans the first 100000 reads of a FastQ file and returns the longest read length that it finds.
my $min_length = CF::Helpers::fastq_min_length($file);
Takes time in seconds as an input and returns a human readable string.
The optional second $long
variable determines whether to use h
/m
/s
(0
, false) or hours
/minutes
/seconds
(1
, true, the default).
my $time = CF::Helpers::parse_seconds($seconds, $long);
Functions to convert between a SLURM / HPC style timestamp and minutes. Attempts string parsing in the following order:
my $minutes = CF::Helpers::timestamp_to_minutes($timestamp);
my $timestamp = CF::Helpers::minutes_to_timestamp($minutes);
Two functions which convert between human readable memory strings
(eg. 4G
or 100M
) and bytes.
my $bytes = CF::Helpers::human_readable_to_bytes('3G');
my $size = CF::Helpers::bytes_to_human_readable('7728742');
Takes a memory string and returns a number of megabytes, rounding up to the nearest MB.
Takes the suggested number of cores to use, a minimum and maximum number and returns a sensible result.
my $cores = CF::Helpers::allocate_cores($recommended, $min, $max);
Takes the suggested number of memory to use, a minimum and maximum amount and returns a sensible result. Input can be human readable strings or bytes. Returns a value in bytes.
my $mem = CF::Helpers::allocate_memory($recommended, $min, $max);
Function to properly compare software version numbers. Correctly returns
that v0.10
is greater than v0.9
.
If you come across a strange looking error message or find a bug, please do let us know. You submit new issues here: https://github.com/ewels/clusterflow/issues
If you'd like Cluster Flow to do something it doesn't, log a request! The issue tracker system mentioned above can be used for enhancement requests too.
If you don't want to set up a GitHub account, feel free to drop the author an e-mail at phil.ewels@scilifelab.se
A number of errors can be caused by scripts not having executable file
privileges. You can see the file permissions with ls -l
, you should see
something like this:
$ ls -l clusterflow/modules/
total 608
-rwxrwxr-x 1 phil phil 6770 May 20 16:40 bismark_align.cfmod
-rwxrwxr-x 1 phil phil 4291 May 16 12:54 bismark_deduplicate.cfmod
-rwxrwxr-x 1 phil phil 2748 May 16 12:54 bismark_messy.cfmod
-rwxrwxr-x 1 phil phil 5652 May 16 12:54 bismark_methXtract.cfmod
-rwxrwxr-x 1 phil phil 3553 May 16 12:54 bismark_tidy.cfmod
-rwxrwxr-x 1 phil phil 7119 May 16 12:54 bowtie1.cfmod
This example is for the modules directory (all modules should have executable
privileges for all), the same applies to the main cf
file.
If you've edited any files, you may get problems due to windows-based editors
putting DOS-style \r
carriage returns in.
Most linux environments come with a package called dos2unix
which will
clean these up:
dos2unix *
This error probably means that Cluster Flow isn't installed in your
environment module system, and you're trying to run module load clusterflow
You can skip this step if you have another way of accessing the cf
file,
or see the Installation Instructions
for details about how to set Cluster Flow up with environment modules.
This message means that your GRIDEngine setup doesn't have the default
orte
environment set up. If you have different environments set up you
can list them with:
qconf -spl
You can get the details of any environment with:
qconf -sp [name]
If you find one which assigns slots to a single node (allocation_rule
should be $fill_up
).
Once you've found your named environment, set @cluster_queue_environment
in
your Cluster Flow config file.
(Thanks to Simon Andrews for help with this)
There may be other differences in the job submission requests that cause
them to fail. If you see errors such as this, you can use the
@custom_job_submit_command
configuration variable to customise the way
that jobs are requested.