Task Setup

Task Setup Overview

After running using --mode initiate, CUT&RUN-Flow will copy the task configuration template into your current working directory. For a full example of this file, see Task nextflow.config.
# Configure:
$ <vim/nano...> ./my_task/nextflow.config   # Task Input, Steps, etc. Configuration
Task-level inputs such as input files and reference fasta files must be configured here (see Input File Setup). Additional task-specific settings are also configured here, such as output read naming rules and output file locations (see Output Setup).

Note

These settings are provided for user customizability, but in the majority of cases the default settings should work fine.

Many pipeline settings can justifiably be configured either on a task-specifc basis (in Task nextflow.config) or as defaults for the pipeline (in Pipe nextflow.config ). These include nextflow “executor” settings for use of SLURM and Environment Modules and associated settings such as memory and cpu usage. These settings are described here, in Executor Setup, but can also be set in the Pipe nextflow.config.

Likewise, settings for individual pipeline components such as Trimmomatic tag trimming paramaters, or the qval used for MACS2 peak calling can be provided in either config file, or both (for a description of these parameters, see Workflow).

Note

If any settings are provided in both the above Task nextflow.config file and the Pipe nextflow.config file located in the pipe directory, the task-directory settings will take precedence. For more information on Nextflow configuration precedence, see config.

Reference Files Setup

CUT&RUN-Flow handles reference database preparation with a series of steps utilizing --mode prep_fasta. The location of the fasta used for preparation is provided to the --ref_fasta (params.ref_fasta) paramater as either a file path or URL.

Reference preparation is then performed using:

$ nextflow CnR-flow --mode prep_fasta

This will place the prepared reference files in the directory specified by --refs_dir (params.refs_dir) (see Output Setup). Once prepared, the this parameter can be dynamically used during pipeline execution to detect the reference name and location, depending on the value of the --ref_mode (params.ref_mode) parameter.

Ref Modes:
  • 'fasta' : Get reference name from --ref_fasta (params.ref_fasta) (which must then be set)

  • 'name' : Get reference name from --ref_name (params.ref_name) (which must then be set)

  • 'manual' : Set required paramaters manually:

Ref Required Manual Paramaters:
  • --ref_name (params.ref_name) : Reference Name

  • --ref_bt2db_path (params.ref_bt2db_path) : Reference Bowtie2 Alignment Reference Path

  • --ref_chrom_sizes_path (params.ref_chrom_sizes_path) : Path to <reference>.chrom_sizes file

  • --ref_eff_genome_size (params.ref_eff_genome_size) : Effective genome size for reference.

The --ref_mode (params.ref_mode) parameter also applies to the preparation and location of the fasta used for the normalization reference if --do_norm (params.do_norm). These paramaters are named in parallel using a norm_[ref...] prefix and are autodetected from the value of --norm_ref_fasta (params.norm_ref_fasta) or --norm_ref_name (params.norm_ref_name) depending on the value of --ref_mode (params.ref_mode). For details on normalization steps, see Normalization Steps.

Input File Setup

Two (mutually-exclusive) options are provided for supplying input sample fastq[.gz] files to the workflow.

Single Sample Group:
A single group of samples with zero or one (post-combination) control sample(s) for all treatment samples.
  • --treat_fastqs (params.treat_fastqs)

  • --ctrl_fastqs (params.ctrl_fastqs)

params {
// CnR-flow Input Files:
//   Provided fastqs must be in glob pattern matching pairs.
//     Example: ['./relpath/to/base*R{1,2}*.fastq']
//     Example: ['/abs/path/to/other*R{1,2}*.fastq']

treat_fastqs   = []    // REQUIRED, Single-group Treatment fastq Pattern
ctrl_fastqs    = []    //           Single-group Control   fastq pattern

}

Note

Note, for convenience, if the same file is found both as a treatment and control, the copy passed to treatment will be ignored (facilitates easy pattern matching).

Warning

Input files must be paired-end, and in fastq[.gz] format. Nextflow requires the use of this (strange-looking) R{1,2} naming construct, (matches either R1 or R2) which ensures that files are fed into the pipeline as pairs.

Multiple Sample Group:
A multi-group layout, with groups of samples provided where each group has a control sample. (All groups are required to have a control sample in this mode.)
  • params.fastq_groups

    params {
    // Can specify multiple treat/control groups as Groovy mapping.
    //   Specified INSTEAD of treat_fasts/ctrl_fastqs parameters.
    //   Note: There should be only one control sample per group 
    //     (after optional lane combination)
    // Example:
    // fastq_groups = [
    //   'group_1_name': ['treat': 'relpath/to/treat1*R{1,2}*',
    //                    'ctrl':  'relpath/to/ctrl1*R{1,2}*'
    //                   ],
    //   'group_2_name': ['treat': ['relpath/to/g2_treat1*R{1,2}*',
    //                              '/abs/path/to/g2_treat2*R{1,2}*'
    //                             ],
    //                    'ctrl':  'relpath/to/g2_ctrl1*R{1,2}*'
    //                   ]
    // ]
    //fastq_groups = []    
    
    }
    

Multiple pairs of files representing the same sample/replicate that were sequenced on different lanes can be automatically recognized and combined (default: true). For more information see: MergeFastqs.

Executor Setup

Nextflow provides extensive options for using cluster-based job scheduling, such as SLURM, PBS, etc. These options are worth reviewing in the nextflow docs: executor. The specific executor is selected with the configuration setting: process.executor = 'option'. The default value of process.executor = 'local' runs the execution on the local filesystem.

Specific settings of note:

Option

Example

process.executor

'slurm'

process.memory

'4 GB'

process.cpus

4

process.time

'1h'

process.clusterOptions

'--qos=low'

To facilitate process efficiency (and for adequate capacity) for different parts of the process, memory-related process labels have been applied to the processes: 'small_mem', 'norm_mem', and 'big_mem'. These are specified using process.withLabel: my_label { key = value } Example: process.withLabel: big_mem { memory = '16 GB' }.
A 1n/2n/4n or 1n/2n/8n strategy is recommended for the respective small_mem/norm_mem/big_mem options. (for details on nextflow process labels, see process). Additionally, mutliple cpu usage is disabled for processes that do not support (or aren’t significanlly more effective) with multiple processes, and so the process.cpus setting only applies to processes within the pipeline with multiple CPUS enabled.
// Process Settings (For use of PBS, SLURM, etc)
process {
    // --Executor, see: https://www.nextflow.io/docs/latest/executor.html 
    //executor = 'slurm'  // for running processes using SLURM (Default: 'local')
    // Process Walltime, See https://www.nextflow.io/docs/latest/process.html#process-time
    //time = '12h'
    // Process CPUs, See https://www.nextflow.io/docs/latest/process.html#cpus
    //cpus = 8
    // 
    // Memory: See https://www.nextflow.io/docs/latest/process.html#process-memory
    // Set Memory for specific task sizes (1n/2n/4n scheme recommended)
    //withLabel: big_mem   { memory = '16 GB' }
    //withLabel: norm_mem  { memory = '4 GB'  }
    //withLabel: small_mem { memory = '2 GB'  }
    // -*OR*- Set Memory for all processes
    //memory = "16 GB"

    ext.ph = null //Placeholder to prevent errors.
}

Output Setup

Output options can control the quantity, naming, and location of output files from the pipeline.

publish_files:

Three modes are available for selecting the number of output files from the pipeline:

  • minimal : Only the final alignments are output. (Trimmed Fastqs are Excluded)

  • default : Multiple types of alignments are output. (Trimmed Fastqs are included)

  • all : All files produced by the pipline (excluding deleted intermediates) are output.

This option is selected with --publish_files (params.publish_files).

publish_mode:

This mode selects the value for the Nextflow process.publishDir mode used to output files (for details, see: publishDir). Available options are:

  • 'copy' : Copy output files (from the nextflow working directory) to the output folder.

  • 'symlink' : Link to the output files located in the nextflow working directory.

trim_name_prefix & trim_name_suffix:
params.trim_name_prefix & params.trim_name_suffix
These options allow trimming of a prefix or suffix from sample names (after any merging steps).
out_dir:

--out_dir (params.out_dir) : Location for output of the files.

refs_dir:

--refs_dir (params.refs_dir) : Location for placing and searching for reference directories.

params {
    // ------- General Pipeline Output Paramaters --------
    publish_files    = 'default' // Options: ["minimal", "default", "all"]
    publish_mode     = 'copy'    // Options: ["symlink", "copy"]

    // Name trim guide: ( regex-based )
    //    ~/groovy-slashy-string/  ;  "~" denotes groovy pattern type.
    //    ~/^/ matches beginning   ;   ~/$/ matches end    
    trim_name_prefix = ''        // Example: ~/^myprefix./ removes "myprefix." prefix.
    trim_name_suffix = ''        // Example: ~/_mysuffix$/ removes "_mysuffix" suffix.   

    // Workflow Output Default Naming Scheme:
    //   Absolute paths for output:
    out_dir          = "${launchDir}/cnr_output"
    refs_dir         = "${launchDir}/cnr_references"
}