Complete Guide: Building a Reproducible RNA-seq Pipeline with Nextflow

Nextflow’s data-centric paradigm makes it ideal for bioinformatics workflows. Let’s build an RNA-seq quality control pipeline with detailed explanations of each component.

1. Pipeline Architecture

Our pipeline will follow this structure:

my_pipeline/
├── main.nf        # Workflow logic
├── nextflow.config # System configuration
└── data/          # Input FASTQs (create this)

2. Understanding the Main Workflow (main.nf)

// Define input parameters
params.reads = "data/*.fastq.gz"

// Process definition
process FastQC {
  tag "FASTQC $sample_id" // Log identifier
  publishDir "results/fastqc", mode: 'copy' // Output directory
  
  input:
  tuple val(sample_id), path(read) // Structured input
  
  output:
  path "*_fastqc.*" // Capture all FastQC outputs
  
  script:
  """
  fastqc -q $read // -q for quiet mode
  """
}

// Workflow definition
workflow {
  // Create input channel
  samples = Channel.fromFilePairs(params.reads)
  
  // Execute process
  FastQC(samples)
  
  // Optional: Add onComplete hook
  onComplete { log.info "Pipeline completed" }
}

Key Improvements:

Used fromFilePairs for paired-end readiness
Added tuple input with sample identifiers
Included execution hooks for monitoring

3. Configuration Deep Dive (nextflow.config)

profiles {
  docker {
    docker.enabled = true
    process.container = 'staphb/fastqc:0.11.9'
  }
  singularity {
    singularity.enabled = true
    singularity.autoMounts = true
  }
}

// Default parameters
params {
  max_memory = '8.GB'
  max_cpus = 4
  max_time = '2.h'
}

// Execution policy
executor {
  queueSize = 100
}

Why This Matters:

Multiple containerization options via profiles
Resource constraints prevent overconsumption
Queue management for large datasets

4. Execution with Advanced Options

# Test run with 2 cores
nextflow run main.nf -profile docker --max_cpus 2

# Resume after interruption
nextflow run main.nf -resume

# View execution report
nextflow log -f name,status,duration $run_id

Pro Tips:

Use -resume to continue failed runs
Monitor resources with -with-report
Test with -entry for complex workflows

5. Extending the Pipeline

process TrimGalore {
  container 'quay.io/biocontainers/trim-galore:0.6.7--0'
  
  input:
  tuple val(id), path(reads)
  
  output:
  tuple val(id), path("*val*.fq.gz"), emit: trimmed
  
  script:
  """
  trim_galore --paired ${reads} -o .
  """
}

// Connect processes
workflow {
  raw_data = Channel.fromFilePairs(params.reads)
  trimmed_data = TrimGalore(raw_data)
  FastQC(trimmed_data)
}

Added Value:

Chained quality control steps
Demonstrated process communication
Showed Biocontainer integration

Essential Resources:
NF-Core Pipelines · Bioconda Packages · BioContainers Registry

This enhanced version provides better error handling, resource management, and clear pathway for expansion. Always validate with test datasets before production use!

1. Pipeline Architecture

2. Understanding the Main Workflow (main.nf)

3. Configuration Deep Dive (nextflow.config)

4. Execution with Advanced Options

5. Extending the Pipeline

Leave a ReplyCancel Reply