Complete Guide: Building a Reproducible RNA-seq Pipeline with Nextflow

Nextflow’s data-centric paradigm makes it ideal for bioinformatics workflows. Let’s build an RNA-seq quality control pipeline with detailed explanations of each component.

1. Pipeline Architecture

Our pipeline will follow this structure:

my_pipeline/
├── main.nf        # Workflow logic
├── nextflow.config # System configuration
└── data/          # Input FASTQs (create this)

2. Understanding the Main Workflow (main.nf)

// Define input parameters
params.reads = "data/*.fastq.gz"

// Process definition
process FastQC {
  tag "FASTQC $sample_id" // Log identifier
  publishDir "results/fastqc", mode: 'copy' // Output directory
  
  input:
  tuple val(sample_id), path(read) // Structured input
  
  output:
  path "*_fastqc.*" // Capture all FastQC outputs
  
  script:
  """
  fastqc -q $read // -q for quiet mode
  """
}

// Workflow definition
workflow {
  // Create input channel
  samples = Channel.fromFilePairs(params.reads)
  
  // Execute process
  FastQC(samples)
  
  // Optional: Add onComplete hook
  onComplete { log.info "Pipeline completed" }
}

Key Improvements:

  • Used fromFilePairs for paired-end readiness
  • Added tuple input with sample identifiers
  • Included execution hooks for monitoring

3. Configuration Deep Dive (nextflow.config)

profiles {
  docker {
    docker.enabled = true
    process.container = 'staphb/fastqc:0.11.9'
  }
  singularity {
    singularity.enabled = true
    singularity.autoMounts = true
  }
}

// Default parameters
params {
  max_memory = '8.GB'
  max_cpus = 4
  max_time = '2.h'
}

// Execution policy
executor {
  queueSize = 100
}

Why This Matters:

  • Multiple containerization options via profiles
  • Resource constraints prevent overconsumption
  • Queue management for large datasets

4. Execution with Advanced Options

# Test run with 2 cores
nextflow run main.nf -profile docker --max_cpus 2

# Resume after interruption
nextflow run main.nf -resume

# View execution report
nextflow log -f name,status,duration $run_id

Pro Tips:

  • Use -resume to continue failed runs
  • Monitor resources with -with-report
  • Test with -entry for complex workflows

5. Extending the Pipeline

process TrimGalore {
  container 'quay.io/biocontainers/trim-galore:0.6.7--0'
  
  input:
  tuple val(id), path(reads)
  
  output:
  tuple val(id), path("*val*.fq.gz"), emit: trimmed
  
  script:
  """
  trim_galore --paired ${reads} -o .
  """
}

// Connect processes
workflow {
  raw_data = Channel.fromFilePairs(params.reads)
  trimmed_data = TrimGalore(raw_data)
  FastQC(trimmed_data)
}

Added Value:

  • Chained quality control steps
  • Demonstrated process communication
  • Showed Biocontainer integration

Essential Resources:
NF-Core Pipelines · Bioconda Packages · BioContainers Registry

This enhanced version provides better error handling, resource management, and clear pathway for expansion. Always validate with test datasets before production use!

Leave a Reply

Your email address will not be published. Required fields are marked *