Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
In the output directory of a given run of this pipeline, there will be a subdirectory for each sample in the samplesheet where the directory name is the sample name (first field in the samplesheet).
For example, if you are running mammals data and your samplesheet has two samples, like so:
Then the output directory would have the following structure:
If you are running yeast data, then the samplesheet would look the same, except fastq_2 must be present. The output directory will be structured similarly, except that within each sample subdirectory, there will be multiple qbeds/qc files for each of the TFs included in the multiplexed library. Mammals libraries, on the other hand, are not multiplexed and contain data for only 1 TF.
For either datatype
, you can include the parameters
--save_genome_intermediate
, --save_sequence_intermediate
and/or
--save_alignment_intermeidate
to save the intermediate files generated at
each processing stage.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Prepare the Genome
- Prepare the Reads
- FastQC
- Divide Fastq
- UMITools Extract (mammals only)
- Demultiplex (yeast only)
- Concat Fastqs (yeast only)
- Concat QC (for mammals, this occurs in the Align step)
- Trimmomatic
- Align
- Count Hops
- Present QC data
- Pipeline information
Mask Genome
This step does not output, except in the work-dir
.
However, if a bed file is provided to the regions_mask
parameter, then it
will be used to mask the genome with bedtools maskfasta
. The masked genome
will be used for the rest of the pipeline.
Concatenate Additional Sequences
This step does not output, except in the work-dir
.
However, if a fasta file is provided to the the additional_sequence
parameter, it will be appended to the (possibly masked) genome and used for all
subsequent steps.
GTF2BED
Translation of the gtf file to a bed file.
This is only output if save_genome_intermediate
is true
Output files
genome/gtf2bed
<gtf_name>.bed
: GTF file in bed format
Aligner Index
The index produced by the aligner of your choice. This will only be saved if
save_genome_intermediate
is true
Output files
genome/<aligner>
<aligner index output>
: The output of the aligner index command
Genome Index
Samtools index of the fasta file (after appending any additional sequences)
Output files
genome/samtools
<genome>.fa[asta].fai
: Genome index (includes additional sequences, if they exist).
FastQC
Output files
fastqc/
*_fastqc.html
: FastQC report containing quality metrics.*_fastqc.zip
: Zip archive containing the FastQC report, tab-delimited data file and plot images.
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.
NB: The yeast workflow runs FastQC both before and after processing (including demultiplexing) the reads. The mammalian workflow runs FastQC only on un-processed reads
Divide Fastq
seqkit split2 is used to divide each of the samples’ fastq files into parts for parallel processing. For the yeast pipeline, these parts are concatenated back together by the respective TFs while for mammals, the divided parts are aligned and processed in parallel
UMItools Extract
Output files
-
umitools/
*.fastq.gz
: The input fastq files with the specified pattern extracted and placed in the header. For example, for single-end reads where ther1_bc_pattern
isNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
, the result of this step would turn a FASTQ record like this:
to this:
Where the barcode immediately follows the first space delimited value in the id line and is appended with an
_
*.log
: Log file generated by the UMI-toolsextract
command.
UMItools extract is used to extract barcode and other CallingCards specific sequences from single or paired-end reads.
Demultiplex
This step is only run for the yeast pipeline. It calls the
callingCardsTools utility
parse fastq
and uses the barcod_details
json file to demultiplex the reads
according to the TF barcodes. This step will typically be performed on
split fastq files (for parallel processing) and not output unless
save_sequence_itermediate
is set.
Concat Fastq
This step is only run for the yeast pipeline. It concatenates the splits of each demultiplexed TF fastq file so that there is a single fastq file for each TF for each sample.
Concat QC
This is performed in both the yeast and mammals workflows, but at different points. For yeast, this is performed in the Prepare Reads subworkflow while in mammals it is performed after alignment in the Process Alignment subworkflow.
Output files
sequence/<sample>
fastqcdemux
: The trimmed fastq filefastqcraw
: Log file generated by tyrimmomatic*_r1_primer_summary
: Given a correct R1 barcode, this file describes how many reads go to each R2 barcode tallied by edit distance. So, the entryMET31,TGATA,11,4,*,1
means that for the MET31 R1 sequenceTGATA
which has an R1 transposon sequence with an edit distance of 11 and R2 TF barcode sequence with an edit distance of 4 and and unknown restriction enzyme, there was 1 read.*_r2_transposon_summary.csv
: Similar to above, but rather than tallying by correct R1 sequences, this tallies by correct R2 TF barcode sequences.
Trimmomatic
Output files
trimmomatic/
*.fastq.gz
: The trimmed fastq file*.log
: Log file generated by tyrimmomatic
Trimmomatic is used to trim reads after extracting known non-genomic Calling Cards specific sequence
Align
The user may select any one of the following:
If save_intermediate
is set, and the selected Aligner generates an index,
the index will be saved at the main level of the outdir
in a directory called
genome
. In each sample subdirectory, there will also be a directory
named by the name of the selected aligner which will store the aligner’s output.
Output files
-
genome/
<ALIGNER>
: this will store the selected aligners index, ifsave_intermediate
is set.
-
<ALIGNER>/
<SAMPLE>.sorted.bam
: Coordinate sorted alignment file generated by the user-selected aligner<SAMPLE>.sorted.bam.bai
: the BAI index for the alignment file
Count Hops
Output files
hops/
*_passing.bam/.bai
: the subset of alignments which are considered passing, countable hops, and its BAI index*_failing.bam/.bai
: the subset of alignments which are considered uncountable*.qbed
: A qBed format file which quantifies the number of hops at a given coordinate in the genome*_summary.tsv
: A tally of the number of reads which passed and failed by failure status*_barcode_qc.tsv
: (mammals only) A tally of the number of reads by barcode components*_srt_count.tsv
: (mammals only) A tally of reads with a single and multi SRT sequence
Alignment Metrics
Picard, RSeQC and Samtools provide alignment level quality control metrics. Qualifying alignments are counted as ‘hops’ of a given transcription factor and those hops are quantified in a qBed file. Hop level QC metrics are also generated.
Picard
Picard CollectMultipleMetrics gathers multiple QC metrics from alignments files. These are performed on the post-processed alignments files, after they have been partitioned into passing and failing alignments.
Output files
hops/picard/
*.alignment_summary_metrics
.base_distribution_by_cycle_metrics/.pdf
*.quality_by_cycle_metrics/.pdf
*quality_distribution_metrics/.pdf
*.read_length_histogram.pdf
SAMtools
Output files
alignment/
<SAMPLE>.sorted.bam.bai
: the BAI index for the alignment file
hops/samtools/
- SAMtools
<SAMPLE>.sorted.bam.flagstat
,<SAMPLE>.sorted.bam.idxstats
and<SAMPLE>.sorted.bam.stats
files generated from the alignment files.
- SAMtools
The original BAM files generated by the selected alignment algorithm are further processed with SAMtools to sort them by coordinate, for indexing, as well as to generate read mapping statistics.
RSeQC
RSeQC is a package of scripts designed to evaluate the quality of RNA-seq data. This pipeline runs several, but not all RSeQC scripts. You can tweak the supported scripts you would like to run by adjusting the --rseqc_modules
parameter which by default will run all of the following: bam_stat.py
, inner_distance.py
, infer_experiment.py
, junction_annotation.py
, junction_saturation.py
,read_distribution.py
and read_duplication.py
.
The majority of RSeQC scripts generate output files which can be plotted and summarised in the MultiQC report.
Infer experiment
Output files
hops/rseqc
*.infer_experiment.txt
: File containing fraction of reads mapping to given strandedness configurations.
A number of RSeQC modules are available. Only the read distribution
RSeQC documentation: infer_experiment.py
Read distribution
Output files
<ALIGNER>/rseqc/read_distribution/
*.read_distribution.txt
: File containing fraction of reads mapping to genome feature e.g. CDS exon, 5’UTR exon, 3’ UTR exon, Intron, Intergenic regions etc.
This tool calculates how mapped reads are distributed over genomic features. A good result for a standard RNA-seq experiments is generally to have as many exonic reads as possible (CDS_Exons
). A large amount of intronic reads could be indicative of DNA contamination in your sample but may be expected for a total RNA preparation.
RSeQC documentation: read_distribution.py
BAM stat
Output files
<ALIGNER>/rseqc/bam_stat/
*.bam_stat.txt
: Mapping statistics for the BAM file.
This script gives numerous statistics about the aligned BAM files.
MultiQC plots each of these statistics in a dot plot. Each sample in the project is a dot - hover to see the sample highlighted across all fields.
RSeQC documentation: bam_stat.py
MultiQC
Output files
multiqc/
multiqc_report.html
: a standalone HTML file that can be viewed in your web browser.multiqc_data/
: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/
: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.