Count RNA-Seq Reads Using Biopipeline Designer
This example shows how to build a bioinformatics pipeline to count the number of reads mapped to genes identified by Cufflinks using a sample paired-end RNA-Seq data for chromosome 4 of the Drosophila genome.
Open Biopipeline Designer App
At the MATLAB® command line, enter:
biopipelineDesigner
Select Data Files
The example uses chromosome 4 of the Drosophila genome as a reference (Dmel_chr4.fa
). It also uses a sample paired-end data provided with the toolbox (SRR6008575_10k_1.fq
and SRR6008575_10k_2.fq
). Use a FileChooser block for each of these files to load the data into the app.
Create FileChooser Block for Reference Sequence
In the Block Library pane of the app, scroll down to the General section. Drag and drop a FileChooser block onto the diagram.
Run the following command at the MATLAB command line to create a variable that contains the full file path to the provided reference sequence.
refSeq = which("Dmel_chr4.fa");
In the app, click the FileChooser block. In the Pipeline Inspector pane, under FileChooser Properties, click the vertical three-dot menu next to the Files property. Select Assign from workspace.
Select refSeq from the list. Click OK.
Create FileChooser Blocks for Paired-End Data
There are two sample files (SRR6008575_10k_1.fq
and SRR6008575_10k_2.fq
) provided with the toolbox that contain RNA-Seq data for pair-end reads. You need to create a FileChooser block for each file.
First, run the following commands at the MATLAB command line to create two variables that contain the full file path to the provided files.
reads1 = which("SRR6008575_10k_1.fq"); reads2 = which("SRR6008575_10k_2.fq");
In the app, drag and drop two FileChooser blocks. Click FileChooser_2 and set its Files property to the reads1
variable and reads2
for FileChooser_3 (following the similar steps you did for the reference sequence refSeq
previously).
Build Reference Genome Indices
Generate indices for the reference genome files before aligning the reads to it. Use a Bowtie2Build block to build such indices.
From the Block Library pane, under the Alignment section, drag and drop the Bowtie2Build block onto the diagram.
Connect FileChooser_1 to Bowtie2Build_1 blocks as shown next. To connect, place your pointer at the output port of the first block and drag (an arrow) towards the input port of the second block.
Align Reads Using Bowtie2
Use a Bowtie2 block as an aligner to map reads from two sample files (FileChooser_2 and FileChooser_3) against the reference sequence.
Drag and drop a Bowtie2 block from the Block Library pane. The IndexBaseName input port of the block takes in the base name of the index files, which is the output of the previous (or upstream) Bowtie2Build_1 block. The Reads1Files and Reads2Files input ports takes in the first mate and second mate reads, respectively. The IndexBaseName and Reads1Files input ports are required and must be connected, as indicated by solid circles. The Reads2File port is an optional port, indicated by a dotted circle, and you use it only when you have paired-end data, such as in this example. Connect these three blocks as shown next. The Bowtie2 Options section of the Pipeline Inspector pane lists all the available alignment options. For details on each options, see Bowtie2AlignOptions
.
Sort SAM Files
The next step is to use the SamSort block, which sorts the alignment records by the reference sequence name first and then by position within the reference. Sorting is needed because you will use the Cufflinks block next to assemble transcriptomes based on the aligned reads, and the block requires sorted SAM files as inputs.
Drag and drop a SamSort block from the Sequence Utilities section of Block Library onto the diagram. Then connect the Bowtie2_1 and SamSort_1 blocks.
Assemble Reads into Transcriptomes
Create a GTF (Gene Transfer Format) file from the aligned data (SAM files) to quantify transcript expression. Use the Cufflinks block to assemble the sorted SAM files into GTF files, which contains information on gene features, including the start and end positions of transcripts.
Drag and drop a Cufflinks block from the Analysis section of Block Library onto the diagram. Then connect the SamSort_1 and Cufflinks_1 blocks.
Count Reads from Paired-End Data
Use the FeatureCount block to count the number of reads in the sorted SAM file that map onto genomic features in the GTF file (TranscriptsGTFFile) generated by the Cufflinks block. Speciflcally, you will count the number of reads mapped to genes identified by Cufflinks.
Drag and drop a FeatureCount block from the Analysis section of Block Library. Then connect the three blocks (SamSort_1, Cufflinks_1, and FeatureCount_1) as shown next:
Plot Read Counts
As the last step in this pipeline, create a custom function that plots read count results for cufflinks-identified genes.
Go back to the MATLAB desktop. On the Home tab, click New Script. An untitled file opens in the Editor. Copy and paste the following code in the file that defines a custom function called plotCounts
. The function generates two plots. The first plot contains the read counts of each gene identified by Cufflinks. The second plot shows the genomic locations of these counts.
function plotCounts(fcCountsTable,cufflinksGenesFPKMFile) genesFPKMTable = readtable(cufflinksGenesFPKMFile,FileType="text"); % Plot counts of genes identified by Cufflinks. figure geneNames = categorical(fcCountsTable.ID,fcCountsTable.ID); stem(geneNames, log2(fcCountsTable.Aligned_sorted)) xlabel("Cufflinks-identified genes") ylabel("log2 counts") % Plot counts along their respective genomic positions. geneStart = str2double(extractBetween(genesFPKMTable.locus,":","-")); figure stem(geneStart,log2(fcCountsTable.Aligned_sorted)) xlabel("Drosophila Chromosome 4 Genomic Position") ylabel("log2 counts") end
Save the file as plotCounts.m
in the current directory.
Create UserFunction Block to Represent Custom Function
A UserFunction block can represent any existing or custom function and can be used as a block in your pipeline.
Drag and drop a UserFunction block from Block Library. In the Pipeline Inspector pane, under UserFunction Properties, update:
RequiredArguments to
CountsTable,GenesFPKMFile
Function to
plotCounts
The UserFunction_1 block is then updated with two input ports named after the values of RequiredArguments.
Connect three blocks (Cufflinks_1, FeatureCount_1, and UserFunction_1) as shown next.
Run Pipeline
Click Run on the Home tab of the app. The app generates the following two figures.
The first figure shows the log2 counts of each gene identified by Cufflinks.
The second figure shows the individual genomic locations of these counts.
See Also
bioinfo.pipeline.Pipeline
| bioinfo.pipeline.block.Cufflinks
| bioinfo.pipeline.block.Bowtie2
| bioinfo.pipeline.block.Bowtie2Build
| bioinfo.pipeline.block.FeatureCount
| bioinfo.pipeline.block.SamSort
| bioinfo.pipeline.block.FileChooser
| bioinfo.pipeline.block.UserFunction