Contenu principal

Bioinformatics Pipeline SplitDimension

Some of the blocks in a bioinformatics pipeline operate on their input data arrays as one single input while other blocks can operate on individual elements or slices of the input data array independently. The SplitDimension property of a block input controls how to split the block input data (or input array) across multiple runs of the same block in a pipeline. In other words, SplitDimension allows you to control how to parallelize independent runs of the same block (with a different input for each run).

Specify SplitDimension to Select Which Input Array Dimensions to Split

You can specify a vector of integers to indicate which dimensions (such the row or column dimension) of the input array to split and pass to the block run method. By splitting the input data, you are specifying how many times you want to run the same block with different inputs.

For example, the bioinfo.pipeline.block.SeqSplit block can apply the same trimming operation on an array of input FASTQ files. To specify that SeqTrim runs on each input file in the array independently, set the SplitDimension property of the block input to a specific dimension (such as 1 for the row dimension or 2 for the column dimension of the array).

You can also specify an empty array [] as the value to perform no dimension splitting of input data, that is, the block runs one time for all of input data. Alternatively, specify "all" to pass all elements of the input array to the run method of the block independently. For instance, if there are n elements, the block runs n times independently.

For an example of how to use SplitDimension, see Split Input SAM Files and Assemble Transcriptomes Using Bioinformatics Pipeline.

Note

If you are running the Bioinformatics Toolbox Software Support Packages (such as Bowtie2, BWA, or Cufflinks) remotely, ensure that these support packages are installed in the remote clusters that you are running the pipeline.

Provide Compatible Array sizes

A block can have different split dimensions for each input (port), but inputs that share split dimensions must have compatible sizes. As with binary operations on MATLAB arrays, two inputs have a compatible size for a dimension if the size of the inputs is the same or one of the dimension sizes is 1. For an input whose size is 1 (or scalar) in a split dimension, the value in that dimension is implicitly expanded to match the same size as the other dimensions. For MATLAB® arrays, dimension one refers to the number of rows and dimension two refers to the number of columns.

The total number of times the block runs within a pipeline is the product of the sizes of the input value in the split dimensions. For example, consider a block with two input ports X and Y. The following table shows the total number of runs (or processes) for various values of SplitDimension.

X array sizeY array sizeX.SplitDimensionY.SplitDimensionTotal number of runs
1-by-12-by-2[][]1⁢⨉1 = 1. This is the default (no dimensional splitting).
1-by-12-by-3[]12⨉1 = 2
5-by-11-by-3125⨉3 = 15
2-by-23-by-3220 because of dimension mismatch
2-by-32-by-42"all"0 because of dimension mismatch
3-by-1-by-41-by-3"all"23⨉3⨉4 = 36
0-by-11-by-1[][]1⨉1 = 1
0-by-11-by-11[]0 because of size 0 in dimension 1

Empty sizes are allowed only in non-SplitDimension. If no inputs specify a SplitDimension, there will always be exactly one run, regardless of the input array sizes. You can merge the output results from multiple block runs with cell arrays. For details, see UniformOutput.

Default Value of SplitDimension for Built-In Pipeline Blocks

Since R2026a

The default value of the SplitDimension property is "all", instead of being empty, for some input ports of built-in pipeline blocks when the expected use case for the blocks is to parallelize across all input data for those input ports.

The table below lists all the built-in blocks with their corresponding SplitDimension values. (The UserFunction and FileChooser blocks have no input ports.)

Built-in BlockInput PortSplitDimension Default Value
BLASTNQueryFile"all"
BlastDatabase[]
BLASTPQueryFile"all"
BlastDatabase[]
BLASTXQueryFile"all"
BlastDatabase[]
BamSortBAMFile"all"
Bowtie2IndexBaseName[]
Reads1Files"all"
Reads2Files"all"
Bowtie2BuildReferenceFASTAFiles[]
IndexBaseName[]
BwaIndexReferenceFASTAFile"all"
BwaMEMIndexBaseName[]
Reads1File"all"
Reads2File"all"
CuffCompareGenomicAnnotationFiles[]
CuffDiffGenomicAnnotationFile[]
GenomicAlignmentFiles[]
CuffMergeGenomicAnnotationFiles[]
CuffNormGenomicAnnotationFile[]
GenomicAlignmentFiles[]
CuffQuantGenomicAnnotationFile[]
GenomicAlignmentFiles[]
CufflinksGenomicAlignmentFiles"all"
FeatureCountGTFFile[]
GenomicAlignmentFiles[]
GenomicsViewerReference[]
Cytoband[]
Tracks[]
LoadMatFile[]
MakeBlastDatabaseInputFile[]
SRAFasterqDumpSRRID"all"
SRASAMDumpSRRID"all"
SamSortSAMFile"all"
SaveVar1[]
SeqFilterFASTQFiles"all"
SeqSplitFASTQFiles"all"
BarcodeFile"all"
SeqTrimFASTQFiles"all"
TBLASTNQueryFile"all"
BlastDatabase[]
TBLASTXQueryFile"all"
TBLASTXBlastDatabase[]

Show split dimensions in Biopipeline Designer

In Biopipeline Designer, you can see dedicated icons for the split dimension settings of the input ports of your pipeline blocks. To show or hide the icons, open the diagram context menu and select Show split dimension icons.

Diagram context menu and one of the pipeline blocks showing split dimension icons for each input port

The three icons indicate the following:

  • — Inputs to this port are not split along any dimension.

  • — Inputs to this port are split along one dimension.

  • — Inputs to this port are split along more than one dimension.

See Also

| | |

Topics