Generate Data for Clustering

Generates 2D data for clustering.
Mise à jour 26 jan. 2023

A MATLAB/Octave function which generates 2D data clusters. Data is created along straight lines, which can be more or less parallel depending on the selected input parameters.


[data, clustPoints, idx, centers, angles, lengths] = ...
    generateData(angleMean, angleStd, numClusts, xClustAvgSep, yClustAvgSep, ...
                 lengthMean, lengthStd, lateralStd, totalPoints, ...)

Input parameters

Required parameters

Parameter Description
angleMean Mean angle in radians of the lines on which clusters are based. Angles are drawn from the normal distribution.
angleStd Standard deviation of line angles.
numClusts Number of clusters (and therefore of lines) to generate.
xClustAvgSep Average separation of line centers along the X axis.
yClustAvgSep Average separation of line centers along the Y axis.
lengthMean Mean length of the lines on which clusters are based. Line lengths are drawn from the folded normal distribution.
lengthStd Standard deviation of line lengths.
lateralStd Cluster "fatness", i.e., the standard deviation of the distance from each point to its projection on the line. The way this distance is obtained is controlled by the optional 'pointOffset' parameter.
totalPoints Total points in generated data. These will be randomly divided between clusters using the half-normal distribution with unit standard deviation.

Optional named parameters

Parameter name Parameter values Default value Description
allowEmpty true, false false Allow empty clusters?
pointDist 'unif', 'norm' unif Specifies the distribution of points along lines, with two possible values: 1) 'unif' distributes points uniformly along lines; or, 2) 'norm' distribute points along lines using a normal distribution (line center is the mean and the line length is equal to 3 standard deviations).
pointOffset 1D, 2D 2D Controls how points are created from their projections on the lines, with two possible values: 1) '1D' places points on a second line perpendicular to the cluster line using a normal distribution centered at their intersection; or, 2) '2D' places point using a bivariate normal distribution centered at the point projection.

Return values

Value Description
data Matrix (totalPoints x 2) with the generated data.
clustPoints Vector (numClusts x 1) containing number of points in each cluster.
idx Vector (totalPoints x 1) containing the cluster indices of each point.
centers Matrix (numClusts x 2) containing line centers from where clusters were generated.
angles Vector (numClusts x 1) containing the effective angles of the lines used to generate clusters.
lengths Vector (numClusts x 1) containing the effective lengths of the lines used to generate clusters.

Usage examples

Basic usage

[data cp idx] = generateData(pi / 2, pi / 8, 5, 15, 15, 5, 1, 2, 200);

The previous command creates 5 clusters with a total of 200 points, with a mean angle of π/2 (std=π/8), separated in average by 15 units in both x and y directions, with mean length of 5 units (std=1) and a "fatness" or spread of 2 units.

The following command plots the generated clusters:

scatter(data(:, 1), data(:, 2), 8, idx);

Using optional parameters

The following command generates 7 clusters with a total of 100 000 points. Optional parameters are used to override the defaults.

[data cp idx] = generateData(0, pi / 16, 7, 25, 25, 25, 5, 1, 100000, ...
  'pointDist', 'norm', 'pointOffset', '1D', 'allowEmpty', true);

The generated clusters can be visualized with the same scatter command used in the previous example.

Reproducible cluster generation

To make cluster generation reproducible, set the random number generator seed to a specific value (e.g. 123) before generating the data:


For GNU Octave, use the following instructions instead:

rand("state", 123);
randn("state", 123);

Previous behaviors and reproducibility of results

Before v2.0.0, lines supporting clusters were parameterized with slopes instead of angles. We found this caused difficulties when choosing line orientation, thus the change to angles, which are much easier to work with. Version v1.3.0 still uses slopes, for those who prefer this behavior.

For reproducing results in studies published before May 2020, use version v1.2.0 instead. Subsequent versions were optimized in a way that changed the order in which the required random values are generated, thus producing slightly different results.


If you use this function in your work, please cite the following reference:

Multidimensional alternative

The MOCluGen toolbox extends generateData with arbitrary dimensions and statistical distributions. Therefore, generateData offers a limited subset of the functionality provided by MOCluGen, although it's probably simpler to use.


This script is made available under the MIT License.

Citation pour cette source

Fachada, N., & Rosa, A. C. (2020). generateData—A 2D data generator. Software Impacts, 4:100017. doi: 10.1016/j.simpa.2020.100017.

Version Publié le Notes de version

See release notes for this release on GitHub:


- Use angles instead of slopes for line specification
- Update citation reference

See release notes for this release on GitHub:

Small text update

Add paper reference

Link to GitHub page.

- Function returns more information.
- More comments w/ example.
- Fix x/yClustAvgSep to conform with specification in comment
- totalPoints is exact number of points
- No clusters with zero elements

