Lemmatization

What Is Lemmatization?

Lemmatization is a text normalization technique in natural language processing. Lemmatization uses vocabulary and morphological analysis to remove affixes of words. For example, “building has floors” reduces to “build have floor” upon lemmatization.

Lemmatization Applications

Lemmatization is often used for:

  • Information retrieval for expanding search criteria
  • Reducing dimensionality of problems in text classification, sentiment analysis, or topic modeling

Lemmatization vs. Stemming

A related approach to lemmatization, stemming, is based on simple heuristic rules. It often results in roots or word parts that are not actual words, whereas lemmatization always returns valid dictionary words.

Examples of lemmatization and stemming are shown below.

Actual Word Lemmatization Stemming
Requirement Requirement Requir
Applied Apply Appli

In MATLAB®, lemmatization can be done using “normalizeWords” function with the style option of ‘lemma’. To learn more about using lemmatization and building predictive models with text data with MATLAB, see Text Analytics Toolbox™.

See also: natural language processing, sentiment analysis, word2vec, stemming, n-gram, text mining with MATLAB, data science, deep learning, Deep Learning Toolbox™, Statistics and Machine Learning Toolbox™