How to use Unicode numeric values in regexprep?

13 vues (au cours des 30 derniers jours)
Vlad Atanasiu
Vlad Atanasiu le 28 Mar 2024
How can "Häagen-Dasz" be converted to "Haagen-Dasz" using Uincode numeric values? For example,
regexprep('Häagen-Dasz','ä','A')
works fine, but
regexprep('Häagen-Dasz','\x{C4}','a')
does not. Here, the hexadecimal \x{C4} stands for [latin capital letter a] with diaeresis, i.e. [ä].

Réponse acceptée

Yash
Yash le 28 Mar 2024
Modifié(e) : Yash le 28 Mar 2024
Hi Vlad,
'\x{C4}' represents the Unicode character Ä (Latin Capital Letter A with Diaeresis) in hexadecimal notation.
If you want to replace ä (Latin Small Letter A with Diaeresis), you should use \x{E4}, which is its Unicode hexadecimal representation.
In the context of your question, you're looking to replace ä with a. The correct approach would be to use the Unicode numeric value for ä in the regex and replace it with a. Here is the code:
regexprep('Häagen-Dasz','\x{E4}','a')
ans = 'Haagen-Dasz'
Hope this helps!

Plus de réponses (2)

Stephen23
Stephen23 le 28 Mar 2024
inp = 'Häagen-Dasz';
baz = @(v)char(v(1)); % only need the first decomposed character.
out = arrayfun(@(c)baz(py.unicodedata.normalize('NFKD',c)),inp) % remove diacritics.
out = 'Haagen-Dasz'
Read more:
https://docs.python.org/3/library/unicodedata.html
https://stackoverflow.com/questions/16467479/normalizing-unicode

VBBV
VBBV le 28 Mar 2024
regexprep('Häagen-Dasz','ä','A')
ans = 'HAagen-Dasz'
regexprep('Häagen-Dasz','ä','\x{C4}')
ans = 'HÄagen-Dasz'
  2 commentaires
VBBV
VBBV le 28 Mar 2024
Déplacé(e) : VBBV le 28 Mar 2024
regexprep('Häagen-Dasz','\x{e4}','a')
ans = 'Haagen-Dasz'
VBBV
VBBV le 28 Mar 2024
The unicode character for small a is \x{e4}

Connectez-vous pour commenter.

Catégories

En savoir plus sur App Building dans Help Center et File Exchange

Produits

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by