Problme with Text analysis
1 vue (au cours des 30 derniers jours)
Afficher commentaires plus anciens
Hi, I try to clean a table containing both latin and non-latin strings to plot a wordcloud. I used regexprep function but not successfully. I can't remove korean strings. Any idea? Here an example of the code and the output:
pathName = 'Keyword Aug. 2020 to Oct. 2021_MatlabSmall.xlsx';
T = readtable(pathName,'Range','A:B');
% Convert all Character Vector to Lowercase
T.Keyword = lower(T.Keyword);
% Remove not useful keywords
T(strcmp(T.Keyword, '(not provided)'), :)=[];
T(strcmp(T.Keyword, '(not set)'), :)=[];
% Set lower case
T.Keyword = lower(T.Keyword);
% Remove links
T(contains(T.Keyword, 'http'), :)=[];
T(contains(T.Keyword, '.'), :)=[];
T.Keyword = strrep(T.Keyword, ' ', '_');
display(head(T));
% Replace non alphanumerics
T.Keyword = regexprep(T.Keyword,'^a-z','');
8×2 table
Keyword Sessions
_________________________________ ________
'stuff' 390
'forum' 128
'student' 76
'재료' 59
'stuff' 56
'uninstall_stuff_license_manager' 52
'stuff_resource_center' 43
'stuff_student_community' 34
0 commentaires
Réponses (1)
DGM
le 19 Oct 2021
I'm terrible with regex, but this might get you somewhere. Replaces everything but lowercase alpha and underscores.
A = {'9.banana' 'orange-123_juice' 'ン戦国時' 'apple_sauce' 'abcクルミ' 'peach' 'pear' 'ピラミッド' 'cherry'}.'
B = regexprep(A,'[^a-z_]','')
0 commentaires
Voir également
Catégories
En savoir plus sur Text Data Preparation dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!