Extract email information from webpages/URLs using Matlab

Question

Gobert le 13 Juin 2021

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/855095-extract-email-information-from-webpages-urls-using-matlab

Commenté : Gobert le 14 Juin 2021

Hi,

I do need your help. When I run the code, K gives 143 empty cells. In other words, K does not contain any email. I tried this with other websites that show emails on some pages but in vain (K always gave empty cells). Therefore, can you please help to find what is wrong with or where I am screwing things up in the following code? I want to write a code that is capable of checking each L page of the main H website and extract emails according to the expression E.

H = webread("https://edition.cnn.com");
L = regexp(H,'https?://[^"]+','match')';
E ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
K = regexpi(L,E,'match')';

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Image Analyst le 13 Juin 2021

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/855095-extract-email-information-from-webpages-urls-using-matlab#answer_723850

Modifié(e) : Image Analyst le 13 Juin 2021

Ouvrir dans MATLAB Online

Looks like L is a cell array of web sites, none of which is an email address with @ in it. So why do you think it should find an email there? Try searching H instead:

K = regexpi(H,E,'match')';

That will give you email addresses.

% Retrieve a web page.
%url = 'https://www.mathworks.com/matlabcentral/answers/?term=';
url = 'https://edition.cnn.com';
webPageContents = webread(url);
% Harvest web sites listed in the page.
listOfWebSites1 = regexp(webPageContents,'https?://[^"]+','match')';
listOfWebSites2 = regexp(webPageContents,'https?://[^"]+','match')';
% Throw out duplicates:
listOfWebSites = unique([listOfWebSites1;listOfWebSites2])
% Harvest email addresses listed in the page.
reForEMailAddresses ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
listOfEMails = regexpi(webPageContents,reForEMailAddresses,'match')';
% Throw out duplicates:
listOfEMails = unique(listOfEMails)

5 commentaires
Afficher 3 commentaires plus anciensMasquer 3 commentaires plus anciens

Image Analyst le 14 Juin 2021

Ouvrir dans MATLAB Online

You could just go through each web site found on the main web site and scan each of those web sites for emails:

reForEMailAddresses ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
for k = 1 : length(listOfWebSites)
    thisWebSite = listOfWebSites{k};
    webPageContents = webread(thisWebSite);
    % Harvest email addresses listed in the page.
    listOfEMails = regexpi(webPageContents, reForEMailAddresses, 'match')';
    % Throw out duplicates:
    listOfEMails = unique(listOfEMails)
end

Is that what you want?

Gobert le 14 Juin 2021

Yes, it is. Thank you!

Connectez-vous pour commenter.

Extract email information from webpages/URLs using Matlab

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponse acceptée

5 commentaires
Afficher 3 commentaires plus anciensMasquer 3 commentaires plus anciens

Plus de réponses (0)

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

Extract email information from webpages/URLs using Matlab

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponse acceptée

5 commentaires Afficher 3 commentaires plus anciensMasquer 3 commentaires plus anciens

Plus de réponses (0)

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

5 commentaires
Afficher 3 commentaires plus anciensMasquer 3 commentaires plus anciens