Extract email information from webpages/URLs using Matlab

Hi,
I do need your help. When I run the code, K gives 143 empty cells. In other words, K does not contain any email. I tried this with other websites that show emails on some pages but in vain (K always gave empty cells). Therefore, can you please help to find what is wrong with or where I am screwing things up in the following code? I want to write a code that is capable of checking each L page of the main H website and extract emails according to the expression E.
H = webread("https://edition.cnn.com");
L = regexp(H,'https?://[^"]+','match')';
E ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
K = regexpi(L,E,'match')';

 Réponse acceptée

Image Analyst
Image Analyst le 13 Juin 2021
Modifié(e) : Image Analyst le 13 Juin 2021
Looks like L is a cell array of web sites, none of which is an email address with @ in it. So why do you think it should find an email there? Try searching H instead:
K = regexpi(H,E,'match')';
That will give you email addresses.
% Retrieve a web page.
%url = 'https://www.mathworks.com/matlabcentral/answers/?term=';
url = 'https://edition.cnn.com';
webPageContents = webread(url);
% Harvest web sites listed in the page.
listOfWebSites1 = regexp(webPageContents,'https?://[^"]+','match')';
listOfWebSites2 = regexp(webPageContents,'https?://[^"]+','match')';
% Throw out duplicates:
listOfWebSites = unique([listOfWebSites1;listOfWebSites2])
% Harvest email addresses listed in the page.
reForEMailAddresses ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
listOfEMails = regexpi(webPageContents,reForEMailAddresses,'match')';
% Throw out duplicates:
listOfEMails = unique(listOfEMails)

5 commentaires

Gobert
Gobert le 13 Juin 2021
Modifié(e) : Gobert le 13 Juin 2021
Thank you! But, Here:
listOfEMails = regexpi(webPageContents,reForEMailAddresses,'match')';
Why do you use "webPageContents" instead of "listOfWebSites" while I aim at harvesting the content of all listed/linked websites? Not just a single website "webPageContents". Or do I miss something about "webPageContents"? Does it include the content of all webpages linked to it?
Image Analyst
Image Analyst le 13 Juin 2021
Modifié(e) : Image Analyst le 13 Juin 2021
You were searching the list of web sites that were listed on the page. This list of web sites does not have any email addresses in it. That's why your k was empty.
I assumed you wanted to find email addresses, and the only place to find them is on the original web page, not from a small subset of that that you scraped off the web site (i.e. not from the list of web sites because there are no email addresses there).\
If you're interested in how to collect a list of stock prices and drop them onto an Excel workbook with your stock portfolio on it (with Windows), I can show you that too.
Thank you again! Of course the list of websites does not contain any email address but if you open each listed webpage, one by one, you can find something, I think. Bref, as you mentioned, I wanted to find email addresses on both the index page (i.e., webPageContents) and all direct links or index page links (i.e., listOfWebSites(:,1)).
You could just go through each web site found on the main web site and scan each of those web sites for emails:
reForEMailAddresses ='[a-zA-Z0-9._%''+-]+@([a-zA-Z0-9._-])+\.([a-zA-Z]{2,4})';
for k = 1 : length(listOfWebSites)
thisWebSite = listOfWebSites{k};
webPageContents = webread(thisWebSite);
% Harvest email addresses listed in the page.
listOfEMails = regexpi(webPageContents, reForEMailAddresses, 'match')';
% Throw out duplicates:
listOfEMails = unique(listOfEMails)
end
Is that what you want?
Yes, it is. Thank you!

Connectez-vous pour commenter.

Plus de réponses (0)

Catégories

En savoir plus sur Startup and Shutdown dans Centre d'aide et File Exchange

Produits

Version

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by