Is there a way to pull a specific link after using webread() to get the content from a page?

1 vue (au cours des 30 derniers jours)
Essentially I'm using webread() to obtain the contents of a google search. If there's a Wikipedia link in the contents, I want to extract it. I've been using regexp(content,exp,'match') but I'm confused on how to create an expression that'll match the Wikipedia link. I know that doing something such as:
regexp(content,'https?://en\.?\w*\.?\w')
Will get me the 'https://en.wikipedia.org' portion of the link, but this expression seems unnecessary just for that part already. I can continue doing that for the whole link but the amount of words in the Wikipedia link will vary so I'm unsure how to contain just the link and not accidentally take text following the link.
(e.g https://en.wikipedia.org/wiki/List_of_landmark_court_decisions_in_the_United_States or https://en.wikipedia.org/wiki/Banana)
In the text that is read, it appears that the link is followed by the &amp. Perhaps I can take all the characters from http to &amp but it would be nice to get some tips on how to create an expression for that!
Thanks for the help!
  1 commentaire
Matthew Cao
Matthew Cao le 1 Mai 2018
Modifié(e) : Matthew Cao le 1 Mai 2018
Ok, I could simply replace the ('\.?\w*\.?\w'') part of the expression with \S+ which will look for any non-white-space character that appears consecutively. This pulls the Wiki link and a lot afterwards too:
https://en.wikipedia.org/wiki/List_of_landmark_court_decisions_in_the_United_States&(there's the word 'amp' here but it is not shown on the forum);sa=U&.............
I need to stop it right at the &,amp!

Connectez-vous pour commenter.

Réponse acceptée

Matthew Cao
Matthew Cao le 1 Mai 2018
I think I've solved it by putting '\S+' in the expression and '?=&sa'. That way the expression will match all the characters following 'https?://en' but stop at the right point.
regexp(content,'https?://en.\S+(?=&(amp);sa)','match')
This will find everything up until the '&(amp);sa'! If there's a more efficient way of doing this let me know!

Plus de réponses (0)

Catégories

En savoir plus sur Environment and Settings dans Help Center et File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by