HTML Page source info
2 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
Hello, many-a-times we come across a series of numbered webpages
basePage.html?page=2
basePage.html?page=3
and so forth, wherein there are several fields identified by their labels:
<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
and so on.
How can the "textOfInterest" of one particular parameter, say, Parameter2, of all the Name*, of all the pages,
basePage.html?page=1toInf
be taken (outputted/exported) into one text file, say, Parameter2.txt?
The "textOfInterest" is often alphanumeric with special characters !@#$% also.
Thanks.
6 commentaires
Rik
le 1 Déc 2020
Modifié(e) : Rik
le 1 Déc 2020
The goal of Bible downloader is religious (although you can use the text of a Bible translation for non-religous purposes as well of course), but the code isn't.
Did you try adapting any of the code? I'll post some code as an answer.
Réponse acceptée
Rik
le 1 Déc 2020
One possibility with strfind:
close_div=strfinf(d,'</div>');
param=1;
pat=sprintf('<label>Parameter%d : </label> <div class="category-related">',param)
position=strfind(d,pat);
position=position+numel(pat);%this will be the start of your text of interest
texts=cell(size(position));
for n=1:numel(position)
end_of_text=close_div(close_div>position(n));
end_of_text=end_of_text(1)-1;
texts{n}=d(position(n):end_of_text);
end
Or with a regexp:
d=['<h2 class="category-heading">Name1</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name2</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name3</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'];
RE=['<label>Parameter\d',... % \d matches a single digit
' : </label> <div class="category-related">',...
'(',... % use parentheses to capture a token
'[^<]*',... % this matches any number of characters other than <
')',...
'</div>'];
t=regexp(d,RE,'tokens');
clc
celldisp(t)
You can also adapt the expression to look forward to match </div> so you can use .* instead of [^<]*
8 commentaires
Rik
le 2 Déc 2020
Those arrows are probably newline characters. What release are you using?
I would suggest parsing each element separately. That way you can write an empty char or whatever you prefer in the email field for that person.
Voir également
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!