Hello, many-a-times we come across a series of numbered webpages
basePage.html?page=2
basePage.html?page=3
and so forth, wherein there are several fields identified by their labels:
<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
and so on.
How can the "textOfInterest" of one particular parameter, say, Parameter2, of all the Name*, of all the pages,
basePage.html?page=1toInf
be taken (outputted/exported) into one text file, say, Parameter2.txt?
The "textOfInterest" is often alphanumeric with special characters !@#$% also.
Thanks.

6 commentaires

Rik
Rik le 26 Nov 2020
Step by step. You want to parse several pages, so you will probably need a loop. You want to write something to a file, so you will first have to store it in Matlab variables.
What have you tried?
b
b le 26 Nov 2020
I am not that good with programming. So I just used a brute-force outer loop:
for i=1:100
try
d=webread(strcat('basePage.html?page=',num2str(i)));
catch
end
end
This put the webpage in d. Clueless after that.
Rik
Rik le 26 Nov 2020
Good start.
Now you need to think about how you can extract the text of interest from the webpage content. The strfind function is probably helpful in this context. That is the main thing I used when I had to parse a few thousand webpages for my Bible downloader.
b
b le 27 Nov 2020
Can you tell how strfind can be used to find the following line :
<label>Parameter1 : </label> <div class="category-related">
b
b le 1 Déc 2020
Initially, I was hesitant to download this file because I thought it is religious or some such thing. But I am happy to have downloaded it. It is immensely useful and 'on the money' for this thread.
My interest occurs in the function button_Callback in BibleDownloader.m. The webpage is getting saved in the parameter called 'data'. And since finding <div class="pagination"> is right in the ballpark of my initially query, I was greatly excited to see the output and experiment with the case 'NB2014' inside this function. Unfortunately, the code doesn't seem to go here, since I was unable to retrieve either 'data', or the indices idx*. All of these indices idx*, viz idx, idx2 and idx3 will be useful for me. How can I access, and get to this part?
Also, perhaps you can suggest one regexp line to pull out 'textOfInterest' from
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
and better still, if you already have something like the BibleDownloader m-file, with regexp used on extracting text between <div class> and </div> type of structure, that will be great.
Rik
Rik le 1 Déc 2020
Modifié(e) : Rik le 1 Déc 2020
The goal of Bible downloader is religious (although you can use the text of a Bible translation for non-religous purposes as well of course), but the code isn't.
Did you try adapting any of the code? I'll post some code as an answer.

Connectez-vous pour commenter.

 Réponse acceptée

Rik
Rik le 1 Déc 2020

0 votes

One possibility with strfind:
close_div=strfinf(d,'</div>');
param=1;
pat=sprintf('<label>Parameter%d : </label> <div class="category-related">',param)
position=strfind(d,pat);
position=position+numel(pat);%this will be the start of your text of interest
texts=cell(size(position));
for n=1:numel(position)
end_of_text=close_div(close_div>position(n));
end_of_text=end_of_text(1)-1;
texts{n}=d(position(n):end_of_text);
end
Or with a regexp:
d=['<h2 class="category-heading">Name1</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name2</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name3</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'];
RE=['<label>Parameter\d',... % \d matches a single digit
' : </label> <div class="category-related">',...
'(',... % use parentheses to capture a token
'[^<]*',... % this matches any number of characters other than <
')',...
'</div>'];
t=regexp(d,RE,'tokens');
clc
celldisp(t)
You can also adapt the expression to look forward to match </div> so you can use .* instead of [^<]*

8 commentaires

b
b le 1 Déc 2020
Thank you.
But I have run into problem with the following part:
Trying to take the output of the two parameters simultaneously: Parameter1 and Parameter2. It so happens, that many times, Parameter1 is present, but the Parameter2 is missing. That is, the structure is like this:
<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
Same problem if try to take all the three parameters.
When all three parameters are to be extracted, the objective is to get ' ' (no value) at the place where it is missing, rather than skipping it completely, because skipping it completely would result in a mismatch (so that when it is exported to the output text file, the corresponding entry is simply blank).
In the first (strfind) code, I tried to replicate the 'for loop' three times for the three parameters, but quickly ran into problems.
Rik
Rik le 1 Déc 2020
for param=1:3
That should be enough. I have asked you several times what you tried. Show your code.
You can also try to adapt the regular expression to capture two tokens. Did you try that?
b
b le 2 Déc 2020
Unfortunately, this is exactly what I was saying (is not working).
Here is the extension of the first (strfind) approach:
close_h2=strfind(d,'</h2>');
patName=sprintf('<h2 class="category-heading">');
positionName=strfind(d,patName);
positionName=positionName+numel(patName);
textsName=cell(size(positionName));
for n2=1:numel(positionName)
end_of_text_name=close_h2(close_h2>positionName(n2));
end_of_text_name=end_of_text_name(1)-1;
textsName(n2)=d(positionName(n2):end_of_text_name);
end
If Parameter2 is of interest, then, when the Parameter2 is missing in one of the h2 name headings (as outlined in the comment above), it gives the output in the following format :
Name1 Parameter2(of Name1)
Name2 Parameter2(of Name3)
Name3 Parameter2(of Name4)
This output is erroneous. In actuality, the output should be
Name1 Parameter2(of Name1)
Name2 NULL
Name3 Parameter2(of Name3)
Rik
Rik le 2 Déc 2020
Can you post the original data? It sounds like you should go through the data line by line. It sounds like you don't make any attempt at matching the output data of each separately parsed line.
You also didn't read the documentation for the sprintf function.
b
b le 2 Déc 2020
yes, that is true. Its just that there are so many combinations of the way the parameters can occur inside the heading name. I will try to resolve and come up with a unified way so that the 'NULL' can be written at the proper place in the output file.
Meantime, I have attached a sample file, which is a variant of the above structure. (These structures will vary - there is no way to obtain a unified approach to all.) The objective is to get the name and the corresponding fields.
My specific question was that if one parameter of interest is missing in the given name-info (say, mail address), then with both the above approaches (strfind and regexp), there is a mismatch in the indices. It is a question of resolving this mismatch, and putting 'NULL' in the output-file at the index where the mail-field is missing, but the name-field exists.
Rik
Rik le 2 Déc 2020
Your actual content doesn't look much like your example data. The code below splits the text file into element that each contain a single person.
d=readfile('https://www.mathworks.com/matlabcentral/answers/uploaded_files/439448/sample1.txt');
idx1=find(contains(d,'<article>'))+1;
idx2=find(contains(d,'</article>'))-1;
elements=arrayfun(@(s1,s2) d(s1:s2),idx1,idx2,'UniformOutput',false);
You might be able to use something mentioned here to parse the HTML for each person to a struct. The structure looks the same to me, so that level of complication might not be required.
b
b le 2 Déc 2020
Thanks for the link.
Downloaded the readfile from github. The 'elements' seems promising, except for - what are those ->->-> arrows in front of all the fields of interest?! Anyways, glad that it has brought to this point.
But the same situation with all the three approaches : when the mail-field is missing, then how to write 'NULL' in the output-file and continue with the loop?
Name1 mail1
Name2 missing
Name3 mail3
Name4 mail4
The strfind and regexp approaches give
Name{1}='Name1'
Name{2}='Name2'
Name{3}='Name3'
Name{4}='Name4'
and
Parameter{1}='mail1'
Parameter{2}='mail3'
Parameter{3}='mail4'
How to bypass the 'for loop' and at the same time, print 'NULL' in the corresponding excel row-column entry? In this example, (row=2,col=2) will be 'NULL', and (row=3,col=2) will be Parameter{2}.
It is not the question of 'skipping if not found', because numel(position) has already been evaluated, =4 here for the Name field, and =3 for the Parameter. So it seems to be hardcoded.
Rik
Rik le 2 Déc 2020
Those arrows are probably newline characters. What release are you using?
I would suggest parsing each element separately. That way you can write an empty char or whatever you prefer in the email field for that person.

Connectez-vous pour commenter.

Plus de réponses (1)

b
b le 3 Déc 2020

0 votes

That is exactly how I am doing it. By parsing it separately, there is no way to correlate which Name-field has the corresponding Mail-field missing. It parses all the Name-fields, then it parses all the mail-fields, as a sequential process.
What modification should be made in the codes, so that they print 'Not Found' when the mail field is missing in the corresponding iteration? Is there a way to get the index values of the missing Mail-fields?

3 commentaires

Rik
Rik le 3 Déc 2020
Can you move this to the comment section (by posting a new comment and deleting this answer)? And please also add the code you're using to parse a single element.
b
b le 3 Déc 2020
I am overwhelmed by the way you have patiently worked with me on this thread. I think I will close this elaborate thread here only, but not before posting this limerick:
There was once a man named Rik,
Who wrote matlab codes so quick,
To the topic, they were relevant
The codes themselves so elegant,
His m-files, sir, were completely sick!
Enjoy your freedom from this thread.
Rik
Rik le 3 Déc 2020
You're welcome (and thanks for the limerick XD).
If you have follow-up question, feel free to post a link to it here.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Variables dans Centre d'aide et File Exchange

Question posée :

b
b
le 26 Nov 2020

Commenté :

Rik
le 3 Déc 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by