retrieve data from a website with multiple pages

9 vues (au cours des 30 derniers jours)
sani
sani le 18 Fév 2022
Commenté : Ive J le 22 Fév 2022
Hi all,
I want to pull the data from this website into a table.
it has 185 pages so I wrote a for loop so it will pass the entire table.
the problam is that I'm using webread, which is seems to read everything into char array.
what I want is that in each itteration of the for loop the data from this table will be read, how can it be done?
thanks
  4 commentaires
Rik
Rik le 19 Fév 2022
You can search the html for the text in the table and guess the structure from what you see.
sani
sani le 19 Fév 2022
I think it is a <table> if I understand correctly, I tried to set weboptions.ContentType to 'table' but it is saying that there is only text.
I'm not sure that this is the way to approach it though

Connectez-vous pour commenter.

Réponse acceptée

Ive J
Ive J le 19 Fév 2022
Modifié(e) : Ive J le 20 Fév 2022
My answer doesn't totally solve your problem, but addresses your main questions (hopefully!). Before parsing the HTML itself, webread doesn't read the content of the URL because the website uses some measures against bot attacks (read more: https://stackoverflow.com/questions/53434555/python-requests-enable-cookies-javascript), so that needs to be fixed first.
url = "https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=1";
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits: https://stackoverflow.com/a/53435185
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
Challenge = extractBetween(top, "Challenge=", ";");
challenge_id = extractBetween(top, "ChallengeId=", ";");
arr = char(Challenge);
last_digit = str2double(arr(end));
arr = sort(arr);
min_digit = str2double(arr(1));
subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
x = double(Challenge) * 3 + subvar1;
y = cos(pi * subvar1);
answer = x * y;
answer = answer - power;
answer = answer + (min_digit - last_digit);
answer = string(floor(answer)) + subvar2;
hdrs = {'X-AA-Challenge' char(Challenge); ...
'X-AA-Challenge-ID' char(challenge_id); ...
'X-AA-Challenge-Result' char(answer)};
% now read the website contents ===========================================
htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
end
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
wildcardPattern + "</td><td>", ...
'</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
'VariableNames', hdr);
tab = convertvars(tab, 1:width(tab), @string);
tab.(1) = double(tab.(1));
head(tab)
ans = 8×7 table
מספר רישיון שם יצרן כתובת ישוב מחוז פרטים_סוג מזון (מהות היצור): פרטים_קבוצת מזון: ___________ ________________________________ _____________________ _____________ _________ __________________________________________________________________________________________________________________________________________________ ___________________________________ 55678 "א. הקר 2009 גלאט למהדרין בע"מ" "מרכז ספיר 3 ירושלים" "ירושלים" "ירושלים" "ייצור מוצרי בשר קפואים בלבד: בשר בקר טחון, בשר בעלי כנף טחון ומוצריהם, קישקע ממולא, בשר בקר מעובד, בשר בעלי כנף מעובד, ניסור ואריזת בשר בקר קפוא" "הסעדה" 68795 "א. כ. התעשיינים בע"מ" "שד הסנהדרין 3 יבנה" "יבנה" "מרכז" "בשר ומוצריו, לרבות עופות וצייד" "הסעדה (קיטרינג)" 52319 "א.א בורקס ליאון" "איתן 24 ראשון לציון" "ראשון לציון" "מרכז" "אחסנה בקירור" "אחסון מזון בקירור" 69047 "א.א בליסימו בע"מ" "איתן 3 ראשון לציון" "ראשון לציון" "מרכז" "קרחונים אכילים, כולל שרבט וסורבט" "מחסן קרור/מחסן בטמ' מבוקרת" 67457 "א.א מטעמים הכי טעים בע"מ" "מודיעין 8 פתח תקווה" "פתח תקווה" "מרכז" "ייצור בצקים ממולאים, ייצור עוגיות יבשות" "לחם, לחמניות, עוגות שמרים ומאפים" 52312 "א.א. בליסימו בע"מ" "לזרוב 3 ראשון לציון" "ראשון לציון" "מרכז" "מוצרי מאפה, תערובות להכנתם ובצקים" "לחמים ולחמניות מאודים" 50780 "א.א. דרך האוכל (חיפה) בע"מ" "שנקר אריה 47 חיפה" "חיפה" "חיפה" "אחסנת בצקים קפואים" "יצור מוצרי בשר בקר וצאן טחון בלבד" 52587 "א.א. לרנר מוצרי מזון העמק בע"מ" "הפועלים 2 באר שבע" "באר שבע" "דרום" "מחסן קרור/מחסן בטמ' מבוקרת" "בשר ומוצריו, לרבות עופות וצייד"
  10 commentaires
sani
sani le 22 Fév 2022
I was actually put your entire script in a for loop, and changed the URL as i increase. Than in each loop I was writing the answer from your script to another tanle using vertcat. If I understand correctly, the answer of size(unitab) = (36,7) is for pages 1-3? If so, this is the dimension I'm expecting to receive.
Ive J
Ive J le 22 Fév 2022
Yes, that's for 3 pages.
Feel free to use the function above! also be aware that sometimes when you send so many requests to a website, they may block your IP (temporarily).
To track possible parsing bugs, you can also save each table as a mat file. In this way, if you expect let's say 120 rows and you get only 100, you can inspect each table individually. You can do this by adding these lines:
for i = 1:n
fprintf('reading page %d of %d\n', i, n)
tab = readEachPage(i);
save("tab.page." + i + ".mat", "tab") % e.g. tab.page.10.mat contains table for page 10
unitab{i} = tab;
end

Connectez-vous pour commenter.

Plus de réponses (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by