Defining a search query to find combinations of words and numbers in a PDF-file

1 vue (au cours des 30 derniers jours)
Sindre Aske
Sindre Aske le 18 Fév 2022
Commenté : DGM le 15 Déc 2023
Hi!
In connection with our master thesis we are trying to assess whether or not an annual report discloses a given set of ESG key performance indicators. We upload the annual reports into Matlab (PDF) and are trying to build a model that analyses the text and looks for a given combination of words and digits within a window of e.g., 5 words. For instance, a search query could be: ("GHG" OR "CO2" OR "greenhouse gas") AND ("ton" OR "tons" OR "tonne") -> these words must appear together or within a "word window" of e.g., 5 words. We are using R2021b.
We are able to build a model that search for "GHG" OR "CO2" in the text, but we cannot figure out how to implement the AND-function or the window size (max 5 words). Current code:
str = extractFileText('MYFILE'); % PDF file
pat = [("GHG"|"CO2"|"greenhouse gas") & ("ton"|"tons"|"tonne")];
ESGD = contains(str,pat);
ESGD1 = extract(str, pat)
disp(ESGD)
We have also tried using the ngram function, without success... Do any of you guys have any suggestions for how we could construct this model? :)
  3 commentaires
Sindre Aske
Sindre Aske le 21 Fév 2022
Thank you! We have looked into it but we are still struggling to combine words and limit the word window.
Current code:
txt = extractFileText('MYFILE.pdf');
expression = '(ton|tons|tonne)\ (GHG|CO2)';
pat = regexp(txt,expression,"match")
extract(txt,pat)
% contains(txt,pat)
disp(contains)
We are able to make an expression that looks for ton OR tons OR tonne followed by GHG or CO2. However, we want to make an expression that searches for "GHG" OR "CO2" AND "ton" OR "tons" within a word window of e.g. 5 words, or 25 characters. Do you have any suggestions for how we could build this expression? We are new to working with text in Matlab :)
DGM
DGM le 15 Déc 2023
Unless there is something known about the given files, I don't know that I'd expect subscripted cases of CO2 to actually show up in plain text as "CO2". Just feeding chemistry textbooks to extractFileText suggests that it would only work about 50% of the time. A lot show up with extraneous spaces or they're completely obfuscated by the formatting.
Also, case-sensitivity is something to consider.

Connectez-vous pour commenter.

Réponses (1)

Vatsal
Vatsal le 15 Déc 2023
Hi,
I understand that you want to identify a specific pattern within a window of words in a text document. Below is the code that accomplishes this using the concept of a sliding window:
% Extract text from file
str = extractFileText('MYFILE'); % PDF file
% Define the sets of words
set1 = ["GHG", "CO2"];
set2 = ["ton", "tons", "tonne"];
% Split the text into words
words = strsplit(str);
% Initialize result
ESGD = false;
% Loop over the words with a sliding window
for i = 1:length(words)-4
window = words(i:i+4);
if any(ismember(window, set1)) && any(ismember(window, set2))
ESGD = true;
break;
end
end
disp(ESGD)
I hope this helps!

Catégories

En savoir plus sur Characters and Strings dans Help Center et File Exchange

Produits


Version

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by