How to extract text from .json files and combine them?

27 vues (au cours des 30 derniers jours)
Susan
Susan le 28 Mar 2020
Commenté : Ameer Hamza le 31 Mar 2020
Hello everyone,
I've got some questions and any inputs would be greatly appreciated. I have bunch of .json files, say 1000. To read each files I run the following code
fname = 'C:\Users\...\d90f3c62681e.json';
val = jsondecode(fileread(fname));
the output is as follows. For each file the paper_id, the size of abstract, and the size of body_text changes. I am interested in the text data in the "abstract" and the "body text". How can I extract text file in the abstract and body_text, and combine all these .json files into one file?
val =
struct with fields:
paper_id: 'd90f3c62681e'
metadata: [1×1 struct]
abstract: [1×1 struct]
body_text: [4×1 struct]
bib_entries: [1×1 struct]
ref_entries: [1×1 struct]
back_matter: []
val.abstract =
struct with fields:
text: '300 words)
cite_spans: []
ref_spans: []
section: 'Abstract'
val.body_text =
4×1 struct array with fields:
text
cite_spans
ref_spans
section
  4 commentaires
Walter Roberson
Walter Roberson le 28 Mar 2020
Which release are you using? When I try in R2020a, I get
paper_id: '0a43046c154d0e521a6c425df215d90f3c62681e'
>> val.abstract
struct with fields:
text: '300 words) 33 Quantification of aerosolized influenza virus [and a bunch more]
Susan
Susan le 29 Mar 2020
Hi Walter,
Thanks for your reply. I am using R2019a and get the same results as yours. My main question is considering some of this json files don't have any text for abstract, i.e., val.abstract = [], could you please tell me how I can put all the available val.abstract.text and val.body_text.text in 1 file? Do I need a for loop to go through all paper_id and extract text from each paper? If so, how?
Many thanks in advance!!

Connectez-vous pour commenter.

Réponse acceptée

Ameer Hamza
Ameer Hamza le 31 Mar 2020
Modifié(e) : Ameer Hamza le 31 Mar 2020
As I answered in the comment on your other question, the following code will create a struct by combining the fields from individual files. It will then create a combined JSON file
files = dir('JSON files/*.json');
s = struct('abstract', [], 'body_text', []);
for i=1:numel(files)
filename = fullfile(files(i).folder, files(i).name);
data = jsondecode(fileread(filename));
if ~isempty(data.abstract)
s.abstract = [s.abstract; cell2struct({data.abstract.text}, 'text', 1)];
end
if ~isempty(data.body_text)
s.body_text = [s.body_text; cell2struct({data.body_text.text}, 'text', 1)];
end
end
str = jsonencode(s);
f = fopen('filename.json', 'w');
fprintf(f, '%s', str);
fclose(f);
  2 commentaires
Susan
Susan le 31 Mar 2020
Thank you so much! Your answer completely solved my issue. Thanks again!
Ameer Hamza
Ameer Hamza le 31 Mar 2020
Glad to be of help.

Connectez-vous pour commenter.

Plus de réponses (1)

Mohammad Sami
Mohammad Sami le 30 Mar 2020
You can import your data into cell arrays
filelist = {};
vals = cell(length(filelist),1);
haveabstract = false(length(filelist),1);
havebody = false(length(filelist),1);
data = cell(length(filelist),3);
% first col paper_id, second_col abstract, third col body
for i=1:length(filelist)
vals{i} = jsondecode(fileread(filelist{i}));
haveabstract(i) = ~isempty(vals{i}.abstract);
havebody(i) = ~isempty(vals{i}.body_text);
data{i,1} = vals{i}.paper_id;
if haveabstract(i)
data{i,2} = vals{i}.abstract;
end
if havebody(i)
data{i,3} = vals{i}.body_text
end
end

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by