How to extract text from .json files and combine them?

28 views (last 30 days)
Susan
Susan on 28 Mar 2020
Commented: Ameer Hamza on 31 Mar 2020
Hello everyone,
I've got some questions and any inputs would be greatly appreciated. I have bunch of .json files, say 1000. To read each files I run the following code
fname = 'C:\Users\...\d90f3c62681e.json';
val = jsondecode(fileread(fname));
the output is as follows. For each file the paper_id, the size of abstract, and the size of body_text changes. I am interested in the text data in the "abstract" and the "body text". How can I extract text file in the abstract and body_text, and combine all these .json files into one file?
val =
struct with fields:
paper_id: 'd90f3c62681e'
metadata: [1×1 struct]
abstract: [1×1 struct]
body_text: [4×1 struct]
bib_entries: [1×1 struct]
ref_entries: [1×1 struct]
back_matter: []
val.abstract =
struct with fields:
text: '300 words)
cite_spans: []
ref_spans: []
section: 'Abstract'
val.body_text =
4×1 struct array with fields:
text
cite_spans
ref_spans
section
  4 Comments
Susan
Susan on 29 Mar 2020
Hi Walter,
Thanks for your reply. I am using R2019a and get the same results as yours. My main question is considering some of this json files don't have any text for abstract, i.e., val.abstract = [], could you please tell me how I can put all the available val.abstract.text and val.body_text.text in 1 file? Do I need a for loop to go through all paper_id and extract text from each paper? If so, how?
Many thanks in advance!!

Sign in to comment.

Accepted Answer

Ameer Hamza
Ameer Hamza on 31 Mar 2020
Edited: Ameer Hamza on 31 Mar 2020
As I answered in the comment on your other question, the following code will create a struct by combining the fields from individual files. It will then create a combined JSON file
files = dir('JSON files/*.json');
s = struct('abstract', [], 'body_text', []);
for i=1:numel(files)
filename = fullfile(files(i).folder, files(i).name);
data = jsondecode(fileread(filename));
if ~isempty(data.abstract)
s.abstract = [s.abstract; cell2struct({data.abstract.text}, 'text', 1)];
end
if ~isempty(data.body_text)
s.body_text = [s.body_text; cell2struct({data.body_text.text}, 'text', 1)];
end
end
str = jsonencode(s);
f = fopen('filename.json', 'w');
fprintf(f, '%s', str);
fclose(f);

More Answers (1)

Mohammad Sami
Mohammad Sami on 30 Mar 2020
You can import your data into cell arrays
filelist = {};
vals = cell(length(filelist),1);
haveabstract = false(length(filelist),1);
havebody = false(length(filelist),1);
data = cell(length(filelist),3);
% first col paper_id, second_col abstract, third col body
for i=1:length(filelist)
vals{i} = jsondecode(fileread(filelist{i}));
haveabstract(i) = ~isempty(vals{i}.abstract);
havebody(i) = ~isempty(vals{i}.body_text);
data{i,1} = vals{i}.paper_id;
if haveabstract(i)
data{i,2} = vals{i}.abstract;
end
if havebody(i)
data{i,3} = vals{i}.body_text
end
end

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by