Incomplete reading of MS Word file

At work I have to read some VERY long Word documents (~300 pages) and analyze the text. However, if I use the commands suggested in https://fr.mathworks.com/matlabcentral/answers/348737-how-to-read-ms-word-file-doc-docx :
word = actxserver('Word.Application');
wdoc = word.Documents.Open(filePath);
text = wdoc.Content.text;
wdoc.Close; % close document
word.Quit; % end application
the resulting "text" variable (1x158745 char) only contains ~25% of the document.
How can I read the whole document using this method? I saw that on newer relaseses there are dedicated functions/toolboxes for reading Word documents, but I don't have access to them as my company only provides R2020b and limited toolboxes.

Réponses (1)

Oguz Kaan Hancioglu
Oguz Kaan Hancioglu le 12 Avr 2023

0 votes

I haven't tried for such a huge file but can you try the open word document with fopen and read the whole text using read(fid, '*char'). Maybe it will work.

1 commentaire

Walter Roberson
Walter Roberson le 12 Avr 2023
That will not work in the form stated. .docx files are zip files that contain a directory of mostly XML files.
You can unzip the .docx file and go through the directory and try to extract things from the XML files; the XML files would be text files.

Connectez-vous pour commenter.

Produits

Version

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by