Using readtable with Word document and special formatting (superscripts)

When reading a table from a Word document, readtable seems to treat superscript content in the same way as other content.
For example, if a row in a table contains , readtable returns Lunar21, which is a problem. Superscripts are often used to point to footnotes. Is there an option for readtable to ignore superscript and other such characters, or any other workaround?

6 commentaires

I doubt if there is any such option.
Could you share your file? (Use the paperclip button to attach.)
The paperclip doesn't let me attach a doc file - says unsupported type.
If there isn't a way to do this with readtable, perhaps I can preprocess the table using actxserver? I am thinking copy-paste into Excel and invoke Excel with actxserver. There may be a way to identify formatting elements and remove them in the Excel file before using readtable.
My familiarity with actxserver properties & methods is minimal and I am having trouble doing this too. Something like this...
Excel = actxserver('Excel.Application');
w = Excel.Workbooks.Open(filename);
ExlSht = w.Sheets.Item('Sheet1');
RangeContainsTable = ExlSht.UsedRange;
r = RangeContainsTable.Rows.Count;
c = RangeContainsTable.Columns.Count;
for i = 1:r
for j = 1:c
% Something here to identify cells that have special
% formatting/superscript & delete them?
% Range.Value methods don't work as value already collapses special
% formatting to normal text
% I am not sure how to invoke (row,col) indexing into the above
% range
t = RangeContainsTable(i,j).Characters.Font.Superscript;
% Above line doesn't work as RangeContainsTable is just a 1x1
% object
end
end
You can zip your file and attach the zip
Thanks for the tip on attaching. Here's an example, although there are many more tables that are much larger and harder to work with.
Nevermind my original request. Although I still don't know how to directly remove special formatting while importing using readtable (if it's possible at all), I did manage to get the actxserver method working. Thanks.

Connectez-vous pour commenter.

 Réponse acceptée

Hi AR,
I understand that you are encountering issues while reading text with superscript characters using the “readtable” function. To address this issue, I would suggest utilizing an ActiveX session with Microsoft Word. This approach will allow you to remove any superscript characters before attempting to read the table. Refer the following steps to do so:
  • Using MATLAB, start an ActiveX session with Word and load the document.
wordApp = actxserver('Word.Application');
wordApp.Visible = true; % You can keep the Microsoft Word visible to debug in case of any errors
doc = wordApp.Documents.Open('path_to_Temp_tableAttempt.docx');
  • Now loop through the tables and, within each cell, iterate through each character to check for superscripts. If a superscript character is encountered, delete it:
for i = 1:doc.Tables.Count
table = doc.Tables.Item(i);
for row = 1:table.Rows.Count
for col = 1:table.Columns.Count
try
cell = table.Cell(row, col);
cellRange = cell.Range;
% Loop through each character in the cell
for charIndex = cellRange.Characters.Count:-1:1
char = cellRange.Characters.Item(charIndex);
% Check if the character is superscript
if char.Font.Superscript
% Remove the superscript character
char.Delete();
end
end
% To handle merged columns
catch ME
disp(['Error processing cell at row ' num2str(row) ', column ' num2str(col) ': ' ME.message]);
end
end
end
end
  • Then save this updated file and close the ActiveX session.
doc.SaveAs2('path_to_updated_Temp_tableAttempt.docx ');
doc.Close();
wordApp.Quit();
By following these steps, you will be able to create an updated document free of superscript characters, making it easier to read the table using the “readtable” function.
To know more about actxserver” function you can refer to the following MATLAB documentation link:
Hope this helps.

3 commentaires

As noted previously, I had solved this problem already using the COM server approach. Generally, I dislike having to use this, as I am not fluent in the methods and options available (for example, Cell.Font.Name, etc..) and documentation for this is poor or non-existent. I recognize that such documentation doesn't need to come from Mathworks but since I can't find any accurate listing of all possible .methods, it ends up being a trial-and-error approach for me, wasting a lot of time.
Possibly calling methods might help.
Thank you. I will try this out.

Connectez-vous pour commenter.

Plus de réponses (0)

Produits

Version

R2023b

Question posée :

AR
le 7 Mar 2024

Commenté :

AR
le 18 Oct 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by