- The contents can be compressed and/or encrypted, such that the string cannot be found in clear text inside the file.
- Even without encryption or compression, the text need not be stored continously, but in a valid PDF each character can be stored with its paper position, such that the order does not matter.
How to extract data from pdf file in matlab?
596 views (last 30 days)
I am in search of such algorithm that will extract data from pdf file.For example in the pdf file a sentence is present i.e: Account# 29 . I want to extract 29 from pdf file.If it is possible by fopen() function ,please share it with me.I have tried pdftotext but doesn't succeed. Now if it is possible to extract data from pdf with the help of fopen(), it will be better.I also tried fopen() but leads to failure.Please share you experience with me..Thanks.
Jan on 21 Sep 2014
Assume you have a PDF file, which is displayed containing the string "Account# 345". Now different details impede the extraction of this string:
In consequence searching a string in a PDF is not reliable. Therefore some OCR software is applied frequently to add an additional layer containing the contents as searchable strings. But as long as you do not specify any details of your PDF we cannot guess if they contain such strings.
Please notice, that your problem is not well defined and suggesting solutions is still based on guessing, although you've posted several corresponding questions in this forum. Finally the main problem is, that somebody decided to store data in PDF files, which is not sufficient for the later extraction of strings. Creating a large and complicatd workaround afterwards is an inefficient way. It would be more stable and faster to obtain the data in a more suitable format as a text file.