Why does OCR separate Text into Words?

Question

0 votes

test.mat

Hi all,

I am trying to retrieve specific text from scanned documents reporting tables of numbers. Since the table can change in the amount of column, I use the following approach:

1 - detection of the units of measure through OCR function,

2 - from the units I need (for example, kg/kW.h), calculation of a proper region of interest where OCR function is used to retrieve the needed numbers

This works rather fine but I do not obtain a consistent behaviour of OCR function. In particular, some cases, all the units are well separated into words by OCR function while in others they are grouped together in a single word. In the code below working with the attached data sample, you can see the issue. In particular, the 16th element of txt1.Words reports the units '(kg/kW.h)(kW.h/)' rather than having two Words (one for '(kg/kW.h)' and the other for '(kW.h/)') with their own WordBoundingBoxes. I do not understand why in some case, the units are in the same Word and in other they are bounded together in a single Word. Is it possible to control the generation process of Words in OCR function?

clear all
load('test.mat')
figure
imshow(I)
roi=[250.5	526	1300	142];
Iocr=insertShape(I,'rectangle',roi,'ShapeColor','blue');
hold on
imshow(Iocr)
txt1=ocr(I,roi,CharacterSet=".()kWrpmlhgh/");%,LayoutAnalysis='word');
UnitString=regexp(txt1.Words,'(?<=\()[\w\.\/]*(?=\))','match');
UnitString(cellfun(@isempty,UnitString))=[];
UnitBox=txt1.WordBoundingBoxes(not(cellfun(@isempty,UnitString)),:);

5 commentaires
Afficher 3 commentaires plus anciens Masquer 3 commentaires plus anciens

dpb le 27 Juil 2024

Modifié(e) : dpb le 27 Juil 2024

I think again you would have to provide both a "good" and a "bad" image for folks to have any chance whatever, as it's going to be related to whatever is different between the two in the particular region of interest.

See the Tips section at ocr for some hints about changing unexpected/unwanted behavior; probably the only way you'll be able to learn much more about the internals will be through the references; it looks as though Mathworks is using the open source implementation as their engine. I don't have the TB, so can't do anything here locally, but one last time; without the two images that behave differently, nothing anybody that comes by will be able to do.

BTW, it's been a long time since I've looked at the Nebraska tests, but they're a lot of fun to look at and very informative in a decision-making process if looking at new (to owner, not just brand new) equipment purchase. We're in the SW corner of KS and such is very big data here although a final purchase decision may end up relying more upon the quality and who are the nearby dealerships and less on the test data temselves.

dpb le 27 Juil 2024

"... as it's going to be related to whatever is different between the two in the particular region of interest."

Since the particular issue is splitting an area into two words instead of considering it as a single word/string for two cases of the same string and whether it splits or not is going to be related to how much white space is recognized between characters, it could be as subtle a thing as distortion in the scanner at the particular location in the one image or the page wasn't quite flat or the like.

There is the 'LayoutAnalysis' named parameter that might be able to help if you can isolate the area in the ROI.

dpb le 27 Juil 2024

"... as it's going to be related to whatever is different between the two in the particular region of interest."

I think you would need to start by pulling the ROIs and then doing image comparisons of them to see where they correlate and don't. This likely would require some scaling and maybe skew/rotation corrections to overlay exactly, but I would expect you should be able to find that the "bad" image does show a larger gap between symbols than the "good" one...for whatever reason, would then be the thing to unravel.

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Follow Question

Answer 1

Karl le 28 Juil 2024

Modifié(e) : Karl le 28 Juil 2024

Ouvrir dans MATLAB Online

0 votes

test.mat

You can have an idea of the amount of white space between characters at a word split by displaying the text image with word blocks superimposed:

load('test.mat')

roi=[250 526 1300 142];

x1 = roi(2);

x2 = roi(2) + roi(4);

y1 = roi(1);

y2 = roi(1) + roi(3);

% Apply initial strategy, and show word bounds.

Iocr1=insertShape(I,'rectangle',roi,'ShapeColor','blue');

txt1=ocr(I,roi,CharacterSet=".()kWrpmlhg/",LayoutAnalysis='Block');

Iocr1 = insertObjectAnnotation(...

Iocr1,'rectangle',txt1.WordBoundingBoxes,txt1.Words,Color='green');

figure(Name='initial');

imshow(Iocr1(x1:x2,y1:y2,:))

disp(txt1.TextLines)

{'r rk l ...' } {'p h mmp mmpm' } {'(kW) p l/hr l/hp.hrp.hr/gll/hr mmphr'} {'rpm (l/h) (kg/kW.h)(kW.W/) (l/h) l' }

On the third line, the units Lb/hp.hr, Hp.hr/gal, Gal/hr aren't split at all. On the third line, the units (kh/kW.h) and (kW.h/l) are joined, but (l/h) is separate. There doesn't seem to be a way to specify to the ocr() function the minimum white space for splitting into words. However, it looks as if splitting can be encouraged by including a space in the character set:

% Include space in CharacterSet.

Iocr2=insertShape(I,'rectangle',roi,'ShapeColor','blue');

txt2=ocr(I,roi,CharacterSet=".()kWrpmlhg/ ",LayoutAnalysis='Block');

Iocr2 = insertObjectAnnotation(...

Iocr2,'rectangle',txt2.WordBoundingBoxes,txt2.Words,Color='green');

figure(Name='space in CharacterSet');

imshow(Iocr2(x1:x2,y1:y2,:))

disp(txt2.TextLines)

{'r rk l ...' } {'p h mmp mmpm' } {'(kW) p l/hr l/hp.hr p.hr/gl l/hr m mphr'} {'rpm (l/h) (kg/kW.h) (kW.W/) (l/h) l' }

The splitting itself now seems okay, but the character recognition (as with the initial approach) is imperfect - (kW.h/l) is interpreted as (kW.W/). I'd guess that this relates to the image resolution. The text in the example image that you attached looks fine at a size suitable for reading by a human, but if I zoom in closer to the pixel level the character edges are blurred, some more than others. If you're able to rescan at a higher resolution (i.e. more pixels per inch), this could help. Otherwise, it could be worth experimenting with some preprocessing, along the lines suggested by dpb.

2 commentaires
Afficher Aucune Masquer Aucune

dpb le 28 Juil 2024

Modifié(e) : dpb le 28 Juil 2024

"...splitting can be encouraged by including a space in the character set:"

The issue is more that it is splitting where not wanted since the units strings are one set of units for the measurement column. Although I suppose if one could get a reproducible splitting across all images, that could be dealt with programmatically.

"the example image that you attached looks fine at a size suitable for reading by a human, but if I zoom in closer ..."

That is a very important point I neglected to add but using a zoom factor from 2X-4X on the image is one of the items mentioned in the Tips section referenced so I let that reference suffice. Since I don't have the TB I didn't download the image to really poke at, but I suspect you've uncovered the root cause. OP may need to do a combination of enlargement and noise and sharpening filtering to stabilize results obtained from just direct application to the images as scanned. The idea of scanning with higher resolution certainly is worth trying if possible.

I am no image processing expert; maybe our resident guru @Image Analyst will come across the Q? and chime in with a real expert's knowledge/experience.

Serbring le 1 Août 2024

Hi all,

thanks for your message. I admit that I was sure that there was a kind of undocumented trick that I was not able to find and so I believed it was easier to fullfil this task. By the way, considering that I have to process more than 1000 files (already scanned), thanks to your suggestions, the results have significantly improved thanks to:

inclusion of empty space in the character set
increment on image contrast
setting of the "LayoutAnalysis" property to "block"
through some manual handling. For example, if there are more than two units, the resulting word bounding box is reduced.

I would like to hear @Image Analyst's opinion.

Connectez-vous pour commenter.

Why does OCR separate Text into Words?

5 commentaires
Afficher 3 commentaires plus anciens Masquer 3 commentaires plus anciens

Réponses (1)

2 commentaires
Afficher Aucune Masquer Aucune

Catégories

Produits

Tags

Community Treasure Hunt

Why does OCR separate Text into Words?

5 commentaires Afficher 3 commentaires plus anciens Masquer 3 commentaires plus anciens

Réponses (1)

2 commentaires Afficher Aucune Masquer Aucune

Catégories

Produits

Tags

Voir également

Community Treasure Hunt

5 commentaires
Afficher 3 commentaires plus anciens Masquer 3 commentaires plus anciens

2 commentaires
Afficher Aucune Masquer Aucune