Image Processing is now a days considered to be a favorite topic in the IT industry. One of its major applications is Optical Character Recognition (OCR). When the object to be matched is presented then our brains or in general recognition system starts extracting the important features of the object that includes color, depth, shape & size. These features are stored in the part of the memory. Now the brain starts finding the closest match for these extracted features in the whole collection of objects, which is already stored in it. This we can refer as standard library. When it finds the match then it gives the matched object or signal from the standard library as the final result. For humans character recognition seems to be simple task but to make a computer analyze and finally correctly recognize a character is a difficult task. Here we are basically dealing with the â„¢Upper Case Characterâ„¢ recognition.
Optical Character Recognition (OCR)
The goal of Optical Character Recognition (OCR) is to classify optical patterns (often contained in a digital image) corresponding to alphanumeric or other characters. The process of OCR involves several steps including segmentation, feature extraction, and classification. Each of these steps is a field unto itself, and is described briefly here in the context of a Matlab implementation of OCR.
Therefore, Text capture is a process to convert analogue text based resources into digitally recognisable text resources. These digital text resources can be represented in many ways such as searchable text in indexes to identify documents or page images, or as full text resources. An essential first stage in any text capture process from analogue to digital will be to create a scanned image of the page side. This will provide the base for all other processes. The next stage may then be to use a technology known as Optical Character Recognition to convert the text content into a machine readable format.
Optical Character Recognition (OCR) is a type of document image analysis where a scanned digital image that contains either machine printed or handwritten script is input into an OCR software engine and translating it into an editable machine readable digital text format (like ASCII text).
OCR works by first pre-processing the digital page image into its smallest component parts with layout analysis to find text blocks, sentence/line blocks, word blocks and character blocks. Other features such as lines, graphics, photographs etc are recognised and discarded. The character blocks are then further broken down into components parts, pattern recognized and compared to the OCR engines large dictionary of characters from various fonts and languages. Once a likely match is made then this is recorded and a set of characters in the word block are recognized until all likely characters have been found for the word block. The word is then compared to the OCR engine’s large dictionary of complete words that exist for that language.
These factors of characters and words recognised are the key to OCR accuracy – by combining them the OCR engine can deliver much higher levels of accuracy. Modern OCR engines extend this accuracy through more sophisticated pre-processing of source digital images and better algorithms for fuzzy matching, sounds-like matching and grammatical measurements to more accurately establish word accuracy.
1.1 Different uses for OCR
There are many uses for the output from an OCR engine and these are not limited to a full text representation online that exactly reproduces the original. Because OCR can, in many circumstances, deliver character recognition accuracy that is below what a good copy typist would achieve it is often assumed it has little validity as a process for many historical documents. However, as long as the process is fitted to the information requirement then OCR can have a place even when the accuracy is relatively low (see Accuracy below for more details).
Potential uses include:
Indexing – the OCR text is output into a pure text file that is then imported to a search engine. The text is used as the basis for full text searching of the information resource. However, the user never sees the OCR’d text – they are delivered a page image from the scanned document instead. This allows for the OCR accuracy to be quite poor whilst still delivering the document to the user and providing searching capability. However, this mode of searching just identifies the document not necessarily the word or page on which it appears – in other terms it just indexes that those words appear in a specific item.
An example of this is
Full text retrieval – in this mode the OCR text is created as above but further work is done in the delivery system to allow for true full text retrieval. The search results are displayed with hit highlighting within the page image displayed. This is a valuable addition to the indexing option
from the perspective of the user. An example of this is the Forced Migration Online Digital Library2.
Full text representation – in this option the OCR’d text is shown to the end user as a representation of the original document. In this case the OCR must be very accurate indeed or the user will lose confidence in the information resource. All sorts of formatting issues in terms of the look and feel of the original are inherent within this option and it is rarely used with mark-up (see below) of some kind. The key factor is the accuracy and this leads to most projects having to check and correct OCR text to ensure the accuracy is suitable for publication with obvious time and cost implications.
Full text representation with xml mark-up - in this option the OCR output is presented to the end user with layout, structure or metadata added via the XML mark-up. In the majority of cases where OCR’d text is to be delivered there will be at least a minimal amount of mark-up done to represent structure or layout. Currently this process normally requires the highest amount of human intervention out of all the options listed here as OCR correction is very likely with additional mark-up of the content in someway. Many examples of digital text resources with XML mark-up may be found through the Text Encoding Initiative website3. The projects listed there also demonstrate the variety in levels of mark-up that are possible making it possible to vary activity to match the projects intellectual requirements and economic constraints.
1.2 Key issues for whether to use OCR
There are several key issues to consider in deciding whether to use OCR at all or choosing between different possible appropriate uses for the text output. The main factors to consider are a combination of accuracy, efficiency and the value gained from the process. If the accuracy is below 98% then considerations of the cost in terms of time and effort to proof read and correct the resource would have to be accounted for if a full text representation is to be made. For instance, see the EEBO production description for how the accuracy issue changed their potential approaches4. If the OCR engine is not capable to delivering the required accuracy then rekeying the text may become viable, but only if the intellectual value to be gained from having the rekeyed text matches the projects goals and budgets. Otherwise, OCR for indexing and retrieval may be the most viable option.
The majority of OCR software suppliers define accuracy in terms of a percentage figure based on the number of correct characters per volume of characters converted. This is very likely to be a misleading figure as it is normally based upon the OCR engine attempting to convert a perfect laser printed text of the modernity and quality of, for instance, the printed version of this document. So, if told that even the better OCR software could get 1 in 10,000 characters wrong and that it will then likely get more than one or two characters wrong in this document would this seem quite so impressive? It is more useful to know how accurate the OCR engine will be on pre-1950’s printed texts of very varying quality in terms of print and paper quality. In this context, it is highly unlikely that we will get 99.99% accuracy and we could assume that even the very best quality printed pre-1950’s resources will give no more than 98% (and most would be considerably less than that). In these scenarios the accuracy measure given by the software suppliers is not very useful in deciding whether OCR is appropriate to the original printed resource.
Regarding accuracy as a measurement of the amount of likely activity required to enable the text output to meet the defined requirements would be more useful.
In this context we might look at the number of words that are incorrect rather than number of characters. For example: a page of 500 words with 2,500 characters. If the OCR engine gives a result of 98% accuracy this equals 50 characters incorrect. However, looked at in word terms this could convert to 50 words incorrect (one character per word) and thus in word accuracy terms would equal 90% accuracy. If 25 words are inaccurate (2 characters on average per word) then this gives 95% in word accuracy terms. If 10 words were inaccurate (average of 5 characters per word) then the word accuracy is 98%. In terms of effort and usefulness the word accuracy matters more than the character accuracy – we can see the possibility of 5 times the effort to correct to 100% across the word accuracy range shown in this simple example. It is essential to remember that correcting OCR or text output is relatively expensive in terms of time and effort requiring both correction and proof reading activities – so it best to seek ways to avoid this additional activity if possible. The other consideration might be the usefulness of the text for indexing and retrieval purposes. If it is possible to achieve 90% character accuracy and still get 90% word accuracy, then most search engines utilising fuzzy logic would get in excess of 98% retrieval rate for straightforward prose text. In these cases the OCR accuracy may be of less interest than the potential retrieval rate for the resource (especially as the user will never see the OCR’d text to notice it isn’t perfect). In most prose
circumstances significant words and names are repeated which improves even more the chances of retrieval and can enable high retrieval rates for OCR accuracies measuring lower than 90%.