Document Imaging

What Is Optical Character Recognition (OCR)? Cost, Capabilities and Limitations

Has your organization ever had difficulty locating a record, or locating a specific piece of information within a record? Probably the easiest and most effective way to remedy this common issue is to digitize your paper records. This process of document imaging not only clears up space, which was once occupied by boxes of paper records, by converting them to a digital format, but it can also make locating information throughout multiple documents extremely easy. By allowing for your records to be stored electronically with multiple index fields, locating documents instantly becomes much easier to do than by manually sorting through paper records. Why not make it even easier to locate certain pieces of information, though? The best way to do this is to add an overlay software to your digitized records called Optical Character Recognition (OCR). If you’ve heard of OCR before, it’s probably because you have used it in some common applications, such as Adobe Reader.

How OCR Works

OCR software is an extra feature that you can choose to add when digitizing records. But what exactly is it and what does it do? In short, the OCR software is used to provide users with the ability to search for specific keywords anywhere within their record database, by being overlaid on top. OCR achieves this feat through a sophisticated process.

First, paper records are digitized using document imaging. Next, the OCR software scans the entire record database and searches for recognizable characters. It does this by dividing each page into sections of text, images, numeric characters, etc. From here, the OCR software compares these characters with a pre-recognized set that has already been programmed in. It then makes a determination of what each character is and creates a final text format.

Once the software has created this functional text format, the database is then stored online via the document imaging provider’s records management software. It is then that users are able to login, search for and successfully locate individual keywords throughout their document database, no matter how large or small the database may be. A much, much easier process than by searching through paper records or digitized records with no OCR added. The OCR software can be added at any point during the document imaging process, too—either during digitization or anytime after the process has completed.

OCR Capabilities and Limitations

Just because it is easy for OCR to recognize and analyze text in a document, doesn’t mean that it can’t recognize other forms of content. If there are tables or images in a document that contain text or numeric characters, the OCR software is able to scan such content and recognize these characters.

But what are the limitations of the OCR software? In short…handwriting. Although modern OCR software can successfully read many types of handwriting found within documents, there are a few factors that can inhibit the OCR software’s ability to do so. Some of them include:

  • Legibility: If handwriting is difficult for someone to read, it’s probably going to be more difficult for the OCR software to read and convert to text.
  • Type of writing: While printed handwriting is fairly easy for OCR to recognize, cursive is more difficult.
  • Type of Medium: Handwriting in pencil is oftentimes more difficult for OCR to read than that of pen.
  • Quality of Writing: Faded handwriting is more difficult for the OCR software to read and convert.

OCR File Formats

Initially, when paper records are imaged, they are usually converted into image file formats such as a .tiff or .jpeg. Once the OCR software is overlaid, however, the file format is converted to a text format such as a .txt file. The reason for this is that a text file format must be used to search for keywords. Because OCR software scans documents to recognize characters, it is effectively turning what a computer sees as an image of a document into an actual text document.

OCR Size and Overlay Time

Just as paper records take up physical space, digitized records take up digital space. Thousands of boxes of records can equate to many gigabytes of digital storage. When OCR is overlaid to these digitized records, the storage space increases to accommodate for this additional feature. The more room a database initially occupies, the more room OCR is going to take up. This also means that the larger the database, the more time it will take for OCR to be added.

OCR Cost

What the size of the database means for you is cost. Most document imaging providers tend to charge clients by the size of space a database takes up, not explicitly whether the OCR software has been added. Because OCR requires additional space on top of what already exists, your cost will increase based on this storage space.

For example, if a provider is hosting a client’s database that takes up ten gigabytes of room, they are charged for ten gigabytes. If that same client elects to add OCR to its database, the amount of room it occupies might increase to 13 gigabytes. This will raise the cost to whatever 13 gigabytes comes out to be.

There also might be minimums that a client is required to pay. For example, if a client’s database takes up only one gigabyte, adding OCR might only increase the storage size to 1.2 gigabytes. Although this OCR overlay only increased the database’s size by 200 megabytes, the client might be required to pay a minimum that is equal to two gigabytes.

Digitizing Documents at Rover Records Management

Deciding to add OCR to your company’s online records database is a very convenient way to make locating specific pieces of information even easier. Rover Records Management offers both document imaging and OCR overlay technologies that are tailored to your business schedule and practices. Rover also offers offsite secured and redundant backup of data, which ensures that your company’s sensitive information is well-secured by storing it in multiple electronic locations.