Hailed as the way in which we can breathe life into our static, paper documents, Optical Character Recognition (OCR) has made strides in the recent decades – becoming a staple module in just about every software package managing documents - From Nuance’s PaperPort to EMC’s Documentum.
OCR itself can mean various things. Wikipedia offers this definition: "… the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text (2008)."
While many estimate the accuracy levels for OCR engines can reach 98 or 99 percent, it has been my experience this is very difficult to achieve in most commercially-available software suites for the small-to-medium businesses (SMBs). Many variables can affect the accuracy levels of output, ranging from document condition to readability.
With so many variables in scanning paper-based documents, it is often not possible to gain high accuracy ratings on a small budget. Thus OCR can often be a challenge to implement in many SMB’s.
When the rubber meets the road: Typical applications of OCR revolve around digitizing documents and transforming them into an image along with usable metadata of the information contained on the physical page itself. In essence, the computer reads the document and creates a library of searchable information.
This type of application allows an EDM solution the opportunity to build a database of text, contextually tied back to the original images as a layer of the document, or image, itself. Searching for usable information within and across documents is much easier. In other words, it gets you in the right neighborhood.
Extremely high accuracy rates are often not at issue in these applications, because the indexes can be combined with this database of textual information dramatically increasing the findability of information.
Where are the brakes on this thing? Where problems can begin to occur is when OCR is not applied to the text contained within the scanned document, but used to lift index values themselves (e.g., customer name, number, etc.). Why is this so dangerous?
Combined with other technology and processes, OCR itself is a wonderful aid in seeking efficiency within the business. However, with no quality assurance or stop-loss measures in place, it is highly likely a document will be misplaced due to a character being off here or there. In essence, you now have a needle in a haystack.
Know your goals, your threshold of pain, and what you can accept.
Do you need 100% accuracy, or is a margin of error acceptable? Ask yourself what will happen if a document is misfiled. The answer will tell you how to design your workflow better.
Document preparation is key to ensuring efficient use of personnel time as well as achieving high levels of accuracy.
While most people turn to OCR to save on document preparation time, consider the entire process against what you are trying to achieve. Isn’t saving 30 minutes in your workflow worth sacrificing an extra 5 minutes in document preparation?
Quality assurance on key information is requisite if high levels of accuracy are required – especially in audit or regulatory scenarios.
OCR technology isn’t perfect, and the quality of the documents being processed greatly affects the level of accuracy. Including discussions about how best to ensure the levels of accuracy and automation you are expecting up-front will save you a great deal of pain after the launch of your new solution.
The key to findability of information contained within documents is to enforce process.
While OCR can greatly increase the automation of your processes, ensure everyone from the executives to the administrative personnel agree on how to find your documents. Many businesses find themselves wishing they had never wasted the time or money on a document management system when all they were missing was a little understanding and enforcing processes.
Where will you apply OCR in your workflows?
Don’t just say, “everywhere” and expect “Google-like” results. Be smart about where OCR will benefit your organization and where you might instead rely on process or other EDM technologies to augment your index structure instead.
Allot for plenty of computing horsepower to drive your OCR needs.
Technology is certainly cheaper than it used to be. However, intensive OCR applications can stretch out longer than most people expect simply because they either haven’t provisioned enough computing horsepower or don’t have realistic expectations about what their budget will allow.
You know your business, but you may not be comfortable with what technologies can be applied in your business to automate workflows and increase findability.
Be sure that your consultants have a vested interested in understanding your business processes and overlay the technology to fit your needs rather than twisting your culture around a half-baked solution.
Create an accountability structure based on solving issues rather than blaming others.
In high demand environments, appointing a “scanning czar” is critical. Information assurance is one of the most overlooked functions in an SMB solution. Having someone who can identify workflow bottlenecks, potential training issues, interdepartmental miscommunication, and business process misalignments is worth its weight in gold.