Capturing Paper Documents - Best Practices and Common Questions
What is Capture?
Despite technology, most companies continue to struggle to manage the burden of paper in many important business processes. And while there are many technological approaches to digital transformation, the first step is often scanning. Also known as “capture,” this capability is characterized by the ability to scan paper documents to store and use them in digital form instead of paper. First developed over 30 years ago, capture systems have evolved from simple solutions for basic scanning into sophisticated and expensive systems for enterprise-wide document automation. Therefore, it's important to understand and leverage scanning as a fundamental tool for business today.
What is Scan and Store? And why is it important?
“Scan and store” begins the process of eliminating paper from paper-intensive operations. In today’s competitive economic environment, companies that continue to conduct business on paper will struggle to control the inherent costs and inefficiencies and find themselves at a disadvantage. Wider and more profitable opportunities exist for organizations to bridge the gap between paper and digital media, especially in traditionally paper-intensive fields such as financial services, healthcare, or government.
What are the common issues with Scanning?
Scanning is important, but it is not easy to do correctly. To a greater or lesser degree, most significant information management initiatives will involve a scanning capability to capture hard copy paper documents as electronic images. To build a scanning capability as part of your implementation, you will need to understand and balance the benefits of capturing paper documents as electronic documents compared to dealing with the ‘paper legacy’.
Two particular scanning issues you will also need to address, as they will influence the type of scanning solution you need, are:
- The number of documents to be captured; and
- The type and quality of documents to be captured.
More specifically, priority for scanning should be given to documents where there are requirements for simultaneous access; documents that are retrieved regularly; or documents that need to be accessed quickly. These will determine what type of hardware and software will be most effective and whether it even makes sense to scan in‐house.
What are Recognition Tools?
There are various types of recognition tools available from many of the document capture and ECM vendors. The most commonly known is “optical character recognition” or OCR. This process reads the scanned image, often in TIFF or PDF format, and uses pattern matching to identify (or best guess) the words and numbers on the imaged document. Good quality original documents can have high accuracy rates, often well over 90% for machine‐printed documents.
Advanced OCR tools can be set up to recognize zones to automate forms processing activities. For example, the top right corner of a form may be a consistently structured customer address block. The zone OCR settings can zoom in on those predictable sections, identifying and capturing the information and automating the contract or form handling. The text recognition can be used not only for full-text search on the imaged documents, but text can also be extracted to pre‐populate database, workflows, or other structured fields.
Intelligent Character Recognition is designed to read and extract text from hand-printed documents. This can be a more difficult operation, with lower accuracy rates, but is constantly improving.
New areas of recognition innovation are improving the capture and text extraction from videos, voice recordings, and even photography. Consider the usefulness of extracting the transcript of a videotaped annual meeting for quick and easy indexing and search. Courts, law firms, media, and translation bureaus are already using this type of recognition regularly.
What are the benefits of scanning?
There are a number of benefits of scanning paper documents. First, once a document is captured electronically, it can be made accessible through the IT infrastructure to others at remote locations. The information is also easily accessible on the desktop, rather than having to identify and go to a filing cabinet.
Once captured electronically, recognition technologies such as OCR and ICR can be used to convert the electronic image of the document into a computer text document. The whole of the text content can then be searched using standard desktop or enterprise search tools. Security and access controls can be applied to enable sharing or to prevent access to information for each user on the system.
Last and certainly not least, the documents, once scanned, take up far less space in offices than filing cabinets used to store paper documents. This only applies, of course, if the scanned image quality and other factors allow original paper documents to be shredded!
What are the costs of scanning and capturing paper electronically?
While there are a number of clear benefits from scanning paper, you should also be aware of the potential costs for capturing paper electronically. To understand the cost implications for capturing legacy paper documents, you will need to assess the amount of work that may be necessary to prepare those legacy paper documents for scanning, as it could have a significant effect on the overall costs. Scanning at scale is not simply a matter of inserting a page and pushing a button.
It may not be possible to actually scan some of your paper holdings, as they may be too large, or too flimsy to put through the scanner. Secondly, you will need to consider the scale of your overall holdings of legacy paper. Even if they are well organized, you will still need to determine how much you will need to scan.
The average cost per sheet of the scanning process will also depend on the nature and quality of your scanning process. Research indicative costs before you make assumptions for a business case.
What are the best practices for a Scanning Process?
If you decide that you do need to scan paper documents and capture them to your system, you will have a number of other considerations to address.
First, you need to decide what to scan. There are generally four approaches to scanning legacy content, also known as backfile conversion.
- Scan everything in the backfile. This is the most expensive because of the volume involved, and because as the paper files get older, they are more likely to be in less‐than‐perfect shape: stapled, with curling or torn pages, dusty, etc. But this might be needed in the case of a full digitization of your archives, for example, or to meet legal requirements.
- Partial conversion. In this approach, the organization would scan only certain holdings: those within the last two years, for example, or only those of a certain type such as contracts or personnel files. This is more cost‐effective but may end up with people needing to look in multiple locations, physical and electronic, to locate a particular document.
- Day‐forward conversion. In this approach, you pick a date and scan everything that comes in after that date. Anything that was received before that date is maintained in whatever format it was received in. This is more cost‐effective yet and has the additional advantage that users know which system to search based on the date.
- Scan on demand. This approach starts with partial or day‐forward conversion, but also includes scanning anything that is specifically requested. For example, the organization begins day‐forward scanning on August 1, but when files are requested from the previous year, those are retrieved, processed, and scanned before being returned to the backfile. This has the additional advantage that only those legacy files that are actually accessed are scanned.
Next, you will need to decide the quality level that is required for your scanning process. Some countries have developed standards that may help you decide what is appropriate for your business.
You may need to set up a new, or enhanced, scanning capability to meet your organization’s scanning requirements. You will need to make sure that the scanning hardware and software are capable of coping with your forecast scanning volumes, and that the staff involved are properly trained and can scan to acceptable quality standards. Continual motivation of people involved in a scanning operation is also vital to achieving a high-quality service.
Before you can actually scan your documents, you will need to prepare them; for example, straightening out and removing any staples or clips, grouping similar-sized documents together. You will also need to determine whether you need to scan documents individually or, if you have a large volume, how to batch documents together for bulk scanning.
The actual processes for scanning documents will need to be developed so. Ideally, each scanning process operates to the same quality level across your organization. It is important, therefore, to set effective quality standards, like image quality for example, and to have suitable monitoring in place to ensure that these standards are met.
Scanned input may come from a variety of devices such as hand‐held scanners, high‐volume systems, or possibly from images scanned in bulk at a specialist agency. Whatever the source of the information that is captured, it must be carefully tracked so that the scanned images can be relied upon later. The capture process must not allow for subsequent tampering with the image unless an audit trail or version control is shown to be in use. Although the scanning and capture process can be shown to be robust, the critical issue is to make sure that the resulting image is credible, and could be used to support a legal argument.
By having consistent policies and procedures, including the scanning and capture stages, there is more likelihood that scanned images will be legally admissible, and can be produced as evidence. They are then said to have evidential weight.
There may also be issues with the size and/or condition of the documents. For example, in 1973, there was a fire in the National Personnel Records Center run by the U.S. National Archives and Records Administration (NARA). Millions of military records were lost; many of those that were not lost were damaged by smoke, heat, and water. These documents are now 40 or more years old and very fragile. NARA is attempting to recover the documents through careful scanning using specific tools and processes, but it’s a very laborious undertaking.
Similarly, documents may be too large, like engineering drawings; too small, like business cards; too thin, like onionskin or carbonless forms; too thick, like cardboard; or otherwise not suitable for scanning (for example, pages that are ripped/torn, crumpled, or smudged). In modern capture, we can bring hardware and software to bear on all of these, but they require significant expertise and specialized resources to do effectively.
Whatever technologies and solutions you adopt, it is important that the solution be compatible with all of the office document types you might use, now and in the foreseeable future. Scanning should be supported, and the procedures should cover whatever is required to make the scanned content legally admissible and tamper‐proof. Make sure also that the solution supports a range of scanner types and is therefore capable of handling the volume of scanning you anticipate.
Even though information capture can be made ‘easy to use’, the systems should provide controlled access, and appropriate management controls. Lastly, the system must be able to grow, as the need for more information inputs is identified. The scalability of the solution should cover likely increases in the need for storage and data capture performance, as well as being able to support new and improved recognition software, if appropriate.
About Kevin Craine
Kevin Craine is a professional writer, an internationally respected technology analyst, and an award-winning podcast producer. He was named the #1 Enterprise Content Management Influencer to follow on Twitter and has listeners and readers worldwide. Kevin creates strategic content for the web, marketing, social media, and more. He is the written voice for some of North America's leading brands and his interviews feature today's best thought leaders. His client list includes many well-known global leaders like IBM, Microsoft and Intel, along with a long list of individuals and start-ups from a wide variety of industries. Kevin's podcasts have been heard around the world, including the award-winning weekly business show "Everyday MBA". He is also the host and producer of "Bizcast" on C-Suite Radio and the producer behind podcasts for Epson, Canon, IBM and AIIM International, among others. Prior to starting Craine Communications Group, Kevin was Director of Document Services for Regence BlueCross BlueShield where he managed high volume document processing operations in Seattle, Portland and Salt Lake City. He also spent time at IKON as an Enterprise Content Management consultant working with national and major accounts. He was the founding editor of Document Strategy magazine. Kevin has also been, at one point or another, an adjunct university professor, a black belt martial artist, and a professional guitarist. Kevin holds an MBA in the Management of Science and Technology as well as a BA in Communications and Marketing.