Even though The European General Data Protection Regulation (GDPR) will go into effect in just seven short months on May 25, 2018, a recently published Gartner report found that 50 percent of the companies surveyed do not expect to be ready to comply with a much more rigorous privacy regime including its onerous enforcement provisions.
The primary challenge appears to be a lack of sufficient insight into information holdings that may span email, file shares, content repositories, and yes, in many cases, paper-based files. Not surprising, the PwC Pulse Survey found that initial investments are expected to be in data discovery best practices and tools. In particular, the Accountability Principle requires organizations to demonstrate that personally identifiable information is “processed lawfully, fairly and in a transparent manner”. Implicit in this is the requirement that organizations must institute policies, processes, and systems that:
With the proliferation of information from multiple channels, it is imperative for organizations to invest in organizational and technical capabilities that ensure compliance with the GDPR accountability principle.
Specifically, a determination must be made as to the location, format, content, legal basis, and access rights associated with the collection and processing of personally identifiable information.
This process can be time-consuming and labor-intensive. However, advances in machine learning technologies are proven to improve the accuracy and reliability of automated data discovery, extraction, and classification processes.
The digitization of all incoming data, as soon as they enter the organization, is an essential step in the data discovery process. Such data may be received in various formats and channels – paper, fax, or Email. Using intelligent document capture algorithms based on full-page OCR, ICR, and Text Analytics, large volumes of incoming documents may be analyzed and personally identifiable information may be automatically extracted and classified with a high degree of accuracy.
Simply put, machine learning is an automated process whereby the computer application is able to learn without programming by “training” the system to extract and classify personally identifiable information based on sample documents. The application gets” smarter” with each document processed with the aim of working towards minimizing labor-intensive manual data classification.
Once data is digitized, the next step in the data discovery process is the identification of all information repositories that may contain personally identifiable information. This is not a trivial task. Often organizations simply do not have visibility to their information holdings. Information may be located in file shares, emails, in distributed content repositories, in antiquated archival systems, and in file cabinets.
Auto Classification engines provide the capability to crawl heterogeneous data sources, extract and classify content based on either standard or user-provided taxonomies, embed metatags, and apply business rules for the processing of personally identifiable information in adherence to the GDPR accountability requirement.
Additionally, data mapping tools may be utilized to gain insights into data flows relating to the processing of personally identifiable information, who has control of the information, disposition, and transfers internally within business units and externally to third parties.
Finally, semantic analysis technologies considerably improve search relevancy across heterogeneous content repositories. Semantic Information Retrieval or commonly referred to as concept searching is based on using natural language processing techniques that find relevant documents without explicitly providing keywords in the search criteria. Semantic analysis overcomes the problem of keyword searching techniques wherein search results may be over-inclusive but lack precision. Using semantic analysis techniques personally identifiable information can be uncovered, tagged, and processed in compliance with GDPR, thereby mitigate risk while improving organizational efficiency.
Considering that according to the PwC Pulse survey, 68% of US organizations surveyed said they will invest between $1 million and $10 million on GDPR related compliance initiatives, an assessment of the efficacy of data discovery tools may be a prudent investment.
What should your organization budget for GDPR? You may want to check out this useful resource.