AIIM - The Global Community of Information Professionals

Guest Post -- GDPR Compliance starts with Data Discovery

Nov 16, 2017 9:00:00 AM by Andrew Pery

This is the seventh post in a series on privacy by Andrew Pery. You might also be interested in:


50% of organizations not ready for GDPR

Even though The European General Data Protection Regulation (GDPR) will go into effect in just seven short months on May 25, 2018 a recently published Gartner report found that 50 percent of the companies surveyed do not expect to be ready to comply with a much more rigorous privacy regime including its onerous enforcement provisions.

The primary challenge appears to be lack of sufficient insight to information holdings that may span email, file shares, content repositories and yes in many cases paper-based files.  Not surprising the PwC Pulse Survey found that initial investments are expected to be in data discovery best practices and tools. In particular the Accountability Principle requires organizations to demonstrate that personally identifiable information is “processed lawfully, fairly and in a transparent manner”.   Implicit in this is the requirement that organizations must institute policies, processes and systems that:

With the proliferation of information from multiple channels, it is imperative for organizations to invest in organizational and technical capabilities that ensure compliance with the GDPR accountability principle.   

Compliance starts with data discovery. 

Specifically, a determination must be made as to the location, format, content, legal basis and access rights associated with the collection and processing of personally identifiable information.   

This process can be time consuming and labor intensive.  However, advances in machine learning technologies are proven to improve the accuracy and reliability of automating data discovery, extraction and classification processes.   

Digitization of all incoming data as soon as they enter the organization is an essential step in the data discovery process.  Such data may be received in various formats and channels – paper, fax or Email.  Using intelligent document capture algorithms based on full page OCR, ICR and Text Analytics large volumes of incoming documents may be analyzed and personally identifiable information may be automatically extracted and classified with high degree of accuracy.  

Simply put, machine learning is an automated process whereby the computer application is able to learn without programming by “training” the system to extract and classify personally identifiable information based on sample documents.  The application gets” smarter” with each document processed with the aim of working towards minimizing labor intensive manual data classification.

Once data is digitized the next step in the data discovery process is identification of all information repositories that may contain personally identifiable information.   This is not a trivial task. Often organizations simply do not have visibility to their information holdings.  Information may be located in file shares, emails, in distributed content repositories, in antiquated archival systems and in file cabinets. 

Auto Classification engines provide the capability to crawl heterogeneous data sources, extract and classify content based on either standard or user provided taxonomies, embed metatags and apply business rules for the processing of personally identifiable information in adherence to the GDPR accountability requirement.

Additionally, data mapping tools may be utilized to gain insights to data flows relating to the processing of personally identifiable information, who has control of the information, disposition and transfers internally within business units and  externally to third parties.

Finally, semantic analysis technologies considerably improve search relevancy across heterogeneous content repositories.   Semantic Information Retrieval or commonly referred to as concept searching is based on using natural language processing techniques that find relevant documents without explicitly providing key words in the search criteria.  Semantic analysis overcomes the problem of key word searching techniques wherein search results may be over inclusive but lack precision.  Using semantic analysis techniques personally identifiable information can be uncovered, tagged and processed in compliance with GDPR thereby mitigate risk while improve organizational efficiency.

Considering that according to the PwC Pulse survey 68% of US organization surveyed said they will invest between $1 million and $10 million on GDPR related compliance initiatives an assessment of the efficacy of data discovery tools may be a prudent investment. 

What should your organization budget for GDPR? You may want to check out this useful resource.

About the author:  Andrew Pery is a marketing executive with over 25 years of experience in the high technology sector focusing on content management and business process automation.  Currenly Andrew is CMO of Top Image Systems.  Andrew holds a Masters of Law degree with Distinction from Northwestern University is a Certified Information Privacy Professional (CIPP/C) and a Certified Information Professional (CIP/AIIM).

[Note from JM:  All this has me thinking about privacy challenges of managing increasing volumes of data, and particularly compliance challenges looming with the pending new European privacy rules - the GDPR. Andrew and I wrote a new eBook on the topic -- Information Privacy and Data Protection Regulation --The EU GDPR is Just the Tip of the Iceberg. Check it out.

eu gdpr

Topics: privacy, security, information security, gdpr

Like what you see? Subscribe to get updates delivered straight to your inbox.

Back to Blog

About AIIM

AIIM provides market research, expert advice, and skills development to an empowered community of leaders committed to information-driven innovation.

Click to download 14 Steps to a Successful ECM Implementation

Subscribe to Email Updates

Recent Posts