Perspective
Organizations generate reams of data in the course of planning, executing, and operationalizing business strategies. Innovating through information depends upon an organization’s ability to make datasets from multiple information systems available to data analysts for use in their respective analytic environments.
But innovating with regulated or “controlled” data (e.g., PII, PHI, and PFI) presents systemic challenges to any organization. This data resides in some system-of-record from which, by company policy, said data may not be easily removed. These systems are designed to provide the level of assurance mandated by Corporate Security and Privacy Departments to mitigate the risks to the confidentiality, integrity, security, and availability of controlled data.
These systems-of-record are designed for their respective operational purposes. Though they may come with some reporting capability, they typically do not lend themselves naturally to the complex analytical capabilities that analysts need to perform their work.
There are a variety of tools used for analytics and visualization of varying degrees of complexity, sophistication, and processing capability that are used for the types of number crunching and insight generation organizations need. However, they exist outside the data storage environment and rarely integrate seamlessly with these controlled data systems-of-record.
To use this data in an analysis, the data must be exported, transmitted, and transformed to be used in these tools. Aside from the logistics, effort, time, and total cost to the organization of moving this data between systems for analysis – which can be daunting when very large datasets are needed – a pervasive threat that the controlled data may be compromised in some way during the transfer is at best mitigated.
Yes, there are various encryption, masking, anonymizing, and de-identifying techniques to protect this controlled data outside the system-of-record. Erring on the side of caution is the typical organizational response. The data remains where it is, and the opportunities that those insights would have revealed go unknown to the organization, denying it competitive advantage.
Business leaders are finding that these much-needed restrictions to the movement of data across system and organizational boundaries stifle innovation where business insights may depend upon what is stored in that data. As a result, there are just a few options available to them:
- Forgo the analysis, which accepts defeat and the opportunity costs of not having these insights
- Work with small samples, which may deny the insights large data sets can convey
- Endure the onerous expense of protect the data during transport and outside the core system, which infers it will be an exception process
- Move the analytic tools into a secure data storage environment, built to the standards of controlled data systems-of-record.
Data lakes make this last option possible, even preferable, and are game-changing for firms working with controlled data. A data lake is a new architectural design pattern that leverages existing data warehouse, transport, and analytics capabilities, enhanced with the new capabilities possible with the advent of big data tools. Collectively, these data lake solutions provide secure storage, multi-tenancy, robust workflows (essential for operationalizing the lengthy and arduous data preparation tasks that precede actual analytic activities), and analytics, visualization, and reporting capabilities.
All computational and analytic activities are conducted in the lake itself. Only outcomes and insights need be delivered to their respective business stakeholders, a far less challenging endeavor than data movement, preparation, and analysis. This also presents fewer operational costs as well as risks that are more easily managed.
Once the data is ingested into the data lake, which ideally is overseen by a mature Data Governance program, it should not need to leave this secure environment. The storage is cheap! Leave it there for auditing and reuse purposes in the future. It is safe there and the monitoring tools allow deeply granular tracking and enforcement of data access, movement, and transformation in Hadoop environments. The costs savings of not deleting or moving the data successive times should be reallocated to more valuable business uses.
Data lakes are natural solutions to addressing both the intrinsic risks of moving data about the organization and the opportunity costs of not deriving insights from that data. There is no longer any reason to think that deriving game-changing insights from this data is beyond your organization.
The solution is here.