Data Lakes Mature - How to Choose the Best Platform for your Business

Intellegant
Dec 14, 2017
4 min read

Updated: Aug 1, 2018

by Ben McPherson, Business Analyst at Intellegant

While once tech buzzwords were reserved for Silicon Valley startups, today, new ways of using data have spread throughout all industries. ‘Traditional’ outfits like mining, manufacturing, or shipping can get just as much advantage from integrating smart data into their business plans as the latest smartphone developer. Phone apps aren’t just for games and taking pictures, they can also be used to get RPM, heat, or other maintenance data from industrial machinery in real-time. The data revolution in shipping, sales, inventory management, and logistics has allowed for behemoths like Amazon or Alibaba, and new technology is combining patient data with smart treatment plans to revolutionize the medical and pharmaceutical industries.

As technologies, platforms, and possibilities expand, problems in integration also multiply. Ideally, a robust data management strategy should be in place to head some problems off at the pass, but this seldom happens perfectly. Too often, issues arise—the wrong technology or strategy is chosen from the beginning, wasting valuable time and resources; apps or frameworks are developed in isolation, with suboptimal compliance and integration with the work of co-workers; or business needs evolve and standards and technologies change, rending previous work out of date. Of course, the future must be considered—can your data solution be easily updated or modernized? While it may seem daunting to get up to speed with latest technologies, there are a multitude of tools and resources that allow business owners to make educated choices.

A prime example of one of the challenging decisions businesses must take is the choice of using a data warehouse, data lake, or hybrid solution.

The more developed technology is a data warehouse, or datamart (a smaller system). Data lakes are a newer solution, but that does not mean that they are more advanced and should be adopted in all cases—careful consideration of the pros, cons, and strengths of the different systems is needed.

James Dixon, founder of Pentaho, came up with the term data lake, and describes the difference as such:

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples. A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. [1]

A data warehouse holds structured, processed data—typically schema-on-write. Data is curated and modelled when it is stored, and can easily be retrieved later for analysis or other use, assuming the relevant information was kept. The structure is often rigid, and may require extensive time if preferred modelling frameworks change. It is a highly mature technology, and thus implementation and, most importantly, security is in place and well understood. Given these advantages, this solution remains ideal for business professionals that need typical reports and don’t envisage ‘deep dives’ or unexpected new data uses to be necessary.

The lake, in contrast, holds a vast amount of data in raw format, typically in something non-relational like Hadoop clusters, occasionally in a relational database management system (RDBMS). This data can be structured, semi-structured, or raw, and thus must be modelled or shaped when retrieved—a system known as schema-on-read. As such, it will often take more curation before it can be used for a given task. On the other hand, the lake can often hold all the relevant data for a given business, with nothing thrown out in processing.

This is preferable for agile businesses, which perhaps do not know exactly what uses they will have for data. To use the ‘bottled water’ analogy above, lots of interesting things in a lake will be lost when the water is processed and bottled for easy retrieval. The lake, with its raw scope and depth, allows data scientists to jump in and possibly make new discoveries. The lack of structure also makes it low cost, but the technology is less mature and weaker regarding security. Currently, adopters need to fix up Hadoop’s flaws in this area with vendor and open-source tools.

A source from a major international bank discusses some of these advantages, which led them to develop a data lake:

Three drivers led us to a Hadoop-based data lake. First, our executives and their data strategy prefer a ‘sourced once, used by everyone’ approach, and the scalability of Hadoop, augmented with the right tools and governance, can enable that. Second, our technical users need to comingle diverse data from many sources for broad self-service exploration and analytics, and that’s where the Hadoop-based data lake excels. Third, the low cost of Hadoop software and commodity hardware met our project’s financial requirements. [2]

A data warehouse is for high performance, repeated, constant use in an expected and regular framework. The data lake is better for innovation, flexibility and agility, cases where exploration for new opportunities is needed or expected, advanced analytics, or that deal with particularly diverse forms and quantities of data. In many businesses, the warehouse will suffice and offer better implementation and security solutions, given its maturity. Lakes are catching up, though: one 2017 study found that around 25% of businesses surveyed already had at least one data lake, with another 25% planning production on one soon. [3]

Finally, businesses that envisage the need for both can pursue a hybrid approach. Take a wearable health device, for instance, such as a heart rate monitor. The initial readings would be held in local memory and utilized by the device itself, and then transferred to a paired device, which could then send data to a cloud solution via cellular or wifi transmission (with all data protection and data security issues this implies). Initial large-scale storage could be in a data lake system, which allows flexibility and freedom in capturing such a large amount of data from a large amount of devices. From there, data could be curated, processed, and sent in batches to a data warehouse, which would be set up to provide pre-defined reporting and analytics. Where warranted, this sort of hybrid system would provide the flexibility, speed, low cost, and scale of a lake solution, while still delivering the precise, regular reports and tools needed from a carefully curated warehouse.

***