In our last post, we dove into data transformation and the difference between data integration and data wrangling, or ETL vs. ELT. To give a quick summary:
Data integration / ETL is a technology-driven process for tying together different sources and systems to collect their data and make it available to an organization. Data is transformed before it’s loaded into a repository (a “data warehouse” in classic BI parlance).
Data wrangling / ELT is the set of human-driven activities involved in taking some raw data and processing it into a state that’s suitable for analysis. In some cases, the data being wrangled lives in a “data lake” where all kinds of data has been dumped. People extract batches of data and transform it as needed for the analysis they’re doing at that time.
Here, we’ll take a closer look at data warehouses and data lakes. These two metaphors reflect a certain logic, because the first suggests structure while the second does not—and that’s exactly the difference.
The term “data warehouse” has tended to mean data that’s organized into rows and columns and built on a database technology, but that’s not always the case. While structure of some type remains the common denominator, different kinds of structures and various underlying technologies power today’s data warehouses. A well-structured data warehouse ensures a consistent and unified view of the business for all users. In each case, though, the ultimate data structure (or schema) must be designed up front, which is hard, and many projects stall or fail during this design phase.
“Data lake” is credited to Pentaho CTO James Dixon. He coined it to refer to a data repository that’s filled via a variety of (sometimes unpredictable and unknown) streams and sources. A data lake requires less upfront investment than a data warehouse, and since it doesn’t enforce a particular structure, it’s flexible and can accommodate frequent changes.
A typical downside of having no structure is that only experts can find and use the data, so sometimes you end up with a bottleneck where your team of data analysts struggles to keep up with the demands of the business.
Which approach is best for marketing? In nearly every case, it needs to be a combination of the two.
Consider these use cases:
- At least some of the users of the data are businesspeople rather than skilled analysts or data scientists. For example, if marketers need to be able to explore and analyze the data and make decisions themselves, then a good, predefined data schema ensures they can find the right data and explore it efficiently.
- Consistency and governance are important. If you want to judge all your agencies against common criteria, for example, and you want everyone to be using the same standards and definitions for measuring campaign performance, then it makes sense to embed and enforce those standards in the actual data that everyone is using.
- The types of analyses and reports can be somewhat standardized or known in advance. If your only inputs are marketing and related data, and you know (at least roughly) the kinds of things you want to learn and report on, then a clean, standardized dataset will make the process faster from a technology standpoint and easier and more efficient from a user standpoint.
You’ll no doubt recognize each of these factors as an important and familiar ingredient of a marketing analytics or marketing intelligence practice. It’s clear that if you want to be a data-driven marketing organization, then you need centralized, normalized and structured data—like a data warehouse.
However, marketers also need freedom to innovate in an environment of rapidly increasing data volumes. For that you want maximum flexibility, extensibility and scale—like a data lake.
So the ideal solution for marketing is not one or the other. It must combine the advantages of each.