In a recent post, we shared a bit about the basics of data integration. Now, let’s dive deeper into one particular element: data transformation.
Data transformation simply refers to the ways you improve raw data to make it useful for downstream applications and teams. Data often lives in different locations and diverse formats across the enterprise, so in addition to getting data out of its various silos, you need to transform at least some of it before you can glean any useful insights.
Let’s say you have several data sources that include geographical and gender information about your customers. One source contains postal codes while another has geographical coordinates; one source uses numerical codes to identify genders, another uses M and F, and another uses “male” and “female.” When you blend these sources together, you want to standardize the way the resulting dataset expresses location and gender.
The example above is a simple one. Data transformation can get very complex. What’s more, there are different approaches, and a number of factors that determine which approach makes the most sense—what the data will be used for, the skills and needs of the users, budgetary constraints, etc.
Depending on these factors and others, transformation can be part of a data integration solution, or an activity within a data wrangling project.
What’s the difference?
Data integration is a technology-driven process for tying different sources and systems together into a persistent solution to collect data and make it available to an organization. For example, say you need to send out regular reports that rely on data from a customer database, blended with data from Google Analytics and DoubleClick, plus some data from your financial system, and finally some data from monthly spreadsheets your agency manages. This calls for a data integration solution with data transformation built in. The term for this is ETL, which stands for Extract, Transform, Load.
Data wrangling, on the other hand, is the set of human-driven activities involved in taking some raw data and processing it into a state that’s suitable for analysis. This could be a one-off job where someone blends data from a few sources, manually decides which columns or fields to use and which to discard, fixes typos and mismatches, and finds and removes outliers and anomalies, all in order to create a clean and blended dataset for analysis. In some cases, this kind of process is referred to as ELT, for Extract, Load, Transform.
Notice that with ETL and ELT the steps are the same but they happen in a different order—specifically, the transformation step.
Here’s another way to think about it: With ETL, the data is transformed before it’s loaded into a repository, which means the ultimate data structure (or schema) needs to be designed up front. A lot of projects stall or fail during this design phase.
ELT requires less upfront design, because you just dump all kinds of data into a “data lake” without transforming it. However, you can end up with a mess that even skilled analysts have trouble navigating.
Deciding which is right for you—whether you should transform data up front or at analysis time—requires a deeper understanding of the tradeoffs. That’s what we’ll over in our next post.