I have been hearing a lot about data lakes lately. Progressive marketers and some large enterprise publishers have been breaking out of traditional data warehouses, mostly used to store structured data, and investing in infrastructure so they can store tons of their first-party data and query it for analytics purposes.
“A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed,” according to Amazon Web Services. “While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.”
A few years ago, data lakes were thought to be limited to Hadoop applications (object storage), but the term is now more broadly applied to an environment in which an enterprise can store both structured and unstructured data and have it organized for fast query processing. In the ad tech and mar tech world, this is almost universally about first-party data. For example, a big airline might want to store transactional data from ecommerce alongside beacon pings to understand how often online ticket buyers in its loyalty program use a certain airport lounge.
However, as we discussed earlier this year, there are many marketers with surprisingly sparse data, like the food marketer who does not get many website visitors or authenticated customers downloading coupons. Today, those marketers face a situation where they want to use data science to do user scoring and modeling but, because they only have enough of their own data to fill a shallow lake, they have trouble justifying the costs of scaling the approach in a way that moves the sales needle.
Figure 1: Marketers with sparse data often do not have enough raw data to create measureable outcomes in audience targeting through modeling. Source: Chris O’Hara.
In the example above, we can think of the marketer’s first-party data – media exposure data, email marketing data, website analytics data, etc. – being the water that fills a data lake. That data is pumped into a data management platform (pictured here as a hydroelectric dam), pumped like electricity through ad tech pipes (demand-side platforms, supply-side platforms and ad servers) and finally delivered to places where it is activated (in the town, where people live).
As becomes apparent, this infrastructure can exist with even a tiny bit of water but, at the end of the cycle, not enough electricity will be generated to create decent outcomes and sustain a data-driven approach to marketing. This is a long way of saying that the data itself, both in quality and quantity, is needed in ever-larger amounts to create the potential for better targeting and analytics.
Most marketers today – even those with lots of data – find themselves overly reliant on third-party data to fill in these gaps. However, even if they have the rights to model it in their own environment, there are loads of restrictions on using it for targeting. It is also highly commoditized and can be of questionable provenance. (Is my Ferrari-browsing son really an “auto intender”?) While third-party data can be highly valuable, it would be akin to adding sediment to a data lake, creating murky visibility when trying to peer into the bottom for deep insights.
So, how can marketers fill data lakes with large amounts of high-quality data that can be used for modeling? I am starting to see the emergence of peer-to-peer data-sharing agreements that help marketers fill their lakes, deepen their ability to leverage data science and add layers of artificial intelligence through machine learning to their stacks.
Figure 2: Second-party data is simply someone else’s first-party data. When relevant data is added to a data lake, the result is a more robust environment for deeper data-led insights for both targeting and analytics. Source: Chris O’Hara.
In the above example (Figure 2), second-party data deepens the marketer’s data lake, powering the DMP with more rich data that can be used for modeling, activation and analytics. Imagine a huge beer company that was launching a country music promotion for its flagship brand. As a CPG company with relatively sparse amounts of first-party data, the traditional approach would be to seek out music fans of a certain location and demographic through third-party sources and apply those third-party segments to a programmatic campaign.
But what if the beer manufacturer teamed up with a big online ticket seller and arranged a data subscription for “all viewers or buyers of a Garth Brooks ticket in the last 180 days”? Those are exactly the people I would want to target, and they are unavailable anywhere in the third-party data ecosystem.
The data is also of extremely high provenance, and I would also be able to use that data in my own environment, where I could model it against my first-party data, such as site visitors or mobile IDs I gathered when I sponsored free Wi-Fi at the last Country Music Awards. The ability to gather and license those specific data sets and use them for modeling in a data lake is going to create massive outcomes in my addressable campaigns and give me an edge I cannot get using traditional ad network approaches with third-party segments.
Moreover, the flexibility around data capture enables marketers to use highly disparate data sets, combine and normalize them with metadata – and not have to worry about mapping them to a predefined schema. The associative work happens after the query takes place. That means I don’t need a predefined schema in place for that data to become valuable – a way of saying that the inherent observational bias in traditional approaches (“country music fans love mainstream beer, so I’d better capture that”) never hinders the ability to activate against unforeseen insights.
Large, sophisticated marketers and publishers are just starting to get their lakes built and begin gathering the data assets to deepen them, so we will likely see a great many examples of this approach over the coming months.
It’s a great time to be a data-driven marketer.