Deepening The Data Lake: How Second-Party Data Increases AI For Enterprises

chrisohara_managingdata_updated

I have been hearing a lot about data lakes lately. Progressive marketers and some large enterprise publishers have been breaking out of traditional data warehouses, mostly used to store structured data, and investing in infrastructure so they can store tons of their first-party data and query it for analytics purposes.

“A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed,” according to Amazon Web Services. “While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.”

A few years ago, data lakes were thought to be limited to Hadoop applications (object storage), but the term is now more broadly applied to an environment in which an enterprise can store both structured and unstructured data and have it organized for fast query processing. In the ad tech and mar tech world, this is almost universally about first-party data. For example, a big airline might want to store transactional data from ecommerce alongside beacon pings to understand how often online ticket buyers in its loyalty program use a certain airport lounge.

However, as we discussed earlier this year, there are many marketers with surprisingly sparse data, like the food marketer who does not get many website visitors or authenticated customers downloading coupons. Today, those marketers face a situation where they want to use data science to do user scoring and modeling but, because they only have enough of their own data to fill a shallow lake, they have trouble justifying the costs of scaling the approach in a way that moves the sales needle.

chris_ohara1

Figure 1: Marketers with sparse data often do not have enough raw data to create measureable outcomes in audience targeting through modeling. Source: Chris O’Hara.

In the example above, we can think of the marketer’s first-party data – media exposure data, email marketing data, website analytics data, etc. – being the water that fills a data lake. That data is pumped into a data management platform (pictured here as a hydroelectric dam), pumped like electricity through ad tech pipes (demand-side platforms, supply-side platforms and ad servers) and finally delivered to places where it is activated (in the town, where people live).

As becomes apparent, this infrastructure can exist with even a tiny bit of water but, at the end of the cycle, not enough electricity will be generated to create decent outcomes and sustain a data-driven approach to marketing. This is a long way of saying that the data itself, both in quality and quantity, is needed in ever-larger amounts to create the potential for better targeting and analytics.

Most marketers today – even those with lots of data – find themselves overly reliant on third-party data to fill in these gaps. However, even if they have the rights to model it in their own environment, there are loads of restrictions on using it for targeting. It is also highly commoditized and can be of questionable provenance. (Is my Ferrari-browsing son really an “auto intender”?) While third-party data can be highly valuable, it would be akin to adding sediment to a data lake, creating murky visibility when trying to peer into the bottom for deep insights.

So, how can marketers fill data lakes with large amounts of high-quality data that can be used for modeling? I am starting to see the emergence of peer-to-peer data-sharing agreements that help marketers fill their lakes, deepen their ability to leverage data science and add layers of artificial intelligence through machine learning to their stacks.

chris_ohara2

Figure 2: Second-party data is simply someone else’s first-party data. When relevant data is added to a data lake, the result is a more robust environment for deeper data-led insights for both targeting and analytics. Source: Chris O’Hara.

In the above example (Figure 2), second-party data deepens the marketer’s data lake, powering the DMP with more rich data that can be used for modeling, activation and analytics. Imagine a huge beer company that was launching a country music promotion for its flagship brand. As a CPG company with relatively sparse amounts of first-party data, the traditional approach would be to seek out music fans of a certain location and demographic through third-party sources and apply those third-party segments to a programmatic campaign.

But what if the beer manufacturer teamed up with a big online ticket seller and arranged a data subscription for “all viewers or buyers of a Garth Brooks ticket in the last 180 days”? Those are exactly the people I would want to target, and they are unavailable anywhere in the third-party data ecosystem.

The data is also of extremely high provenance, and I would also be able to use that data in my own environment, where I could model it against my first-party data, such as site visitors or mobile IDs I gathered when I sponsored free Wi-Fi at the last Country Music Awards. The ability to gather and license those specific data sets and use them for modeling in a data lake is going to create massive outcomes in my addressable campaigns and give me an edge I cannot get using traditional ad network approaches with third-party segments.

Moreover, the flexibility around data capture enables marketers to use highly disparate data sets, combine and normalize them with metadata – and not have to worry about mapping them to a predefined schema. The associative work happens after the query takes place. That means I don’t need a predefined schema in place for that data to become valuable – a way of saying that the inherent observational bias in traditional approaches (“country music fans love mainstream beer, so I’d better capture that”) never hinders the ability to activate against unforeseen insights.

Large, sophisticated marketers and publishers are just starting to get their lakes built and begin gathering the data assets to deepen them, so we will likely see a great many examples of this approach over the coming months.

It’s a great time to be a data-driven marketer.

Follow Chris O’Hara (@chrisohara) and AdExchanger (@adexchanger) on Twitter.

Advertisements

Choosing a Data Management Platform

“Big  Data”  is  all  the  rage  right  now,  and  for a good reason. Storing tons and tons of data has gotten very inexpensive, while the accessibility of that data has increased substantially in parallel. For the modern marketer, that means having access to literally dozens of disparate data sources, each of which cranks out large volumes of data every day. Collecting, understanding, and taking action against those data sets is going to make or break companies from now on. Luckily, an almost endless variety of companies have sprung up to assist agencies and advertisers with the challenge. When it comes to the largest volumes of data, however, there are some highly specific attributes you should consider when selecting a data management platform (DMP).

Collection and Storage: Scale, Cost, and Ownership
First of all, before you can do anything with large amounts of data, you need a place to keep it. That  place  is  increasingly  becoming  “the  cloud”  (i.e.,  someone  else’s  servers),  but  it  can  also  be   your own servers. If you think you have a large amount of data now, you will be surprised at how much it will grow. As devices like the iPad proliferate, changing the way we find content, even more data will be generated. Companies that have data solutions with the proven ability to scale at low costs will be best able to extract real value out of this data. Make sure to understand how your DMP scales and what kinds of hardware they use for storage and retrieval.

Speaking of hardware, be on the lookout for companies that formerly sold hardware (servers) getting into the  data  business  so  they  can  sell  you  more  machines.  When  the  data  is  the  “razor,”   the  servers  necessarily  become  the  “blades.”  You  want  a  data  solution  whose  architecture  enables the easy ingestion of large, new data sets, and one that takes advantage of dynamic cloud provisioning to keep ongoing costs low. Not necessarily a hardware partner.

Additionally, your platform should be able to manage extremely high volumes of data quickly, have an architecture that enables other systems to plug in seamlessly, and whose core functionality enables multi-dimensional analysis of the stored data—at a highly granular level. Your data are going to grow exponentially, so the first rule of data management is making sure that, as your data grows, your ability to query them scales as well. Look for a partner that can deliver on those core attributes, and be wary of partners that have expertise in storing limited data sets.
There are a lot of former ad networks out there with a great deal of experience managing common third party data sets from vendors like Nielsen, IXI, and Datalogix. When it comes to basic audience segmentation, there is a need to manage access to those streams. But, if you are planning on capturing and analyzing data that includes CRM and transactional data, social signals, and other large data sets, you should look for a DMP that has experience working with first party data as well as third party datasets.

The concept of ownership is also becoming increasingly important in the world of audience data. While the source of data will continue to be distributed, make sure that whether you choose a hosted or a self-hosted model, your data ultimately belongs to you. This allows you to control the policies around historical storage and enables you to use the data across multiple channels.

Consolidation and Insights: Welcome to the (Second and Third) Party
Third party data (in this context, available audience segments for online targeting and measurement) is the stuff that the famous Kawaja logo vomit map was born from. Look at the map, and you are looking at over 250 companies dedicated to using third party data to define and target audiences. A growing number of platforms help marketers analyze, purchase, and deploy that data for targeting (BlueKai, Exelate, Legolas being great examples). Other networks (Lotame, Collective, Turn) have leveraged their proprietary data along with their clients to offer audience management tools that combine their data and third party data to optimize campaigns. Still others (PulsePoint’s  Aperture  tool  being  a  great  example)  leverage  all  kinds  of  third party data to measure online audiences, so they can be modeled and targeted against.

The key is not having the most third party data, however. Your DMP should be about marrying highly validated first party data, and matching it against third party data for the purposes of identifying, anonymizing, and matching third party users. DMPs must be able to consolidate and create as whole of a view of your audience as possible. Your DMP solution must be able to enrich the audience information using second and third party data. Second party data is the data associated with audience outside your network (for example, an ad viewed on a publisher site or search engine). While you must choose the right set of third party providers that provide the best data set about your audience, your DMP must be able to increase reach by ensuring that you can collect information about as many relevant users as possible and through lookalike modelling.

First Party Data

  • CRM data, such as user registrations
  • Site-site data, including visitor history
  • Self-declared user data (income, interest in a product)

Second Party Data

  • Ad serving data (clicks, views)
  • Social signals from a hosted solution
  • Keyword search data through an analytics platform or campaign

Third Party Data

  • Audience segments acquired through a data provider

For example, if you are selling cars and you discover that your on-site users who register for a test drive are most closely  matched  with  PRIZM’s  “Country  Squires”  audience,  it  is  not  enough  to  buy   that Nielsen segment. A good DMP enables you to create your own lookalike segment by leveraging that insight—and the tons of data you already have. In other words, the right DMP partner can help you leverage third party data to activate your own (first party) data.

Make sure your provider leads with management of first party data, has experience mining both types of data to produce the types of insights you need for your campaigns, and can get that data quickly.  Data  management  platforms  aren’t  just  about  managing  gigantic  spreadsheets.  They  are   about finding out who your customers are, and building an audience DNA that you can replicate.

Making it Work
At the end of the day, it’s  not  just  about  getting  all  kind  of  nifty  insights  from  the  data.  It’s   valuable to know that your visitors that were exposed to search and display ads converted at a 16% higher rate, or that your customers have an average of two females in the household.  But  it’s   making those insights meaningful that really matters.
So, what to look for in a data management platform in terms of actionability? For the large agency or advertiser, the basic functionality has to be creating an audience segment. In other words, when the blend of data in the platform reveals that showing five display ads and two SEM ads to a household with two women in it creates sales, the platform should be able to seamlessly produce that segment and prepare it for ingestion into a DSP or advertising platform. That means having an extensible architecture that enables the platform to integrate easily with other systems.

Moreover, your DSP should enable you to do a wide range of experimentation with your insights. Marketers often wonder what levers they should pull to create specific results (i.e., if I change my display creative, and increase the frequency cap to X for a given audience segment, how much will conversions increase)? Great DMPs can help built those attribution scenarios, and help marketers visualize results. Deploying specific optimizations in a test environment first means less waste, and more performance. Optimizing in the cloud first is going to become the new standard in marketing.

Final Thoughts
There are a lot of great data management companies out there, some better suited than others when it comes to specific needs. If you are in the market for one, and you have a lot of first party data to manage, following these three rules will lead to success:

  • Go beyond third party data by choosing a platform that enables you to develop deep audience profiles that leverage first and third party data insights. With ubiquitous access to third party data, using your proprietary data stream for differentiation is key.
  • Choose a platform  that  makes  acting  on  the  data  easy  and  effective.  “Shiny,  sexy”  reports  are   great, but the right DMP should help you take the beautifully presented insights in your UI, and making them work for you.
  • Make sure your platform has an applications layer. DMPs must not only provide the ability to profile your segments, but also assist you with experimentation and attribution–and provide you with ability to easily perform complicated analyses (Churn, and Closed Loop being two great  examples).  If  your  platform  can’t  make  the  data  dance,  find  another  partner.

Available DMPs, by Type
There are a wide variety of DMPs out there to choose from, depending on your need. Since the space is relatively new, it helps to think about them in terms of their legacy business model:

  • Third Party Data Exchanges / Platforms: Among the most popular DMPs are data aggregators like BlueKai and Exelate, who have made third  party  data  accessible  from  a  single  user  interface.  BlueKai’s  exchange approach enables data buyers  to  bid  for  cookies  (or  “stamps”)  in  a  real-time environment, and offers a wide variety of providers to choose from. Exelate also enables access to multiple third party sources, albeit not in a bidded model. Lotame offers  a  platform  called  “Crowd  Control”  which  was  evolved  from  social   data, but now enables management of a broader range of third party data sets.
  • Legacy Networks: Certain networks with experience in audience segmentation have evolved to provide data management capabilities, including Collective, Audience Science, and Turn. Collective is actively acquiring assets (such as creative optimization provider, Tumri14) to  broaden  its  “technology   stack”  in  order  to  offer  a  complete  digital  media  solution  for  demand  side customers. Turn is, in fact, a fully featured demand-side platform with advanced data management capabilities, albeit lacking  the  backend  chops  to  aggregate  and  handle  “Big  Data”  solutions  (although  that  may   rapidly change, considering their deep engagement with Experian). Audience Science boasts the most advanced native categorical audience segmentation capabilities, having created dozens of specific, readily accessible audience segments, and continues to migrate its core capabilities from media sales to data management.
  • Pure Play DMPs: Demdex (Adobe), Red Aril, Krux, and nPario are all pure-play data management platforms, created from the ground up to ingest, aggregate, and analyze large data sets. Unlike legacy networks, or DMPs that specialize in aggregating third party data sets, these DMPs provide three core offerings: a core platform for storage and retrieval of data; analytics technology for getting insights from the data with a reporting interface; and applications, that enable marketers to take action against that data, such as audience segment creation, or lookalike modeling functionality. Marketers with extremely large sets of structured and unstructured data that go beyond ad serving and audience data (and may include CRM and transactional data, as an example), will want to work with a pure-play DMP

This post is an excerpt of Best Practices in Digital Display Advertising: How to make a complex ecosystem work efficiently for your organization All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording or any information storage and retrieval system, without prior permission in writing from the publisher.

Copyright © Econsultancy.com Ltd 2012