Data Clean Rooms Are Evolving

That strikethrough in the title is deliberate, so bear with me.

I know clean rooms about as well as anyone. I was there very early in the Habu journey, through to the company’s exit to LiveRamp and beyond. They were one of the most zeitgeisty adtech and martech platforms I think we’ve ever seen, and that wasn’t without reason.

The notion of being able to combine your data with someone else’s is genuinely powerful. Understanding non-endemic insights, enriching your view of a customer, finding specific target groups… none of that is to be sniffed at. I made the case for clean rooms on plenty of stages back in the day, and I still believe the core promise was real.

But here’s the reality of the 90% of clean rooms that actually got stood up. They only ever had one company’s data in them. Not two. What was really happening was that people used clean rooms to provision access to their first-party data to an external party, an agency or a brand, so that party could run queries with the rails on and only ever see aggregated outputs. There’s loads of value in that and I don’t want to diminish it. But my point is that a lot of the clever stuff under the hood, the double blind joins, the secure computation, really wasn’t being used. The second you introduced the need to join at a one-to-one level between party A and party B’s datasets, legal and privacy got in the way. It was far easier to start with analytics clean rooms powered by pre-built libraries of queries, outputting nice graphs and aggregations. BI on partner data, essentially.

I said something similar to Digiday back in 2022, when more than half the marketers surveyed had never touched a clean room. The gap between the hype and the practical reality was wide even then.

They reinforced the old way of working

These analytics clean rooms actively reinforced the prior generation of data management. Customers would load their clean room with audience taxonomies and data files. Analytics and data science teams would build SQL templates that kept the outputs on track. Clients would run parameterised insights at will. All perfectly functional. But ultimately, everyone was only ever able to find what had already been predetermined. You can’t query your way to a pattern nobody thought to write a template for.

Fast forward to today, and almost everything underneath has changed. SQL is no longer the bedrock of analysis. Computation is shifting from CPU to GPU as models, deep learning, and ongoing discovery become the high-value workloads. As a result, taxonomies aren’t worth what they used to be. What matters now is high-scale, high-dimensionality, well-categorised data that’s ready for AI.

Deep learning takes an entirely different approach to all of this. It changes how you process source data, how you organise it, and, in the context of collaborating with an outside party, how you expose the aggregate.

A retail media example

Let’s say we’re a retail media network and we want to externalise information about our loyalty customers to one of our largest suppliers.

Historically, we’d spin up a clean room, whether independent software or via a cloud, pre-provision a bunch of queries, and maybe let the supplier bring a thin layer of their own data to match against a subset of users. The client gets some analyses and insights, but they likely don’t learn much genuinely new about their consumers. They certainly don’t discover the underlying temporal or counter-intuitive trends sitting in a dataset that vast. A huge amount of effort for what amounts to a glorified dashboard.

Deep learning does something different. Instead of pre-determining the taxonomy or the segments or the analyses, we let the neural network run unsupervised discovery on the source data. As the retail media network, I could pre-provision a table of raw loyalty information, perhaps confined to a single category or a couple of categories relevant to the supplier, and have the network find the latent patterns and natural clusters inside it.

A few benefits fall out of that straight away:

There’s nothing to pre-build. No template library, no taxonomy to agree on first. Much faster to deploy.
You can operate across the entire catalogue and customer base. With the firepower NVIDIA provides, a greater scale of inputs means more interesting outputs. You’re no longer rationing what you look at.
You strip out the human bias. Focusing on discovery rather than confirmation surfaces opportunities you wouldn’t have gone looking for. Discovery is what unlocks the revenue from these partnerships, as opposed to simply reinforcing assumptions everyone already held.

Where it gets really interesting is that even though the model trains at a raw data level, the network only needs to share the aggregate outputs with the supplier for them to get value. And somewhat paradoxically, those aggregate outputs should offer more insight than the retailer has ever had before. It’s very rare to get that kind of trade-off, where you give up granularity and end up with more, not less. I’ve written before about how the quality of what goes in determines the quality of what comes out, and this is the same principle running in a collaborative setting.

From insights to activation

So far I’ve only been talking about insights and analytics clean rooms. The 90% of deployments I mentioned at the start. But people want to take action, and activation has always been the great aspiration for clean rooms, and an increasing share of the actual activity.

The catch with activation and clean rooms is that it’s never really been very clean. We’d do all this double blind joining and secure computation, and then at the very end we’d need to output a list of people we wanted to target and send that list to an endpoint. That’s the moment you simply can’t avoid the legal conversation, the privacy officer, and the concerns about moving from aggregate to customer level. All the careful privacy engineering upstream, undone by an export at the finish line.

But in parallel with the rise of deep learning and GPU compute, a number of AI-adjacent initiatives are reshaping that final step. Work from parties like the IAB Tech Lab is beginning to centre on not sending IDs at all, but instead representing people as vector embeddings. It’s worth unpacking why that matters.

An ID, whether it’s a hashed email, a MAID, or some other key, is a direct pointer to one specific person. Its whole job is to be matchable. Hand the same key to two parties and they can deterministically link their records and rebuild a shared view of an individual. That’s precisely the property privacy teams lose sleep over… it’s persistent, it’s re-identifiable, and it travels.

A vector embedding behaves differently. Rather than carrying who someone is, it carries what they’re like… a long list of numbers describing where a person, or a group of people, sits in a learned feature space. Two people with similar behaviours land near each other. The embedding holds the signal that’s actually useful for modelling and targeting, while leaving behind the identifier that makes re-identification trivial.

A few things fall out of that:

There’s no clean join key. To “match” on an embedding you’re hunting for nearest neighbours in vector space, not running an exact key lookup. The match becomes probabilistic rather than deterministic, which is a much harder starting point for anyone trying to single an individual out.
A well-constructed embedding is hard to run backwards. You can’t straightforwardly invert it to recover the raw inputs or the underlying identity, especially once it’s been compressed and had a little noise applied.
You can stay at the aggregate. Instead of a row per person, you can share embeddings that represent clusters or cohorts, with minimum-size thresholds around them, so nothing you pass over resolves to one human being.

Worth noting that embeddings aren’t magic. Membership-inference and inversion attacks are an active area of research, so the privacy you genuinely get depends on how the embedding is built, how aggregated it is, and what thresholds you wrap around it. Done carelessly, an embedding can leak more than people assume. Built properly, it changes the shape of the whole problem.

What about accuracy?

Embeddings do cost you some accuracy, so let’s be honest about it. A deterministic ID match is exact… same person, every time, and you know it. A probabilistic match in vector space gives you confidence that two records are similar without certainty they’re the same individual, and if you layer differential privacy or noise on top you concede a little more again.

In the contexts where clean rooms actually earn their keep, though, that cost barely registers. Retail media, and broadly any reach-driven media, is a game of scale and direction rather than surgical one-to-one precision. You’re pointing spend at the right kinds of people in the right kinds of moments, and a probabilistic match does that perfectly well.

An example of a place you’d genuinely want pinpoint accuracy is something like suppression, where being wrong about a single person carries a real cost. But suppression never needed a clean room in the first place. That’s just your own first-party data being pushed to an API endpoint (or increasingly an MCP). Activation today is more and more commoditised to a point where even Databricks now has Reverse ETL natively in the warehouse. So the use cases that demands precision will be solved elsewhere, and many of the use cases a clean room is genuinely for were not necessarily the ones that needed it.

Which brings us back to activation, the thing clean rooms have always aspired to and never quite done cleanly. Rather than shipping a list of 50,000 people to an endpoint, the moment that always pulls the privacy officer into the room, you can describe your target as a region of that embedding space… a centroid, and a radius around it. The activation side resolves which of their own users fall inside that region, and the individual-level list never has to leave. Pull the radius tight to the centroid and you get a smaller, higher-precision audience. Loosen it and you trade some of that precision for reach. The distance becomes a dial the buyer controls, instead of a fixed list someone had to export and hand over.

The concept is changing under our feet

So the idea of a clean room is fundamentally changing to adapt to an AI-first world. It’s obviously not the only piece of the advertising and marketing stack going through this, but it’s one I feel especially close to, with an intrinsic understanding of how it actually worked and where the bodies were buried.

I’m tracking initiatives like agentic audiences, confidential GPU compute, and plenty more very keenly. The pace of change shows no sign of abating, and for once that feels like good news for the people who were always a bit frustrated by what clean rooms promised versus what they delivered.