The boxes nobody opens

In 2024 we spent over three weeks clearing out my father's house.

My father was, at the end of his life, a hoarder. Every cupboard was full. Every drawer was full. The cellar was so full that we didn’t even start emptying it, leaving it all to a junk dealer. Most of what we unpacked had not been touched in years, in some cases in over a decade. There were boxes inside other boxes, art with no remaining value and an extraordinary number of plastic bags filled with smaller plastic bags. Some of the things we unwrapped had clearly been valuable once, the kind of items that would have been bought with care and stored with intent. By the time we found them, that care and that intent had no living context. There was nobody left who remembered why they had been kept.

What surprised me was not the quantity. It was the cost of removing it. The junk dealer cost money. The weeks of labour cost us all energy we had not really set aside, while the objects that once had value had quietly become a liability. Not a small liability either. Cleaning out a life takes more than a three weeks, and more than a lump sum.

On one of our drives home, I got to thinking about this for a long time, and I kept arriving at the same uneasy thought: we do exactly the same thing with data.

A pattern at every scale

The pattern is not specific to bedrooms and cellars. It plays out wherever someone has the option to keep more than they can use. Individuals do it with health information, kept for the doctor visit that never quite happens. Public administrations do it with citizen records on the assumption that someone, eventually, will know what to do with them. And businesses do it with everything they can capture about a current or potential customer, because the marginal cost of one more row in the warehouse is, on a clean spreadsheet, almost zero.

I want to be careful here. I am not arguing that storage is unethical, or that data minimization is a moral position. I am arguing something narrower. The arms race to capture and retain raw consumer data, the one most marketing organizations have been engaged in since at least the early days of digital advertising, is producing diminishing returns. And the costs are no longer just storage.

It is worth being precise about what an organization actually needs at the moment a decision is taken. At runtime, when a campaign is launched, a quote is generated, a recommendation is made, the relevant artefact is rarely the full historical archive. It is a current score, a current segment, a current insight, the output of a model that has already been built. The training of such a model, of course, is a different moment. To build the model, someone has to handle the underlying data. That work cannot be wished away, and any honest discussion of data has to keep training and inference apart. But the runtime moment, which is where most decisions actually happen, is far less data-hungry than most data lakes suggest.

The four quiet costs

What is hoarded accumulates four costs that compound over time.

The first is compliance. Each new dataset increases the surface area against which a regulator could ask a question, and each new field expands the catalogue that has to be defended. This is not hypothetical. Anyone who has lived through a data subject access request, or a regulator's information notice, knows that the cost is in the response, not in the storage.

The second is breach surface. The more rows of data you hold, the larger the prize for a malicious actor, and the more painful the disclosure if something goes wrong. There is no version of this calculation in which holding more data reduces breach exposure. There is only the question of whether the value of the data still exceeds the rising risk of holding it.

The third is governance overhead. Every dataset you choose to keep is a dataset that someone, somewhere, has to classify, document, refresh, deprecate, and answer questions about. The overhead does not show up cleanly in any cost centre. It shows up in the calendars of the people who would otherwise be doing something more productive.

The fourth, the one I want to spend the most time on, is staleness. This is the cost most companies underestimate, because it is silent. Data does not announce when it stops describing the world. It simply continues to sit in the data warehouse, looking exactly the same as it did when it was current, while the reality it represents drifts away.

Three short examples

Let me try to make this concrete.

Imagine a health insurer working from segmentations that were carefully built five years ago. The segments captured aging patterns, care utilization, supplemental coverage choices. Five years on, demographic distributions have shifted, care delivery has moved partly online, and supplemental insurance behavior has reorganized around new products. The segmentation still exists, still gets used to design prevention programs, and still informs risk models. The team that built it has moved on, and the new team is busy with new priorities. The numbers come out clean. They are simply pointed at a country that has quietly become a different country.

Imagine a B2B sales organization that spends a meaningful share of its marketing budget on keeping prospect data fresh. Names, roles, organizational changes, recent initiatives, signs of buying intent. Every competitor in the same space buys broadly the same enrichment, often from broadly the same providers. The data is always one or two steps behind reality. Job changes happen faster than the refresh cycle. Reorganizations happen in private long before they appear in public databases. The whole market is paying, redundantly, for an approximation of the truth that the prospect already knows about themselves. I will return to this case in a moment, because it complicates the easy story.

Imagine a retailer with a rich customer model that has been performing well for years. Then a market shifts. A category that used to be defined by brand loyalty becomes defined by price sensitivity, or vice versa. The model continues to score customers as if the shift had not happened, because the patterns it learned are still the patterns it sees, until enough new behavior has accumulated to overwhelm the old. By the time the model retrains, the retailer has spent two quarters speaking the wrong language to its own customers.

Each of these is, in the end, a story about an organization that mistook its archive for its understanding.

The deeper move

This is where I want to make the move that, I think, the conventional discussion of data hoarding misses.

The real problem is not that organizations keep too much data. The real problem is that they keep too many insights past their expiry date. They carry forward conclusions, segmentations, propensity scores, strategic assumptions, narrative frames, and they continue to act on those even when the world underneath has shifted. The boxes in my father's house were not, in the end, a metaphor for raw data. They were a metaphor for crystallized interpretations of a market, a taste, a moment, kept long after the market, the taste, and the moment had moved on.

This reframing matters because it changes what an organization should optimize for. If the problem is data, the answer is retention policy. If the problem is insight, the answer is freshness. The discipline you need to build is not the discipline of throwing away. It is the discipline of regularly asking whether what you currently believe about the world still matches the world.

The answer that already exists

I should be clear that the technical answer to this problem is not new.

For more than a decade, researchers and practitioners have been describing approaches that allow data to remain where it sits, while only insights travel. Federated learning, secure multi-party computation, trusted execution environments, differential privacy, the broader thinking around data fabrics and data meshes. None of this is exotic anymore. Apple uses federated learning for some of its on-device behavior. Healthcare consortia use it for cross-institution model training. There are working systems, working products, and a working literature. Anyone who has been paying attention will have heard of all of this, and rightly so.

So the interesting question is not whether the conceptual answer exists. The interesting question is why the conceptual answer has not become the default.

I think there are two reasons, and they reinforce each other.

The first is strategic. For two decades, raw consumer data has been treated as a moat. This was a defensible position. The companies that captured the most behavioral data could build the best models, and the best models created compounding advantages over rivals who were still trying to build their first lake house. The mistake is to assume that this will continue indefinitely. The competitive frontier is, slowly, moving. It is moving from "who holds the most" to "who has the best mechanism to keep insights fresh without owning the data". Collective intelligence, well-governed, beats private archives on model quality, refresh rate, and the cross-domain patterns that any single organization can simply never see on its own. Hoarders are still right that data is a moat, for now. They may be wrong that it will continue to be the right one.

The second is cultural. Trust is shifting. Consumers, citizens, patients, are not nostalgic for the era of unlimited collection, and the next generation of users will be even less so. An organization that can credibly say "we do not hold your data, we ask permitted questions of it where it sits" will earn a kind of permission that hoarders have been steadily losing. This is not a marketing claim, although it can certainly be turned into one. It is a posture, and postures eventually get tested. The organizations that have already started designing for it will have an answer when the test comes.

Honest objections

I want to take three objections head-on, because the easy version of this argument is the one that does not survive contact with anyone who actually works in the field.

The first objection is legal. An "insight" is not automatically outside the scope of the General Data Protection Regulation. If an insight singles out an individual, it remains personal data, and the obligations that come with personal data still apply. So the move from data hoarding to insight hoarding is not, by itself, a compliance escape. The reduction has to be in identifiability as well as in volume.

The second objection is technical. Federated systems are not magic. They are vulnerable to a known class of attacks, model inversion, membership inference, repeated queries that creep along the privacy budget, and these attacks do not respect good intentions. What protects them is governance: who can ask what, how often, with which guarantees on the output, with what audit trail. The honest version of the federated proposition includes the governance, not just the architecture.

The third objection is internal to the argument I am making. Insights also age. A propensity score that is six months old is, in some retail contexts, already stale. A segment built last year is already drifting. So replacing data hoarding with insight hoarding does not solve the problem, it relocates it. The real value, the part that is actually durable, is not any specific insight. It is the mechanism that keeps insights aligned with a moving world. That is the asset worth building.

There is a fourth tension I want to name explicitly, even if I cannot resolve it here. The federated proposition works most naturally where there is already a relationship between the questioner and the source. A bank and its customers, a hospital and its patients, an administration and its citizens. There are existing trust frames, contractual or institutional, that the federated approach can extend rather than invent. B2B prospecting, the second of my three vignettes above, has no such pre-existing relationship. The prospect has not consented to anything. That is precisely why so much of the prospecting market today is built on scraping rather than on asking. I do not think the federated story collapses there, but it does have to do more work, and probably has to be paired with first-party intent signalling that the prospect controls. I flag this not to solve it, but to keep the argument honest. The new posture is not a universal solvent.

Back to the boxes

I want to end where I started.

The point of clearing my father's house was not, in the end, that he should have kept less. He probably had reasons, often good ones, sometimes private, for keeping what he kept. The point was that the assumption underneath the keeping had not held up. Holding more had not turned into knowing more. Knowing more had not turned into using more. By the time we opened the boxes, the inheritance was not the contents. The inheritance was the work of sorting them.

I think most organizations, including the ones I have worked with and the ones I will work with next, are in some version of this position. Their data lakes are full. Their dashboards are detailed. Their models exist. They have, in good faith, kept what they thought would be valuable, and they are surprised when the upkeep turns out to be the dominant cost. The temptation is to respond with another layer of retention policy, another committee, another minimization initiative. The deeper move is to ask whether the things they currently believe about their customers, their citizens, their markets, are still things that match the world.

What if the asset worth building is not the warehouse, and not the model that came out of the last warehouse, but the discipline of asking that question on a schedule that cannot be quietly skipped?

Think about it.

This article appeared first on the blog Exploring The Black Box.