From “Data Is the New Oil” to “Insight Is the New Alpha”
Generative AI has upended a decade of received wisdom about the strategic value of raw data. What mattered pre-GPT—sheer volume, elaborate ETL pipelines, and lightly-tuned scikit-learn models—now looks like table stakes. This post explains why, defines two emerging market segments, and offers a roadmap for founders deciding which side of the line they occupy.
1. Three Shifts That Erode the Worth of Raw Data
- Synthetic abundance. Frontier models can fabricate vast amounts of statistically coherent text, code, and images—on demand. Owning piles of similar real-world data is no longer a moat.
- Instant pipelines. Copilot-style agents now write end-to-end ingestion and transformation code from a single prompt. The friction of “mobilising” data has collapsed.
- Commoditised models. Anything that fits in the scikit-learn zoo—or its neural equivalents—is trivial to reproduce. Benchmark accuracy alone cannot justify high margins.
2. Population Data vs. Refined Data
The landscape now cleaves into two camps:
- Population Data companies. Aggregate or resell broad, generically useful datasets (e.g. generic buying journies without intel on specific people, companies, events, buying triggers, etc.).
- Refined Data companies. Capture narrow, proprietary, behaviour-level signals—insights a public model could not plausibly know with confidence (e.g. “VP Finance at Company X kills any deal > $250 k”).
3. Why Population Data Firms Face Rapid Commoditisation
- No scarcity. If a foundation model can generate a statistically equivalent dataset, your competitive moat evaporates.
- Price compression. Buyers will compare your license fee to the near-zero marginal cost of synthetic substitutes.
- Identity crisis. Many “AI platforms” were, in truth, data brokers with a thin model veneer.
4. The Bright Future for Refined Data Providers
Companies with authentic, granular, often relationship-level intelligence are suddenly hot commodities. Their assets are:
- Non-replicability. No public corpus contains your customer-specific or workflow-specific edge cases.
- Complementarity. Refined signals amplify foundation models; they are not displaced by them.
5. A New Litmus Test for “AI” Start-ups
Ask one question: “Could a frontier model generate a functionally identical substitute in one weekend?”
If the answer is “yes,” you are in the Population Data business—expect margin pressure. If “no,” congratulations: you own Refined Data.
6. Outlook
Foundation models have not made data worthless; they have made undifferentiated data worthless. The next wave of outsized returns will accrue to firms that treat insight as alpha, protect it behind defensible interfaces, and feed it back into a virtuous cycle of increasingly personalised intelligence.
← Back to all posts