AI Coders Won’t Save You: Value Generation By Data Science in the Era of AI

Most CTOs want to make AI work, most like the idea of AGI, but the currently available offerings for integrating AI are very limited and arguably wrong.

AI Coders by Themselves Don't Generate Value

Autonomous AI coders don't work. Even if they did, they would raise the question, why have a software company in the era when all you have to do is point a commercial AI at a problem? There is no value generation for the company if that is the development process.

A fully autonomous AI coder would effectively reduce software development to "idea generation", but even in the case of a brilliant idea generator, there is no value generation there. The amount of value an idea has is effectively the amount of money you can make until one of your competitors notices what you're doing.

Where value used to come from

Data Science has been one of the ways an engineering team at a software company could generate value over the last ten years. But that's changing as the value of the data being mobilized by Data Science is also changing.

Here's a test:

Ask ChatGPT to replicate your dataset. Can it? If so, your data no longer has any value (unless it is absolutely enormous and would require an absurd number of AI queries to replicate). Does it miss some parts? Those are the parts that still have value and are worth modeling by Data Science.

Build on commercial AI—don’t fight it

The models Data Science should be building in the era of AI are the ones that build on answers provided by commercial AI. Consider the example sales call transcript between an AE at SFDC and a Sales Director at an IT company:

Download the Sales Call Transcript (PDF)

Source: internal example transcript.

Have GPT-5 walk through this document and label and extract all facts that, were this conversation real, it wouldn't have been able to guess. You get a much more succinct list:

GTM team composition: 28 AEs and 12 SDRs.
Average deal size: ~$85,000 ARR.
ICP: IT directors/CIOs at companies with 200–1,500 employees.
Primary KPIs this half: forecast accuracy and win rate.
Current forecast variance: ±30%.
Current win rate: ~19%.
Average sales cycle length: 63 days (longer for security deals).
Weekly cadence: Monday pipeline inspection (messy due to late CRM updates).
Custom pricing approvals add multiple days of delay.
Quoting done in Word/PDF documents.
Forecasting maintained in spreadsheets that drift from CRM data.
Current stack: Salesforce Enterprise, Outlook, Gong, Slack; PandaDoc for signatures; Snowflake as data lake; Azure AD for SSO; no CPQ.
Buying committee: VP Sales (economic buyer), RevOps, IT (integrations), Finance (ROI); procurement is straightforward once numbers align.
Timeline goal: pilot in ~6 weeks; broader rollout in Q4.
E-signature preference: keep PandaDoc initially.
ROI assumption: 28 AEs save ~45 minutes/day via auto-capture and less spreadsheet reconciliation.
ROI assumption: loaded hourly cost estimated at ~$100.
ROI target: improve forecast variance from ±30% to ±10–15%.
ROI target: increase win rate by +2–4 percentage points.
Annual pipeline size used for ROI modeling: ~$30 million.
Stakeholder reaction: ROI assumptions deemed reasonable by Jordan.
Change plan: 2-week enablement with champions, office hours, and in-app prompts.
Governance: adoption scorecard co-owned with RevOps.
Adoption metrics: activity coverage, opportunity field completeness, and time-to-update after key meetings.
CPQ phase-one scope: minimal catalog, standard discounts, single approval path.
Security/IT requirement: Azure AD SSO is required (non-negotiable).
Signature tool decision: keep PandaDoc for now.
Decision criteria: ≤±15% forecast variance, measurable rep time saved, fewer approval delays.
Decision process: Pilot → validation → security/finance review → order form.
Champion: Jordan, with RevOps as co-sponsor.
User count for initial scope: ~40 users (AEs + SDR managers).
Pilot implementation duration target: 6–8 weeks.
Commercial ballpark: low six figures ARR for analytics/engagement add-ons.
SI services ballpark: mid–high five-figure one-time cost for pilot.
Commercial next step: send a formal estimate after scope is locked.
Immediate action: 60-minute working session this week with Jordan + RevOps to define three Monday dashboard views, hygiene metrics, and two approval bottlenecks to remove.
Sandbox step: enable Einstein Activity Capture.
Slack step: connect a dedicated Slack pilot channel.
Demo requirement: show Pipeline Inspection/Revenue Intelligence using the company’s existing Salesforce fields.
Pilot cohort: 8–10 reps for a 2-week validation.
Stakeholders to loop in: RevOps and Sales Ops manager.
Finance collateral: one-pager with ROI assumptions and adoption scorecard.

Represent every line as a fact

This representation is both much more compressed and it's possible that we could derive a structured representation from it. Recommended model (works in Postgres, later portable to graph): represent every line as a fact (subject–predicate–object) with qualifiers.

subject: the entity the fact is about (e.g., org:apex_it).

predicate: the property/relationship (e.g., has_team_composition, uses_tool).

object: typed value or another entity (e.g., {"aes":28,"sdrs":12} or tool:pandadoc).

qualifiers: observed_at, valid_from/valid_to (temporal), confidence, source (transcript + span), created_by, privacy_label.

Minimal tables

-- Postgres sketch
entities (
  id uuid primary key,
  type text,
  name text,
  external_ids jsonb
)

facts (
  id uuid primary key,
  subject_id uuid references entities(id),
  predicate text,
  object jsonb,
  object_type text,
  confidence numeric,
  source_id uuid references sources(id),
  observed_at timestamptz,
  valid_from timestamptz,
  valid_to timestamptz,
  created_at timestamptz default now()
)

relations (
  id uuid primary key,
  src_id uuid references entities(id),
  rel_type text,
  dst_id uuid references entities(id),
  props jsonb,
  confidence numeric,
  source_id uuid references sources(id),
  observed_at timestamptz,
  valid_from timestamptz,
  valid_to timestamptz
)

sources (
  id uuid primary key,
  kind text,
  uri text,
  transcript_id text,
  span int4range,
  checksum text
)

embeddings (
  id uuid primary key,
  owner_type text,
  owner_id uuid,
  vector vector  -- pgvector
)

Example fact

{
  "subject": "org:apex_it",
  "predicate": "has_team_composition",
  "object": { "aes": 28, "sdrs": 12 },
  "qualifiers": {
    "observed_at": "2025-08-16T15:00:00Z",
    "source": { "kind": "call", "transcript_id": "call-0816", "span": [1234, 1310] },
    "confidence": 0.95
  }
}

Why not just dump text into a vector DB and RAG?

Now the question, why go through this exercise? Why not just dump your unstructured text into a vector db and query with RAG if you want to retrieve the information? Or, if there aren't that many transcripts, dump them all into a single prompt and ask questions?

Orgs are paralyzed by fear of blindly dumping data into commercial LLMs. This leads them to do nothing, rather than acting prudently to mitigate risk of IP theft.
A structured representation is universally queryable by any commercial or open-source LLM, in addition to enabling SQL/noSQL/GraphDB queries.
It lets you answer, “what do we actually own?” What information is actually unique to you versus already baked into widely available AI?

Arguably, #1 and #3 are the most important, where #3 can lead to new products, features, or strategic initiatives. Or, in the worst case, a sober reckoning that a business's data is no longer very valuable.

Conclusion

Treat frontier LLMs as oracles and your work as building the structured layer that captures the non-guessable facts those models surface. Encode them as typed, time-bounded facts with provenance, and you get three things at once: (1) safer adoption for cautious orgs, (2) multi-modal query paths that include LLMs and databases, and (3) a clear inventory of proprietary signal you can productize. In short: don’t chase “AI coders.” Use AI to reveal what matters, then harden it into durable data assets your competitors can’t trivially replicate.

← Back to all posts