AI Strategy
/
feb 16, 2025
Strategy for Measuring & Improving AI Products
The moment you plug your product into an LLM, you stop shipping predictable software and start calibrating a probabilistic system. Model upgrades can look like progress while outcomes stay flat, costs rise, and trust erodes in ways SaaS-era metrics can’t see. Your AI product strategy shoudl look like an F1 team: horsepower is table stakes, winning comes from setup, telemetry, and constant calibration under changing conditions.
/
AUTHOR

Arpy Dragffy

LLMs are changing how products are designed, built, and monetized. They make it easier to ship new features, new interfaces, and new tiers. They also make it harder to know whether customers are getting better outcomes, or just getting more novelty.
This strategy gives product leaders a practical advantage: benchmark each release, separate real outcome gains from output glow, catch regressions early, and tune the system toward sustainable value. Done well, it creates compounding returns: fewer blind upgrades, faster learning, clearer competitive positioning, and a tighter trust loop with customers.
1) AI is rewriting how products are designed, built, and monetized
LLMs turn capability into distribution. They convert questions into workflows, and workflows into monetization. Many teams will grow simply by shipping an AI interface. The hard part is proving that growth is tied to durable customer outcomes.
A field study on generative AI in customer support showed productivity gains on average, but with meaningful differences across workers. That variability is the point. Impact is not automatic. [1]
2) Products look simpler to ship, but they are more complex to run
Traditional digital products were bounded. LLM products are probabilistic systems with moving parts that interact in unpredictable ways. The feature is no longer the UI. It is the behavior of a system over time.
Research on hidden technical debt in machine learning systems explains why AI systems become harder to maintain and reason about as they scale. [2]
3) Model improvements can look like progress while outcomes stay unclear
New models often sound better. They handle more. They comply more often. That does not guarantee customers are more successful. It can also mean higher cost and new failure modes.
This is why measurement is treated as a first-class requirement in AI risk management. NIST’s AI RMF centers measurement as a core function of managing AI risk and performance. [3]
Even if you change nothing, the system can still change underneath you. Service LLMs evolve without public changelogs, which makes one-time evaluation brittle and makes continuous benchmarking a prerequisite for reliable improvement. (PLOS ONE, 2026: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0339920)
LLM outputs are also not stable under repeated inference. Recent work shows reliability and safety failure rates can diverge under repeated sampling and temperature changes, even when benchmark-style scores look similar. (arXiv, 2026: https://arxiv.org/abs/2602.11786)
4) Traditional products had predictable flows; LLM products create near-infinite paths
In classic products, breakdowns were observable because paths were finite. In LLM chat, users take countless routes: partial context, shifting constraints, mid-task pivots, contradictory instructions. The system must succeed across a broad distribution, not a narrow funnel.
When you widen the distribution of user paths, adoption can rise while inconsistency complaints persist because the telemetry is not capturing the variance. (See Point 6 for why this is structurally hard to observe.)
Multi-turn performance is measurably worse than single-turn in large-scale simulations, and the degradation is driven largely by increased unreliability. When models take a wrong turn early, they often do not recover. (arXiv, 2025: https://arxiv.org/abs/2505.06120)
5) Evals are necessary, but they do not certify real-world success
AI evals can validate required outputs and golden queries. They are good for regression detection. They are not a guarantee that real users across real paths will succeed.
Even LLM-as-judge evaluation can be biased. Position bias has been documented and can distort results if teams treat judge scores as ground truth. [4]
6) Teams are flying blind on what changed, what improved, and what got worse
Most teams can measure adoption, monetization, and a few support signals. They cannot reliably answer, release over release: Did outcomes improve? Where does the system fail today? Did the upgrade increase cost without increasing success? How do we compare to competitors?
A randomized trial found experienced developers were slower with AI tools while believing they were faster. Perception diverged from outcomes. That is the risk when you only track usage and sentiment. [5]
A common pattern: a model upgrade raises completion rates and lowers complaints, yet verification behavior spikes and repeat delegation drops. The product looks healthier, but customers are working harder to trust it, and long-term value erodes quietly.
7) Reliability is a product constraint, not an edge case
Fluency is not correctness. Completed is not success. Confident wrongness and omission are business risks in document processing, tax, compliance, support, and decision workflows.
A comprehensive ACM survey frames hallucination as a central reliability challenge for LLMs and reviews mitigation approaches because it directly impacts trustworthiness. [6]
This shows up in the wild. A 2025 Scientific Reports analysis of AI mobile app reviews estimated a measurable prevalence of user-reported hallucination indicators and cataloged common types like factual incorrectness and fabricated information. (Scientific Reports, 2025: https://www.nature.com/articles/s41598-025-15416-8)
8) Cost can climb faster than value if you only optimize for capability
Teams chase power and ship bigger models, longer contexts, more tool calls, and more retries. Spend rises while outcomes remain flat. Token burn becomes a margin leak.
OWASP highlights unbounded consumption and denial-of-service style risks as key categories for LLM applications. This is the “it works, but it’s too expensive” failure mode. [7]
9) Security and data exposure risks become universal once you connect to LLMs
If your product ingests user text, documents, tickets, emails, or web content, you inherit new attack surfaces. Prompt injection and sensitive information disclosure are systemic risk categories.
Reprompt is a clear illustration of how a system can appear normal while enabling data exfiltration through indirect prompt injection against an LLM assistant. [8]
10) The strategy: calibrate like an F1 team using a 4-pillar bullseye
You are no longer just shipping digital products. You are calibrating complex systems. That requires a measurement framework that turns performance into something steerable.
The Bullseye:
- Power: capability under real constraints
- Speed: time-to-value and time-to-confidence
- Impact: outcomes that move and stay moved
- Joy: confidence, clarity, control, willingness to delegate again
Operating stance:
- Benchmark every release against prior versions, live variants, and competitors.
- Treat every model swap, prompt change, retrieval tweak, or orchestration update as a calibration event.
- Track cost-per-successful-outcome, not cost-per-request.
LLM outcomes are not stable by default. Studies show outcomes can shift materially with small prompt variations and decoding settings such as temperature. Continuous benchmarking and calibration is required, not one-time evaluation. [9][10]
Release contract: every change should declare and measure a delta in (1) outcomes and durability, (2) confidence and controllability, (3) cost-per-successful-outcome, and (4) risk exposure. If you cannot show the deltas, you are shipping uncertainty.
Risks if product teams don’t change:
Power-only optimization burns money: Bigger models and more autonomy can drive unbounded consumption patterns and tool-call cascades while outcomes stay flat. [7]
Autonomy creates externalities: ClaudeBot’s crawling controversy shows how automated systems can generate operational fallout when behavior is not governed and constrained. [11]
Agent hype outpaces controls: Moltbot illustrates how powerful automation can create privacy, security, and trust issues when outcomes and controls are not instrumented. [12]
Flying blind turns uncertainty into business events: Liability and reputational damage show up when customers act on incorrect outputs and the company is held accountable. [13]
PH1’s vision
PH1 exists to help product leaders build AI products that customers trust and businesses can sustain. ph1.ca. This is also why we started the Product Impact Podcast: to push the industry toward a higher standard where teams can prove outcomes, not just ship capability.
If you want help installing this measurement discipline, PH1 can support through:
Sources
[1] Brynjolfsson, Li, Raymond. NBER Working Paper 31161, “Generative AI at Work” (2023)
[2] Sculley et al. “Hidden Technical Debt in Machine Learning Systems” (NeurIPS 2015)
[4] Study on position bias in LLM-as-a-judge evaluation (2024)
[5] METR. “AI tools slowed experienced developers” RCT report (2025)
[6] “A Survey of Hallucination in Large Language Models” (ACM Computing Surveys, 2024)
[9] Loya et al. Findings of EMNLP 2023. Prompt variations and decoding sensitivity
[11] The Verge. Anthropic’s ClaudeBot crawler controversy (2024)
[14] PH1 website and Product Impact / Design of AI podcast page
/
BLOG


