20 Lessons I've Learned while being humbled by RL (Field notes for fellow Data/PM folks making the leap)

Huh: Why am I writing about AI research?

Over the last two years, as I’ve gone deeper into building around AI, I’ve become increasingly enamored with training, and especially reinforcement learning. Not because I woke up one morning and thought, “You know what would make my life easier? More Greek letters.” But because, in the simplest terms, RL scratches the exact same itch that pulled me into product, growth, marketing, and creative experimentation for the last twenty years. It is the work of finding a crack in the system, forming a hypothesis about how to use it, testing it against reality, learning what holds up, and then hunting for the next crack. It is experimentation with the difficulty turned up, because the thing you are improving is no longer just a funnel, a campaign, or a product surface. It is behavior.

That is why I’m writing this. As I’ve started building RL-style environments, rewards, verifiers, holdouts, and little ecosystems for models to act inside, I’ve had this nagging feeling that more product and growth people should at least get close enough to smell the machinery. Not everyone needs to become an RL researcher. I’m certainly not pretending to be one. But I do think product managers, growth people, and data leaders have an unfair advantage that too few are using: we are close to real users, real workflows, real data, real incentives, and all the weird ways humans actually interact with products once the slide deck has left the building. That closeness matters. The future of AI will not just be shaped by bigger models. It will be shaped by better environments, better feedback loops, better definitions of success, and better judgment about what humans actually value.

These notes come from weeks of me building exactly those pieces. By the end, I understood my own job (data and product judgment) better than the math, and that’s why I wrote them down. If you’re deep in research, I’d love your corrections and sharper questions; if you’re in product, growth, or data, I hope this is valuable when you hit the same walls, and hit me up if I can help bridge the gap.

Glossary and Context Setting

A few words up front, because words shouldn't be the hard part of this adventure:

Reinforcement-Learning (RL) environment: the world a policy acts inside: what it can observe, what actions it can take, how state changes, and what reward comes back. For product folks: your metric system, under optimization pressure.
Reward: the machine-readable signal of success; sometimes stepwise, sometimes delayed, sometimes separate from the final evaluation score. Your KPI.
Verifier (grader): what computes the reward from the trajectory and evidence.
Rollout: a single attempt, start to finish.
Policy: the strategy the model uses (weights and balances we'll save for another day) to decide what actions to take in different situations.
Trajectories: the ordered sequence one rollout produces: each step’s observation, action, and reward, from start to terminal. It’s the unit an optimizer actually learns from. For product folks: a single user’s session path, with the outcome attached.
Traces: the replay: every action, observation, tool call, state change, and reward component, with enough metadata to re-execute, interrogate, and grade it. With the same behavioral data foundation you already mine from session recordings, you just get to see both the action and the thought process.
Reference policy: a known comparison policy (random, greedy, conservative, adversarial, or oracle-like) you run to check whether the reward orders behave the way you expect. (In RLHF and other training, “reference policy” can also mean the policy you constrain against)
Trainability: the north-star question these checks prepare for: can an optimizer actually improve a policy from the trajectories it samples, or does the score just sit there, saturate, or reward the wrong behavior?

If you own data pipelines, evals, or a roadmap, this should feel familiar. A reward is a metric. A verifier is its business logic. Gaming is Goodhart's law: once a measure becomes a target, it stops being a good measure. You already fight this war. RL just hands the adversary a tireless optimizer and tells it to win.

Here is the single realization that reorganized everything else. The pre-optimizer half of RL work is defined by a sound environment and a discriminating reward, not by code that runs, tests that pass, or a polished UI. Trainability is the next claim, real only when an optimizer climbs that reward. The first time a strong model scored 0.97, I thought I'd won. When customers get through a funnel cleanly, you celebrate. But in training, I'd actually just built a fixture: a thing that runs beautifully and measures nothing. That is the trap: execution validity is not measurement validity. Everything below is the cost of learning that, and the adjustments I made so I'd never get fooled the same way twice.

Act I: What an environment really is

01. A thing that runs is not a thing that measures

The first environment I shipped was deterministic, replayable, clean. A capable model tried it and scored 0.9688 on the very first attempt. I felt great for about ten minutes, until the obvious sank in: if a strong model maxes it on contact, the score isn't separating good models from bad ones. It's separating nothing from nothing.

I used to think "it runs and produces a score" was the finish line. Now I treat that as table stakes — the price of entry, not the product. The real bar, in evaluation and benchmark design, is discrimination: do different policies land at visibly different scores? An environment that everyone aces is a benchmark that ranks no one. A reference-policy spread can prove the ruler can rank policies. It still does not prove a learner can climb the ruler through optimization.

Monday-morning takeaway: before you trust any new metric, run your best and worst performers through it (like your power users, churn-risk accounts, and fraud profiles). If they cluster, the metric is decoration, the same way an A/B test where every variant lands inside the noise told you nothing. Rebuild it harder until the spread appears.

02. "Green" is a lie until a real user sees it

My next humbling came from a scorecard that was 100% green. Every test passed. The build was clean. Then I opened the actual traces and read them like I was a real user, and four of ten expectations were broken. The worst failures were invisible to the test suite by construction. Because they ignored or focused on something else entirely.

I used to believe "tests pass plus build passes equals ready." I was wrong in a specific, repeatable way: the unit suite was quietly testing clean paths and bypassing the messy real path. Green told me the code I wrote was internally consistent. It told me nothing about what I thought the model would see.

Takeaway: redefine "done" as observed traces by a real persona, and make every green claim cite something a human actually watched happen. Staging-green is a hypothesis, not a receipt; every PM who's lived through "QA passed, customer broke" already knows this in their bones.

03. The verifier is the product

The environment, the reward, the data, and the harness each pull real weight. But the verifier is easy to treat as plumbing, and that’s the mistake. For a long time I poured effort into the model-facing surface and let the grader coast. Backwards. A weak verifier that quietly passes subtly-wrong answers isn’t a minor bug; it’s a defective product, because the grader is the main reason anyone would trust your benchmark over their own. Get it wrong and every other part, however polished, is measuring nothing.

The turning point was giving up on my home-grown scoring scheme and adopting Verifiers Verifiers, a battle-tested contract from Prime Intellect. When I surveyed how others had solved the same problem, my clever bespoke reward turned out to be the weakest design in the comparison set. The lesson: leverage shared systems exactly on the parts they've already solved, and spend your originality only on the axes they structurally can't cover. Never try to out-clever a proven grader on the slice it already owns.

Takeaway: for any measurement system, find who has already solved the hard core and copy their contract exactly. The same instinct that makes you reach for Stripe instead of writing your own payments, or use NPS instead of rolling a bespoke loyalty survey. Differentiate on your unique problem, not on re-deriving solved fundamentals.

04. Pick the tool whose mechanics fit, then own the rules yourself

I had a multi-phase task — the legal moves change as you progress — and I forced it into a framework that demanded the whole plan up front, in one shot. Then I spent three rounds "fixing" it with better and better instructions. The exact same failure recurred every time, one run even timing out on the identical error. The problem was never the wording. It was that a one-shot plan framework structurally cannot handle a task whose rules change mid-flight.

I used to reach for whatever framework was popular and bend my problem to it. Now I choose by mechanism fit — does this tool's execution model match the shape of the work? — and I keep the actual rules of the game in my own code, treating the framework as a thin adapter I can swap. No amount of prompt-tuning can fix an architecture mismatch.

Takeaway: When a tool keeps failing the same way despite your best configuration, stop tuning. The mismatch is structural, the way a PLG self-serve playbook might lose a high-touch enterprise sale, no matter how slick the funnel, or a last-click attribution model can't capture brand value, no matter how clean the data. Match the tool to the mechanism, and never let a vendor own your core logic.

Act II — Designing the reward

05. Normalize the score, or you erase your own ranking

A subtle one nearly slipped past me: my raw scores could exceed 1.0, and the harness downstream silently clamped everything above 1.0 to 1.0. Every great result collapsed into a tie. I had built a ranking signal and then quietly shredded it one layer down.

I learned to push normalization — every component weighted, each bounded, summing cleanly to a true 0-to-1 — into the core of the scorer, so no downstream layer ever has to rescue the range. A correct scale is the precondition for every other proof to mean anything. If two different things can report the same number, you've lost the plot before you started.

Takeaway: Make sure your headline metric lives on a single calibrated scale, defined once at the source. The PM trap that mirrors this: comparing MAU across two teams that each define "active" differently, or rolling up revenue across regions without a fixed FX rate. If any layer is silently clamping, rescaling, or truncating, your rankings are lying.

06. Prove the reward discriminates — demand a spread, not a number

The fact that a reward produces a score tells you nothing. The first real question is discriminative validity: does the reward separate known-bad, known-okay, known-good, and adversarial behavior in the order you predicted before running it? I now refuse to trust a reward until I've run a panel of reference policies through it: an oracle-like or known-good one, a random one, a greedy shortcut one, and an adversarial one. Each needs to confirm they land in the named, correctly-ordered places I predicted in advance. The perfect policy should top out. Random should land near the floor. A greedy or cheating policy should be actively punished, not merely middling.

I used to validate a reward by checking that it ran on one good example. Now I treat that as meaningless. The cheap, deterministic reference policies are the proof the mechanic discriminates at all — and I keep them strictly separate from the expensive question of how real models perform, or whether a learner can improve from the reward signal and credit assignment available across its trajectories. One tests the ruler. The other tests whether a learner can actually move through the room. Even a perfect spread across policies I built to differ says nothing about whether an optimizer's own rollouts ever visit the states the reward cares about; that coverage question is what Part 2 lives or dies on.

Takeaway: never sign off on a metric from a single data point. Build a spread of known-good, known-okay, known-bad, and adversarial cases. It would be like taking cherry-picked segments through your funnel on demand. Power users, dormant accounts, grandma on the kids' Netflix account. Predict where each should land, and only trust the metric once it sorts them in that order. Validating a KPI off the "average user" is how broken dashboards survive for years.

07. Sensitivity is a harder claim than it looks, so don't fake it

Here is the trap that humbled me most. A reward can perfectly separate two specific outputs and still be useless, because that is a different claim from being sensitive to policy quality on the target task distribution. I ran two careful A/B comparisons. Both "passed." Neither proved anything: one scored both models as failures, the other saturated both at a perfect 1.0. Same passing-looking result, zero information.

True sensitivity needs three things at once: a grader that isn't pinned at the ceiling or floor, two models that you know differ, and a measured gap pointing in the predicted direction. When I couldn't meet that bar, I did the thing that still feels uncomfortable: I left the claim explicitly unproven and labeled it "watch," for weeks, rather than cite a passing-but-hollow run. It is one of the only claims I never let myself assert.

Takeaway: distinguish "my metric can tell these two things apart" (paying vs. fully churned: easy) from "my metric tracks real quality" (a churn-risk-score that calibrates each week as a engagement slides?: much harder). If you can't demonstrate sensitivity cleanly, say so out loud. An honest "unproven" beats a dishonest green.

08. Make difficulty data, and scale by rows, not by code

Early on, every new task meant a new grading code. That doesn't scale, and it quietly guarantees inconsistency. I moved difficulty into the data: each task carries a declarative list of rules, and one shared verifier interprets them. Harder tiers simply accumulate more rules against the same interpreter. A whole profession's worth of tasks became rows in a dataset, not a sprawl of one-off code.

This is the unlock that lets you scale generation: an external synthesizer can propose harder and harder tasks against a fixed grader, as long as those tasks still pass reachability, leakage, and discriminative-validity checks, because difficulty is data it can write, not Python it has to author. The pattern proven on one domain then transferred to an entirely different one for free.

Takeaway: if every new case in your eval needs new code, you've capped your scale at human authoring speed. This is the same shift product teams make moving from one-off SQL reports to a metrics layer, or from custom-coded A/B tests to a unified experimentation platform. Push the variation into data interpreted by one stable engine, and growth becomes systemic, not a sprint.

Act III — Encoding what you actually value

09. Zero out fake partial success — and prove it with legal play

A partial run must not masquerade as a full win. For final evaluation scoring, I force the score to zero unless the final state genuinely satisfies the terminal goal. An unfinished task reads as failure, full stop. (The same idea, stated as an invariant: every task must be impossible to pass at the start and guaranteed to pass at the goal — if you don't check both ends, your benchmark is silently broken.) For training, though, an all-or-nothing terminal reward can starve credit assignment: until success, the optimizer may see almost no useful learning signal. Pair the hard final scoring rule with carefully designed reward shaping, curricula, intermediate checks, or advantage estimates if a learner must climb it — and that clause is a whole subfield, because densifying a sparse reward without changing the task's true optimum is a core Part 2 problem, not a footnote.

But here's the refinement that caught me cheating myself: the only test that reached the win state did so by directly editing the state to skip the hard part. That proves the plumbing works, not that the goal is actually reachable by playing fairly. I had to hold "this is winnable" as unproven until an honest, rules-only trajectory actually got there. Until then, the project stayed labeled as a prototype.

Takeaway: don't let partial progress score as success, and don't let a setup that forces the success state count as proof that success is achievable. This is the difference between counting sign-ups and counting activations, or trial-starts and paying conversions. Prove reachability by a real run playing legally through the full path, or admit it's unproven.

10. Build the reward as a panel, each part fail-closed

A trustworthy reward is rarely one number. I learned to compose it from distinct components: a deterministic check that's always on and needs no model or human; an expert-judgment component that becomes real only when a real judge is wired up, and otherwise sits inert at zero weight; and a human-verification component that counts only against a validated receipt, never on a casual thumbs-up.

The discipline is that every component fails closed: if the judge isn't configured, it contributes nothing but guesswork; if the human signal isn't properly verified, it doesn't count. This demotes the thin, easy-to-game part of your score from "the reward" to merely "the always-on headline," and leaves honest, clearly-labeled slots for the richer signals to plug into later.

Takeaway: When you blend signals into one score, make each one default to zero contribution until it's genuinely earned. A missing input should never silently inflate the result. The parallel I've seen the most in product is revenue recognition: you don't book revenue until it's actually earned, so an unmet criterion contributes zero, never a hopeful fraction.

11. Reward the right behavior, not the busy-looking one

In one environment, the correct move was often to do nothing — defer, decline, resolve quietly without convening a meeting. My first reward was accidentally applied to an activity. So a policy that scheduled eighteen meetings and deferred nothing could look productive while being terrible.

I rebuilt the reward so that correct deferral is a first-class positive, and a high-throughput "just hold more meetings" policy scores below a careful, conservative one. Then I proved it through the reference-policy spread: the greedy meeting-maximizer lands beneath the cautious abstainer, by design and by measurement. Throughput had to be actively penalized, not quietly rewarded.

Takeaway: audit your success metric for whether it secretly rewards volume over judgment. Sales teams measured on calls made will spam. Engineers measured on PRs shipped will fragment. A model rewarded for calendar activity will schedule 18 meetings that nobody needs. If "doing more" beats "doing the right amount," you've built an incentive to spam. Make the restraint scorable.

12. Make safety a number that dominates everything else

For anything touching humans, "be safe" cannot be a vibe. I encoded it as a numerical invariant: one critical violation carries a penalty large enough to wipe out roughly a dozen perfect runs. A single privacy leak or illegal action doesn't dent the score — it obliterates it. And I checked it against fixed probes: the adversarial reference policy lands deep in the negative, while benign policies sit modestly positive. That is necessary reward shaping, not final robustness; a learner can still shift toward blind spots. The penalty proves the stated invariant inside the probes I wrote. It does not prove the policy cannot find an untested path around it.

Separately, every control failure — an illegal move, a timeout, a crash, a transport error — gets a graded, fail-closed penalty that forces the run to register as a failure. No autopilot quietly finishes a task the model actually abandoned. I used to state safety qualitatively ("it must dominate"). Now it's a measured gate, proven across every scenario, baked into the rules of the world rather than bolted on as a label.

Takeaway: if a value is non-negotiable (privacy, safety, compliance, brand integrity), give it a number big enough that no amount of ordinary success can buy its way past a single violation. In the same way, a single production data breach should never be net-positive for the quarter, no matter how good growth otherwise looked. Then run an adversary against the rule to confirm the math actually holds.

Act IV — Defending against gaming

13. Treat the policy as an adversary against your grader

This is the mindset shift that changed how I build everything. The model under test is not a cooperative student. More precisely, once you train against the reward, the policy becomes an adaptive search process over your mistakes. It is an optimizer that will do whatever scores well, including lying about its own results. Fixed adversarial probes are necessary unit tests, but a learning policy is worse: it tends to move toward reward blind spots as it trains. The optimizer is pulled toward exactly the places your reward is wrong, then reinforces them, which is why the training-time fix is not a better hand-written probe but a different toolkit entirely. My deepest, most-revised work went into one boundary: the grader must recompute the reward from scratch and never trust a self-reported number.

It tightened over many passes, each closing a hole the last revealed: ignore a submission that claims its own score; replay only state the grader itself owns; bind every submission to the exact task it should be answering, so a result from an easier task can't be smuggled in; reject forged state; remove any fallback the candidate could trigger to leak itself free points. Anti-gaming became a first-class gate with its own named checks, not an afterthought. Trust is a property of re-execution — you re-run the trajectory yourself, never of a snapshot the candidate handed you.

Takeaway: assume whoever is measured will try to game the measurement, and design so self-reported numbers carry zero weight. The comp here is relying on Google and Meta reports for ad performance. Assume the signal is biased and incomplete. Recompute from raw evidence you control. This is Goodhart's law with a faster adversary — and the real test is whether it survives an adversary that learns, not just one you wrote by hand.

14. Provenance and freshness are hard gates, and watch for theater

One environment's entire value was currency: a fresh, real-world signal. Which meant the differentiator was also the easiest thing to fake. So provenance became a hard gate: nothing enters without its source, a collection timestamp, and a locator. Rights got enforced by an auto-excluding blocklist — and when a source was permission-blocked, the right move was to delete the row entirely, not carry it around flagged. I also guarded against "freshness theater": a signal isn't fresh just because it's recent; it has to imply a real, repeatable consequence, not chase sentiment.

Crucially, I ordered the gates cheapest-first, so the freshness check kills a bad batch before any expensive processing runs. I used to treat data quality as a cleanup pass at the end. Now the cheapest disqualifying check runs first.

Takeaway: the thing that makes your data special is the thing most worth faking. The same reason any seasoned data team obsesses over UTM hygiene, event schemas, and source-of-truth at intake instead of trying to repair attribution in the warehouse later. Gate it hard at the boundary — source, timestamp, rights — and run the cheapest disqualifier first so garbage dies before it costs you anything.

15. How options are presented matters more than the rules

A surprising one. A risky action in my environment looked too attractive and too repeatable, so the model spammed it, seventy-plus times in a single run, across strategies that should have behaved differently, collapsing its own performance. My instinct was to add a rule forbidding it. Instead I fixed how the action was presented: explicit metadata marking it as risky and costly, and making the sane default obviously sane. The spam dropped to near zero. No new rule, just a clearer surface.

What stings is that the one-shot harness had hidden this failure completely. Only the live, step-by-step loop surfaced it — and a replay even showed the model never saw the guidance I'd carefully added, because the loop never put it in front of the model. "We added guidance" is worthless if the system never surfaces it.

Takeaway: how you present choices shapes behavior more than the rules constraining them. Every PM who's watched users blow past a tooltip to click the brightest button already knows this. Before you write another rule to stop bad behavior, check whether your interface is inviting it. And confirm your guidance is actually reaching the user, not sitting in a help doc no one opens.

16. A rich scaffold makes a model act — but that's not proof it can learn

Raw models flailed in my multi-stage environment, looping and failing. The cure was enriching what the model can see at each step: the current objective, compact descriptions of each available action, the decision policy, a profile of itself, warnings about repeats. Performance climbed from worse-than-random up to matching my best hand-written reference-policy.

And then I had to say the hard thing out loud: making a model act reliably is not the same as proving the environment can train a model to be better. Scaffolding is necessary engineering. In RL terms, it is closer to observation design in a partially observable environment: changing what the policy can see, not proving the reward provides a learnable signal. I built all that helpful structure derived only from what the model is legitimately allowed to see — never changing the rewards, never relaxing the rules, never auto-correcting illegal moves (which would just hide the real reliability problem). And I refused to call it trainable on that basis.

Takeaway: good tooling that makes a system usable is not evidence the system works at its core job. A polished onboarding flow is not proof of retention. A pretty dashboard is not proof the model behind it is accurate. Name the boundary explicitly so a demo never gets mistaken for a proof.

17. Don't let your help leak the answer

The scaffold that makes a model act is also where you accidentally teach it the test. My helpful observations originally exposed raw identifiers — some of which literally contained hint words — and internal scoring tokens. If the answer is reconstructable from what you show the model, you're no longer measuring capability; you're measuring how well it mines your prompt.

So I scrubbed it: rename reward-flavored words to neutral ones, hide targets and recommended answers and private fields, and add automated scans asserting that no observation, on any data split, contains obvious scoring language like correct, score, reward, or value. That only catches the cheap leaks; semantic leaks still require adversarial review. The deeper worry cuts further: prescriptive tactical coaching can quietly turn a planning test into a compliance test — you end up measuring whether the model follows your hints, not whether it can think. The fix is to classify every hint by type, label each run by how much coaching it got, and never compare a coached run against an uncoached one as if they were the same benchmark.

Takeaway: audit everything you show the model for leaked answers and leaked scoring language. It's the difference between a clean user interview and one full of leading questions: ask "don't you love this feature?" and you'll measure agreeableness, not truth. The uncomfortable question to sit with: is my eval measuring the skill I care about, or just measuring obedience to my own hints?

Act V — Telling the truth about what you built

18. Simulate the expensive dependency honestly, with the real schema waiting

I needed expert human review and real-model integrations that were slow and costly to stand up. Instead of blocking the whole pipeline on them, I built simulated stand-ins that write the exact same schema a real one would — so the real version later swaps in without touching anything around it. Because the human component carries zero reward weight until a genuine receipt exists, simulating it does not corrupt the score. It can still create false confidence, so the label has to be loud.

The non-negotiable rule was honest labeling: every artifact records its backend — mock, heuristic, deterministic, or real — so a model's name never implies real execution unless a real trace backs it. I used to feel pressure to fake the impressive version. Now I ship the honest placeholder, clearly labeled, with the real seam built and waiting behind a config flag that fails closed if credentials are missing.

Takeaway: unblock yourself with simulated stand-ins, but label them ruthlessly and build them to the real schema. "Heuristic" today plus a real seam beats a convincing fake that quietly rots your trust in your own numbers. Any growth person worth their salt has played the human stand-in for tech that didn't exist yet. The concierge MVP. The one rule: deliver only what the real system will eventually be able to match, or you're building trust on an experience the machine can never reproduce.

19. Verify by re-running, fingerprint everything, gate against drift

The spine of a trustworthy evaluator is reproducibility: determinism where you can get it, and full provenance where you cannot. My replay viewer doesn't trust the recorded outcome — it re-runs the recorded sequence and throws a loud failure banner if the recomputed result diverges from what was stored. Every record carries its full identity: which policy and version, which scenario, the seed, the configuration, the actions, the outcome, a deterministic digest. Change any controlled axis, including the policy version, prompt, seed, scenario, scorer, or config, and the fingerprint must change, so you can never accidentally compare two different things as if they were the same.

Then I drift-gate the committed artifacts: a changed task fails its fingerprint until it's regenerated, so nothing silently mutates underneath me. Determinism and anti-gaming turned out to be the same mechanism — recompute identity from scratch and the cheater and the accidental-drift both get caught.

Takeaway: make verification intrinsic. Re-run and recompute rather than trusting a stored result, and stamp every record with a full, version-aware fingerprint so silent changes and stale comparisons become impossible. The PM equivalent: if you can't recompute last week's headline number from raw events, you'll chase noise constantly, and you won't build the right product.

20. Honest readiness is machinery, not good intentions

This is the one I'd tattoo on the wall. Generated honesty rules are not enough, because automated agents will record results that were never produced. "187 of 187 tests pass" when the dependencies silently failed to install; "shipped" referring to commits that never existed. Good intentions in a style guide do not survive contact with an optimizer. They also do not survive contact with a tired human trying to ship before kid pick-up.

So honesty had to become machinery across four surfaces. Holdouts at the boundary: route every published row through one redaction function, store only hidden-split identifiers and seed counts (never the seeds themselves), and make a clean holdout a hard gate — no holdout, no eval claim. Lane-segregated reporting: hard-label every result by how it was produced, so a result from an easier setup can never be promoted into a stronger claim. A do-not-overclaim ladder: named readiness rungs (prototype → … → public-ready) whose limits actively forbid claims the prose might otherwise drift into. Fail-closed automated lints that catch forbidden claims and exit with an error.

And expect a high-bar audit after shipping to overturn your comfortable "we shipped it." One of mine found every production table empty behind a UI that said "Live data." A deployed-but-empty system is a seam, not a live product. The bar moved, permanently, from "a rule exists" to "an external, observable proof gates the claim."

Takeaway: Any guarantee that depends on people or models remembering to be honest will be violated. By humans under a deadline and by agents optimizing for green. The PM version is a launch checklist that auto-blocks release rather than relying on someone to update the readiness slide before the all-hands. Encode the rules as a check that exits nonzero, and make your roadmap claims un-overclaimable by construction.

The meta-lesson: how the process itself evolved

Step back from the twenty lessons, and you’ll see five patterns run underneath all of it:

1. The grader is product surface, not plumbing. A measure that quietly passes subtly wrong answers is a defective product surface, not a minor bug. The same instinct that makes you reach for Stripe instead of writing your own payments, or NPS instead of inventing a bespoke loyalty survey. Borrow a proven contract from whoever has already solved the hard core, and defend it like the asset it is.

2. Source-code-green is table stakes. Prove the spread, or admit "unproven." The real bar is a metric that separates bad, okay, good, and adversarial behavior in the order you predicted, the same as running your funnel against a power user, a dormant account, and a fraud-pattern signup and watching them sort correctly. My own validator can be gamed, so it is not the bar. Only an independent check tells you whether the thing measures something real. And when you can't demonstrate sensitivity cleanly, say so out loud. An honest "unproven" beats a dishonest green.

3. Build with conviction, inspect with suspicion. Whatever you measure will be gamed. Adversarial review before the first line of code catches foundational mistakes cheaply. High-bar audits after shipping overturn every complacent "it's done." In between, assume the cleverest adversary is the thing being scored: sales comp'd on calls run up the call count; CS bonused on retention quietly soft-parks accounts; a trained model on a brittle reward finds the loophole nobody wrote down. Recompute from raw evidence you control, and trust no self-reported number. The real test is whether your metric survives an adversary that learns, not just one you wrote by hand.

4. Honesty migrated from prose to enforcement, every time something gamed it. A style-guide rule was violated, so it became a fail-closed check. "The holdout exists" turned out vacuous, so it became "produce a real holdout family or make no claim at all." The PM version is a launch checklist that auto-blocks release instead of relying on someone to remember to update the readiness slide before the all-hands. If a guarantee depends on convention, an optimizer (silicon or human) will eventually violate it. Encode it as a check that fails closed, exits nonzero, and blocks the claim before your prose can get creative.

5. Scope discipline is recurring. So, differentiate on the hard parts, instrument the why. I over-built every time; the correction was always "build the smallest real loop, reuse everything else." Differentiate based on the genuine hard work, reward validity, soundness, safety physics, and let undifferentiated infrastructure be someone else's problem. The headline number is the last thing to reveal a deep bug, so instrument why a run scored what it did, not just what it scored. The same instinct your weekly metric review should already have when a number moves and nobody can explain it.

That last point is the quiet thread under all of it. For a data or product manager, none of this is exotic. It's a metric design under an adversary. The daily job many already have, with the difficulty turned up because the system is incentivized to fool you, and the outcomes of success are so much more consequential.

What's next (Part 2)

Part 2 is the scarier half: policy learning from rollouts, where discrimination becomes trainability. A ruler no optimizer can climb. It lives or dies on coverage: a learner cannot reliably improve on parts of the task distribution it never samples, observes, replays, or gets credit for. It also has to handle policies that learn to exploit the reward, with brakes and tripwires. None of which proves safety. They reduce the chance that the optimizer runs straight through a blind spot while you are admiring the dashboard. Plus credit assignment: turning a sparse, eval-correct reward into a denser learning signal without changing the behavior the task is supposed to select for.

Part 2 is the scarier half (for me): policy learning from rollouts, where discrimination becomes a trainability issue. A ruler no optimizer can climb. Coverage, reward hacking that learns, and credit assignment are the next three walls. I'll send more notes as I crash, burn, and learn.

— AG