Flaky Tests Need Evidence Before Edits
A flaky test is not the same thing as a broken feature. If you let an AI agent jump straight to a code change, you can end up fixing the symptom instead of the cause.
That is why flaky test triage should start with evidence. Claude Code and Codex are useful here because they can inspect the failing test, compare recent changes, and help you separate a real regression from a timing issue, environment issue, or test-order problem.
Junction helps by keeping the run local, streaming the output live, and making the resulting diff easy to review before anyone merges it.
First Decide What Kind Of Failure You Have
Before asking an agent to patch anything, classify the failure:
- a real regression in application code,
- a fragile assertion in the test itself,
- a timing or race condition,
- an environment-specific failure,
- or a test that depends on shared state.
That classification tells you whether the fix belongs in the product code, the test, or the surrounding harness.
If you skip this step, the agent may make the test pass without explaining why it failed in the first place.
Ask The Agent For Evidence, Not A Guess
The best first prompt is not "fix the test." It is something like:
Inspect the failing test and the recent code changes.
Tell me whether this looks like a real regression or a flaky test.
If the issue is in the test, propose the smallest safe fix.
If the issue is in the code, explain the observed failure path before editing.That keeps the run in analysis mode until there is enough context to change code safely.
For an ordinary bug, the agent can move from diagnosis to edit quickly. For a flaky test, it often needs one more pass to prove that the failure is reproducible before the fix is credible.
A Good Triage Sequence
A practical triage sequence looks like this:
- Run the failing test and capture the exact error.
- Re-run it enough times to see whether the failure pattern changes.
- Check whether recent edits touched the same code path.
- Inspect the assertion, setup, and teardown logic.
- Decide whether the test should be stabilized or the product code should change.
- Apply the smallest patch that preserves the test's signal.
The goal is not to silence the test. The goal is to restore trust in it.
Example: Timing-Sensitive Assertions
Suppose a test expects a status update to appear immediately after an asynchronous save, but the output is sometimes one tick late.
A good agent-assisted fix might:
- wait for the actual state transition instead of a guessed delay,
- assert on the final state rather than an intermediate render,
- or adjust the test helper so it waits for the event the code really emits.
A bad fix would be to add a broad sleep or loosen the assertion until the test always passes. That turns a flaky test into a weak test.
This is where code review matters. The diff should explain why the failure happened and why the new assertion is still meaningful.
When To Fix The Test And When To Fix The Code
Fix the test when:
- the product behavior is correct,
- the assertion is too strict,
- the test depends on timing or order in a way the product does not guarantee,
- or the test setup leaks state between runs.
Fix the code when:
- the test is correctly describing intended behavior,
- the code is genuinely wrong,
- or the failure exposes a real edge case the test was already guarding.
If you cannot tell which bucket you are in, stop and keep gathering evidence. A flaky test is a place where haste produces false confidence.
How Junction Helps During Triage
Junction is useful because it keeps the failing run visible while the agent works. You can watch the output stream, inspect the changed files, and stop the session if the fix starts to drift.
That matters for flaky tests because the first candidate fix is often wrong. The browser control surface gives you a chance to review the evidence before the patch gets too far from the original failure.
If the command the agent wants to run is unclear or too broad, use the guidance in How to Approve AI Agent Actions Safely before you let the run continue.
Tradeoffs
There is a temptation to treat flaky tests as purely mechanical. They are not.
The hard part is deciding whether the test is too brittle or the code is under-specified. An agent can help, but it cannot make that judgment without clear signals from the repo and the run.
The other tradeoff is maintenance. If the fix makes the test more tolerant without improving the underlying timing or state model, the next failure will arrive later and be harder to diagnose.
Where Junction Fits
Junction gives Claude Code and Codex a local place to inspect the repo, patch the narrow failure, and show the diff before merge. That keeps the debugging loop close to the machine where the bug exists instead of pushing the work into a cloud sandbox.
If you are setting this workflow up from scratch, start with the setup guide. For related supervision habits, reviewing AI agent diffs from your phone is a good companion to this post.