AI in Web Performance Audits: Deterministic Measurement and Actionable Reports

A web performance audit has two phases with very different natures. The first is collecting data: measuring LCP, CLS, INP, TTFB, analyzing resources, detecting anti-patterns. The second is interpreting that data: identifying which problems matter most, explaining the mechanism behind each one, and proposing solutions prioritized by impact.

AI helps in both phases, but in completely different ways. Confusing them leads to unreliable results.

Measurement doesn’t improvise

In earlier articles I covered the WebPerf Snippets Agent SKILLs and structured returns for agents. The core idea is that the agent doesn’t generate JavaScript on the fly: it reads a predefined, tested, and validated script, and executes it via Chrome DevTools MCP.

skills/webperf-core-web-vitals/
├── SKILL.md
└── scripts/
    ├── LCP.js        # Returns { value, rating, element, url }
    ├── CLS.js        # Returns { value, rating, sources[] }
    ├── INP.js        # Returns { value, rating, interaction }
    └── ...

The output of LCP.js is not console text formatted for a human. It’s a structured JSON object the agent processes directly:

{
  "metric": "LCP",
  "value": 3840,
  "rating": "needs-improvement",
  "element": "IMG",
  "url": "https://cdn.perf.reviews/hero.jpg",
  "renderTime": 3840,
  "loadTime": 2100
}

The agent knows that rating: "needs-improvement" triggers the LCP workflow. That workflow tells it the next step is to run LCP-Sub-Parts.js to break down the time into TTFB, load delay, load time, and render delay. The decision logic lives in the SKILL, not in the LLM.

This matters because a measurement that varies depending on how the model interprets the code that day is not a measurement. It’s an estimate. In performance diagnosis, we need the first.

From data to diagnosis

With metrics collected reliably, the work begins where AI does have freedom: interpreting, connecting, and prioritizing.

The agent receives, for example, these results after analyzing a product page:

{
  "LCP": { "value": 4200, "rating": "poor", "element": "IMG" },
  "CLS": { "value": 0.08, "rating": "needs-improvement" },
  "INP": { "value": 180, "rating": "good" },
  "TTFB": { "value": 820, "rating": "poor" },
  "render_blocking": ["fonts.googleapis.com/css2", "vendor.min.css"],
  "preload_async_conflicts": 3
}

From here, the AI is not measuring: it’s reasoning. The high TTFB explains part of the poor LCP. Render-blocking resources add latency before painting begins. The preload + async conflicts create priority pressure on resources that don’t deserve it. The elevated CLS is probably related to the blocking font resources.

This kind of cross-analysis, where one data point explains another, is exactly where a well-contextualized agent provides real value: it connects scattered information and generates a coherent diagnosis that would otherwise require several manual iterations.

The prompt for this step is straightforward:

Analyze the audit results for https://perf.reviews/products/item-123
and generate a report with the issues found, ordered by impact on
user experience, with a specific solution for each one.

The agent has the structured data, the SKILL thresholds as reference, and the context of how metrics relate to each other. The resulting report is not generic: it’s specific to the data from that page.

The complete audit workflow

With Claude Code or Gemini, Chrome DevTools MCP, and the WebPerf SKILLs installed, the complete audit workflow goes from a series of scattered manual actions to a process the agent runs end to end:

1. Navigate to the URL (navigate_page)
2. Measure Core Web Vitals → run LCP.js, CLS.js, INP.js
3. If LCP > threshold → run LCP-Sub-Parts.js
4. If TTFB > threshold → run TTFB-Sub-Parts.js
5. Analyze resources → run Render-Blocking.js, Preload-Async-Conflicts.js
6. Collect all JSON results
7. Generate report with diagnosis, prioritization, and concrete solutions
8. Export in the required format (markdown, PDF, Slack)

Steps 1 through 6 are deterministic: the agent runs fixed scripts. Steps 7 and 8 are where the AI generates content from reliable data. The separation is not arbitrary: it guarantees that the final report reflects what is actually on the page, not what the model thinks might be there.

The report output format is where the agent has the most freedom. The same set of metrics can become a technical document for the engineering team, an executive summary for business stakeholders, or a Slack alert when a metric crosses a threshold. The content is the same; the agent adapts the presentation to the audience.

Regression testing: from one-off audits to continuous monitoring

A one-off audit captures the state of a page at a given moment. The real value comes when that audit becomes a baseline against which to compare future changes.

The regression workflow starts from the same set of deterministic scripts:

1. Initial audit → save metrics as baseline (baseline.json)
2. After each deployment → run the same scripts on the same environment
3. Compare new values against baseline
4. If there is a regression in any metric → identify what changed and in which direction
5. Generate alert with change diagnosis

The key point is step 4. The agent doesn’t just detect that LCP worsened from 2.1s to 3.8s: it crosses that information with recent changes (git log, recent PRs, deployments) and generates a diagnosis like “LCP worsened by 1.7s after introducing a new font resource without font-display: swap in commit abc123”.

// baseline.json
{
  "url": "https://perf.reviews/",
  "date": "2026-03-01",
  "metrics": {
    "LCP": 2100,
    "CLS": 0.04,
    "INP": 120,
    "TTFB": 380
  }
}

// audit-2026-03-23.json
{
  "url": "https://perf.reviews/",
  "date": "2026-03-23",
  "metrics": {
    "LCP": 3800,
    "CLS": 0.04,
    "INP": 125,
    "TTFB": 390
  },
  "regressions": [
    {
      "metric": "LCP",
      "baseline": 2100,
      "current": 3800,
      "delta": "+1700ms",
      "severity": "critical"
    }
  ]
}

The prompt for regression analysis:

Compare the current audit of https://perf.reviews against the baseline in
baseline.json. Identify regressions, estimate their impact on Core Web Vitals,
and suggest which recent changes may have caused them.

This approach turns the audit into a continuous process: not a quarterly review, but an automatic check that acts as a safety net before a problem reaches production.

AI as a junior auditor that never improvises measurements

There is a distinction worth making explicit: the agent doesn’t replace the judgment of the person doing the audit. It replaces the repetitive, mechanical parts that consume time without adding analytical value.

Navigating to eight different URLs and running the same scripts on each is mechanical work. Detecting that four of them have preload + async conflicts all coming from the same origin is analytical work the agent does well when it has reliable data. Deciding whether the solution is to remove the preloads or add fetchpriority="low" based on the role of those resources on each specific page is work for whoever knows the product.

The combination of Gemini CLI with Chrome DevTools MCP for Long Task debugging follows the same logic: the agent connects directly to the browser; it doesn’t improvise over visual graphs. Source Maps give it the source code context so the diagnosis is precise. The judgment of which fix to apply remains human.

AI speeds up data collection and report generation. Interpreting what matters most in the context of each product still requires domain knowledge. With the right tools, the person doing the audit spends their time on analysis and prioritization, not running repetitive scripts.

Conclusion

Separating measurement (deterministic, script-based) from analysis (contextual, AI-generated) is the distinction that makes this approach work. The scripts ensure the agent measures the same thing every time, in the same way, regardless of the model or session. The structured data they produce is the input that lets the agent reason over it with precision.

Continuous regression testing is the natural extension: once you have a reliable baseline and reproducible scripts, automating the comparison is trivial. The agent moves from a point-in-time audit tool to an early warning system.

The WebPerf SKILLs are available via npx -y skills add nucliweb/webperf-snippets. The debugging workshop repository with Gemini is at nucliweb/webperf-debugging-devtools-mcp.

If you want us to implement this workflow in your product, at perf.reviews we do web performance audits with this same approach.