Victor Ng | AI Workflows Case Study

50%

Design to Ship Cycle Reduced

10+

Frontend Fixes Shipped in 1 Week

18

Usability Issues Surfaced

Building the testing agent was the easy part. What changed things was owning the whole loop, from automated testing all the way to learning to ship the frontend fixes myself. Once a designer can put fixes into production, product quality stops waiting on engineering bandwidth.

Context

On Ads Manager, every team owns a flow. Mine, App & Gaming, owns the experience advertisers use to promote an app. As the team's Usability Lead, I'm responsible for keeping that flow at the highest quality bar in the org.

That bar is enforced two ways, and both are manual. Internally, my team and our design leads run quarterly dogfooding sessions, testing our own flow the way a customer would. Externally, a central usability team scores every team's flow against a rigorous rubric. Between them, that's the standard I own.

Through H1 2026, as designers across the org began adopting AI, I spent the cycle testing tools and asking a sharper question: how could I streamline this entire process for my team, not just speed up a single step of it?

Before · the manual loop

Manual dogfood

→

Task every finding

→

Prioritization session

→

Usability roadmap

→

UI fixes cut for bandwidth

Where it leaks

Engineering bandwidth is finite, so the "low priority" work, especially small UI fixes, rarely survives prioritization. Those fixes are real usability debt that just never ships. Learning to vibe code as a designer closes that gap. I find the issue, then ship the fix myself.

After · with the dogfooding agent

AI dogfoods both platforms

→

Findings scored & filed automatically

→

Quick UI fixes shipped by design

→

Roadmap holds only the deep work

Most of this work lived in systems design, cognitive modeling, and service design, not in Figma.

01

Eliminate inconsistency

Manual dogfooding (testing our own product the way a customer would) varies wildly from tester to tester. I designed a structured CAPTURE → EVALUATE → ACT protocol that evaluates the same surfaces, with the same rigor, every time.

02

Scale coverage

One person can now run several sessions a day across different personas and platforms, each one producing a structured report where every issue is already graded against Meta's product quality rubric. What used to take a week takes an afternoon.

03

Embed the user's perspective

The skill doesn't just check whether things work. It plays the part of specific advertiser archetypes, catching confusing defaults, jargon, and missing controls that functional testing walks right past.

Directing the LLMBuilding this meant handing an LLM a long task full of judgment calls and getting consistent results. The prompts keep the agent in character for a whole session, get it to tell bugs apart from design problems, and stop it reaching for lazy AI habits like defaulting to "add a tooltip".

Automating a task everyone thinks needs a humanUX evaluation is assumed to need a person in the loop. I broke "judgment" down into concrete questions a model can actually check: can the persona tell what this page is for? Do they know what to do next? Does the language match their technical level?

Grounded in internal design knowledgeThe personas and the rubric weren't invented from scratch. I distilled them from Meta's own design system, encoding its evaluation frameworks into the agent's prompt: the Product Quality Scorecard (PQS), Meta's rubric for grading UX quality; the Usability Playbook (UPB), a 0 to 100 usability score; and our team's app promotion pattern docs. The agent now grades against the same standards a designer on my team would. The hard part was the translation: taking design judgment that usually lives in people's heads and writing it down as an explicit system an LLM can run the same way every session.

How the agent learned our standards

Five sources of internal design knowledge, the same standards my team already used to judge quality, became a single evaluation protocol the agent runs every session.

Evaluation Criteria

What to inspect at every step of the flow.

Quality Axes (PQS)

The dimensions Meta grades UX on.

Product Quality Process

Severity, effort sizing, and triage rules.

Usability Scoring (UPB)

Meta's 0 to 100 usability score.

Design System Patterns

App promotion pattern docs & conventions.

The Dogfooding Agent

Takes on a persona, checks every screen against those standards, and writes up what it finds.

↓

Filed tasks

Classified tickets on the engineering board.

Prioritized

Severity scored and effort sized.

Screenshots

Captured and attached per issue.

Tracker log

A row per session in our sheet.

Filed straight into the tools the team already uses, ready to triage the moment a run ends.

First Time Priya

Small Business Owner

name: "First Time Priya" archetype: "Small business, first app campaign on Meta, based in India" evaluation_lens: - "Is the terminology understandable for someone whose only reference point is Google UAC?" - "Are the defaults safe for a small budget?" system_instruction: | You must stay in character for the entire session. Do NOT evaluate based on standard UI heuristics. Evaluate strictly based on Priya's knowledge gaps.

Enterprise Marcus

Senior UA Manager ($2M+/mo)

name: "Enterprise Marcus" archetype: "Senior UA Manager running $2M+/month in app campaigns, based in US" evaluation_lens: - "Are power-user controls (SKAdNetwork, optimization events) easily accessible?" - "Does the system try to automate things I need manual control over (e.g. Advantage+)?" system_instruction: | You must stay in character for the entire session. Evaluate strictly based on Marcus's need for control, measurement confidence, and efficiency.

The same screen, two verdicts

I model personas as cognitive states, not demographics. Run the same screen through Marcus and Priya and the criteria flip: one default reads as a power user's frustration and a beginner's relief at the same time. That tension is exactly what product teams have to design around.

Marcus's column is from the real Run 1 session. Priya's column is a projected read of those same screens through her documented model. It stays illustrative until her session runs.

Real screen (Run 1)

Enterprise Marcus · measured

First Time Priya · projected

Advantage+ defaults ON

Friction. Hidden automation he'd want to disable.

Relief. The "let the system handle it" she expects from Google.

SKAdNetwork not visible

Blocking. Needs full control to protect measurement.

Hidden risk. She'll mismeasure iOS and never know why.

Unconnected app, click does nothing

Friction. Knows to leave and use the App Dashboard.

Dead end. The most likely abandonment point in the flow.

Advantage+ defaulting ON is a retention risk and a retention aid at the same time, depending only on who's looking. The persona system catches that contrast on every run, instead of leaving it to whoever happens to be testing.

For its first production run, the agent took Enterprise Marcus through the full App Promotion flow on iOS and Android, back to back, in about 90 minutes. One pass surfaced 18 issues, including things our manual sessions had been walking past. A few mattered enough to show here; the point is how many it caught, and how fast.

Persona

Enterprise Marcus

Flow

E2E App Promotion

Platforms

iOS + Android

Duration

~90 minutes

18

Issues surfaced

4

Launch blocking

58/100

iOS usability score

2/5

Cross platform consistency

43

Screenshots captured

What manual testing missed

When we dogfood our own product, we rarely step all the way into the advertiser's shoes, so issues like these slip right past us. The agent doesn't have that problem. Each one is filed as a task in the persona's own voice, with the structured record attached for triage.

Task 01 · filed by Enterprise Marcus

Launch Blocking Quality issue · Flow doesn't hold together

SKAdNetwork Configuration Missing

Enterprise advertisers cannot configure their iOS measurement strategy. SKAdNetwork is Apple's privacy era system for measuring app installs, so wrong defaults quietly degrade campaign reporting. At $2M+/month, that is real measurement risk.

Enterprise Marcus

"I scrolled through the entire ad set for my iOS campaign. Where is SKAdNetwork? My entire iOS strategy depends on getting the conversion schema right. If the system is making SKAN decisions for me, I need to know what those decisions are."

Task 02 · filed by Enterprise Marcus

Launch Blocking Quality issue · Flow doesn't hold together

Android Ad Level Fails to Recognize App

After selecting an app at the ad set level, the ad level shows an error warning: "multiple apps that can't be edited together." This is a flow breaking bug that blocks Android campaign creation.

Enterprise Marcus

"I selected one app on Google Play. Now the ad level says I have 'multiple apps that can't be edited together.' And it tells me to go back and select a store that I already selected. This is broken."

issue_id: "SKAN_Config_Missing" quality_axis: "Coherent End to End" classification: "Launch-Blocking" rock_size: "Large (6+ months)" problem: "No SKAdNetwork section exists. For an iOS campaign, SKAN configuration is foundational to measurement." user_impact: "Enterprise advertisers cannot configure their iOS measurement strategy." persona_quote: | "I scrolled through the entire ad set for my iOS campaign. Where is SKAdNetwork? My entire iOS strategy depends on getting the conversion schema right." recommended_fix: "Surface current SKAN configuration as read-only summary in ad set."

Why the system beats a manual pass

A single report was never the goal. I wanted a system that runs the same way every time, faster, and catches what a tired tester on one platform would miss, like the cross platform bugs it caught running iOS and Android back to back.

Dimension

Manual

AI driven

Issues per session

3–5 typical

18 (first session)

Time

45 min – 1 hr, not exhaustive across platforms

~90 min, full pass on both platforms

Cross platform bugs

Rare (needs 2 sessions + a manual diff)

2, found automatically

Output

Manual table with screenshots

Structured report, scored & classified

Task creation

Often deferred, filed later if at all

Filed automatically to the board

Consistency

Varies with the tester

Identical protocol every run

What it doesn't replace

The agent isn't a replacement for manual testing. It handles the repetitive parts and hits both platforms every run, so I spend my time on the judgment calls instead of the clicking. A human still catches what it misses, and every issue it flags is still mine to verify.

The skill is a set of modular markdown files, kept deliberately separate. The evaluation engine stays fixed while personas and flows swap in and out, so any team at Meta can fork it for their own surface in a day without touching the core logic.

SKILL.md # Setup wizard → launches the session references/ ├── agent.md # CAPTURE → EVALUATE → ACT protocol + quality rubric ├── personas/ │ ├── priya_first_time.md │ └── marcus_enterprise.md ├── flows/ │ └── e2e_app_promotion.md └── dogfood-connect.sh # Drives live Ads Manager via Chrome DevTools Protocol

Iteration: two personas, not three

I started with three personas. Testing showed the middle "agency" persona was producing overlapping noise rather than distinct findings, so I cut it to two extreme archetypes: First Time Priya (catches onboarding gaps) and Enterprise Marcus (catches power user friction). Two clear voices gave far cleaner signal than three that overlapped.

App specific focus areas
I spelled out the areas the agent has to inspect (Objective selection, App selection, SKAdNetwork). Without that, it just evaluates whatever is most visually prominent. This keeps the attention on the surfaces our team actually owns.

Wired into our codebase
Our codebase is connected inside Claude, so I can spin up every app ads flow the agent needs to evaluate, generated straight from the source. That means a usability pass runs against the real states we ship, the full set of screens and edge cases a flow has to cover, not just whatever a test account happens to surface.

The agent gave me a reliable, repeatable way to surface issues. But I didn't stop at filing tickets. I tested every issue it flagged, confirmed it was real, then used vibe coding to write the frontend fix myself. An engineer on the team reviewed every diff before it landed in production.

Learning enough to close the gap between finding a problem and shipping the fix is what changed how I work. 10+ frontend changes landed in a single week, fixed by the person who understood the UX problem best instead of waiting for sprint planning.

10+

Frontend fixes shipped in one week

1

Person: detect, verify, fix, ship

0

Tickets filed to another team's backlog

The loop

Agent finds issue → I verify manually → I vibe code the fix → Engineer reviews → Shipped

Repeated until all issues are resolved

↻ Repeated until all issues are resolved

One fix, end to end

A usability pass flagged the Attribution model field in ad set settings: an always-visible dropdown adding clutter to an already dense page, out of step with neighboring fields like Performance goal that stay collapsed until you need them. I converted it to the same progressive disclosure pattern, a read-only summary that expands to the full selector on click, then shipped it. An engineer reviewed the diff before it landed.

The committed diff: Convert Attribution model to progressive disclosure pattern, plus 157 minus 87 — Committed and reviewed. A flagged usability issue going all the way to production code, written and shipped by the person who caught it.

Before Always-visible dropdown

Attribution model shown as an always-visible dropdown in ad set settings

Open by default, adding height to an already long page.

After · shipped Progressive disclosure

Attribution model collapsed to a read-only summary that expands on click

Collapses to a read-only summary, matching the Performance goal field. Expands to the full selector on click.

The system now runs the same way for our team dogfooding sessions. The agent handles detection, I handle verification and fixes, and an engineer handles review. The whole design to ship cycle dropped by half.

Why this matters

When design can ship frontend, product quality stops waiting in someone else's backlog. The small fixes that used to die in prioritization get caught and corrected by the person who cares most about them. The real change here is simple: design now holds the quality bar and enforces it directly in the code.

I've built this twice now, and a third would take about a day. The real result is less a single skill than a template any team that owns a flow can stand up, tuned each session for depth over coverage.

What I'd build next

Today the skill stops once an issue is filed. Next is closing the loop: let it propose the fix, then rerun the same session after the fix ships to confirm it held. That turns a testing tool into regression detection on a schedule.

Where it's still thin

n = 1. The headline numbers are one real run. Signal, not proof.
Personas validated only by me. Built from internal docs and my own read, never sat next to a real advertiser.

Streamlining Usability
Workflows with AI

Automating UX Evaluation

Eliminate inconsistency

Scale coverage

Embed the user's perspective

Beyond Prompt Engineering

What One Session Surfaced

SKAdNetwork Configuration Missing

Android Ad Level Fails to Recognize App

Key Design Decisions

Iteration: two personas, not three

I Found the Issues, Then Shipped the Fixes Myself.

What I'd Watch, and Where It Goes