Streamlining Usability
Workflows with AI

As Usability Lead for Ads Manager, I built an AI workflow that changed how our team dogfoods: from automated persona testing to shipping the frontend fixes myself. It cut our design to ship cycle in half.

50%
Design to Ship Cycle Reduced
10+
Frontend Fixes Shipped in 1 Week
18
Usability Issues Surfaced

Building the testing agent was the easy part. What changed things was owning the whole loop, from automated testing all the way to learning to ship the frontend fixes myself. Once a designer can put fixes into production, product quality stops waiting on engineering bandwidth.

Context

On Ads Manager, every team owns a flow. Mine, App & Gaming, owns the experience advertisers use to promote an app. As the team's Usability Lead, I'm responsible for keeping that flow at the highest quality bar in the org.

That bar is enforced two ways, and both are manual. Internally, my team and our design leads run quarterly dogfooding sessions, testing our own flow the way a customer would. Externally, a central usability team scores every team's flow against a rigorous rubric. Between them, that's the standard I own.

Through H1 2026, as designers across the org began adopting AI, I spent the cycle testing tools and asking a sharper question: how could I streamline this entire process for my team, not just speed up a single step of it?

Before · the manual loop
Manual dogfood
Task every finding
Prioritization session
Usability roadmap
UI fixes cut for bandwidth
Where it leaks

Engineering bandwidth is finite, so the "low priority" work, especially small UI fixes, rarely survives prioritization. Those fixes are real usability debt that just never ships. Learning to vibe code as a designer closes that gap. I find the issue, then ship the fix myself.

After · with the dogfooding agent
AI dogfoods both platforms
Findings scored & filed automatically
Quick UI fixes shipped by design
Roadmap holds only the deep work

Automating UX Evaluation

Most of this work lived in systems design, cognitive modeling, and service design, not in Figma.

01

Eliminate inconsistency

Manual dogfooding (testing our own product the way a customer would) varies wildly from tester to tester. I designed a structured CAPTURE → EVALUATE → ACT protocol that evaluates the same surfaces, with the same rigor, every time.

02

Scale coverage

One person can now run several sessions a day across different personas and platforms, each one producing a structured report where every issue is already graded against Meta's product quality rubric. What used to take a week takes an afternoon.

03

Embed the user's perspective

The skill doesn't just check whether things work. It plays the part of specific advertiser archetypes, catching confusing defaults, jargon, and missing controls that functional testing walks right past.

Beyond Prompt Engineering

Directing the LLMBuilding this meant handing an LLM a long task full of judgment calls and getting consistent results. The prompts keep the agent in character for a whole session, get it to tell bugs apart from design problems, and stop it reaching for lazy AI habits like defaulting to "add a tooltip".

Automating a task everyone thinks needs a humanUX evaluation is assumed to need a person in the loop. I broke "judgment" down into concrete questions a model can actually check: can the persona tell what this page is for? Do they know what to do next? Does the language match their technical level?

Grounded in internal design knowledgeThe personas and the rubric weren't invented from scratch. I distilled them from Meta's own design system, encoding its evaluation frameworks into the agent's prompt: the Product Quality Scorecard (PQS), Meta's rubric for grading UX quality; the Usability Playbook (UPB), a 0 to 100 usability score; and our team's app promotion pattern docs. The agent now grades against the same standards a designer on my team would. The hard part was the translation: taking design judgment that usually lives in people's heads and writing it down as an explicit system an LLM can run the same way every session.

Five sources of internal design knowledge, the same standards my team already used to judge quality, became a single evaluation protocol the agent runs every session.

Evaluation Criteria
What to inspect at every step of the flow.
Quality Axes (PQS)
The dimensions Meta grades UX on.
Product Quality Process
Severity, effort sizing, and triage rules.
Usability Scoring (UPB)
Meta's 0 to 100 usability score.
Design System Patterns
App promotion pattern docs & conventions.
The Dogfooding Agent
Takes on a persona, checks every screen against those standards, and writes up what it finds.
Filed tasks
Classified tickets on the engineering board.
Prioritized
Severity scored and effort sized.
Screenshots
Captured and attached per issue.
Tracker log
A row per session in our sheet.

Filed straight into the tools the team already uses, ready to triage the moment a run ends.

First Time Priya
First Time Priya
Small Business Owner
personas/priya_first_time.md
name: "First Time Priya" archetype: "Small business, first app campaign on Meta, based in India" evaluation_lens: - "Is the terminology understandable for someone whose only reference point is Google UAC?" - "Are the defaults safe for a small budget?" system_instruction: | You must stay in character for the entire session. Do NOT evaluate based on standard UI heuristics. Evaluate strictly based on Priya's knowledge gaps.
Enterprise Marcus
Enterprise Marcus
Senior UA Manager ($2M+/mo)
personas/marcus_enterprise.md
name: "Enterprise Marcus" archetype: "Senior UA Manager running $2M+/month in app campaigns, based in US" evaluation_lens: - "Are power-user controls (SKAdNetwork, optimization events) easily accessible?" - "Does the system try to automate things I need manual control over (e.g. Advantage+)?" system_instruction: | You must stay in character for the entire session. Evaluate strictly based on Marcus's need for control, measurement confidence, and efficiency.
The same screen, two verdicts

I model personas as cognitive states, not demographics. Run the same screen through Marcus and Priya and the criteria flip: one default reads as a power user's frustration and a beginner's relief at the same time. That tension is exactly what product teams have to design around.

Marcus's column is from the real Run 1 session. Priya's column is a projected read of those same screens through her documented model. It stays illustrative until her session runs.

Real screen (Run 1)
Enterprise Marcus · measured
First Time Priya · projected
Advantage+ defaults ON
Friction. Hidden automation he'd want to disable.
Relief. The "let the system handle it" she expects from Google.
SKAdNetwork not visible
Blocking. Needs full control to protect measurement.
Hidden risk. She'll mismeasure iOS and never know why.
Unconnected app, click does nothing
Friction. Knows to leave and use the App Dashboard.
Dead end. The most likely abandonment point in the flow.

Advantage+ defaulting ON is a retention risk and a retention aid at the same time, depending only on who's looking. The persona system catches that contrast on every run, instead of leaving it to whoever happens to be testing.

What One Session Surfaced

For its first production run, the agent took Enterprise Marcus through the full App Promotion flow on iOS and Android, back to back, in about 90 minutes. One pass surfaced 18 issues, including things our manual sessions had been walking past. A few mattered enough to show here; the point is how many it caught, and how fast.

The agent narrating its CAPTURE and EVALUATE reasoning as Enterprise Marcus in the terminal while driving live Ads Manager in the browser
The agent mid-run. On the left it narrates the CAPTURE → EVALUATE loop in Marcus's voice (“where is the conversion location picker?”); on the right it's driving the real Ads Manager campaign setup. Every issue it files comes out of a live session like this one.
Persona
Enterprise Marcus
Flow
E2E App Promotion
Platforms
iOS + Android
Duration
~90 minutes
18
Issues surfaced
4
Launch blocking
58/100
iOS usability score
2/5
Cross platform consistency
43
Screenshots captured
What manual testing missed

When we dogfood our own product, we rarely step all the way into the advertiser's shoes, so issues like these slip right past us. The agent doesn't have that problem. Each one is filed as a task in the persona's own voice, with the structured record attached for triage.

Task 01 · filed by Enterprise Marcus
Launch Blocking Quality issue · Flow doesn't hold together

SKAdNetwork Configuration Missing

Enterprise advertisers cannot configure their iOS measurement strategy. SKAdNetwork is Apple's privacy era system for measuring app installs, so wrong defaults quietly degrade campaign reporting. At $2M+/month, that is real measurement risk.

Enterprise Marcus Enterprise Marcus

"I scrolled through the entire ad set for my iOS campaign. Where is SKAdNetwork? My entire iOS strategy depends on getting the conversion schema right. If the system is making SKAN decisions for me, I need to know what those decisions are."

Task 02 · filed by Enterprise Marcus
Launch Blocking Quality issue · Flow doesn't hold together

Android Ad Level Fails to Recognize App

After selecting an app at the ad set level, the ad level shows an error warning: "multiple apps that can't be edited together." This is a flow breaking bug that blocks Android campaign creation.

Enterprise Marcus Enterprise Marcus

"I selected one app on Google Play. Now the ad level says I have 'multiple apps that can't be edited together.' And it tells me to go back and select a store that I already selected. This is broken."

agent_output/issue_01.yaml
issue_id: "SKAN_Config_Missing" quality_axis: "Coherent End to End" classification: "Launch-Blocking" rock_size: "Large (6+ months)" problem: "No SKAdNetwork section exists. For an iOS campaign, SKAN configuration is foundational to measurement." user_impact: "Enterprise advertisers cannot configure their iOS measurement strategy." persona_quote: | "I scrolled through the entire ad set for my iOS campaign. Where is SKAdNetwork? My entire iOS strategy depends on getting the conversion schema right." recommended_fix: "Surface current SKAN configuration as read-only summary in ad set."
Why the system beats a manual pass

A single report was never the goal. I wanted a system that runs the same way every time, faster, and catches what a tired tester on one platform would miss, like the cross platform bugs it caught running iOS and Android back to back.

Dimension
Manual
AI driven
Issues per session
3–5 typical
18 (first session)
Time
45 min – 1 hr, not exhaustive across platforms
~90 min, full pass on both platforms
Cross platform bugs
Rare (needs 2 sessions + a manual diff)
2, found automatically
Output
Manual table with screenshots
Structured report, scored & classified
Task creation
Often deferred, filed later if at all
Filed automatically to the board
Consistency
Varies with the tester
Identical protocol every run

The agent isn't a replacement for manual testing. It handles the repetitive parts and hits both platforms every run, so I spend my time on the judgment calls instead of the clicking. A human still catches what it misses, and every issue it flags is still mine to verify.

Key Design Decisions

The skill is a set of modular markdown files, kept deliberately separate. The evaluation engine stays fixed while personas and flows swap in and out, so any team at Meta can fork it for their own surface in a day without touching the core logic.

app-promotion-dogfooding/
SKILL.md # Setup wizard → launches the session references/ ├── agent.md # CAPTURE → EVALUATE → ACT protocol + quality rubric ├── personas/ │ ├── priya_first_time.md │ └── marcus_enterprise.md ├── flows/ │ └── e2e_app_promotion.md └── dogfood-connect.sh # Drives live Ads Manager via Chrome DevTools Protocol

Iteration: two personas, not three

I started with three personas. Testing showed the middle "agency" persona was producing overlapping noise rather than distinct findings, so I cut it to two extreme archetypes: First Time Priya (catches onboarding gaps) and Enterprise Marcus (catches power user friction). Two clear voices gave far cleaner signal than three that overlapped.

App specific focus areas
I spelled out the areas the agent has to inspect (Objective selection, App selection, SKAdNetwork). Without that, it just evaluates whatever is most visually prominent. This keeps the attention on the surfaces our team actually owns.

Wired into our codebase
Our codebase is connected inside Claude, so I can spin up every app ads flow the agent needs to evaluate, generated straight from the source. That means a usability pass runs against the real states we ship, the full set of screens and edge cases a flow has to cover, not just whatever a test account happens to surface.

I Found the Issues, Then Shipped the Fixes Myself.

The agent gave me a reliable, repeatable way to surface issues. But I didn't stop at filing tickets. I tested every issue it flagged, confirmed it was real, then used vibe coding to write the frontend fix myself. An engineer on the team reviewed every diff before it landed in production.

Learning enough to close the gap between finding a problem and shipping the fix is what changed how I work. 10+ frontend changes landed in a single week, fixed by the person who understood the UX problem best instead of waiting for sprint planning.

10+
Frontend fixes shipped in one week
1
Person: detect, verify, fix, ship
0
Tickets filed to another team's backlog
The loop
Agent finds issue I verify manually I vibe code the fix Engineer reviews Shipped
Repeated until all issues are resolved
↻ Repeated until all issues are resolved
One fix, end to end

A usability pass flagged the Attribution model field in ad set settings: an always-visible dropdown adding clutter to an already dense page, out of step with neighboring fields like Performance goal that stay collapsed until you need them. I converted it to the same progressive disclosure pattern, a read-only summary that expands to the full selector on click, then shipped it. An engineer reviewed the diff before it landed.

The committed diff: Convert Attribution model to progressive disclosure pattern, plus 157 minus 87
Committed and reviewed. A flagged usability issue going all the way to production code, written and shipped by the person who caught it.
Before Always-visible dropdown
Attribution model shown as an always-visible dropdown in ad set settings

Open by default, adding height to an already long page.

After · shipped Progressive disclosure
Attribution model collapsed to a read-only summary that expands on click

Collapses to a read-only summary, matching the Performance goal field. Expands to the full selector on click.

The system now runs the same way for our team dogfooding sessions. The agent handles detection, I handle verification and fixes, and an engineer handles review. The whole design to ship cycle dropped by half.

When design can ship frontend, product quality stops waiting in someone else's backlog. The small fixes that used to die in prioritization get caught and corrected by the person who cares most about them. The real change here is simple: design now holds the quality bar and enforces it directly in the code.

What I'd Watch, and Where It Goes

I've built this twice now, and a third would take about a day. The real result is less a single skill than a template any team that owns a flow can stand up, tuned each session for depth over coverage.

What I'd build next

Today the skill stops once an issue is filed. Next is closing the loop: let it propose the fix, then rerun the same session after the fix ships to confirm it held. That turns a testing tool into regression detection on a schedule.

Where it's still thin
  • n = 1. The headline numbers are one real run. Signal, not proof.
  • Personas validated only by me. Built from internal docs and my own read, never sat next to a real advertiser.
Next Project Horizon VR Work Events →