How We Built a Chat AI Agent Into Live Device Testing Sessions

Published: (March 10, 2026 at 04:40 PM EDT)
2 min read
Source: Dev.to

Source: Dev.to

What It Does

The agent can see the live device screen. You can ask it in plain English:

  • “What’s the XPath for the equals button?”
  • “Give me a UIAutomator2 selector for the digit 7”
  • “What’s the Accessibility ID of the login button?”

It responds instantly with working locators — in whatever language you’re using (Java, Python, Swift, Kotlin, WebDriverIO). No switching tools, no Appium Inspector. Just ask.

How We Built It

Screen visibility

Our sessions already stream device screens via WebRTC. We grab a single screenshot at the moment the user asks a question, keeping latency low and avoiding a continuous video feed to the model.

The model

We send the screenshot plus the user message to a vision‑capable LLM. The prompt is structured to return locators in a specific format; we parse the response and render it with syntax highlighting in the UI.

Locator formats

We support:

  • XPath
  • CSS Selector
  • UIAutomator2 (Android)
  • XCUITest (iOS)
  • Accessibility ID

The model is instructed to return all applicable formats for the visible element, not just one.

Code output

Users pick their language from a dropdown (Java, Python, Swift, Kotlin, WebDriverIO). We wrap the locator in idiomatic framework code for each:

# Python / Appium
driver.find_element(AppiumBy.XPATH, "//android.widget.Button[@content-desc='equals']")
// Java / Appium
driver.findElement(By.xpath("//android.widget.Button[@content-desc='equals']"));

UI integration

The panel sits alongside the device stream—it doesn’t overlay the screen. Users can keep testing while asking questions, and the conversation history stays within the session.

What We Learned

The hardest part wasn’t the AI integration—it was the prompt engineering. Getting the model to return clean, parseable locator output (instead of prose with embedded code) required several iterations.

We also found that grounding the model on the visible screen state (rather than a DOM or accessibility tree) made responses feel more natural. Users think in terms of what they see, not what’s in the XML hierarchy.

Try It

The Chat AI Agent is live now in the RobotActions portal. A free trial is available.

We’d love feedback from anyone doing Appium or mobile automation—especially if you’ve built similar tooling. Drop a comment or reach out directly.

0 views
Back to Blog

Related posts

Read more »