When will browser agents do real work?

Vision-based agents

Vision-based agents treat the browser as a visual canvas. They look at screenshots, interpret them using multimodal models, and output low-level actions like “click (210,260)” or “type “Peter Pan”.” This mimics how a human would use a computer—reading visible text, locating buttons visually, and clicking where needed.

The upside is universality: the model doesn’t need structured data, just pixels. The downside is precision and performance: visual models are slower, require scrolling through the entire page, and struggle with subtle state changes between screenshots (“Is this button clickable yet?”).

DOM-based agents

DOM-based agents, by contrast, operate directly on the Document Object Model (DOM), the structured tree that defines every webpage. Instead of interpreting pixels, they reason over textual representations of the page: element tags, attributes, ARIA roles, and labels.

When will browser agents do real work?

Vision-based agents

DOM-based agents

12 Underrated (And Cheap) Gadgets You Should Have On Your Radar

3 Apple CarPlay Features That Android Auto Desperately Needs

The Pros & Cons Of The New MacBook Pro M5

Visual Studio Code unifies UI for managing coding agents