Vision-based agents
Vision-based agents treat the browser as a visual canvas. They look at screenshots, interpret them using multimodal models, and output low-level actions like “click (210,260)” or “type “Peter Pan”.” This mimics how a human would use a computer—reading visible text, locating buttons visually, and clicking where needed.
The upside is universality: the model doesn’t need structured data, just pixels. The downside is precision and performance: visual models are slower, require scrolling through the entire page, and struggle with subtle state changes between screenshots (“Is this button clickable yet?”).
DOM-based agents
DOM-based agents, by contrast, operate directly on the Document Object Model (DOM), the structured tree that defines every webpage. Instead of interpreting pixels, they reason over textual representations of the page: element tags, attributes, ARIA roles, and labels.



