Control a Browser Using AI

Open source library Browser Use paves the way for AI agents to control browsers

Apr 06, 2025

OpenAI Operator

In January 2025 OpenAI announced a research preview of their Operator product that can interact with a browser to “ handle a wide variety of repetitive browser tasks such as filling out forms, ordering groceries, and even creating memes.” https://openai.com/index/introducing-operator/

The product looks amazing with user oversight front and center through a “take over” feature that allows the human supervising the work to intervene at any time. Operator will also intentionally break at certain times to let the user handle log in, payment info and to solve captchas.

The one challenge is at the time of this writing, Operator is in preview and is only available to US based Pro subscribers with the $200/month plan. While the rest of us wait for this feature to be available to other subscriptions, you might wonder “Does open source software have agent based browsing?” The answer is yes!

Browser Use

Browser Use is a YCombinator backed company that has published an open source software library that allows agents to control your web browser. https://github.com/browser-use/browser-use While Browser Use offers a paid cloud subscription, the free library is simple to use locally. You’ll need

A newer version of Python
Some python packages
A Chromium plugin for Playwright (Playwright is a web page test automation tool, it’s how Browser Use identifies and interacts with web pages)
An API key for a state of the art language model vendor (OpenAI, Anthropic, etc.)
Tiny amount of python code
A descriptive prompt for the agent to follow

There is also a browser based UI you can use to configure agents. https://github.com/browser-use/web-ui

The Test Drive

Get the full project here:

https://github.com/ccozad/ml-reference-designs/tree/master/browser-use

The videos of the GitHub page were promising but I wanted to try things myself. I modified the getting started script to go find some prices. I gave the agent the prompt:

Use autotrader.com to find car prices for a make of toyota and model of corolla in the 95621 zip code. Stop when you have gathered 5 prices and present a summary.

To reach this final prompt I went through about 10 iterations of running this scenario. These 10 tries consumed about 200K tokens total or 20K tokens per try for the ChatGPT 4-o model. With the rates on my personal account that cost me about $0.50 for all 10 tries, or about $0.05 a try.

I had to do so many tries because the language model picked up that Toyota was a car make but it missed that Corolla was a car model. This “confidently wrong” behavior is typical when trying to get repeatable, accurate outcomes with an LLM. Browser Use still navigated through the page to find Corollas, which was impressive in its own right, but it took longer with other car models mixed in.

The final results were quite impressive:

What’s Next?

I am excited to think about new use cases, especially for automating workflows that may not have a tidy API available.

What do you think this technology can be used for?

Charles’s Substack

Discussion about this post