NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild

Shikhar Murty1,Hao Zhu1,Dzmitry Bahdanau2, andChristopher Manning1

1Stanford University, 2ServiceNow Research

Abstract

We introduce NNetNav, a method for unsupervised interaction with websites that generates synthetic demonstrations for training browser agents. Given any website, NNetNav produces these demonstrations by retroactively labeling action sequences from an exploration policy. Most work on training browser agents has relied on expensive human supervision, and the limited prior work on such interaction-based techniques has failed to provide effective search through the exponentially large space of exploration. In contrast, NNetNav exploits the hierarchical structure of language instructions to make this search more tractable: Complex instructions are typically decomposable into simpler sub-tasks, allowing NNetNav to automatically prune interaction episodes when an intermediate trajectory cannot be annotated with a meaningful sub-task.

LLama-3.1-8b finetuned on 10k NNetNav self-generated demonstrations obtains over 16% success rate on WebArena, and 35% on WebVoyager, an improvement of 15pts and 31pts respectively over zero-shot LLama-3.1-8b, outperforming zero-shot GPT-4 and reaching the state-of-the-art among unsupervised methods, for both benchmarks.

NNetNav Overview Figure

Figure 1: Given web URLs (1), NNetNav (2) uses a structured exploration strategy to interact with websites (3) and autonomously discover diverse (instruction, trajectory) demonstrations, as summarized in (4). To effectively prune exploration, the trajectory-so-far is periodically evaluated by a relabeling module and further exploration continues only if it can be assigned a meaningful language instruction. All components in NNetNav are implemented with the same zero-shot base LLM

Example Traces from our Models

Below are some example traces from our model on WebArena and WebVoyager. The model is able to follow complex instructions such as "Reply to the first comment on the post about woodworking in the sub-reddit r/woodworking" and "Find driving directions from Stanford to San Francisco". In some of these examples the model shows the ability to back-track. Pushing further on such emergent back-tracking is an active area of research for us.

Use the dropdown to choose an example to view

Benchmarks & Results

We evaluate NNetNav on two standard web navigation benchmarks:WebArena and WebVoyager. Our method achieves significant improvements over existing baselines, including large-scale language models like GPT-4 and zero-shot LLaMA-3.1-8b. Please read our paper for more details on the experimental setup and results.

MethodParam CountWebArenaWebVoyager
GPT-4 (Zero-shot)?14.1%33.5%
LLaMA-3.1-8B (Zero-shot)8B1%4%
Qwen-7B-AgentTrek (Xu et al., 2024)7B10.5%-
Qwen-32B-AgentTrek (Xu et al., 2024)32B16.3%-
LLaVa-34B PAE (Zhou et al., 2024)34B-33.0%
LLaVa-7B PAE (Zhou et al., 2024)7B-22.3%
LLaMa-8B-NNetNav-WA (Ours)8B16.3%28.1%
LLaMa-8B-NNetNav-Live (Ours)8B9.5%35.2%
LLaMa-8B-NNetNav-All (Ours)8B14.9%34.1%

Citation

If you find our work useful, please cite our paper:

@inproceedings{Murty2025NNetNav,
  author= {Shikhar Murty and Hao Zhu and Dzmitry Bahdanau and Christopher Manning},
  title={NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild},
  journal={arXiv preprint arXiv:2410.02907},
  year={2025}
}