← Back to homepage
Arxiv Preprint

The World Won't Stay Still:
Programmable Evolution for Agent Benchmarks

Guangrui Li*, Yaochen Xie*, Yi Liu*, Ziwei Dong*, Xingyuan Pan*, Tianqi Zheng*, Jason Choi*,
Michael Morais*, Binit Jha*, Shaunak Mishra*, Bingrou Zhou*, Chen Luo*, Monica Cheng*, Dawn Song†

* Amazon    † UC Berkeley

Overview

Graph-based programmable environment evolution workflow
From seed graph to versioned environments through controlled graph edits.

TL;DR

Abstract

LLM-powered agents solve user requests by interacting with environments, querying data, and invoking tools in multi-turn workflows. However, most existing benchmarks evaluate agents in static environments with fixed schemas and toolsets, which misses a key real-world challenge: environments evolve over time.

We introduce ProEvolve, a graph-based framework for programmable environment evolution. ProEvolve models data, tools, and schemas as a typed relational graph, where capability updates (adding, removing, or modifying tools and fields) are represented as coherent graph transformations. Based on this representation, the framework can both generate evolving environments automatically and instantiate task sandboxes via subgraph programming. We validate ProEvolve by evolving a single seed environment into 200 environments and 3,000 task sandboxes, then benchmarking representative agents to study robustness under dynamic change.

Core idea:

Key Results

200Versioned environments generated
2,000+Task sandboxes in evolving settings
384Unique tools across evolved benchmark
50Evolution episodes from one seed environment

Efficiency and Robustness Trends

Preliminary experiments show clear model gaps and adaptation challenges under sequential environment evolution. Memory replay can help in some conditions, while reflection replay can also degrade robustness depending on the model and evolution type.

Strategy comparison under evolving environments
Replay strategy comparison across model families and evolving environment versions.

Citation

If you use this benchmark framework, please cite the project page and paper once the arXiv identifier is available.

Last updated: March 2026.