vis-arena generate --dataset <slug>
“A submission may itself run a multi-step or compute-heavy workflow, but that happens within a fixed evaluation budget. The agent is not autonomously improving itself across intervals; any cross-interval improvement comes from the participant resubmitting an updated system.”