Skip to main content
The RL Environment Company

We Rewrote the Browser
To Build Your Gym.

We bring the environment and the rollout system. Our custom browser renderer and W8 async infrastructure deliver the rewards and throughput you need to train agents for real work.

Capabilities

What You Can Build

Frontend Codegen

Give your code generation models eyes. Our renderer generates visual and structural rewards, enabling agents to iterate on UI until it's pixel-perfect.

Agents

Train agents on the live web, not static snapshots. We handle the complexity of modern web apps—auth, popups, and dynamic DOMs—so you can focus on reasoning.

Browser Games

Turn any browser game into a reasoning gym. We expose internal game state and provide deterministic frame stepping for high-fidelity RL training.

Deep Search Evals

Evaluate long-horizon search capabilities. Let agents navigate the open web to find answers, with full trajectory replay and ground-truth validation.

wootz-browser

researcher@lab:~$ # The interface to your RL environment

$ swe-rl rollout --workers 4 --backend gemini

✓ Userspace reboot (10s) ........... OK

✓ Async batching ................... ACTIVE

✓ CDP reward stream ................ CONNECTED

$ swe-rl evaluate --suite swe-bench-verified

Running evaluation on 50 tasks...

[Env 0] Success: 0.82 (Reward: 0.94)

[Env 1] Success: 0.79 (Reward: 0.88)

$ cat metrics.json | grep -i utilization

"concurrency_utilization": 0.92

"gpu_saturation": 0.88

W8 Rollout Spec
Production Ready
Verified

The only browser built for RL. We expose internal renderer signals to generate rewards that standard browsers can't, all while running 10x faster rollouts.

⚡ 10s Reset

Userspace reboot

🔄 Async Infra

No step barriers

💎 Pure Rewards

Browser-native signals

🛡️ Legit Infra

Production stable

The Async Rollout System

W8-RL: The Rollout Engine for Browser based RL

A fully async, emulator-centric rollout system designed to saturate GPUs. Features pluggable scheduling algorithms, userspace reboots, and per-node inference routing.

Pluggable Scheduling

  • SHDS: Short-Horizon Diversified Scheduler for coverage
  • Bandit-Time: Optimize reward/sec via UCB
  • GRPO: Automatic K-rollout grouping for advantage

Emulator-Centric Control

  • Userspace Reboot: 10s resets (kernel stays hot)
  • Browser-Owned CDP: Disconnects are state transitions
  • Direct Injection: DOM signals bypass the wire

WebGym-Style Async

  • No Barriers: Zero blocking on step/episode boundaries
  • Per-Node Routing: Local screenshot loading
  • Op Queues: Navigation/Screenshot isolation
Our Stance

We Build Environments,
Not Models

We are a software company that builds the gymnasium. You build the athlete. Standard browsers are black boxes. We rewrote the renderer and network stack to generate deterministic, browser-native rewards no one else can.

01

We provide the gymnasium. You build the athlete.

02

Standard browsers are black boxes. We rewrote the renderer and network stack.

03

This lets us generate deterministic, browser-native rewards no one else can.

04

To drive this custom browser, we built the W8 Async Rollout System.

05

Userspace reboots in 10s. Zero synchronous barriers.

06

Our goal: Simulate and automate every task in the knowledge economy.

The Gym

You Train, We Grade

Models are dropped into our environments and tasked with objectives like building features or debugging. We grade their work based on success.

The Infra

We Bring the Rollout

We provide the rollout system because we own the browser. Our W8 architecture delivers the 10s resets and async inference needed for scale.

The Goal

Automate Everything

We're starting with the hardest problem: software engineering. But our infrastructure is built to scale until every task in the knowledge economy can be simulated, graded, and automated.

Integrated with Infrastructure & Training Platforms

Native integration with Training & Finetuning Partners

We don't train models—we provide the reality they learn from. WootzApp integrates natively with inference providers like Together.ai and orchestration frameworks like Ray and CleanRL. You bring the policy and the compute; we supply the massive-scale, interactive browser simulations required to close the loop.

Bypass anti-bot protections and capture frame-perfect rendering events via our hardened CDP pipeline.

Advanced Scheduling

Maximize Throughput with Smart Scheduling

Raw speed isn't enough. We provide the scheduling algorithms to make sure every GPU cycle counts.

Short-Horizon Diversified (SHDS)

Maximize task coverage by mixing easy and hard tasks with adaptive horizons. Prevents overfitting while maintaining throughput.

Bandit-Time Scheduling

Optimizes for reward-per-second using UCB scores and variance tracking. Perfect for high-efficiency training runs.

GRPO Grouping

Automatic task grouping ensures K trajectories per task for advantage computation, compatible with modern RL algorithms.

Early Stop Policies

DomProgress and visual hash monitoring prevent wasted compute on stuck or looped episodes.

ChromiumRL Signals

Rewards from the Metal Up

Because we own the renderer, we can grade layout stability, paint events, and network purity—signals impossible to get from Selenium or Playwright.

Proof of Superiority

Our rewards show higher monotonicity and better near-miss separation than standard pass/fail tests. We don't just tell you if you failed—we tell you by how much.

Every pixel, every DOM node, and every network request is part of the grade.

Sample Efficiency

Reach target scores with 4x fewer samples using our dense rewards.

Reward Gradient

Monotonic rewards that don't plateau, guiding models through near-misses.

Dense Signals

Feedback on every render, not just sparse pass/fail flags.

Determinism

Locked viewports, fonts, and time for reproducible grading.

Scale

Run 1000s of concurrent environments with minimal overhead.

Example environment (news homepage)

Spec locks in grid (`2fr 1fr`), hero ratios, section order, tokens, and policy boundaries. Reward suites check structure, semantics, responsiveness, accessibility, and compliance.

  • Ship as Dockerized RL APIs or Verifiers-ready packages.
  • Scorecards surface structure, token, and accessibility deltas.
  • Every drop includes spec, DSL, and policy versions for audit trails.

Responsive layout

Models practice grids, breakpoints, and spacing with human review loops built in.

Design tokens

Reward functions enforce typography, color, and component tokens your systems rely on.

Accessibility

Specs demand semantic structure, focus states, and motion-safe defaults across viewports.

Reusable modules

Environment DSL keeps cards, rails, and promos composable instead of one-off markup.

Validation

Validate Before You Train

80% of rollout bugs are systems issues. Run our scorecard to verify throughput, latency, and reliability before you launch a training run.

WootzApp W8 — The Rollout Infrastructure for Browser based RL