Skip to main content

W8-RL Overview

W8-RL is a Ray-distributed rollout framework for web-agent RL. It uses a real browser (WootzApp) to generate ChromiumRL visual + semantic rewards and runs entirely in Docker.

W8-RL exposes three compatibility paths on top of the same EnvActor core:

  • SkyRL (BaseTextEnv interface, Ray-backed)
  • OpenEnv (HTTP via Ray Serve + FastAPI)
  • Tinker (token-based RL training via Tinker Cookbook)

All paths share the same execution core:

Task container -> EnvActor -> emulator browser -> ChromiumRL signals -> reward bundle
Docker-only execution

Everything must run in Docker. Do not run Python or tests on the host. Use the provided scripts or docker compose run.

If you want a quick start, see Getting Started.

Tested with

  • Ray 2.40.0
  • Docker 24+

W8-RL Differentiators

What W8-RL does well compared to other web-agent RL systems:

CapabilityW8-RL ApproachBenefit
Async rolloutsFairScheduler with no synchronization barriersEliminates GPU idle time from straggler rollouts
Queue schedulingPer-environment OperationDispatcher with separate queues by operation typeHandles long-tail latency; navigation can't block screenshots
Batch collectionTimeout OR batch_size triggersSmooth GPU utilization without CPU spikes
CDP resilience8-state connection machine with automatic reconnectionProduction-grade reliability for flaky browser connections
Browser isolationFull emulator pool with per-rollout isolationNo cross-contamination between parallel trajectories
ChromiumRL signalsPaint time, CLS, LCP, FCP from browser internalsPerformance-aware rewards beyond visual similarity
GPU pipeliningscreenshot_queue_size=4 streams observationsOverlaps screenshot capture with inference
GRPO groupingK trajectories per task with relative advantageVariance reduction within task groups

Long-Tail Latency Handling

Browser simulations have brutal long-tail latency—some rollouts take far longer than others while GPUs sit idle waiting. W8-RL addresses this:

  1. Per-task timeout: Stuck rollouts get killed, not waited on
  2. Non-blocking threads: While one rollout hangs, other threads utilize the GPU
  3. Intervention only when needed: Manual intervention only when (a) all threads finish while few still hang, or (b) one thread takes unexpectedly long

Improvement Roadmap

Planned enhancements based on research analysis:

GapPriorityApproachStatus
Difficulty-aware horizonsHighmax_steps = 10 * ceil(difficulty/3) based on rubric fact countImplemented
Action repetition detectionHighPerceptual hash (imagehash.phash); terminate after 3 consecutive repeatsImplemented
Rubric-based evaluationMediumLLM-generated fact groups with partial credit scoringImplemented
Episode memoryMediumRay Actor-based KV store for cross-step information synthesisPlanned
Task decompositionDeferredGenerate subtasks from fact group subsets for denser rewardsResearch
Task scale expansionDeferredAggregate benchmarks + synthetic generation (target: 100K+ tasks)Research

Difficulty Classification

Tasks are organized by fact count (total evaluation criteria in rubric):

tasks/difficulty/
├── 1/ # Trivial (1 fact, max 10 steps)
├── 2-3/ # Easy (max 10 steps)
├── 4-6/ # Medium (max 20 steps)
└── 7+/ # Hard (max 30 steps)

This enables glob-based parallel sharding (tasks/difficulty/[1-3]/**/*.json) and natural horizon control.

Next Steps