Skip to main content

Agent Testing And Delegated Stage Auth

Problem

Coding agents can already create, publish, and update Stage apps, but they still hit hard blockers when a realistic flow requires:

  • browser login to the platform
  • end-user OAuth inside a staged app
  • access to APIs on behalf of the human who launched the app
  • repeatable Playwright or UI tests that should run during push-to-stage

The current platform has strong building blocks:

  • gateway login already supports CLI exchange codes and bearer tokens
  • Stage already stores end-user Google authorizations server-side
  • Stage apps already call a server-side Google proxy instead of handling raw refresh tokens in the browser

The missing piece is a first-class, scoped, auditable way to let an agent test a staged app as a delegated identity without turning off auth or copying long-lived user secrets into CI.

Goals

  • Let agents run unit, integration, E2E, and UI tests without waiting for manual browser login during each run.
  • Preserve least privilege.
  • Keep live third-party tests opt-in and auditable.
  • Make the default path deterministic and cheap.
  • Reuse the existing gateway, Stage, and CLI auth model instead of adding special-case bypasses.

Non-Goals

  • Global "agent can impersonate any user" access.
  • Passing raw refresh tokens, session cookies, or password flows to agents.
  • Using user-agent checks, IP allowlists, or hidden bypass query params as auth.
  • Making all E2E tests hit live third-party providers by default.

Design Principles

  1. Prefer hermetic tests over live tests.
  2. Use delegated credentials, not copied human credentials.
  3. Bind every agent grant to a specific app or Stage session.
  4. Give each grant a TTL, audience, and explicit capabilities.
  5. Record who delegated access, to which agent run, for which deploy, and for what scopes.
  6. Make browser automation consume a bootstrap artifact instead of a human login screen.

Proposed Testing Model

Every app should declare which test modes it supports.

ModePurposeExternal authExpected frequency
unitPure logic and component testsnoneevery change
integrationapp + platform service contractsmocked or in-processevery change
e2e-syntheticfull browser flow on staged appdelegated platform auth, fake provider dataevery push to stage
e2e-livereal provider flow against staged appdelegated user auth with explicit scopesopt-in, narrow smoke suite

The default gate on a stage deploy should be unit + integration + e2e-synthetic.

e2e-live should be reserved for:

  • smoke checks on the exact user path that matters
  • provider regressions
  • debugging sessions where the human explicitly opts in

Core Architecture

1. Delegated Agent Test Grant

Add a new gateway-issued credential type for agent testing.

Implementation choice:

  • expose the grant to clients as an opaque random handle, for example sat_...
  • store only a hash server-side
  • keep the authoritative grant record server-side for listing, revocation, and audit
  • derive signed browser session cookies only after a one-time exchange

Suggested stored fields or claims:

  • sub: agent-run:<runId>
  • act: delegating human email
  • aud: stage-test
  • appSid or sessionId
  • deployId
  • capabilities
  • providerScopes
  • exp
  • jti
  • label
  • revokedAt
  • lastUsedAt

Capabilities should be explicit, for example:

  • stage.read
  • stage.write
  • stage.render
  • stage.browser
  • stage.google.proxy
  • app.api

This is not the human's normal CLI token. It is a derived, narrower credential.

Important constraints:

  • default TTL should be minutes, not hours or days
  • a grant must be bound to a specific app, session, or deploy
  • a grant must never be accepted as a general replacement for a human session across the platform
  • direct browser use should go through one-time exchange, not by putting the raw grant into a URL

2. One-Time Browser Bootstrap

Add a one-time exchange flow for browser automation.

Flow:

  1. Human authenticates once with shift-cli login.
  2. Human or CI asks shift-cli to mint an agent test grant for a specific app or session.
  3. Gateway returns:
    • a short-lived API bearer token for HTTP calls
    • a one-time browser bootstrap URL or exchange code
    • metadata about allowed scopes and expiry
  4. Playwright opens the bootstrap URL once before tests.
  5. Gateway validates the exchange code and sets agent-scoped cookies.
  6. Tests run with a real browser session and no manual login screen.

This should mirror the existing /auth/exchange pattern instead of inventing a new login system.

Do not use:

  • GET /auth/...?...token=sat_xxx
  • raw bearer tokens in browser query params
  • the human's existing shift_session cookie

3. Separate Agent Browser Session

Do not reuse shift_session or shift_stage_user directly.

Add dedicated agent-test session forms, for example:

  • shift_agent_session for platform routes
  • shift_stage_agent for Stage end-user routes

That separation gives:

  • clear auditability
  • different TTLs
  • route-level restrictions
  • easier revocation
  • no confusion between a real user session and a delegated test session

4. Delegated Stage End-User Auth

For Stage apps that need Google access, reuse the existing server-side authorization storage pattern.

Instead of copying the human's Stage cookie into the agent browser:

  1. Human authorizes the Stage app once.
  2. Stage stores provider refresh tokens server-side as it already does.
  3. The agent test grant references the existing userAuth record for that session and user.
  4. The Stage Google proxy accepts either:
    • an authenticated end-user cookie
    • an agent-test cookie whose grant is bound to the same session and allowed scopes

This lets the proxy keep issuing access tokens server-side while the agent only holds a narrow delegated session.

5. Provider Modes

Provider access should be explicit per test run.

Provider modeBacking sourceDefault
mockapp-defined fixturesyes
replayrecorded provider responsesyes
synthetictest tenant or sandbox workspacerecommended
live-delegatedreal user authorization through server-side proxyno

Stage apps should be able to ask for a provider mode through session metadata so the same app can run:

  • hermetic tests in CI
  • realistic smoke tests on stage
  • live debugging when a human explicitly delegates access

Proposed Platform Changes

Gateway

Add:

  • delegated grant minting endpoint
  • one-time browser bootstrap exchange endpoint
  • middleware support for agent-test JWTs or cookies
  • route-level capability enforcement
  • revocation and expiry checks
  • audit emission to Pulse and Ledger for every grant issue and redemption

Suggested routes:

  • POST /auth/agent/grants
  • GET /auth/agent/grants
  • DELETE /auth/agent/grants/:idOrLabel
  • POST /auth/agent/bootstrap
  • GET /auth/agent/bootstrap

Practical note:

  • the other plan's idea of opaque random sat_ tokens is good
  • its proposed agent-bridge?token=... shape is not
  • the merged design should keep opaque handles but exchange them for one-time codes before the browser sees anything

shift-cli

Current implemented commands:

  • shift-cli token create --session <id> --ttl 30m --json
  • shift-cli token list --json
  • shift-cli token revoke <grant-id> --json
  • shift-cli test bootstrap --app <sid> --output .shift/e2e-auth.json --json

The bootstrap output should be machine-friendly and include:

  • baseUrl
  • grantId
  • grantLabel
  • expiresAt
  • bootstrapUrl or exchangeCode
  • apiToken
  • providerMode
  • sessionId
  • appSid

Stage Convex Schema

Add tables for test grants and run state, for example:

  • agentTestRuns
  • agentTestGrants
  • providerReplays
  • providerFixtures

Minimum fields:

  • delegating user
  • agent or run identifier
  • app or session binding
  • provider mode
  • allowed scopes
  • TTL
  • revocation state
  • created and redeemed timestamps

If it is simpler to land incrementally, the first table can live alongside the existing auth state in root Convex and later be specialized for Stage test runs.

Stage Runtime

Add a test-context layer exposed to the runtime:

  • current provider mode
  • current test run id
  • deterministic seed
  • optional fixture bundle id

This lets app code and platform adapters select:

  • mock provider
  • replay provider
  • live proxy provider

without branching on ad hoc environment variables.

SDK And Test Helpers

Add a first-class testing surface in packages/sdk/src/testing/.

Recommended modules:

  • client.ts for authenticated API clients
  • session.ts for Stage session lifecycle helpers
  • browser.ts for Playwright bootstrap helpers
  • lifecycle.ts for deploy-and-test orchestration

Recommended helpers:

  • createTestClient()
  • createTestSession()
  • authenticatedPage()
  • deployAndTest()

This part of the other plan is directionally right and should be kept.

CI And Push-To-Stage Flow

Recommended flow for a staged deploy:

  1. shift-cli stage push publishes or updates the app.
  2. The push step optionally requests an agent grant for the resulting app/session.
  3. The CLI emits bootstrap JSON for the E2E runner.
  4. The E2E runner:
    • restores browser state via bootstrap
    • runs synthetic smoke tests
    • optionally runs live-delegated smoke tests if explicitly enabled
  5. The gateway revokes the grant when:
    • TTL expires
    • the deploy is replaced
    • the test run ends
    • the human explicitly revokes it

This should work for both:

  • local agent debugging on a developer machine
  • remote CI jobs running after push-to-stage

CI should use the minted delegated grant for the deploy under test.

CI should not use:

  • SHIFT_API_KEY as the primary staged E2E auth mechanism
  • a broad shared static secret to impersonate users

Security Controls

Required controls:

  • grant TTL of minutes, not days
  • one-time browser exchange codes
  • app or session binding on every grant
  • audience restriction to stage testing
  • capability checks in middleware
  • provider-scope allowlist
  • explicit opt-in for live third-party access
  • full issuance and redemption audit trail
  • revocation on demand and on deploy replacement

Recommended controls:

  • require a fresh local CLI session to mint a live-delegated grant
  • require a per-app or per-deploy opt-in flag for e2e-live
  • use sandbox or test-workspace accounts whenever possible
  • attach screenshots, trace artifacts, and audit logs to each run

What To Avoid

  • Reusing a human's full platform bearer token inside CI.
  • Exporting raw Stage refresh tokens to the browser or to the agent.
  • Disabling auth middleware on staging.
  • A hidden query parameter like ?agent=true that bypasses auth gates.
  • Tests that mutate production-like third-party data without isolation.

Implementation Status

Phase 1 — Delegated Platform Auth ✅ Implemented

All Phase 1 deliverables are live:

  • Opaque delegated grant recordssat_... tokens with hash-only server-side storage
  • Browser bootstrap exchange — One-time codes redeemed for shift_agent_session and shift_stage_agent cookies
  • CLI commandsshift-cli token create|list|revoke and shift-cli test bootstrap
  • SDK test helpers@the-shift/sdk/testing with createTestClient() and authenticatedPage()
  • Synthetic E2E — Runs on Stage deploys via shift-cli test bootstrap

Phase 2 — Provider Replay 🚧 In Progress

  • Session-level provider mode support is partially implemented
  • Fixture bundles and recorded responses are not yet available

Phase 3 — Live Delegated Provider Auth ✅ Implemented

  • Agent-test Stage cookieshift_stage_agent with capability-scoped access
  • Persistent OAuth via Passport — Refresh tokens stored user-scoped with consent-skip for returning users
  • Stage Google proxy — Accepts both end-user cookies and agent-test cookies with bound session + scope checks
  • Audit trail — Authorization lifecycle events recorded in passport_audit

Phase 4 — Run Orchestration 🔮 Planned

  • Per-app test policy, required checks by environment, grant revocation automation, and flaky test quarantine are not yet implemented.

Browser-Based Eval Mode (Gate L4)

A browser-based evaluation mode has been added to the platform-test workflow. This mode uses browser automation for visual and functional QA of Stage apps:

  • Visual checks — Layout integrity, theme support, responsive behavior
  • Functional checks — Navigation, data operations, error handling, state persistence
  • Scenario format — Declarative criteria sets with pass/fail grading
  • React fiber injection — Test framework can inspect React component tree inside Stage sandboxes

Usage: shift-cli test eval --scenario <name> or shift-cli test eval --app-dir ./my-app

Suggested Rollout

Phase 1 ✅

Implement delegated platform auth for Stage browser automation.

Phase 2 🚧

Add provider replay and fixture support.

Phase 3 ✅

Add live delegated provider auth.

Phase 4 🔮

Add run orchestration and policy.

Repository Touchpoints

The current codebase already contains most of the primitives this design should reuse:

  • gateway CLI exchange flow: packages/gateway/src/auth/routes.ts
  • gateway bearer and cookie auth middleware: packages/gateway/src/auth/middleware.ts
  • CLI session storage: packages/core/src/auth.ts
  • Stage end-user OAuth: packages/gateway/src/auth/stage-oauth.ts
  • Stage Google proxy: packages/gateway/src/auth/stage-google-proxy.ts
  • Stage auth gate UI: stage/src/components/AuthGate.tsx
  • Stage user auth records: stage/convex/stage.ts

Review Of The Other Plan

Keep:

  • opaque random token handles with hash-only storage
  • list and revoke support
  • SDK testing helpers
  • Playwright auth helper
  • CLI token and test ergonomics

Change:

  • bind every token to app, session, deploy, and capability set
  • reduce TTL defaults substantially
  • never pass raw tokens in browser query params
  • never convert delegated tokens into the standard shift_session cookie
  • do not rely on SHIFT_API_KEY for staged E2E identity
  • cover Stage end-user OAuth explicitly through userAuth and the Google proxy

If we want the fastest high-value version, build this first:

  1. Gateway-issued opaque agent test grant limited to one app or session.
  2. One-time browser bootstrap for Playwright using exchange codes.
  3. shift-cli token create|list|revoke commands backed by delegated grant semantics.
  4. SDK testing helpers for API and browser tests.
  5. Stage deploy pipeline that runs e2e-synthetic automatically with the minted grant.
  6. Live delegated Google tests only for explicit smoke paths and only after manual opt-in.

That gets rid of the manual browser blocker without immediately taking on the highest-risk part of the design.