Agent Testing And Delegated Stage Auth

Problem

Coding agents can already create, publish, and update Stage apps, but they still hit hard blockers when a realistic flow requires:

browser login to the platform
end-user OAuth inside a staged app
access to APIs on behalf of the human who launched the app
repeatable Playwright or UI tests that should run during push-to-stage

The current platform has strong building blocks:

gateway login already supports CLI exchange codes and bearer tokens
Stage already stores end-user Google authorizations server-side
Stage apps already call a server-side Google proxy instead of handling raw refresh tokens in the browser

The missing piece is a first-class, scoped, auditable way to let an agent test a staged app as a delegated identity without turning off auth or copying long-lived user secrets into CI.

Goals

Let agents run unit, integration, E2E, and UI tests without waiting for manual browser login during each run.
Preserve least privilege.
Keep live third-party tests opt-in and auditable.
Make the default path deterministic and cheap.
Reuse the existing gateway, Stage, and CLI auth model instead of adding special-case bypasses.

Non-Goals

Global "agent can impersonate any user" access.
Passing raw refresh tokens, session cookies, or password flows to agents.
Using user-agent checks, IP allowlists, or hidden bypass query params as auth.
Making all E2E tests hit live third-party providers by default.

Design Principles

Prefer hermetic tests over live tests.
Use delegated credentials, not copied human credentials.
Bind every agent grant to a specific app or Stage session.
Give each grant a TTL, audience, and explicit capabilities.
Record who delegated access, to which agent run, for which deploy, and for what scopes.
Make browser automation consume a bootstrap artifact instead of a human login screen.

Proposed Testing Model

Every app should declare which test modes it supports.

Mode	Purpose	External auth	Expected frequency
`unit`	Pure logic and component tests	none	every change
`integration`	app + platform service contracts	mocked or in-process	every change
`e2e-synthetic`	full browser flow on staged app	delegated platform auth, fake provider data	every push to stage
`e2e-live`	real provider flow against staged app	delegated user auth with explicit scopes	opt-in, narrow smoke suite

The default gate on a stage deploy should be unit + integration + e2e-synthetic.

e2e-live should be reserved for:

smoke checks on the exact user path that matters
provider regressions
debugging sessions where the human explicitly opts in

Core Architecture

1. Delegated Agent Test Grant

Add a new gateway-issued credential type for agent testing.

Implementation choice:

expose the grant to clients as an opaque random handle, for example sat_...
store only a hash server-side
keep the authoritative grant record server-side for listing, revocation, and audit
derive signed browser session cookies only after a one-time exchange

Suggested stored fields or claims:

sub: agent-run:<runId>
act: delegating human email
aud: stage-test
appSid or sessionId
deployId
capabilities
providerScopes
exp
jti
label
revokedAt
lastUsedAt

Capabilities should be explicit, for example:

stage.read
stage.write
stage.render
stage.browser
stage.google.proxy
app.api

This is not the human's normal CLI token. It is a derived, narrower credential.

Important constraints:

default TTL should be minutes, not hours or days
a grant must be bound to a specific app, session, or deploy
a grant must never be accepted as a general replacement for a human session across the platform
direct browser use should go through one-time exchange, not by putting the raw grant into a URL

2. One-Time Browser Bootstrap

Add a one-time exchange flow for browser automation.

Flow:

Human authenticates once with shift-cli login.
Human or CI asks shift-cli to mint an agent test grant for a specific app or session.
Gateway returns:
- a short-lived API bearer token for HTTP calls
- a one-time browser bootstrap URL or exchange code
- metadata about allowed scopes and expiry
Playwright opens the bootstrap URL once before tests.
Gateway validates the exchange code and sets agent-scoped cookies.
Tests run with a real browser session and no manual login screen.

This should mirror the existing /auth/exchange pattern instead of inventing a new login system.

Do not use:

GET /auth/...?...token=sat_xxx
raw bearer tokens in browser query params
the human's existing shift_session cookie

3. Separate Agent Browser Session

Do not reuse shift_session or shift_stage_user directly.

Add dedicated agent-test session forms, for example:

shift_agent_session for platform routes
shift_stage_agent for Stage end-user routes

That separation gives:

clear auditability
different TTLs
route-level restrictions
easier revocation
no confusion between a real user session and a delegated test session

4. Delegated Stage End-User Auth

For Stage apps that need Google access, reuse the existing server-side authorization storage pattern.

Instead of copying the human's Stage cookie into the agent browser:

Human authorizes the Stage app once.
Stage stores provider refresh tokens server-side as it already does.
The agent test grant references the existing userAuth record for that session and user.
The Stage Google proxy accepts either:
- an authenticated end-user cookie
- an agent-test cookie whose grant is bound to the same session and allowed scopes

This lets the proxy keep issuing access tokens server-side while the agent only holds a narrow delegated session.

5. Provider Modes

Provider access should be explicit per test run.

Provider mode	Backing source	Default
`mock`	app-defined fixtures	yes
`replay`	recorded provider responses	yes
`synthetic`	test tenant or sandbox workspace	recommended
`live-delegated`	real user authorization through server-side proxy	no

Stage apps should be able to ask for a provider mode through session metadata so the same app can run:

hermetic tests in CI
realistic smoke tests on stage
live debugging when a human explicitly delegates access

Proposed Platform Changes

Gateway

Add:

delegated grant minting endpoint
one-time browser bootstrap exchange endpoint
middleware support for agent-test JWTs or cookies
route-level capability enforcement
revocation and expiry checks
audit emission to Pulse and Ledger for every grant issue and redemption

Suggested routes:

POST /auth/agent/grants
GET /auth/agent/grants
DELETE /auth/agent/grants/:idOrLabel
POST /auth/agent/bootstrap
GET /auth/agent/bootstrap

Practical note:

the other plan's idea of opaque random sat_ tokens is good
its proposed agent-bridge?token=... shape is not
the merged design should keep opaque handles but exchange them for one-time codes before the browser sees anything

`shift-cli`

Current implemented commands:

shift-cli token create --session <id> --ttl 30m --json
shift-cli token list --json
shift-cli token revoke <grant-id> --json
shift-cli test bootstrap --app <sid> --output .shift/e2e-auth.json --json

The bootstrap output should be machine-friendly and include:

baseUrl
grantId
grantLabel
expiresAt
bootstrapUrl or exchangeCode
apiToken
providerMode
sessionId
appSid

Stage Convex Schema

Add tables for test grants and run state, for example:

agentTestRuns
agentTestGrants
providerReplays
providerFixtures

Minimum fields:

delegating user
agent or run identifier
app or session binding
provider mode
allowed scopes
TTL
revocation state
created and redeemed timestamps

If it is simpler to land incrementally, the first table can live alongside the existing auth state in root Convex and later be specialized for Stage test runs.

Stage Runtime

Add a test-context layer exposed to the runtime:

current provider mode
current test run id
deterministic seed
optional fixture bundle id

This lets app code and platform adapters select:

mock provider
replay provider
live proxy provider

without branching on ad hoc environment variables.

SDK And Test Helpers

Add a first-class testing surface in packages/sdk/src/testing/.

Recommended modules:

client.ts for authenticated API clients
session.ts for Stage session lifecycle helpers
browser.ts for Playwright bootstrap helpers
lifecycle.ts for deploy-and-test orchestration

Recommended helpers:

createTestClient()
createTestSession()
authenticatedPage()
deployAndTest()

This part of the other plan is directionally right and should be kept.

CI And Push-To-Stage Flow

Recommended flow for a staged deploy:

shift-cli stage push publishes or updates the app.
The push step optionally requests an agent grant for the resulting app/session.
The CLI emits bootstrap JSON for the E2E runner.
The E2E runner:
- restores browser state via bootstrap
- runs synthetic smoke tests
- optionally runs live-delegated smoke tests if explicitly enabled
The gateway revokes the grant when:
- TTL expires
- the deploy is replaced
- the test run ends
- the human explicitly revokes it

This should work for both:

local agent debugging on a developer machine
remote CI jobs running after push-to-stage

CI should use the minted delegated grant for the deploy under test.

CI should not use:

SHIFT_API_KEY as the primary staged E2E auth mechanism
a broad shared static secret to impersonate users

Security Controls

Required controls:

grant TTL of minutes, not days
one-time browser exchange codes
app or session binding on every grant
audience restriction to stage testing
capability checks in middleware
provider-scope allowlist
explicit opt-in for live third-party access
full issuance and redemption audit trail
revocation on demand and on deploy replacement

Recommended controls:

require a fresh local CLI session to mint a live-delegated grant
require a per-app or per-deploy opt-in flag for e2e-live
use sandbox or test-workspace accounts whenever possible
attach screenshots, trace artifacts, and audit logs to each run

What To Avoid

Reusing a human's full platform bearer token inside CI.
Exporting raw Stage refresh tokens to the browser or to the agent.
Disabling auth middleware on staging.
A hidden query parameter like ?agent=true that bypasses auth gates.
Tests that mutate production-like third-party data without isolation.

Implementation Status

Phase 1 — Delegated Platform Auth ✅ Implemented

All Phase 1 deliverables are live:

Opaque delegated grant records — sat_... tokens with hash-only server-side storage
Browser bootstrap exchange — One-time codes redeemed for shift_agent_session and shift_stage_agent cookies
CLI commands — shift-cli token create|list|revoke and shift-cli test bootstrap
SDK test helpers — @the-shift/sdk/testing with createTestClient() and authenticatedPage()
Synthetic E2E — Runs on Stage deploys via shift-cli test bootstrap

Phase 2 — Provider Replay 🚧 In Progress

Session-level provider mode support is partially implemented
Fixture bundles and recorded responses are not yet available

Phase 3 — Live Delegated Provider Auth ✅ Implemented

Agent-test Stage cookie — shift_stage_agent with capability-scoped access
Persistent OAuth via Passport — Refresh tokens stored user-scoped with consent-skip for returning users
Stage Google proxy — Accepts both end-user cookies and agent-test cookies with bound session + scope checks
Audit trail — Authorization lifecycle events recorded in passport_audit

Phase 4 — Run Orchestration 🔮 Planned

Per-app test policy, required checks by environment, grant revocation automation, and flaky test quarantine are not yet implemented.

Browser-Based Eval Mode (Gate L4)

A browser-based evaluation mode has been added to the platform-test workflow. This mode uses browser automation for visual and functional QA of Stage apps:

Visual checks — Layout integrity, theme support, responsive behavior
Functional checks — Navigation, data operations, error handling, state persistence
Scenario format — Declarative criteria sets with pass/fail grading
React fiber injection — Test framework can inspect React component tree inside Stage sandboxes

Usage: shift-cli test eval --scenario <name> or shift-cli test eval --app-dir ./my-app

Suggested Rollout

Phase 1 ✅

Implement delegated platform auth for Stage browser automation.

Phase 2 🚧

Add provider replay and fixture support.

Phase 3 ✅

Add live delegated provider auth.

Phase 4 🔮

Add run orchestration and policy.

Repository Touchpoints

The current codebase already contains most of the primitives this design should reuse:

gateway CLI exchange flow: packages/gateway/src/auth/routes.ts
gateway bearer and cookie auth middleware: packages/gateway/src/auth/middleware.ts
CLI session storage: packages/core/src/auth.ts
Stage end-user OAuth: packages/gateway/src/auth/stage-oauth.ts
Stage Google proxy: packages/gateway/src/auth/stage-google-proxy.ts
Stage auth gate UI: stage/src/components/AuthGate.tsx
Stage user auth records: stage/convex/stage.ts

Review Of The Other Plan

Keep:

opaque random token handles with hash-only storage
list and revoke support
SDK testing helpers
Playwright auth helper
CLI token and test ergonomics

Change:

bind every token to app, session, deploy, and capability set
reduce TTL defaults substantially
never pass raw tokens in browser query params
never convert delegated tokens into the standard shift_session cookie
do not rely on SHIFT_API_KEY for staged E2E identity
cover Stage end-user OAuth explicitly through userAuth and the Google proxy

Recommended First Implementation Slice

If we want the fastest high-value version, build this first:

Gateway-issued opaque agent test grant limited to one app or session.
One-time browser bootstrap for Playwright using exchange codes.
shift-cli token create|list|revoke commands backed by delegated grant semantics.
SDK testing helpers for API and browser tests.
Stage deploy pipeline that runs e2e-synthetic automatically with the minted grant.
Live delegated Google tests only for explicit smoke paths and only after manual opt-in.

That gets rid of the manual browser blocker without immediately taking on the highest-risk part of the design.

Problem​

Goals​

Non-Goals​

Design Principles​

Proposed Testing Model​

Core Architecture​

1. Delegated Agent Test Grant​

2. One-Time Browser Bootstrap​

3. Separate Agent Browser Session​

4. Delegated Stage End-User Auth​

5. Provider Modes​

Proposed Platform Changes​

Gateway​

shift-cli​

Stage Convex Schema​

Stage Runtime​

SDK And Test Helpers​

CI And Push-To-Stage Flow​

Security Controls​

What To Avoid​

Implementation Status​

Phase 1 — Delegated Platform Auth ✅ Implemented​

Phase 2 — Provider Replay 🚧 In Progress​

Phase 3 — Live Delegated Provider Auth ✅ Implemented​

Phase 4 — Run Orchestration 🔮 Planned​

Browser-Based Eval Mode (Gate L4)​

Suggested Rollout​

Phase 1 ✅​

Phase 2 🚧​

Phase 3 ✅​

Phase 4 🔮​

Repository Touchpoints​

Review Of The Other Plan​

Recommended First Implementation Slice​