Agent Testing And Delegated Stage Auth
Problem
Coding agents can already create, publish, and update Stage apps, but they still hit hard blockers when a realistic flow requires:
- browser login to the platform
- end-user OAuth inside a staged app
- access to APIs on behalf of the human who launched the app
- repeatable Playwright or UI tests that should run during push-to-stage
The current platform has strong building blocks:
- gateway login already supports CLI exchange codes and bearer tokens
- Stage already stores end-user Google authorizations server-side
- Stage apps already call a server-side Google proxy instead of handling raw refresh tokens in the browser
The missing piece is a first-class, scoped, auditable way to let an agent test a staged app as a delegated identity without turning off auth or copying long-lived user secrets into CI.
Goals
- Let agents run unit, integration, E2E, and UI tests without waiting for manual browser login during each run.
- Preserve least privilege.
- Keep live third-party tests opt-in and auditable.
- Make the default path deterministic and cheap.
- Reuse the existing gateway, Stage, and CLI auth model instead of adding special-case bypasses.
Non-Goals
- Global "agent can impersonate any user" access.
- Passing raw refresh tokens, session cookies, or password flows to agents.
- Using user-agent checks, IP allowlists, or hidden bypass query params as auth.
- Making all E2E tests hit live third-party providers by default.
Design Principles
- Prefer hermetic tests over live tests.
- Use delegated credentials, not copied human credentials.
- Bind every agent grant to a specific app or Stage session.
- Give each grant a TTL, audience, and explicit capabilities.
- Record who delegated access, to which agent run, for which deploy, and for what scopes.
- Make browser automation consume a bootstrap artifact instead of a human login screen.
Proposed Testing Model
Every app should declare which test modes it supports.
| Mode | Purpose | External auth | Expected frequency |
|---|---|---|---|
unit | Pure logic and component tests | none | every change |
integration | app + platform service contracts | mocked or in-process | every change |
e2e-synthetic | full browser flow on staged app | delegated platform auth, fake provider data | every push to stage |
e2e-live | real provider flow against staged app | delegated user auth with explicit scopes | opt-in, narrow smoke suite |
The default gate on a stage deploy should be unit + integration + e2e-synthetic.
e2e-live should be reserved for:
- smoke checks on the exact user path that matters
- provider regressions
- debugging sessions where the human explicitly opts in
Core Architecture
1. Delegated Agent Test Grant
Add a new gateway-issued credential type for agent testing.
Implementation choice:
- expose the grant to clients as an opaque random handle, for example
sat_... - store only a hash server-side
- keep the authoritative grant record server-side for listing, revocation, and audit
- derive signed browser session cookies only after a one-time exchange
Suggested stored fields or claims:
sub:agent-run:<runId>act: delegating human emailaud:stage-testappSidorsessionIddeployIdcapabilitiesproviderScopesexpjtilabelrevokedAtlastUsedAt
Capabilities should be explicit, for example:
stage.readstage.writestage.renderstage.browserstage.google.proxyapp.api
This is not the human's normal CLI token. It is a derived, narrower credential.
Important constraints:
- default TTL should be minutes, not hours or days
- a grant must be bound to a specific app, session, or deploy
- a grant must never be accepted as a general replacement for a human session across the platform
- direct browser use should go through one-time exchange, not by putting the raw grant into a URL
2. One-Time Browser Bootstrap
Add a one-time exchange flow for browser automation.
Flow:
- Human authenticates once with
shift-cli login. - Human or CI asks
shift-clito mint an agent test grant for a specific app or session. - Gateway returns:
- a short-lived API bearer token for HTTP calls
- a one-time browser bootstrap URL or exchange code
- metadata about allowed scopes and expiry
- Playwright opens the bootstrap URL once before tests.
- Gateway validates the exchange code and sets agent-scoped cookies.
- Tests run with a real browser session and no manual login screen.
This should mirror the existing /auth/exchange pattern instead of inventing a new login system.
Do not use:
GET /auth/...?...token=sat_xxx- raw bearer tokens in browser query params
- the human's existing
shift_sessioncookie
3. Separate Agent Browser Session
Do not reuse shift_session or shift_stage_user directly.
Add dedicated agent-test session forms, for example:
shift_agent_sessionfor platform routesshift_stage_agentfor Stage end-user routes
That separation gives:
- clear auditability
- different TTLs
- route-level restrictions
- easier revocation
- no confusion between a real user session and a delegated test session
4. Delegated Stage End-User Auth
For Stage apps that need Google access, reuse the existing server-side authorization storage pattern.
Instead of copying the human's Stage cookie into the agent browser:
- Human authorizes the Stage app once.
- Stage stores provider refresh tokens server-side as it already does.
- The agent test grant references the existing
userAuthrecord for that session and user. - The Stage Google proxy accepts either:
- an authenticated end-user cookie
- an agent-test cookie whose grant is bound to the same session and allowed scopes
This lets the proxy keep issuing access tokens server-side while the agent only holds a narrow delegated session.
5. Provider Modes
Provider access should be explicit per test run.
| Provider mode | Backing source | Default |
|---|---|---|
mock | app-defined fixtures | yes |
replay | recorded provider responses | yes |
synthetic | test tenant or sandbox workspace | recommended |
live-delegated | real user authorization through server-side proxy | no |
Stage apps should be able to ask for a provider mode through session metadata so the same app can run:
- hermetic tests in CI
- realistic smoke tests on stage
- live debugging when a human explicitly delegates access
Proposed Platform Changes
Gateway
Add:
- delegated grant minting endpoint
- one-time browser bootstrap exchange endpoint
- middleware support for agent-test JWTs or cookies
- route-level capability enforcement
- revocation and expiry checks
- audit emission to Pulse and Ledger for every grant issue and redemption
Suggested routes:
POST /auth/agent/grantsGET /auth/agent/grantsDELETE /auth/agent/grants/:idOrLabelPOST /auth/agent/bootstrapGET /auth/agent/bootstrap
Practical note:
- the other plan's idea of opaque random
sat_tokens is good - its proposed
agent-bridge?token=...shape is not - the merged design should keep opaque handles but exchange them for one-time codes before the browser sees anything
shift-cli
Current implemented commands:
shift-cli token create --session <id> --ttl 30m --jsonshift-cli token list --jsonshift-cli token revoke <grant-id> --jsonshift-cli test bootstrap --app <sid> --output .shift/e2e-auth.json --json
The bootstrap output should be machine-friendly and include:
baseUrlgrantIdgrantLabelexpiresAtbootstrapUrlorexchangeCodeapiTokenproviderModesessionIdappSid
Stage Convex Schema
Add tables for test grants and run state, for example:
agentTestRunsagentTestGrantsproviderReplaysproviderFixtures
Minimum fields:
- delegating user
- agent or run identifier
- app or session binding
- provider mode
- allowed scopes
- TTL
- revocation state
- created and redeemed timestamps
If it is simpler to land incrementally, the first table can live alongside the existing auth state in root Convex and later be specialized for Stage test runs.
Stage Runtime
Add a test-context layer exposed to the runtime:
- current provider mode
- current test run id
- deterministic seed
- optional fixture bundle id
This lets app code and platform adapters select:
- mock provider
- replay provider
- live proxy provider
without branching on ad hoc environment variables.
SDK And Test Helpers
Add a first-class testing surface in packages/sdk/src/testing/.
Recommended modules:
client.tsfor authenticated API clientssession.tsfor Stage session lifecycle helpersbrowser.tsfor Playwright bootstrap helperslifecycle.tsfor deploy-and-test orchestration
Recommended helpers:
createTestClient()createTestSession()authenticatedPage()deployAndTest()
This part of the other plan is directionally right and should be kept.
CI And Push-To-Stage Flow
Recommended flow for a staged deploy:
shift-cli stage pushpublishes or updates the app.- The push step optionally requests an agent grant for the resulting app/session.
- The CLI emits bootstrap JSON for the E2E runner.
- The E2E runner:
- restores browser state via bootstrap
- runs synthetic smoke tests
- optionally runs live-delegated smoke tests if explicitly enabled
- The gateway revokes the grant when:
- TTL expires
- the deploy is replaced
- the test run ends
- the human explicitly revokes it
This should work for both:
- local agent debugging on a developer machine
- remote CI jobs running after push-to-stage
CI should use the minted delegated grant for the deploy under test.
CI should not use:
SHIFT_API_KEYas the primary staged E2E auth mechanism- a broad shared static secret to impersonate users
Security Controls
Required controls:
- grant TTL of minutes, not days
- one-time browser exchange codes
- app or session binding on every grant
- audience restriction to stage testing
- capability checks in middleware
- provider-scope allowlist
- explicit opt-in for live third-party access
- full issuance and redemption audit trail
- revocation on demand and on deploy replacement
Recommended controls:
- require a fresh local CLI session to mint a live-delegated grant
- require a per-app or per-deploy opt-in flag for
e2e-live - use sandbox or test-workspace accounts whenever possible
- attach screenshots, trace artifacts, and audit logs to each run
What To Avoid
- Reusing a human's full platform bearer token inside CI.
- Exporting raw Stage refresh tokens to the browser or to the agent.
- Disabling auth middleware on staging.
- A hidden query parameter like
?agent=truethat bypasses auth gates. - Tests that mutate production-like third-party data without isolation.
Implementation Status
Phase 1 — Delegated Platform Auth ✅ Implemented
All Phase 1 deliverables are live:
- Opaque delegated grant records —
sat_...tokens with hash-only server-side storage - Browser bootstrap exchange — One-time codes redeemed for
shift_agent_sessionandshift_stage_agentcookies - CLI commands —
shift-cli token create|list|revokeandshift-cli test bootstrap - SDK test helpers —
@the-shift/sdk/testingwithcreateTestClient()andauthenticatedPage() - Synthetic E2E — Runs on Stage deploys via
shift-cli test bootstrap
Phase 2 — Provider Replay 🚧 In Progress
- Session-level provider mode support is partially implemented
- Fixture bundles and recorded responses are not yet available
Phase 3 — Live Delegated Provider Auth ✅ Implemented
- Agent-test Stage cookie —
shift_stage_agentwith capability-scoped access - Persistent OAuth via Passport — Refresh tokens stored user-scoped with consent-skip for returning users
- Stage Google proxy — Accepts both end-user cookies and agent-test cookies with bound session + scope checks
- Audit trail — Authorization lifecycle events recorded in
passport_audit
Phase 4 — Run Orchestration 🔮 Planned
- Per-app test policy, required checks by environment, grant revocation automation, and flaky test quarantine are not yet implemented.
Browser-Based Eval Mode (Gate L4)
A browser-based evaluation mode has been added to the platform-test workflow. This mode uses browser automation for visual and functional QA of Stage apps:
- Visual checks — Layout integrity, theme support, responsive behavior
- Functional checks — Navigation, data operations, error handling, state persistence
- Scenario format — Declarative criteria sets with pass/fail grading
- React fiber injection — Test framework can inspect React component tree inside Stage sandboxes
Usage: shift-cli test eval --scenario <name> or shift-cli test eval --app-dir ./my-app
Suggested Rollout
Phase 1 ✅
Implement delegated platform auth for Stage browser automation.
Phase 2 🚧
Add provider replay and fixture support.
Phase 3 ✅
Add live delegated provider auth.
Phase 4 🔮
Add run orchestration and policy.
Repository Touchpoints
The current codebase already contains most of the primitives this design should reuse:
- gateway CLI exchange flow:
packages/gateway/src/auth/routes.ts - gateway bearer and cookie auth middleware:
packages/gateway/src/auth/middleware.ts - CLI session storage:
packages/core/src/auth.ts - Stage end-user OAuth:
packages/gateway/src/auth/stage-oauth.ts - Stage Google proxy:
packages/gateway/src/auth/stage-google-proxy.ts - Stage auth gate UI:
stage/src/components/AuthGate.tsx - Stage user auth records:
stage/convex/stage.ts
Review Of The Other Plan
Keep:
- opaque random token handles with hash-only storage
- list and revoke support
- SDK testing helpers
- Playwright auth helper
- CLI
tokenandtestergonomics
Change:
- bind every token to app, session, deploy, and capability set
- reduce TTL defaults substantially
- never pass raw tokens in browser query params
- never convert delegated tokens into the standard
shift_sessioncookie - do not rely on
SHIFT_API_KEYfor staged E2E identity - cover Stage end-user OAuth explicitly through
userAuthand the Google proxy
Recommended First Implementation Slice
If we want the fastest high-value version, build this first:
- Gateway-issued opaque agent test grant limited to one app or session.
- One-time browser bootstrap for Playwright using exchange codes.
shift-cli token create|list|revokecommands backed by delegated grant semantics.- SDK
testinghelpers for API and browser tests. - Stage deploy pipeline that runs
e2e-syntheticautomatically with the minted grant. - Live delegated Google tests only for explicit smoke paths and only after manual opt-in.
That gets rid of the manual browser blocker without immediately taking on the highest-risk part of the design.