The GEAR Framework: Governed Enterprise Agent Responsibility

1. Why this framework exists

The autonomous vehicle industry solved a problem the enterprise AI market has not: it created a shared vocabulary (SAE J3016, Levels 0–5) that tells everyone — engineers, regulators, insurers, buyers — exactly who is responsible for what, and when. Critically, J3016 levels do not describe how smart the vehicle is. They describe how responsibility for the driving task is divided between machine and human, inside an explicitly bounded operating envelope.

Enterprise AI agents today are sold and bought the way cars would be if every vendor claimed "full self-driving" with no definition of conditions, fallback, or liability. The result is predictable: companies either over-trust (deploy unsupervised agents into irreversible workflows) or under-trust (keep humans approving every step forever, destroying the ROI case).

GEAR closes that gap. It gives a company three things:

1• A six-gear autonomy ladder (G0–G5) with precise definitions of who owns which decisions at each gear.
2• A placement methodology that derives the maximum permissible gear from consequence, and the currently achievable gear from readiness — per use case.
3• A gear-shifting protocol with quantitative promotion gates, demotion triggers, and explicit kill criteria — so autonomy is earned, revocable, and economically justified, including the decision to phase the initiative out entirely.

What GEAR deliberately imports from the AV world

AV concept (SAE J3016)	GEAR translation	Why it matters
Levels describe responsibility allocation, not intelligence	Gears describe decision ownership, not model capability	Stops "our model is smarter, so run it autonomously" arguments
Operational Design Domain (ODD)	Operational Task Domain (OTD)	A Waymo is Level 4 in Phoenix. Your agent is G4 inside its OTD — never in general
Dynamic Driving Task split into strategic / tactical / operational decisions	Decision Stack: Strategic / Tactical / Operational layers	Each gear is defined by which layers the agent owns
Fallback-ready user (L3)	Fallback-Ready Owner (FRO)	G3 is only safe if a named human can actually take over in time
Minimal Risk Condition (a stable, stopped state)	Minimal Safe State (MSS)	The agent must always be able to halt safely — no half-completed transactions
Gradual ODD expansion (city by city)	OTD expansion with re-certification	New document type, new value threshold, new market = new certification

What GEAR deliberately rejects from the AV analogy

• Linear progress assumption. Companies should not assume every use case "graduates" to G5. Most enterprise value lives permanently at G2–G4, and some use cases should be parked or killed.
• One level per vehicle. An enterprise runs a portfolio of agents at different gears simultaneously. GEAR is a portfolio governance tool, not a single-system rating.
• G3 as a natural waypoint. The AV industry learned that conditional autonomy with a passive human monitor is the most dangerous configuration — humans are poor passive supervisors of mostly-correct automation. GEAR treats G3 as a high-discipline gear requiring engineered vigilance, not a comfortable middle ground.

2. Foundational constructs

2.1 The Decision Stack

Every agentic workflow decomposes into three decision layers (adapted from the Michon hierarchy used in J3016):

• Strategic — What outcome, and why. Goal setting, prioritization, policy. ("Verify income documents for housing loan applications against policy v4.2.")
• Tactical — How to get there. Task decomposition, tool selection, sequencing, exception classification. ("This payslip conflicts with the bank statement; cross-check, then route.")
• Operational — Doing the steps. Extraction, classification, API calls, drafting, record updates.

A gear is fully defined by which layers the agent owns, and who handles the fallback. This is the framework's load-bearing idea: instead of debating "how autonomous is this agent," you state precisely which decision layers it owns and who catches it when it exits its domain.

2.2 The Operational Task Domain (OTD)

The OTD is the explicit, written envelope inside which an autonomy claim is valid. It must specify, at minimum:

OTD dimension	Example specification
Task scope	Income verification for salaried applicants only
Input envelope	Payslips, bank statements, employment letters; Arabic + English; PDF/JPEG
Tool & action permissions	Read DMS, write to case record; cannot approve or reject the loan
Value / materiality bounds	Loan amounts ≤ 2M; auto-processing only below a confidence threshold of 0.92
Data boundary	On-prem only; no external API calls; PII never leaves the enclave
Volume & time bounds	≤ 500 cases/day; business hours only (if FRO coverage is required)
Exclusions	Self-employed applicants, handwritten documents, politically exposed persons → always escalate

Rule: autonomy claims without an OTD are marketing, not engineering. Every gear assignment in GEAR is written as "Gn within OTD-x," never "Gn" alone.

2.3 Minimal Safe State (MSS)

The MSS is the agent's equivalent of pulling onto the shoulder and stopping: a stable, halted condition reachable from any point in execution. A valid MSS must guarantee:

1• No partial side effects — in-flight multi-step actions are rolled back or completed via compensating transactions (saga pattern), never abandoned mid-write.
2• State preservation — full context (inputs, reasoning trace, partial outputs) is packaged into a human-readable case file.
3• Routing — the case lands in a monitored queue with an SLA, not a log file nobody reads.
4• No silent failure — entering MSS always emits an alert and a metric.

MSS reachability is a certification requirement for G3 and above and must be drill-tested (like a fire drill) on a fixed cadence.

2.4 Fallback-Ready Owner (FRO)

A named human (or staffed queue) who is contractually available to receive escalations within a defined response time while the agent operates. G3 is invalid without one — exactly as SAE L3 is invalid without a fallback-ready driver. The FRO requirement is what makes G3 operationally expensive and is a key input to the economics in Section 5.

3. The Gear Ladder: G0–G5

Mnemonic for executives: G0 Ask. G1 Assist. G2 Approve. G3 Escalate. G4 Audit. G5 Aspire.

Gear	Name	Agent owns	Human role (per interaction)	Fallback owner	Oversight model
G0	Reference	Nothing — answers questions	Operator	Human	Human does the work; AI informs
G1	Assist	Fragments of Operational	Collaborator	Human	AI drafts/suggests inside a step; human executes
G2	Co-Execute	Operational; proposes Tactical	Approver	Human	Agent performs multi-step work; human approves every consequential action before it lands
G3	Conditional	Operational + Tactical, within OTD	Fallback-Ready Owner	Human (FRO)	Agent runs end-to-end; escalates exceptions; FRO must respond within SLA
G4	Bounded	Operational + Tactical + its own fallback to MSS, within OTD	Auditor / Observer	Agent (to MSS)	No real-time human availability required; humans review samples and incidents after the fact
G5	General	Participates in Strategic; sets sub-goals across domains	Policy setter	Agent	Not an enterprise deployment option today. Research horizon only

Gear profiles: reward, risk, and required controls

G0 — Reference. Reward: knowledge access, faster decisions; near-zero deployment risk. Risk: hallucinated answers informing human decisions; over-reliance without verification. Controls: grounding/RAG with citations, confidence display. Typical residence time: permanent for advisory use cases; a few weeks as a starting point for agentic ones.

G1 — Assist. Reward: 20–40% cycle-time reduction on drafting-heavy work; lowest change-management cost; fastest path to visible value. Risk: quality erosion if humans rubber-stamp suggestions; accountability blur ("the AI wrote it"). Controls: suggestion provenance, mandatory human edit-rate monitoring. Watch metric: if humans accept >95% of suggestions unedited for a sustained period, the use case is signaling it wants G2 — or that review has already collapsed.

G2 — Co-Execute. Reward: the agent absorbs the operational layer entirely; humans shift from doing to deciding. Typically 40–60% effort reduction. This is the default launch gear for any consequential workflow. Risk: the approval-fatigue trap — at high volume, per-action approval becomes a bottleneck and degrades into reflexive clicking, giving you G3 risk at G2 cost. Controls: approval queues with diff views (show what changes, not walls of text), batch approval only for homogeneous low-consequence actions, approval-latency and override-rate dashboards. Economic signature: the oversight tax is at its maximum here; G2 at scale is often more expensive per case than it looks.

G3 — Conditional. Reward: the step-change gear. Humans exit the per-case loop; unit economics improve 5–10x vs G2 because human time is spent only on exceptions. This is where most enterprise value is realized. Risk: the vigilance trap — the AV industry's hardest-learned lesson. When the agent is right 95% of the time, the FRO stops genuinely reviewing, and the 5% failures sail through with a human signature on them. Also: handover quality — an escalation that arrives without context forces the human to redo the whole case. Controls (non-negotiable):

1• Engineered vigilance — inject blind known-error cases into the FRO's queue and measure catch rate; if catch rate decays, the human side of the system has failed certification.
2• Structured handover — every escalation carries the case file from MSS packaging: what was attempted, what conflicted, what the agent recommends.
3• FRO capacity contract — escalation volume forecast vs staffed capacity, reviewed monthly. An FRO at 120% utilization means you are running G4 without admitting it.

G4 — Bounded. Reward: true lights-out operation inside the OTD; 24/7 throughput; marginal cost per case approaches infrastructure cost; humans move entirely to portfolio-level quality management. Risk: failures compound silently between audits; OTD drift (the world changes — new document formats, new regulation — while the certificate stays frozen); blast radius is bounded only by how honestly the OTD and permissions were engineered. Controls: hard permission boundaries enforced outside the model (policy engine, least-privilege credentials — the agent must be physically unable to exceed its OTD, not just instructed not to); statistical sampling audits with confidence intervals; drift monitors on input distribution; automatic demotion triggers (Section 6); quarterly MSS drills. Honest constraint: G4 is only legitimate for use cases whose worst-case failure inside the OTD is reversible or financially bounded. If you cannot write down the maximum possible loss of a bad day at G4, you have not finished defining the OTD.

G5 — General. Included to mark the end of the spectrum and to give executives a disciplined answer to "why aren't we doing what the keynote showed." An agent that sets its own goals across domains has no OTD, and without an OTD none of the certification machinery in this framework can attach to it. Position: not a deployment option; revisit per major capability generation.

4. Placement methodology: the C×R Matrix

Two independent scores decide the gear. This separation is the discipline most ad-hoc approaches lack: consequence sets the ceiling (how high you're allowed to go); readiness sets the floor of effort (how high you can currently go). Capability enthusiasm influences neither.

4.1 Consequence Score (C) — sets the Gear Ceiling

Score each dimension 1 (low) to 5 (critical) for the worst plausible failure inside the proposed OTD:

Dimension	Question	1	5
Reversibility	Can a bad action be undone?	Fully reversible in minutes	Irreversible (funds moved, message sent, record destroyed)
Blast radius	How far does one failure propagate?	One internal record	Customers, partners, downstream systems
Financial materiality	Direct loss potential per failure	Negligible	Material to P&L or regulatory capital
Regulatory exposure	Does failure breach law/regulation?	None	Licensed-activity breach, reporting obligation
Data sensitivity	What data does the agent touch/move?	Public	Special-category PII, sovereign-restricted data
External visibility	Who sees the failure?	Internal team	Customers, press, regulator

Gear Ceiling rule (use the maximum of the six scores — risk does not average):

Max consequence score	Tier	Gear Ceiling
5	Critical	G2 — a human approves every consequential action, indefinitely
4	High	G3 — autonomous execution, human-owned fallback
3	Moderate	G4 — within a tightly engineered OTD
1–2	Low	G4 (G5 is never a ceiling)

Two design notes. First, the ceiling applies to the action, not the agent: a single agent can run G4 on data extraction while its loan-decision action is permanently pinned at G2 — gears attach to action classes inside the OTD. Second, consequence can be engineered down: adding reversibility (compensating transactions, holding queues, delayed execution windows) or capping materiality (value thresholds) lowers C and legitimately raises the ceiling. This is usually the highest-ROI architecture work in the whole program.

4.2 Readiness Score (R) — sets the current Achievable Gear

Score 1–5 on evidence, not intention:

Dimension	What "5" looks like
Task performance evidence	≥95–99% on a golden dataset representative of the OTD, plus production agreement-rate data
Observability	Full reasoning traces, per-step logging, replayable cases, real-time dashboards
Guardrail maturity	Permissions enforced outside the model; policy engine; sandboxed tools; tested MSS
Data & tool reliability	Stable APIs, monitored data quality, versioned prompts/models
Escalation capacity	Named FRO, staffed queue, measured response SLA
Organizational maturity	Agent owner, incident process, audit cadence, sign-off authority exist on paper and in practice

Achievable Gear rule (use the minimum of the six — readiness is a weakest-link property): min ≥ 4 supports G4; min ≥ 3 supports G3; min ≥ 2 supports G2; below that, G0–G1 only.

4.3 The placement decision

Target Gear = lower of (Gear Ceiling from C, Achievable Gear from R) Launch Gear = Target Gear − 1, or Target Gear in shadow mode (Section 6)

	R supports G1	R supports G2	R supports G3	R supports G4
Ceiling G2 (Critical)	Launch G0/G1	Target G2	Target G2 (readiness surplus → spend it on better approval UX, not more autonomy)	Target G2
Ceiling G3 (High)	Launch G1, build readiness	Launch G2	Target G3	Target G3
Ceiling G4 (Mod/Low)	Launch G1	Launch G2	Launch G3	Target G4

A readiness surplus above the ceiling is not wasted — it buys cheaper oversight, faster audits, and resilience. A readiness deficit below an attractive ceiling defines the engineering roadmap: the gap analysis between current R and required R is the build plan.

5. The economics: where each gear pays

Net value at a gear is:

NV(G) = Automation dividend (labor, cycle time, throughput, 24/7 coverage) − Oversight tax (approvals, FRO staffing, audits) − Control cost (guardrails, observability, certification) − Risk-adjusted incident cost (probability × bounded worst case)

The curve this produces is the strategic insight:

• G1 delivers modest, fast, low-risk value. It rarely justifies a platform investment on its own.
• G2 often disappoints at scale: the automation dividend grows with volume, but so does the oversight tax — one human approval per case is a linear cost that caps ROI. G2 is a transition gear, economically: right for launch, wrong as a destination for high-volume work.
• G3 is the inflection point. The oversight tax collapses from "every case" to "exceptions only." For most high-volume back-office workflows, G3 is where the business case actually closes.
• G4 maximizes the dividend but adds material control cost and concentrates tail risk. It pays when volume is high, consequence is engineered down, and audits replace queues.

Strategic implication for portfolio planning: when evaluating a use case, model NV at the ceiling gear, not the launch gear. If the ceiling is G2 (Critical tier) and NV(G2) is negative at projected volume — i.e., the approval burden exceeds manual cost and you are never allowed to go higher — the use case fails economically regardless of how impressive the demo is. That is a phase-out signal, and finding it in a spreadsheet is far cheaper than finding it in production.

6. The Gear-Shifting Protocol: earn, hold, lose, kill

Autonomy in GEAR is a license: earned with evidence, held with monitoring, revoked on triggers, and surrendered when the economics die.

6.1 Shadow mode (pre-shift)

Before any promotion, run the agent at the target gear with actions simulated: it executes the full decision flow, but writes goes to a shadow store while humans (or the current gear) still do the real work. Compare agent output to human ground truth at production volume. Shadow mode converts promotion from an opinion into a measurement.

6.2 Promotion gates (shift up)

A shift of one gear requires all gates over a defined evidence window (e.g., 60 days or 500 cases, whichever is later):

1• Accuracy gate — agreement with ground truth ≥ threshold set per consequence tier (e.g., ≥98% Critical-adjacent, ≥95% High, ≥92% Moderate).
2• Override gate — human override/correction rate below threshold (e.g., <5%) and trending flat or down.
3• Escalation quality gate (for entry to G3) — escalations are genuinely exceptional (<15% of volume) and handover files rated usable by the FRO.
4• Vigilance gate (for holding G3) — blind injected-error catch rate ≥90%.
5• Incident gate — zero Sev-1, ≤ agreed Sev-2 count in window.
6• MSS gate (for entry to G4) — successful MSS drill from at least three distinct failure points, including mid-transaction.
7• Sign-off — agent owner + risk/compliance counterpart sign a renewed Gear Certificate (Section 7).

One gear per shift. No skipping. A model/prompt/tool-version change of material scope resets the evidence window (a "new vehicle platform" requires re-homologation, even on the same roads).

6.3 Demotion triggers (shift down — automatic where possible)

• Override or correction rate breaches threshold for N consecutive days
• Input drift monitor fires (the world left the OTD: new document formats, new regulation, new customer segment)
• Any Sev-1 incident → immediate drop to G2 (or MSS-and-halt) pending review
• FRO utilization > capacity for two consecutive cycles (you no longer actually have a fallback)
• Vigilance catch rate decays below floor
• Certification expiry without renewal (certificates time-bound by default: 6 months)

Demotion is designed to be boring and routine — a gear is a setting, not a status symbol. Organizations that treat demotion as failure will hide the signals that should trigger it.

6.4 Kill / Park criteria (phase out the approach)

Recommend Kill when any of the following holds:

1• Ceiling-locked negative NV — the consequence ceiling caps the use case at a gear whose net value is negative at realistic volume (Section 5), and consequence cannot be engineered down.
2• Structural readiness gap — a required readiness dimension (typically observability or guardrails) has no funded path to the needed level.
3• Chronic exception dominance — escalation rate stays above ~30–40% after two full improvement cycles; the "exceptions" are the job, and the OTD was a fiction.
4• Control cost inversion — certification, audit, and FRO costs exceed the dividend durably (common in low-volume, high-consequence niches).

Recommend Park (kill with a revisit date) when the blocker is external and time-resolving: model capability on the task, upcoming regulatory clarity, or a dependency system being replaced. Parking with a written re-entry condition ("revisit when extraction accuracy on handwritten Arabic exceeds 95% on our golden set") prevents both zombie projects and permanently lost opportunities.

7. Governance artifacts (the minimum set)

1• Gear Certificate — per agent, per OTD: gear, OTD specification, decision-stack ownership table, FRO designation, MSS definition, gate evidence, expiry date, signatories. One page. If it doesn't fit on one page, the OTD is too vague.
2• Autonomy Register — the enterprise inventory: every agent, its current gear, certificate expiry, last drill, last incident. This is the artifact a regulator, auditor, or acquiring entity will ask for first.
3• Shift Log — every promotion/demotion with trigger and evidence. The institution's memory of what autonomy it has trusted, and why.
4• Drill Calendar — MSS drills and vigilance injections, scheduled and attested.

8. Worked example: document verification agent (banking)

Use case: agentic verification of income documents (Arabic/English) for housing-loan applications; extraction, cross-document consistency checks, policy validation, case-record updates. The agent does not make the credit decision.

Step 1 — Define the OTD. Salaried applicants; payslips, bank statements, employment letters; loans ≤ 2M; on-prem processing; exclusions: self-employed, handwritten docs, PEPs.

Step 2 — Consequence (C). Reversibility 2 (case record updates reversible; no funds move). Blast radius 3 (wrong verification feeds a downstream credit decision, but a human credit officer remains in that loop). Financial materiality 3. Regulatory 4 (verification quality is examinable by the banking regulator). Data sensitivity 4 (financial PII, residency requirements). External visibility 2. Max = 4 → High tier → Ceiling G3.

Step 3 — Readiness (R). Golden-set accuracy 94% (score 3), observability strong (4), guardrails enforced via policy engine (4), data quality variable on scanned docs (3), FRO queue exists but unstaffed for volume (2), org maturity 3. Min = 2 → currently supports G2.

Step 4 — Placement. Target = min(G3, G2) = launch at G2, with a funded path to G3: staff the FRO queue (readiness 2→3) and push accuracy past the 95% High-tier gate (3→4).

Step 5 — Economics. At 500 cases/day, NV(G2) is mildly positive but approval throughput caps at ~300 cases/day per reviewer — G2 is a bottleneck within a quarter. NV(G3) closes the business case decisively. Conclusion: proceed, because the ceiling (G3) is economically sufficient; had the ceiling been G2, this volume profile would have triggered a Kill review before build.

Step 6 — Shift plan. 60-day G2 operation → shadow-G3 for 30 days → gates 1–5 → G3 certificate (6-month expiry) with monthly vigilance injection and a demotion trigger on >8% override rate. The credit-decision action class remains outside the OTD permanently.

9. One-page summary

1• Autonomy is allocated, not assumed. A gear states which decision layers (Strategic/Tactical/Operational) the agent owns, and who catches the fallback — nothing else.
2• No OTD, no gear. Every autonomy claim is bounded to a written Operational Task Domain, like an AV certified city by city.
3• Two scores, one decision. Consequence sets the ceiling (G2/G3/G4 by tier). Readiness sets what's achievable now. Target = the lower. Launch one gear below or in shadow.
4• G2 is a launchpad, G3 is the business case, G4 is earned, G5 is a poster. Model the economics at the ceiling gear before building anything.
5• Beware the two traps: approval fatigue at G2 and the vigilance trap at G3 — both give you the risk of the next gear up at the cost of the current one. Engineer against them explicitly.
6• Autonomy is a revocable license. Quantitative promotion gates, automatic demotion triggers, time-boxed certificates, drilled Minimal Safe States.
7• Killing is a valid outcome. Ceiling-locked negative value, structural readiness gaps, or exception-dominated workloads → Kill or Park with a written re-entry condition.