The autonomous vehicle industry solved a problem the enterprise AI market has not: it created a shared vocabulary (SAE J3016, Levels 0–5) that tells everyone — engineers, regulators, insurers, buyers — exactly who is responsible for what, and when. Critically, J3016 levels do not describe how smart the vehicle is. They describe how responsibility for the driving task is divided between machine and human, inside an explicitly bounded operating envelope.
Enterprise AI agents today are sold and bought the way cars would be if every vendor claimed "full self-driving" with no definition of conditions, fallback, or liability. The result is predictable: companies either over-trust (deploy unsupervised agents into irreversible workflows) or under-trust (keep humans approving every step forever, destroying the ROI case).
GEAR closes that gap. It gives a company three things:
| AV concept (SAE J3016) | GEAR translation | Why it matters |
|---|---|---|
| Levels describe responsibility allocation, not intelligence | Gears describe decision ownership, not model capability | Stops "our model is smarter, so run it autonomously" arguments |
| Operational Design Domain (ODD) | Operational Task Domain (OTD) | A Waymo is Level 4 in Phoenix. Your agent is G4 inside its OTD — never in general |
| Dynamic Driving Task split into strategic / tactical / operational decisions | Decision Stack: Strategic / Tactical / Operational layers | Each gear is defined by which layers the agent owns |
| Fallback-ready user (L3) | Fallback-Ready Owner (FRO) | G3 is only safe if a named human can actually take over in time |
| Minimal Risk Condition (a stable, stopped state) | Minimal Safe State (MSS) | The agent must always be able to halt safely — no half-completed transactions |
| Gradual ODD expansion (city by city) | OTD expansion with re-certification | New document type, new value threshold, new market = new certification |
Every agentic workflow decomposes into three decision layers (adapted from the Michon hierarchy used in J3016):
A gear is fully defined by which layers the agent owns, and who handles the fallback. This is the framework's load-bearing idea: instead of debating "how autonomous is this agent," you state precisely which decision layers it owns and who catches it when it exits its domain.
The OTD is the explicit, written envelope inside which an autonomy claim is valid. It must specify, at minimum:
| OTD dimension | Example specification |
|---|---|
| Task scope | Income verification for salaried applicants only |
| Input envelope | Payslips, bank statements, employment letters; Arabic + English; PDF/JPEG |
| Tool & action permissions | Read DMS, write to case record; cannot approve or reject the loan |
| Value / materiality bounds | Loan amounts ≤ 2M; auto-processing only below a confidence threshold of 0.92 |
| Data boundary | On-prem only; no external API calls; PII never leaves the enclave |
| Volume & time bounds | ≤ 500 cases/day; business hours only (if FRO coverage is required) |
| Exclusions | Self-employed applicants, handwritten documents, politically exposed persons → always escalate |
Rule: autonomy claims without an OTD are marketing, not engineering. Every gear assignment in GEAR is written as "Gn within OTD-x," never "Gn" alone.
The MSS is the agent's equivalent of pulling onto the shoulder and stopping: a stable, halted condition reachable from any point in execution. A valid MSS must guarantee:
MSS reachability is a certification requirement for G3 and above and must be drill-tested (like a fire drill) on a fixed cadence.
A named human (or staffed queue) who is contractually available to receive escalations within a defined response time while the agent operates. G3 is invalid without one — exactly as SAE L3 is invalid without a fallback-ready driver. The FRO requirement is what makes G3 operationally expensive and is a key input to the economics in Section 5.
Mnemonic for executives: G0 Ask. G1 Assist. G2 Approve. G3 Escalate. G4 Audit. G5 Aspire.
| Gear | Name | Agent owns | Human role (per interaction) | Fallback owner | Oversight model |
|---|---|---|---|---|---|
| G0 | Reference | Nothing — answers questions | Operator | Human | Human does the work; AI informs |
| G1 | Assist | Fragments of Operational | Collaborator | Human | AI drafts/suggests inside a step; human executes |
| G2 | Co-Execute | Operational; proposes Tactical | Approver | Human | Agent performs multi-step work; human approves every consequential action before it lands |
| G3 | Conditional | Operational + Tactical, within OTD | Fallback-Ready Owner | Human (FRO) | Agent runs end-to-end; escalates exceptions; FRO must respond within SLA |
| G4 | Bounded | Operational + Tactical + its own fallback to MSS, within OTD | Auditor / Observer | Agent (to MSS) | No real-time human availability required; humans review samples and incidents after the fact |
| G5 | General | Participates in Strategic; sets sub-goals across domains | Policy setter | Agent | Not an enterprise deployment option today. Research horizon only |
G0 — Reference. Reward: knowledge access, faster decisions; near-zero deployment risk. Risk: hallucinated answers informing human decisions; over-reliance without verification. Controls: grounding/RAG with citations, confidence display. Typical residence time: permanent for advisory use cases; a few weeks as a starting point for agentic ones.
G1 — Assist. Reward: 20–40% cycle-time reduction on drafting-heavy work; lowest change-management cost; fastest path to visible value. Risk: quality erosion if humans rubber-stamp suggestions; accountability blur ("the AI wrote it"). Controls: suggestion provenance, mandatory human edit-rate monitoring. Watch metric: if humans accept >95% of suggestions unedited for a sustained period, the use case is signaling it wants G2 — or that review has already collapsed.
G2 — Co-Execute. Reward: the agent absorbs the operational layer entirely; humans shift from doing to deciding. Typically 40–60% effort reduction. This is the default launch gear for any consequential workflow. Risk: the approval-fatigue trap — at high volume, per-action approval becomes a bottleneck and degrades into reflexive clicking, giving you G3 risk at G2 cost. Controls: approval queues with diff views (show what changes, not walls of text), batch approval only for homogeneous low-consequence actions, approval-latency and override-rate dashboards. Economic signature: the oversight tax is at its maximum here; G2 at scale is often more expensive per case than it looks.
G3 — Conditional. Reward: the step-change gear. Humans exit the per-case loop; unit economics improve 5–10x vs G2 because human time is spent only on exceptions. This is where most enterprise value is realized. Risk: the vigilance trap — the AV industry's hardest-learned lesson. When the agent is right 95% of the time, the FRO stops genuinely reviewing, and the 5% failures sail through with a human signature on them. Also: handover quality — an escalation that arrives without context forces the human to redo the whole case. Controls (non-negotiable):
G4 — Bounded. Reward: true lights-out operation inside the OTD; 24/7 throughput; marginal cost per case approaches infrastructure cost; humans move entirely to portfolio-level quality management. Risk: failures compound silently between audits; OTD drift (the world changes — new document formats, new regulation — while the certificate stays frozen); blast radius is bounded only by how honestly the OTD and permissions were engineered. Controls: hard permission boundaries enforced outside the model (policy engine, least-privilege credentials — the agent must be physically unable to exceed its OTD, not just instructed not to); statistical sampling audits with confidence intervals; drift monitors on input distribution; automatic demotion triggers (Section 6); quarterly MSS drills. Honest constraint: G4 is only legitimate for use cases whose worst-case failure inside the OTD is reversible or financially bounded. If you cannot write down the maximum possible loss of a bad day at G4, you have not finished defining the OTD.
G5 — General. Included to mark the end of the spectrum and to give executives a disciplined answer to "why aren't we doing what the keynote showed." An agent that sets its own goals across domains has no OTD, and without an OTD none of the certification machinery in this framework can attach to it. Position: not a deployment option; revisit per major capability generation.
Two independent scores decide the gear. This separation is the discipline most ad-hoc approaches lack: consequence sets the ceiling (how high you're allowed to go); readiness sets the floor of effort (how high you can currently go). Capability enthusiasm influences neither.
Score each dimension 1 (low) to 5 (critical) for the worst plausible failure inside the proposed OTD:
| Dimension | Question | 1 | 5 |
|---|---|---|---|
| Reversibility | Can a bad action be undone? | Fully reversible in minutes | Irreversible (funds moved, message sent, record destroyed) |
| Blast radius | How far does one failure propagate? | One internal record | Customers, partners, downstream systems |
| Financial materiality | Direct loss potential per failure | Negligible | Material to P&L or regulatory capital |
| Regulatory exposure | Does failure breach law/regulation? | None | Licensed-activity breach, reporting obligation |
| Data sensitivity | What data does the agent touch/move? | Public | Special-category PII, sovereign-restricted data |
| External visibility | Who sees the failure? | Internal team | Customers, press, regulator |
Gear Ceiling rule (use the maximum of the six scores — risk does not average):
| Max consequence score | Tier | Gear Ceiling |
|---|---|---|
| 5 | Critical | G2 — a human approves every consequential action, indefinitely |
| 4 | High | G3 — autonomous execution, human-owned fallback |
| 3 | Moderate | G4 — within a tightly engineered OTD |
| 1–2 | Low | G4 (G5 is never a ceiling) |
Two design notes. First, the ceiling applies to the action, not the agent: a single agent can run G4 on data extraction while its loan-decision action is permanently pinned at G2 — gears attach to action classes inside the OTD. Second, consequence can be engineered down: adding reversibility (compensating transactions, holding queues, delayed execution windows) or capping materiality (value thresholds) lowers C and legitimately raises the ceiling. This is usually the highest-ROI architecture work in the whole program.
Score 1–5 on evidence, not intention:
| Dimension | What "5" looks like |
|---|---|
| Task performance evidence | ≥95–99% on a golden dataset representative of the OTD, plus production agreement-rate data |
| Observability | Full reasoning traces, per-step logging, replayable cases, real-time dashboards |
| Guardrail maturity | Permissions enforced outside the model; policy engine; sandboxed tools; tested MSS |
| Data & tool reliability | Stable APIs, monitored data quality, versioned prompts/models |
| Escalation capacity | Named FRO, staffed queue, measured response SLA |
| Organizational maturity | Agent owner, incident process, audit cadence, sign-off authority exist on paper and in practice |
Achievable Gear rule (use the minimum of the six — readiness is a weakest-link property): min ≥ 4 supports G4; min ≥ 3 supports G3; min ≥ 2 supports G2; below that, G0–G1 only.
Target Gear = lower of (Gear Ceiling from C, Achievable Gear from R) Launch Gear = Target Gear − 1, or Target Gear in shadow mode (Section 6)
| R supports G1 | R supports G2 | R supports G3 | R supports G4 | |
| Ceiling G2 (Critical) | Launch G0/G1 | Target G2 | Target G2 (readiness surplus → spend it on better approval UX, not more autonomy) | Target G2 |
| Ceiling G3 (High) | Launch G1, build readiness | Launch G2 | Target G3 | Target G3 |
| Ceiling G4 (Mod/Low) | Launch G1 | Launch G2 | Launch G3 | Target G4 |
A readiness surplus above the ceiling is not wasted — it buys cheaper oversight, faster audits, and resilience. A readiness deficit below an attractive ceiling defines the engineering roadmap: the gap analysis between current R and required R is the build plan.
Net value at a gear is:
NV(G) = Automation dividend (labor, cycle time, throughput, 24/7 coverage) − Oversight tax (approvals, FRO staffing, audits) − Control cost (guardrails, observability, certification) − Risk-adjusted incident cost (probability × bounded worst case)
The curve this produces is the strategic insight:
Strategic implication for portfolio planning: when evaluating a use case, model NV at the ceiling gear, not the launch gear. If the ceiling is G2 (Critical tier) and NV(G2) is negative at projected volume — i.e., the approval burden exceeds manual cost and you are never allowed to go higher — the use case fails economically regardless of how impressive the demo is. That is a phase-out signal, and finding it in a spreadsheet is far cheaper than finding it in production.
Autonomy in GEAR is a license: earned with evidence, held with monitoring, revoked on triggers, and surrendered when the economics die.
Before any promotion, run the agent at the target gear with actions simulated: it executes the full decision flow, but writes goes to a shadow store while humans (or the current gear) still do the real work. Compare agent output to human ground truth at production volume. Shadow mode converts promotion from an opinion into a measurement.
A shift of one gear requires all gates over a defined evidence window (e.g., 60 days or 500 cases, whichever is later):
One gear per shift. No skipping. A model/prompt/tool-version change of material scope resets the evidence window (a "new vehicle platform" requires re-homologation, even on the same roads).
Demotion is designed to be boring and routine — a gear is a setting, not a status symbol. Organizations that treat demotion as failure will hide the signals that should trigger it.
Recommend Kill when any of the following holds:
Recommend Park (kill with a revisit date) when the blocker is external and time-resolving: model capability on the task, upcoming regulatory clarity, or a dependency system being replaced. Parking with a written re-entry condition ("revisit when extraction accuracy on handwritten Arabic exceeds 95% on our golden set") prevents both zombie projects and permanently lost opportunities.
Use case: agentic verification of income documents (Arabic/English) for housing-loan applications; extraction, cross-document consistency checks, policy validation, case-record updates. The agent does not make the credit decision.
Step 1 — Define the OTD. Salaried applicants; payslips, bank statements, employment letters; loans ≤ 2M; on-prem processing; exclusions: self-employed, handwritten docs, PEPs.
Step 2 — Consequence (C). Reversibility 2 (case record updates reversible; no funds move). Blast radius 3 (wrong verification feeds a downstream credit decision, but a human credit officer remains in that loop). Financial materiality 3. Regulatory 4 (verification quality is examinable by the banking regulator). Data sensitivity 4 (financial PII, residency requirements). External visibility 2. Max = 4 → High tier → Ceiling G3.
Step 3 — Readiness (R). Golden-set accuracy 94% (score 3), observability strong (4), guardrails enforced via policy engine (4), data quality variable on scanned docs (3), FRO queue exists but unstaffed for volume (2), org maturity 3. Min = 2 → currently supports G2.
Step 4 — Placement. Target = min(G3, G2) = launch at G2, with a funded path to G3: staff the FRO queue (readiness 2→3) and push accuracy past the 95% High-tier gate (3→4).
Step 5 — Economics. At 500 cases/day, NV(G2) is mildly positive but approval throughput caps at ~300 cases/day per reviewer — G2 is a bottleneck within a quarter. NV(G3) closes the business case decisively. Conclusion: proceed, because the ceiling (G3) is economically sufficient; had the ceiling been G2, this volume profile would have triggered a Kill review before build.
Step 6 — Shift plan. 60-day G2 operation → shadow-G3 for 30 days → gates 1–5 → G3 certificate (6-month expiry) with monthly vigilance injection and a demotion trigger on >8% override rate. The credit-decision action class remains outside the OTD permanently.
Mnemonic: G0 Ask · G1 Assist · G2 Approve · G3 Escalate · G4 Audit · G5 Aspire.
Zero&One is a leading Premier AWS Consulting Partners in MENA region with a vision to empower businesses of all scales in their cloud adoption journey. We specialize in AWS services like DevOps, application modernization, cloud migration and serverless computing. We currently operate from our offices in Lebanon, UAE, and Saudi with 100+ certifications in our hands and serve 50+ happy customers across the region.