Fast-or-Verify Pattern

Most NL→SQL demos are impressive… right up to the moment you need real guarantees. In finance, healthcare, or regulated ops, “pretty good” isn’t good enough.

This article lays out a two‑lane architecture that answers known questions instantly and routes novel ones to human‑in‑the‑loop (HITL) verification and then learns from every review.

Think of it like airport security:

TSA Pre✓® for vetted passengers (fast, cheap, repeatable) and manual screening for everyone else (slower, safer, documented). Same airport, two speeds, one goal: trust!


Lane 1 - The Fast Lane (Reuse, don’t regenerate)

1) Semantic lookup inside the database. When a user asks, the agent runs a similarity search over a repository of previously approved question⇄SQL pairs stored as embeddings in Oracle Autonomous Database 23ai AI Vector Search. If it finds a true match, it reuses the human‑approved SQL - no new generation, no chance to hallucinate new SQL.

2) LLM‑as‑a‑Judge gate. A lightweight judge agent checks whether the new question is semantically identical to the validated one (e.g., cosine threshold and schema terms align). If yes → execute the approved SQL and return the answer. If not → route to the Quality Lane.

Why this works: you avoid generation entirely for common asks. Over time, your “approved SQL cache” compounds - faster answers, lower cost, higher consistency.


Lane 2 - The Quality Lane (Generate, pause, verify, learn)

For novel/ambiguous asks we generate carefully and verify before execution:

1) Clarify & improve the question (short feedback loop).

2) Generate candidate SQL using Select AI in‑database - keeping data movement minimal and auditability high. Don’t execute yet.

3) HITL checkpoint with interrupts. The agent pauses; an analyst verifies business logic, schema, joins, filters, limits and then will either Approve → execute, or Reject → the agent revises and resubmits.

4) Close the loop. Store the now‑approved question⇄SQL as an embedding for future Fast‑Lane reuse (with lineage + reviewer + timestamp).

Reality check: NL→SQL is powerful but can mis-join, mis-filter or reference non-existent columns/tables. The human checkpoint is your safety valve.

Demo repo & UI

I've published a Gradio demo that shows the main moving parts with a clean UI. What it includes today:

  • LLM-as-a-Judge decisioning (the “is this semantically the same?” gate).
  • Fast Lane behavior for returning pre-approved SQL when there’s a true match (no execution).
  • Quality Lane (partial): up to NL→SQL generation (SQL is produced and displayed, not executed).
  • Confidence label: "High/Medium/Low" classification for the generated SQL.

What it does not include:

  • Human-in-the-loop review UI (approve/reject/edit) and the asynchronous pause/resume cycle.
  • Automatic vectorization & storage of new question⇄SQL approvals (i.e., embedding + metadata write-back).
  • SQL execution against a database.

👉 Repo: https://github.com/shjanjua/fast-or-verify-nl2sql - it’s built with Gradio so you can run and share the demo quickly without front-end heavy lifting.

Pre-reqs: Oracle Database with Select AI configured and a vector table containing approved Question, SQL & Embedded-Question for Fast-Lane lookup.

Under the Hood

  • Oracle Autonomous Database 23ai – AI Vector Search: semantic lookup of approved Q⇄SQL right where your data lives. Oracle Documentation
  • Select AI: in-database NL→SQL + RAG for the Quality Lane. Oracle Documentation
  • OCI Generative AI: managed LLMs/embeddings for judging, rewriting and confidence scoring.
  • Python + LangGraph: multi-step agent.
  • Gradio: lightweight UI to expose the flow and show the SQL being produced.

Guardrails that matter

  • Least privilege: read-only roles/views; block DDL/DML by policy.
  • Dual thresholds: similarity and judge verdict; escalate to HITL on doubt.
  • Query “unit tests” (row-count bounds, KPI validations, time windows) — not in demo.
  • “Show your work” UX: SQL preview, tables touched, filter summary (the demo previews SQL only).
  • Governance metadata (reviewer, version, glossary tags) — not in demo.

What to measure (once you wire in HITL + execution)

  • Fast-Lane hit rate (% answered via reuse).
  • Time-to-answer (p50/p95) and review latency.
  • HITL approval rate and revision count.
  • Escapes (post-hoc corrections) → target near-zero.
  • Cost per answer vs BI baseline.

Conclusion

This pattern doesn’t replace experts, it scales them. Every human‑approved query becomes a reusable asset, turning your conversational layer into a trust machine over time