Engineering Text-to-Elasticsearch DSL with LLMs: Constraints, Safety, and Trust

Turning user intent into validated Elasticsearch queries with guardrails.

TL;DR

Problem: Vectors are great at meaning, bad at hard constraints (years, owners, statuses, permissions).
Approach: LLM translates text → typed IR → deterministic compilation into DSL/ES|QL, under a strict allowlist.
Failure modes: Invalid syntax, invented fields, unsafe operators, expensive queries, prompt injection.
Guardrails: Schema validation + policy allowlist + query cost budgets + read-only creds + timeouts.
Trust: Reverse-parse constraints into UI chips so users can inspect and correct the system’s interpretation.

Users ask for “financial reports from 2024” and mean it literally. Vectors optimize similarity, not constraint satisfaction, so you’ll often get something from 2023 because the content is semantically close. That’s not an embedding bug. It’s soft relevance colliding with hard filters.

We’ll build a Text-to-DSL pipeline with IR → validation → policy budgets → compilation → UI explainability.

1) The Challenge: Vectors vs. Filters

Keyword search is brittle. Vector search is fuzzy. Consider this running example:

“Financial reports from 2024 by Tremblay, exclude drafts.”

Keyword search looks for "Tremblay" and "2024" in the text (and often misses "2024" when it lives in metadata).
Vector search may return a 2023 draft that talks about 2024 forecasts.

To answer correctly, you must map intent to schema:

Tremblay → author (exact/fuzzy match)
2024 → year (hard filter)
exclude drafts → status != draft (negative filter)
financial reports → body/title fields (relevance)

2) Model Selection: Constraints Over Creativity

Query generation is a constrained task: you’re mapping text into a small grammar and a known schema. Bigger models can help with ambiguity, but they also cost more and can be more “creative” than you want.

Optimize for:

constraint accuracy
parse success rate
latency/cost

Use a fast model for most queries. Escalate to a larger model only for:

query decomposition (multi-intent)
clarification generation
recovery after repeated validation failures

What to measure

Parse success %: valid IR produced?
Field hallucination rate: invented field/operator?
P95 latency: stable under load?
Policy rejection rate: blocked by guardrails?
Fallback rate: vector-only due to failures?
Constraint precision/recall: inferred filters vs labeled intent

JSON mode note: constrained JSON reduces parse errors. It doesn’t guarantee semantic validity. You still need schema + policy validation.

3) Architecture: Typed Intermediate Representation (IR)

Do not let the LLM write raw Elasticsearch DSL. It’s expressive, nested, and full of sharp edges. Instead: LLM outputs a typed IR, and you compile deterministically.

Step 1: LLM output (IR v1)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "query": "financial reports",
  "filters": [
    {"field": "author", "op": "eq", "value": "Tremblay"},
    {"field": "year", "op": "eq", "value": 2024},
    {"field": "status", "op": "neq", "value": "draft"}
  ],
  "limit": 20,
  "sort": [{"field": "_score", "dir": "desc"}]
}

Step 2: Validation (two layers)

Schema validation: IR matches a strict type system (e.g., Pydantic). Types, enums, required fields.
Policy validation: the query is permissible and safe: allowed fields/operators, bounded complexity, safe defaults.

Step 3: Compilation (deterministic)

Your code converts validated IR into DSL or ES|QL. The model doesn’t get to “discover” script queries, regex tricks, or other creative ways to torch your cluster.

The IR is the contract. Compilation is the enforcement.

4) Security: The Policy Layer

Allowing an LLM to influence queries expands your attack surface. Treat model output like untrusted user input (because it is).

Defense depth 1: Permissions (non-negotiable)

Read-only credentials only (no modify/delete/reindex).
Auth filters injected server-side at compile time and cannot be overridden.
- Example: always inject tenant_id = current_tenant / user_id = current_user.

The LLM can propose constraints. It cannot widen permissions.

Defense depth 2: Query policy (“the budget”)

Before execution, enforce:

Allowlist: fields and operators only (e.g., term, match, range)
Disallow: script queries, regex, leading wildcards, deep pagination
Budgets: max filters/clauses, max query length, max keyword tokens
Timeouts: hard execution limits

Query cost guardrails (reads can still be weaponized)

Cap limit and ban deep pagination (from > N)
Enforce max boolean clauses (e.g., 20)
Enforce max time-range span unless user confirms (e.g., ≤ 10 years)
Use ES safety knobs where appropriate: timeout, terminate_after, avoid expensive hit counts when not needed

(Write access isn’t the only way to do damage. Reads can bankrupt you just fine.)

5) The Trust Loop: Interactive Reverse Parsing

Transparency is not a nice-to-have. It’s how users debug your system.

Implement a reverse parser that turns IR into UI-visible constraints.

User query: “Financial reports from 2024 by Tremblay, exclude drafts”

Generated constraints (from IR):

[Author: Tremblay] (x)
[Year: 2024] (x)
[Status: NOT draft] (x)

This isn’t just explanation. It’s a control panel. Removing a chip regenerates the query without rewriting the prompt.

Clarification mode If the request contains ambiguous constraints (“recent”, “cheap”), the system either:

asks a clarification question, or
applies a default but surfaces it as an editable chip (e.g., [Recent: last 12 months])

Now the trust loop is actually worthy of the name.

6) Production Patterns

ES|QL vs Query DSL

ES|QL’s linear syntax is often easier for models than nested JSON. Still: don’t pipe raw model text into the engine. Generate IR, validate, compile.

Hybrid orchestration (resilience, not ideology)

Run structured + vector retrieval in parallel:

A: Vector → embedding → kNN search (high recall, robust)
B: Structured → LLM → IR → validate → compile → policy → execute (high precision, brittle)

Merge: if both succeed, combine (e.g., RRF). If structured fails, fall back to vector silently (and log it).

Runtime pipeline

Parse: user text → intent candidates
Generate: LLM → IR (constrained JSON)
Validate: schema + policy
Compile: IR → DSL/ES|QL + inject auth filters
Execute: ES query (timeouts enforced)
Explain: IR → UI chips
Fallback/Merge: vector results + fusion if enabled

Conclusion

Text-to-DSL works when you stop treating it like prompting and start treating it like systems engineering: typed contracts, validation layers, policy budgets, and a UI that exposes constraints.

The goal isn’t “the AI made a query.” It’s “the system understood your intent.”