ss-tools/specs/027-dataset-llm-orchestration/research.md

# Research: LLM Dataset Orchestration

**Feature**: [LLM Dataset Orchestration](./spec.md)
**Branch**: `027-dataset-llm-orchestration`
**Date**: 2026-03-16

## Scope

This document resolves the Phase 0 technical unknowns identified in [`specs/027-dataset-llm-orchestration/plan.md`](./plan.md) and establishes implementation decisions for:
- Superset link parsing and context recovery
- runtime variable and Jinja discovery
- compiled SQL preview strategy
- SQL Lab launch handoff
- semantic source resolution architecture
- clarification session architecture
- typed API scope
- persistence, testing, and rollout strategy

---

## 1. Superset Link Intake and Native Filter Recovery

### Decision
Use a dedicated backend adapter, `SupersetContextExtractor`, to parse supported Superset URLs and recover context in two layers:
1. **URL-level recovery** from dashboard, dataset, and query parameters in the provided link
2. **Superset API enrichment** using the resolved dashboard/dataset identifiers to recover saved filter context and related dataset metadata

The recovered filter set will be stored as an `ImportedFilterSet` with provenance per value:
- `superset_native`
- `superset_url`
- `manual`
- `inferred`

If only partial filter recovery is possible, the session enters `recovery_required` rather than failing closed.

### Rationale
The UX requires recovery to feel progressive rather than brittle. A pure URL parser would be too fragile and version-specific. A pure API lookup would miss context encoded in the link itself. A two-layer strategy preserves value even when one source is incomplete.

This also matches the spec’s requirement that partial recovery remain usable and explicitly explained.

### Alternatives considered
- **Parse only URL query parameters**
  Rejected because it is too dependent on frontend link shape and misses server-side saved context.
- **Resolve everything only through dashboard API after identifying the dashboard**
  Rejected because some context lives directly in the URL and should be captured before API reconciliation.
- **Block session start if any filter cannot be recovered**
  Rejected because it violates the UX principle of progressive value and FR-025.

---

## 2. Runtime Template Variable and Jinja Discovery

### Decision
Detect runtime execution variables using a hybrid discovery pipeline:
1. Load dataset detail from Superset via existing [`backend/src/core/superset_client.py`](backend/src/core/superset_client.py)
2. Extract `sql`, `is_sqllab_view`, metric expressions, and other query-bearing fields where available
3. Apply backend parsing rules for:
   - `{{ ... }}`
   - `filter_values('...')`
   - common template parameter references
   - Jinja variables in SQL text and metric expressions
4. Normalize discovered variables into typed `TemplateVariable` entries with:
   - variable name
   - source expression
   - inferred kind (`native_filter`, `parameter`, `derived`, `unknown`)
   - required/optional status
   - mapping status

The parser will not execute Jinja locally. It only discovers variables and structure.

### Rationale
The UX and specification both require execution transparency, but also require Superset to remain the source of truth for final SQL compilation. Discovery is safe locally; execution is not. Separating variable discovery from compilation preserves WYSIWWR and avoids local behavior drift.

### Alternatives considered
- **Use only string regex over SQL text**
  Rejected because it is simple but insufficient for richer query-bearing payloads.
- **Execute local Jinja rendering to discover required params**
  Rejected because it conflicts with the UX principle that only Superset defines final execution.
- **Rely on LLM to infer variables from SQL text**
  Rejected because variable discovery must be deterministic and auditable.

---

## 3. Compiled SQL Preview Strategy

### Decision
Compiled preview will be generated only through a dedicated Superset-side compilation call path managed by `SupersetCompilationAdapter`. The adapter will:
1. receive the candidate run context,
2. submit it to a Superset-compatible preview/compilation flow,
3. store the exact returned compiled SQL plus metadata,
4. mark preview state as `ready`, `pending`, `failed`, or `stale`.

Launch remains blocked unless preview state is `ready` and bound to the current mapping/version snapshot.

#### Superset Version Compatibility Matrix
The system will explicitly check `SupersetClient.get_version()` during intake to determine available features:
- **v3.x+**: Supports all features (context recovery, native filter import, SQL compilation preview, SQL Lab launch).
- **v2.x**: Supports best-effort context recovery; SQL compilation preview may require fallback to Manual Launch.
- **Legacy/Unknown**: Automatic recovery disabled; session enters `recovery_required` mode immediately.

#### Fallback Strategy (Preview Availability < 100%)
If Superset-side compilation is unavailable (e.g., version mismatch or unsupported dataset type):
- The system will allow **Manual Launch Approval** by a user with `dataset:session:approve` permission.
- The UI will explicitly display: "Native Preview Unavailable - Manual Validation Required."
- The audit record will mark the run as "Launched without Compiled Preview."

### Rationale
This is a direct requirement from FR-029 and the UX reference’s WYSIWWR principle. The compiled preview must be the same source of truth as the real execution path. Even a “close enough” local approximation would undermine trust and violate the spec.

### Alternatives considered
- **Local best-effort Jinja compilation with warning banner**
  Rejected because the spec explicitly forbids fallback to unverified local approximation.
- **No preview, only launch-time validation**
  Rejected because it breaks both UX and explicit launch gate requirements.
- **Preview generated by the LLM**
  Rejected because the LLM must not write or edit SQL directly.

---

## 4. SQL Lab Launch as Canonical Audited Target

### Decision
Treat Superset SQL Lab as the primary canonical launch target for this feature. `DatasetReviewOrchestrator.launch_dataset` will:
1. verify preview readiness,
2. verify required-value completeness,
3. verify warning approvals,
4. create or bind a SQL Lab execution session in Superset,
5. persist a `DatasetRunContext` snapshot with session reference, approved mappings, semantic decisions, and preview fingerprint.

#### Fallback path: Export Prepared Context
If SQL Lab session creation is unavailable:
- Provide an **Export Prepared Run Context (JSON/SQL)** action.
- This allows users to manually take the compiled SQL (if available) or the effective parameter payload to Superset.
- The system still records the `DatasetRunContext` as "Exported for Manual Launch" to preserve audit integrity.

### Rationale
The specification explicitly names SQL Lab as canonical. The system must preserve auditability and replay, not merely trigger execution. Persisting the SQL Lab reference and run context as a single audited boundary supports both operational trust and later reopening.

### Alternatives considered
- **Run query outside SQL Lab through another Superset endpoint**
  Rejected because it violates the canonical target clarified in the spec.
- **Redirect user to a generic analytical view with params**
  Rejected because it is weaker for audit and replay.
- **Allow fallback execution when SQL Lab creation fails**
  Rejected because the feature’s safety model depends on the audited target.

---

## 5. Semantic Enrichment Architecture

### Decision
Split semantic enrichment into a dedicated `SemanticSourceResolver` domain module instead of embedding all logic inside `DatasetReviewOrchestrator`.

Planned responsibilities:
- `resolve_from_file`
- `resolve_from_dictionary`
- `resolve_from_reference_dataset`
- `rank_candidates`
- `detect_conflicts`
- `apply_field_decision`

Candidate ranking will follow the required confidence hierarchy:
1. exact dictionary/file match
2. trusted reference dataset match
3. fuzzy semantic match
4. AI-generated draft

Manual overrides are represented as a lockable terminal provenance state and cannot be overwritten implicitly.

### Rationale
The feature’s semantic logic is too substantial to remain an untyped orchestrator subroutine. The reviewer feedback correctly identified this as a likely god-object risk. A dedicated resolver improves testability, contract clarity, and future extensibility.

### Alternatives considered
- **Keep enrichment inside orchestrator only**
  Rejected because it centralizes too many responsibilities in a single Complexity 5 module.
- **Push matching into frontend only**
  Rejected because provenance, confidence, and audit decisions must be persisted server-side.
- **Use only AI generation when dictionaries are absent**
  Rejected because “reuse before invention” is a core UX principle.

---

## 6. Clarification Session Architecture

### Decision
Create a dedicated `ClarificationEngine` service with persistent `ClarificationSession` state. The engine will manage:
- unresolved-item prioritization,
- one-question-at-a-time question generation,
- best-guess suggestion packaging,
- answer recording,
- progress summarization,
- resume-from-last-open-item behavior.

Answers are persisted before session state advancement.

#### LLM Quality & Feedback Mechanism
To maintain high trust in AI-generated summaries, match recommendations, and questions:
- **User Feedback**: UI will include inline feedback (👍/👎) for all AI-drafted content.
- **Audit Logging**: Feedback is persisted in `SemanticFieldDecision` and `ClarificationAnswer` records for prompt engineering improvement.
- **Confidence Triage**: Low-confidence LLM outputs (score < 0.7) will be automatically marked for "expert review" rather than presented as confirmed suggestions.

#### Question Generation Strategy
Question templates and logic will be developed in Milestone 2:
- **Evaluation Criteria**: Questions must be judged on: relevancy, business impact explanation, and actionability of suggestions.
- **Templates**: Standardized prompts for field ambiguity, filter conflict, and missing run-time values.
- **Human-in-the-loop**: High-risk semantic conflicts will prioritize "expert review" options.

### Rationale
Guided clarification is not a simple endpoint action; it is a stateful interaction model with resumability and audit expectations. Treating it as a first-class module aligns with FR-012 to FR-020 and supports deterministic behavior rather than ad hoc agent chat state.

### Alternatives considered
- **Use assistant conversation history as the only clarification state**
  Rejected because feature state and conversational transcript are not the same thing.
- **Generate all questions at once**
  Rejected because the UX explicitly requires one question at a time.
- **Do not persist skipped questions separately**
  Rejected because resume and expert-review flows need explicit unresolved-state semantics.

---

## 7. API Contract Granularity

### Decision
Expand the API beyond the current session/preview/launch skeleton into a typed session contract set with explicit lifecycle and field-level operations.

Required endpoint groups:
1. **Session lifecycle**
   - `POST /sessions`
   - `GET /sessions`
   - `GET /sessions/{id}`
   - `PATCH /sessions/{id}`
   - `DELETE /sessions/{id}`

2. **Semantic source and field-level semantic actions**
   - `POST /sessions/{id}/semantic-source`
   - `PATCH /sessions/{id}/fields/{field_id}/semantic`
   - `POST /sessions/{id}/fields/{field_id}/lock`
   - `POST /sessions/{id}/fields/{field_id}/unlock`

3. **Clarification**
   - `GET /sessions/{id}/clarification`
   - `POST /sessions/{id}/clarification/answers`
   - `POST /sessions/{id}/clarification/resume`

4. **Mapping review**
   - `GET /sessions/{id}/mappings`
   - `PATCH /sessions/{id}/mappings/{mapping_id}`
   - `POST /sessions/{id}/mappings/{mapping_id}/approve`

5. **Preview and launch**
   - `POST /sessions/{id}/preview`
   - `POST /sessions/{id}/launch`

6. **Exports**
   - `GET /sessions/{id}/exports/documentation`
   - `GET /sessions/{id}/exports/validation`

Core schemas must be fully typed, especially:
- `ValidationFinding`
- `ImportedFilter`
- `TemplateVariable`
- `MappingDecision`
- `SemanticFieldEntry`
- `ClarificationQuestion`
- `DatasetRunContextSummary`
- `CompiledPreview`

### Rationale
The existing API draft is structurally correct but too loose for backend/frontend parallel work. Typed schemas are a blocker for safe implementation, especially given mapping approvals, field-level overrides, and resumable state.

### Alternatives considered
- **Minimal object blobs in `SessionDetail`**
  Rejected because they prevent stable frontend typing and contract validation.
- **Expose only session-level mutations**
  Rejected because FR-040 and mapping approval UX require field-level actions.
- **Defer exports until later**
  Rejected because FR-035 requires exportable outputs as part of the feature.

---

## 8. Persistence Model

### Decision
Persist dataset review workflow as first-class application entities rather than storing everything in ad hoc task payloads.

Persistence boundaries will include:
- `DatasetReviewSession`
- `DatasetProfileSnapshot`
- `ValidationFindingRecord`
- `SemanticFieldDecision`
- `ImportedFilterRecord`
- `TemplateVariableRecord`
- `MappingApprovalRecord`
- `ClarificationSessionRecord`
- `ClarificationAnswerRecord`
- `CompiledPreviewRecord`
- `DatasetRunContextRecord`

Long-running task execution remains in the existing task system, but tasks reference session identifiers rather than replacing session state.

### Rationale
The current task system is good for observability and async work, but not sufficient as the canonical store for resumable feature state. The feature requires reopening, editing, exporting, and replaying a session independently of task lifecycle.

### Alternatives considered
- **Store all workflow state inside task params/results**
  Rejected because tasks are execution records, not editable session aggregates.
- **Use filesystem-only session snapshots**
  Rejected because this weakens structured querying and RBAC alignment.
- **Persist only final run context**
  Rejected because clarification and review resume need mid-flow state.

---

## 9. Async and Observability Strategy

### Decision
Use `TaskManager` for expensive or multi-step operations:
- initial deep context recovery,
- semantic enrichment imports over large sources,
- preview generation when upstream latency is non-trivial,
- launch handoff if it includes multi-step Superset orchestration.

Short synchronous operations may still exist for simple field edits and approvals, but session state transitions must remain observable and reflected in the session model.

### Rationale
This aligns with the repository constitution and existing operational model. It also supports progressive UX milestones and live readiness changes.

### Alternatives considered
- **Make all session operations synchronous**
  Rejected because upstream Superset and enrichment work may exceed acceptable UI latency.
- **Use frontend polling only with no task integration**
  Rejected because the repository already has stronger task observability patterns.

---

## 10. Frontend Integration Pattern

### Decision
Implement the feature on top of the existing [`frontend/src/lib/api.js`](frontend/src/lib/api.js) wrapper conventions and Svelte 5 rune-based patterns. Add dedicated wrapper functions under the frontend API layer for dataset orchestration rather than embedding endpoint strings inside components.

The main route will be a dedicated review workspace separate from the current [`frontend/src/routes/datasets/+page.svelte`](frontend/src/routes/datasets/+page.svelte), which remains the dataset hub/list page.

### Rationale
The current dataset page already serves as a listing and bulk-operation view. The new orchestration flow is a multi-state workspace and should not overload the existing hub component. Using wrapper functions preserves constitution compliance and testability.

### Alternatives considered
- **Extend the existing dataset hub page for the whole flow**
  Rejected because it would overload a currently unrelated list page with a large state machine.
- **Call endpoints directly with native `fetch` in new components**
  Rejected by constitution and existing frontend architecture.
- **Build the whole flow inside assistant chat only**
  Rejected because the UX reference defines a structured workspace, not chat-only interaction.

---

## 11. Testing Strategy

### Decision
Use a three-layer testing strategy:

#### Backend
- **Unit tests**
  - semantic ranking/conflict rules,
  - clarification prioritization,
  - mapping approval guards,
  - preview stale/ready rules,
  - launch blocking conditions.

- **Integration tests**
  - session persistence,
  - Superset context extraction with fixtures,
  - compiled preview orchestration,
  - SQL Lab launch handoff,
  - export generation.

- **API contract tests**
  - typed schema responses,
  - RBAC guards,
  - session lifecycle endpoints,
  - field-level semantic operations,
  - mapping approval endpoints.

#### Frontend
- **Component tests**
  - source intake states,
  - validation triage,
  - semantic conflict view,
  - clarification dialog,
  - mapping review,
  - compiled preview stale/error behavior,
  - launch summary gating.

- **Route integration tests**
  - progressive loading,
  - save/resume flows,
  - recovery-required state,
  - warning approval requirements.

#### External dependency handling
- mock Superset endpoints with version-shaped fixtures,
- mock LLM structured outputs deterministically,
- verify degraded paths explicitly instead of pretending success.

### Rationale
The feature blends deterministic parsing, stateful orchestration, and external-service dependencies. It needs more than happy-path smoke tests.

### Alternatives considered
- **Rely on manual quickstart only**
  Rejected because the feature has too many state transitions and edge cases.
- **Mock everything at unit level only**
  Rejected because contract drift with Superset and frontend schemas would go undetected.
- **Run live LLM/Superset tests in CI**
  Rejected because determinism and cost would be poor.

---

## 12. Incremental Delivery Strategy

### Decision
Implement in three milestones:

### Milestone 1 — Sessioned Auto Review
Includes:
- source intake,
- session persistence,
- profile/finding generation,
- semantic-source application,
- exportable documentation/validation outputs.

### Milestone 2 — Guided Clarification
Includes:
- clarification engine,
- resumable dialogue state,
- conflict review,
- field-level semantic overrides/locks.

### Milestone 3 — Controlled Execution
Includes:
- imported filter recovery,
- template-variable mapping,
- mapping approval workflow,
- compiled SQL preview,
- SQL Lab launch,
- audited run context.

### Rationale
This delivers user value earlier and reduces risk by separating understanding from execution. It also directly addresses the reviewer feedback that the feature should not be treated as a monolithic drop.

### Alternatives considered
- **Single all-at-once delivery**
  Rejected due to risk concentration and hard-to-demonstrate progress.
- **Execution-first milestone**
  Rejected because trust, semantics, and readiness are prerequisites for safe launch.

---

## 13. Risk Resolution Outcomes

### Decision
Treat the following as non-negotiable gates:
- no Superset-side preview support → stop execution-scope planning
- no reliable SQL Lab session handoff → block launch design
- no typed API schemas → block parallel frontend/backend implementation
- no explicit semantic/clarification modules → block Phase 1 contract completion

### Rationale
These are architecture-shaping constraints, not minor implementation details.

### Alternatives considered
- **Continue with placeholders and fill gaps during coding**
  Rejected because the feature is too stateful and contract-heavy for speculative implementation.

---

## 14. Review Feedback Incorporated

### Decision
The next design phase must explicitly address the previously identified review gaps by:
- adding `SemanticSourceResolver`,
- adding `ClarificationEngine`,
- expanding `SupersetCompilationAdapter` into a broader extraction + compilation boundary or adding `SupersetContextExtractor`,
- fully typing `api.yaml`,
- adding mapping and field-level endpoints,
- adding session lifecycle endpoints,
- adding error-path coverage to `quickstart.md`,
- preserving delivery milestones in the plan.

### Rationale
These gaps are valid and materially affect implementation readiness. Folding them into Phase 1 keeps the planning workflow honest and actionable.

### Alternatives considered
- **Ignore review findings until code review**
  Rejected because several issues are architectural, not cosmetic.

---

## Final Phase 0 Result

All identified `NEEDS CLARIFICATION` items from the plan are resolved at the planning level.

The implementation should proceed into Phase 1 with:
- expanded typed data model,
- expanded contracts,
- expanded API surface,
- negative-path quickstart scenarios,
- post-design constitution re-check.