# Semantic Module Contracts: LLM Dataset Orchestration **Feature**: [LLM Dataset Orchestration](../spec.md) **Branch**: `027-dataset-llm-orchestration` This document defines the semantic contracts for the core components of the Dataset LLM Orchestration feature, following the [GRACE-Poly Standard](../../../.ai/standards/semantics.md). --- ## 1. Backend Modules # [DEF:DatasetReviewOrchestrator:Module] # @COMPLEXITY: 5 # @PURPOSE: Coordinate the full dataset review session lifecycle across intake, recovery, semantic review, clarification, mapping review, preview generation, and launch. # @LAYER: Domain # @RELATION: [DEPENDS_ON] ->[DatasetReviewSessionRepository] # @RELATION: [DEPENDS_ON] ->[SemanticSourceResolver] # @RELATION: [DEPENDS_ON] ->[ClarificationEngine] # @RELATION: [DEPENDS_ON] ->[SupersetContextExtractor] # @RELATION: [DEPENDS_ON] ->[SupersetCompilationAdapter] # @RELATION: [DEPENDS_ON] ->[TaskManager] # @PRE: session mutations must execute inside a persisted session boundary scoped to one authenticated user. # @POST: state transitions are persisted atomically and emit observable progress for long-running steps. # @SIDE_EFFECT: creates task records, updates session aggregates, triggers upstream Superset calls, persists audit artifacts. # @DATA_CONTRACT: Input[SessionCommand] -> Output[DatasetReviewSession | CompiledPreview | DatasetRunContext] # @INVARIANT: Launch is blocked unless a current session has no open blocking findings, all launch-sensitive mappings are approved, and a non-stale Superset-generated compiled preview matches the current input fingerprint. # @TEST_CONTRACT: start_or_resume_session -> returns persisted session shell with recommended next action # @TEST_SCENARIO: launch_gate_blocks_stale_preview -> launch rejected when preview fingerprint no longer matches current mapping inputs # @TEST_EDGE: missing_dataset_ref -> blocking failure # @TEST_EDGE: stale_preview -> blocking failure # @TEST_EDGE: sql_lab_launch_failure -> terminal failed launch state with audit record # @TEST_INVARIANT: launch_gate -> VERIFIED_BY: [launch_gate_blocks_stale_preview] #### ƒ **start_session** # @PURPOSE: Initialize a new session from a Superset link or dataset selection and trigger context recovery. # @PRE: source input is non-empty and environment is accessible. # @POST: session exists in persisted storage with intake/recovery state and task linkage when async work is required. # @SIDE_EFFECT: persists session and may enqueue recovery task. #### ƒ **apply_semantic_source** # @PURPOSE: Apply a selected semantic source and update field-level candidate/decision state. # @PRE: source exists and session is not terminal. # @POST: semantic field entries and findings reflect selected-source outcomes without overwriting locked manual values. # @SIDE_EFFECT: updates semantic decisions and conflict findings. #### ƒ **record_clarification_answer** # @PURPOSE: Persist one clarification answer and re-evaluate profile, findings, and readiness. # @PRE: target question belongs to the session’s active clarification session. # @POST: answer is saved before current-question pointer advances. # @SIDE_EFFECT: updates clarification and finding state. #### ƒ **prepare_launch_preview** # @PURPOSE: Assemble effective execution inputs and trigger Superset-side preview compilation. # @PRE: all required variables have candidate values or explicitly accepted defaults. # @POST: returns preview artifact in pending, ready, failed, or stale state. # @SIDE_EFFECT: persists preview attempt and upstream compilation diagnostics. #### ƒ **launch_dataset** # @PURPOSE: Start the approved dataset execution through SQL Lab and persist run context for audit/replay. # @PRE: session is run-ready and compiled preview is current. # @POST: returns persisted run context with SQL Lab session reference and launch outcome. # @SIDE_EFFECT: creates SQL Lab execution session and audit snapshot. # [/DEF:DatasetReviewOrchestrator:Module] --- # [DEF:DatasetReviewSessionRepository:Module] # @COMPLEXITY: 5 # @PURPOSE: Persist and retrieve dataset review session aggregates, including readiness, findings, semantic decisions, clarification state, previews, and run contexts. # @LAYER: Domain # @RELATION: [DEPENDS_ON] ->[DatasetReviewSession] # @RELATION: [DEPENDS_ON] ->[DatasetProfile] # @RELATION: [DEPENDS_ON] ->[ValidationFinding] # @RELATION: [DEPENDS_ON] ->[CompiledPreview] # @PRE: repository operations execute within authenticated request or task scope. # @POST: session aggregate reads are structurally consistent and writes preserve ownership and version semantics. # @SIDE_EFFECT: reads/writes application persistence layer. # @DATA_CONTRACT: Input[SessionMutation] -> Output[PersistedSessionAggregate] # @INVARIANT: answers, mapping approvals, preview artifacts, and launch snapshots are never attributed to the wrong user or session. # @TEST_CONTRACT: save_then_resume -> persisted session can be reopened without losing semantic/manual/clarification state # @TEST_SCENARIO: resume_session_preserves_manual_overrides -> locked semantic fields remain active after reload # @TEST_EDGE: foreign_user_access -> rejected # @TEST_EDGE: missing_session -> not found # @TEST_EDGE: partial_preview_snapshot -> preserved but not marked launchable # @TEST_INVARIANT: ownership_scope -> VERIFIED_BY: [foreign_user_access] #### ƒ **create_session** # @PURPOSE: Persist initial session shell. #### ƒ **load_session_detail** # @PURPOSE: Return the full session aggregate for API/frontend use. #### ƒ **save_profile_and_findings** # @PURPOSE: Persist profile and validation state together. #### ƒ **save_preview** # @PURPOSE: Persist compiled preview attempt and mark older fingerprints stale. #### ƒ **save_run_context** # @PURPOSE: Persist immutable launch audit snapshot. # [/DEF:DatasetReviewSessionRepository:Module] --- # [DEF:SemanticSourceResolver:Module] # @COMPLEXITY: 4 # @PURPOSE: Resolve, rank, and apply semantic metadata candidates from files, connected dictionaries, reference datasets, and AI generation fallback. # @LAYER: Domain # @RELATION: [DEPENDS_ON] ->[LLMProviderService] # @RELATION: [DEPENDS_ON] ->[SemanticSource] # @RELATION: [DEPENDS_ON] ->[SemanticFieldEntry] # @RELATION: [DEPENDS_ON] ->[SemanticCandidate] # @PRE: selected source and target field set must be known. # @POST: candidate ranking follows the configured confidence hierarchy and unresolved fuzzy matches remain reviewable. # @SIDE_EFFECT: may create conflict findings and semantic candidate records. # @DATA_CONTRACT: Input[SemanticSourceSelection | SemanticFieldSet | ManualFieldDecision] -> Output[SemanticCandidateSet | RankedSemanticResolution | ValidationFindingSet] # @INVARIANT: Manual overrides are never silently replaced by imported, inferred, or AI-generated values. # @TEST_CONTRACT: rank_candidates -> exact dictionary beats reference import beats fuzzy beats AI draft # @TEST_SCENARIO: manual_lock_survives_reimport -> locked field remains active after another source is applied # @TEST_EDGE: malformed_source_payload -> failed source application with explanatory finding # @TEST_EDGE: conflicting_sources -> conflict state preserved for review # @TEST_EDGE: no_trusted_matches -> AI draft fallback only # @TEST_INVARIANT: confidence_hierarchy -> VERIFIED_BY: [rank_candidates] #### ƒ **resolve_from_file** # @PURPOSE: Normalize uploaded semantic file records into field-level candidates. #### ƒ **resolve_from_dictionary** # @PURPOSE: Resolve candidates from connected tabular dictionary sources. #### ƒ **resolve_from_reference_dataset** # @PURPOSE: Reuse semantic metadata from trusted Superset datasets. #### ƒ **rank_candidates** # @PURPOSE: Apply confidence ordering and determine best candidate per field. #### ƒ **detect_conflicts** # @PURPOSE: Mark competing candidate sets that require explicit user review. #### ƒ **apply_field_decision** # @PURPOSE: Accept, reject, or manually override a field-level semantic value. # [/DEF:SemanticSourceResolver:Module] --- # [DEF:ClarificationEngine:Module] # @COMPLEXITY: 4 # @PURPOSE: Manage one-question-at-a-time clarification sessions, including prioritization, answer persistence, and readiness impact updates. # @LAYER: Domain # @RELATION: [DEPENDS_ON] ->[ClarificationSession] # @RELATION: [DEPENDS_ON] ->[ClarificationQuestion] # @RELATION: [DEPENDS_ON] ->[ClarificationAnswer] # @RELATION: [DEPENDS_ON] ->[ValidationFinding] # @PRE: target session contains unresolved or contradictory review state. # @POST: every recorded answer updates the clarification session and associated session state deterministically. # @SIDE_EFFECT: creates clarification questions, persists answers, updates findings/profile state. # @DATA_CONTRACT: Input[ClarificationSessionState | ClarificationAnswerCommand] -> Output[ClarificationQuestionPayload | ClarificationProgressSnapshot | SessionReadinessDelta] # @INVARIANT: Clarification answers are persisted before the current question pointer or readiness state is advanced. # @TEST_CONTRACT: next_question_selection -> returns only one highest-priority unresolved question at a time # @TEST_SCENARIO: save_and_resume_clarification -> reopening session restores current question and prior answers # @TEST_EDGE: skipped_question -> unresolved topic remains visible # @TEST_EDGE: expert_review_marked -> topic deferred without false resolution # @TEST_EDGE: duplicate_answer_submission -> idempotent or rejected deterministically # @TEST_INVARIANT: single_active_question -> VERIFIED_BY: [next_question_selection] #### ƒ **start_or_resume** # @PURPOSE: Open clarification mode on the highest-priority unresolved question. #### ƒ **build_question_payload** # @PURPOSE: Return question, why-it-matters text, current guess, and suggested options. #### ƒ **record_answer** # @PURPOSE: Persist one answer and compute state impact. #### ƒ **summarize_progress** # @PURPOSE: Produce the clarification change summary shown on exit or pause. # [/DEF:ClarificationEngine:Module] --- # [DEF:SupersetContextExtractor:Module] # @COMPLEXITY: 4 # @PURPOSE: Recover dataset, dashboard, filter, and runtime-template context from Superset links and related API payloads. # @LAYER: Infra # @RELATION: [DEPENDS_ON] ->[ImportedFilter] # @RELATION: [DEPENDS_ON] ->[TemplateVariable] # @RELATION: [DEPENDS_ON] ->[SupersetClient] # @DATA_CONTRACT: Input[SupersetLink | DatasetReference | EnvironmentContext] -> Output[RecoveredSupersetContext | ImportedFilterSet | TemplateVariableSet | RecoverySummary] # @PRE: Superset link or dataset reference must be parseable enough to resolve an environment-scoped target resource. # @POST: returns the best available recovered context with explicit provenance and partial-recovery markers when necessary. # @SIDE_EFFECT: performs upstream Superset API reads. # @INVARIANT: Partial recovery is surfaced explicitly and never misrepresented as fully confirmed context. # @TEST_CONTRACT: recover_context_from_link -> output distinguishes URL-derived, native-filter-derived, and unresolved context # @TEST_SCENARIO: partial_filter_recovery_marks_recovery_required -> session remains usable but not falsely complete # @TEST_EDGE: unsupported_link_shape -> intake failure with actionable finding # @TEST_EDGE: dataset_without_filters -> successful dataset recovery with empty imported filter set # @TEST_EDGE: missing_dashboard_binding -> partial recovery only # @TEST_INVARIANT: provenance_visibility -> VERIFIED_BY: [recover_context_from_link] #### ƒ **parse_superset_link** # @PURPOSE: Extract candidate identifiers and query state from supported Superset URLs. #### ƒ **recover_imported_filters** # @PURPOSE: Build imported filter entries from URL state and Superset-side saved context. #### ƒ **discover_template_variables** # @PURPOSE: Detect runtime variables and Jinja references from dataset query-bearing fields. #### ƒ **build_recovery_summary** # @PURPOSE: Summarize recovered, partial, and unresolved context for session state and UX. # [/DEF:SupersetContextExtractor:Module] --- # [DEF:SupersetCompilationAdapter:Module] # @COMPLEXITY: 4 # @PURPOSE: Interact with Superset preview compilation and SQL Lab execution endpoints using the current approved execution context. # @LAYER: Infra # @RELATION: [DEPENDS_ON] ->[CompiledPreview] # @RELATION: [DEPENDS_ON] ->[DatasetRunContext] # @RELATION: [DEPENDS_ON] ->[SupersetClient] # @DATA_CONTRACT: Input[ApprovedExecutionContext | PreviewFingerprint | LaunchRequest] -> Output[CompiledPreview | PreviewFailureArtifact | DatasetRunContext | LaunchFailureAudit] # @PRE: effective template params and dataset execution reference are available. # @POST: preview and launch calls return Superset-originated artifacts or explicit errors. # @SIDE_EFFECT: performs upstream Superset preview and SQL Lab calls. # @INVARIANT: The adapter never fabricates compiled SQL locally; preview truth is delegated to Superset only. # @TEST_CONTRACT: compile_then_launch -> launch uses the same effective input fingerprint verified in preview # @TEST_SCENARIO: preview_failure_blocks_launch -> no SQL Lab session is created after failed preview # @TEST_EDGE: compilation_endpoint_error -> failed preview artifact with readable diagnostics # @TEST_EDGE: sql_lab_creation_error -> failed launch audit state # @TEST_EDGE: fingerprint_mismatch -> launch rejected # @TEST_INVARIANT: superset_truth_source -> VERIFIED_BY: [preview_failure_blocks_launch] #### ƒ **compile_preview** # @PURPOSE: Request Superset-side compiled SQL preview for the current effective inputs. #### ƒ **mark_preview_stale** # @PURPOSE: Invalidate previous preview after mapping or value changes. #### ƒ **create_sql_lab_session** # @PURPOSE: Create the canonical audited execution session after all launch gates pass. # [/DEF:SupersetCompilationAdapter:Module] --- ## 2. Frontend Components #### ƒ **handleSourceSubmit** #### ƒ **handleResumeSession** #### ƒ **handleLaunch** --- #### ƒ **submitSupersetLink** #### ƒ **submitDatasetSelection** --- #### ƒ **groupFindingsBySeverity** #### ƒ **jumpToFindingTarget** --- #### ƒ **applyManualOverride** #### ƒ **applyCandidateSelection** --- #### ƒ **submitAnswer** #### ƒ **skipQuestion** #### ƒ **pauseClarification** --- #### ƒ **approveMapping** #### ƒ **overrideMappingValue** --- #### ƒ **requestPreview** #### ƒ **showPreviewErrorTarget** --- #### ƒ **buildLaunchSummary** #### ƒ **confirmLaunch** --- ## 3. Contract Coverage Notes The feature requires: - dedicated semantic resolution contracts instead of hiding source-ranking logic inside orchestration, - a first-class clarification engine because guided ambiguity resolution is a persisted workflow, not a simple endpoint, - a Superset extraction boundary distinct from preview/launch behavior, - UI contracts that cover the UX state machine rather than only the happy path. These contracts are intended to align directly with: - [`specs/027-dataset-llm-orchestration/spec.md`](../spec.md) - [`specs/027-dataset-llm-orchestration/ux_reference.md`](../ux_reference.md) - [`specs/027-dataset-llm-orchestration/research.md`](../research.md) - [`specs/027-dataset-llm-orchestration/data-model.md`](../data-model.md)