24 KiB
24 KiB
Semantic Module Contracts: LLM Dataset Orchestration
Feature: LLM Dataset Orchestration
Branch: 027-dataset-llm-orchestration
This document defines the semantic contracts for the core components of the Dataset LLM Orchestration feature, following the GRACE-Poly Standard.
1. Backend Modules
[DEF:DatasetReviewOrchestrator:Module]
@COMPLEXITY: 5
@PURPOSE: Coordinate the full dataset review session lifecycle across intake, recovery, semantic review, clarification, mapping review, preview generation, and launch.
@LAYER: Domain
@RELATION: [DEPENDS_ON] ->[DatasetReviewSessionRepository]
@RELATION: [DEPENDS_ON] ->[SemanticSourceResolver]
@RELATION: [DEPENDS_ON] ->[ClarificationEngine]
@RELATION: [DEPENDS_ON] ->[SupersetContextExtractor]
@RELATION: [DEPENDS_ON] ->[SupersetCompilationAdapter]
@RELATION: [DEPENDS_ON] ->[TaskManager]
@PRE: session mutations must execute inside a persisted session boundary scoped to one authenticated user.
@POST: state transitions are persisted atomically and emit observable progress for long-running steps.
@SIDE_EFFECT: creates task records, updates session aggregates, triggers upstream Superset calls, persists audit artifacts.
@DATA_CONTRACT: Input[SessionCommand] -> Output[DatasetReviewSession | CompiledPreview | DatasetRunContext]
@INVARIANT: Launch is blocked unless a current session has no open blocking findings, all launch-sensitive mappings are approved, and a non-stale Superset-generated compiled preview matches the current input fingerprint.
@TEST_CONTRACT: start_or_resume_session -> returns persisted session shell with recommended next action
@TEST_SCENARIO: launch_gate_blocks_stale_preview -> launch rejected when preview fingerprint no longer matches current mapping inputs
@TEST_EDGE: missing_dataset_ref -> blocking failure
@TEST_EDGE: stale_preview -> blocking failure
@TEST_EDGE: sql_lab_launch_failure -> terminal failed launch state with audit record
@TEST_INVARIANT: launch_gate -> VERIFIED_BY: [launch_gate_blocks_stale_preview]
ƒ start_session
@PURPOSE: Initialize a new session from a Superset link or dataset selection and trigger context recovery.
@PRE: source input is non-empty and environment is accessible.
@POST: session exists in persisted storage with intake/recovery state and task linkage when async work is required.
@SIDE_EFFECT: persists session and may enqueue recovery task.
ƒ apply_semantic_source
@PURPOSE: Apply a selected semantic source and update field-level candidate/decision state.
@PRE: source exists and session is not terminal.
@POST: semantic field entries and findings reflect selected-source outcomes without overwriting locked manual values.
@SIDE_EFFECT: updates semantic decisions and conflict findings.
ƒ record_clarification_answer
@PURPOSE: Persist one clarification answer and re-evaluate profile, findings, and readiness.
@PRE: target question belongs to the session’s active clarification session.
@POST: answer is saved before current-question pointer advances.
@SIDE_EFFECT: updates clarification and finding state.
ƒ prepare_launch_preview
@PURPOSE: Assemble effective execution inputs and trigger Superset-side preview compilation.
@PRE: all required variables have candidate values or explicitly accepted defaults.
@POST: returns preview artifact in pending, ready, failed, or stale state.
@SIDE_EFFECT: persists preview attempt and upstream compilation diagnostics.
ƒ launch_dataset
@PURPOSE: Start the approved dataset execution through SQL Lab and persist run context for audit/replay.
@PRE: session is run-ready and compiled preview is current.
@POST: returns persisted run context with SQL Lab session reference and launch outcome.
@SIDE_EFFECT: creates SQL Lab execution session and audit snapshot.
[/DEF:DatasetReviewOrchestrator:Module]
[DEF:DatasetReviewSessionRepository:Module]
@COMPLEXITY: 5
@PURPOSE: Persist and retrieve dataset review session aggregates, including readiness, findings, semantic decisions, clarification state, previews, and run contexts.
@LAYER: Domain
@RELATION: [DEPENDS_ON] ->[DatasetReviewSession]
@RELATION: [DEPENDS_ON] ->[DatasetProfile]
@RELATION: [DEPENDS_ON] ->[ValidationFinding]
@RELATION: [DEPENDS_ON] ->[CompiledPreview]
@PRE: repository operations execute within authenticated request or task scope.
@POST: session aggregate reads are structurally consistent and writes preserve ownership and version semantics.
@SIDE_EFFECT: reads/writes application persistence layer.
@DATA_CONTRACT: Input[SessionMutation] -> Output[PersistedSessionAggregate]
@INVARIANT: answers, mapping approvals, preview artifacts, and launch snapshots are never attributed to the wrong user or session.
@TEST_CONTRACT: save_then_resume -> persisted session can be reopened without losing semantic/manual/clarification state
@TEST_SCENARIO: resume_session_preserves_manual_overrides -> locked semantic fields remain active after reload
@TEST_EDGE: foreign_user_access -> rejected
@TEST_EDGE: missing_session -> not found
@TEST_EDGE: partial_preview_snapshot -> preserved but not marked launchable
@TEST_INVARIANT: ownership_scope -> VERIFIED_BY: [foreign_user_access]
ƒ create_session
@PURPOSE: Persist initial session shell.
ƒ load_session_detail
@PURPOSE: Return the full session aggregate for API/frontend use.
ƒ save_profile_and_findings
@PURPOSE: Persist profile and validation state together.
ƒ save_preview
@PURPOSE: Persist compiled preview attempt and mark older fingerprints stale.
ƒ save_run_context
@PURPOSE: Persist immutable launch audit snapshot.
[/DEF:DatasetReviewSessionRepository:Module]
[DEF:SemanticSourceResolver:Module]
@COMPLEXITY: 4
@PURPOSE: Resolve, rank, and apply semantic metadata candidates from files, connected dictionaries, reference datasets, and AI generation fallback.
@LAYER: Domain
@RELATION: [DEPENDS_ON] ->[LLMProviderService]
@RELATION: [DEPENDS_ON] ->[SemanticSource]
@RELATION: [DEPENDS_ON] ->[SemanticFieldEntry]
@RELATION: [DEPENDS_ON] ->[SemanticCandidate]
@PRE: selected source and target field set must be known.
@POST: candidate ranking follows the configured confidence hierarchy and unresolved fuzzy matches remain reviewable.
@SIDE_EFFECT: may create conflict findings and semantic candidate records.
@INVARIANT: Manual overrides are never silently replaced by imported, inferred, or AI-generated values.
@TEST_CONTRACT: rank_candidates -> exact dictionary beats reference import beats fuzzy beats AI draft
@TEST_SCENARIO: manual_lock_survives_reimport -> locked field remains active after another source is applied
@TEST_EDGE: malformed_source_payload -> failed source application with explanatory finding
@TEST_EDGE: conflicting_sources -> conflict state preserved for review
@TEST_EDGE: no_trusted_matches -> AI draft fallback only
@TEST_INVARIANT: confidence_hierarchy -> VERIFIED_BY: [rank_candidates]
ƒ resolve_from_file
@PURPOSE: Normalize uploaded semantic file records into field-level candidates.
ƒ resolve_from_dictionary
@PURPOSE: Resolve candidates from connected tabular dictionary sources.
ƒ resolve_from_reference_dataset
@PURPOSE: Reuse semantic metadata from trusted Superset datasets.
ƒ rank_candidates
@PURPOSE: Apply confidence ordering and determine best candidate per field.
ƒ detect_conflicts
@PURPOSE: Mark competing candidate sets that require explicit user review.
ƒ apply_field_decision
@PURPOSE: Accept, reject, or manually override a field-level semantic value.
[/DEF:SemanticSourceResolver:Module]
[DEF:ClarificationEngine:Module]
@COMPLEXITY: 4
@PURPOSE: Manage one-question-at-a-time clarification sessions, including prioritization, answer persistence, and readiness impact updates.
@LAYER: Domain
@RELATION: [DEPENDS_ON] ->[ClarificationSession]
@RELATION: [DEPENDS_ON] ->[ClarificationQuestion]
@RELATION: [DEPENDS_ON] ->[ClarificationAnswer]
@RELATION: [DEPENDS_ON] ->[ValidationFinding]
@PRE: target session contains unresolved or contradictory review state.
@POST: every recorded answer updates the clarification session and associated session state deterministically.
@SIDE_EFFECT: creates clarification questions, persists answers, updates findings/profile state.
@INVARIANT: Clarification answers are persisted before the current question pointer or readiness state is advanced.
@TEST_CONTRACT: next_question_selection -> returns only one highest-priority unresolved question at a time
@TEST_SCENARIO: save_and_resume_clarification -> reopening session restores current question and prior answers
@TEST_EDGE: skipped_question -> unresolved topic remains visible
@TEST_EDGE: expert_review_marked -> topic deferred without false resolution
@TEST_EDGE: duplicate_answer_submission -> idempotent or rejected deterministically
@TEST_INVARIANT: single_active_question -> VERIFIED_BY: [next_question_selection]
ƒ start_or_resume
@PURPOSE: Open clarification mode on the highest-priority unresolved question.
ƒ build_question_payload
@PURPOSE: Return question, why-it-matters text, current guess, and suggested options.
ƒ record_answer
@PURPOSE: Persist one answer and compute state impact.
ƒ summarize_progress
@PURPOSE: Produce the clarification change summary shown on exit or pause.
[/DEF:ClarificationEngine:Module]
[DEF:SupersetContextExtractor:Module]
@COMPLEXITY: 4
@PURPOSE: Recover dataset, dashboard, filter, and runtime-template context from Superset links and related API payloads.
@LAYER: Infra
@RELATION: [CALLS] ->[SupersetClient]
@RELATION: [DEPENDS_ON] ->[ImportedFilter]
@RELATION: [DEPENDS_ON] ->[TemplateVariable]
@PRE: Superset link or dataset reference must be parseable enough to resolve an environment-scoped target resource.
@POST: returns the best available recovered context with explicit provenance and partial-recovery markers when necessary.
@SIDE_EFFECT: performs upstream Superset API reads.
@INVARIANT: Partial recovery is surfaced explicitly and never misrepresented as fully confirmed context.
@TEST_CONTRACT: recover_context_from_link -> output distinguishes URL-derived, native-filter-derived, and unresolved context
@TEST_SCENARIO: partial_filter_recovery_marks_recovery_required -> session remains usable but not falsely complete
@TEST_EDGE: unsupported_link_shape -> intake failure with actionable finding
@TEST_EDGE: dataset_without_filters -> successful dataset recovery with empty imported filter set
@TEST_EDGE: missing_dashboard_binding -> partial recovery only
@TEST_INVARIANT: provenance_visibility -> VERIFIED_BY: [recover_context_from_link]
ƒ parse_superset_link
@PURPOSE: Extract candidate identifiers and query state from supported Superset URLs.
ƒ recover_imported_filters
@PURPOSE: Build imported filter entries from URL state and Superset-side saved context.
ƒ discover_template_variables
@PURPOSE: Detect runtime variables and Jinja references from dataset query-bearing fields.
ƒ build_recovery_summary
@PURPOSE: Summarize recovered, partial, and unresolved context for session state and UX.
[/DEF:SupersetContextExtractor:Module]
[DEF:SupersetCompilationAdapter:Module]
@COMPLEXITY: 4
@PURPOSE: Interact with Superset preview compilation and SQL Lab execution endpoints using the current approved execution context.
@LAYER: Infra
@RELATION: [CALLS] ->[SupersetClient]
@RELATION: [DEPENDS_ON] ->[CompiledPreview]
@RELATION: [DEPENDS_ON] ->[DatasetRunContext]
@PRE: effective template params and dataset execution reference are available.
@POST: preview and launch calls return Superset-originated artifacts or explicit errors.
@SIDE_EFFECT: performs upstream Superset preview and SQL Lab calls.
@INVARIANT: The adapter never fabricates compiled SQL locally; preview truth is delegated to Superset only.
@TEST_CONTRACT: compile_then_launch -> launch uses the same effective input fingerprint verified in preview
@TEST_SCENARIO: preview_failure_blocks_launch -> no SQL Lab session is created after failed preview
@TEST_EDGE: compilation_endpoint_error -> failed preview artifact with readable diagnostics
@TEST_EDGE: sql_lab_creation_error -> failed launch audit state
@TEST_EDGE: fingerprint_mismatch -> launch rejected
@TEST_INVARIANT: superset_truth_source -> VERIFIED_BY: [preview_failure_blocks_launch]
ƒ compile_preview
@PURPOSE: Request Superset-side compiled SQL preview for the current effective inputs.
ƒ mark_preview_stale
@PURPOSE: Invalidate previous preview after mapping or value changes.
ƒ create_sql_lab_session
@PURPOSE: Create the canonical audited execution session after all launch gates pass.
[/DEF:SupersetCompilationAdapter:Module]
2. Frontend Components
ƒ handleSourceSubmit
ƒ handleResumeSession
ƒ handleLaunch
ƒ submitSupersetLink
ƒ submitDatasetSelection
ƒ groupFindingsBySeverity
ƒ jumpToFindingTarget
ƒ applyManualOverride
ƒ applyCandidateSelection
ƒ submitAnswer
ƒ skipQuestion
ƒ pauseClarification
ƒ approveMapping
ƒ overrideMappingValue
ƒ requestPreview
ƒ showPreviewErrorTarget
ƒ buildLaunchSummary
ƒ confirmLaunch
3. Contract Coverage Notes
The feature requires:
- dedicated semantic resolution contracts instead of hiding source-ranking logic inside orchestration,
- a first-class clarification engine because guided ambiguity resolution is a persisted workflow, not a simple endpoint,
- a Superset extraction boundary distinct from preview/launch behavior,
- UI contracts that cover the UX state machine rather than only the happy path.
These contracts are intended to align directly with: