Files
ss-tools/specs/027-dataset-llm-orchestration/contracts/modules.md
busya ed3d5f3039 feat(027): Final Phase T038-T043 implementation
- T038: SessionEvent logger and persistence logic
  - Added SessionEventLogger service with explicit audit event persistence
  - Added SessionEvent model with events relationship on DatasetReviewSession
  - Integrated event logging into orchestrator flows and API mutation endpoints

- T039: Semantic source version propagation
  - Added source_version column to SemanticFieldEntry
  - Added propagate_source_version_update() to SemanticResolver
  - Preserves locked/manual field invariants during propagation

- T040: Batch approval API and UI actions
  - Added batch semantic approval endpoint (/fields/semantic/approve-batch)
  - Added batch mapping approval endpoint (/mappings/approve-batch)
  - Added batch approval actions to SemanticLayerReview and ExecutionMappingReview components
  - Aligned batch semantics with single-item approval contracts

- T041: Superset compatibility matrix tests
  - Added test_superset_matrix.py with preview and SQL Lab fallback coverage
  - Tests verify client method preference and matrix fallback behavior

- T042: RBAC audit sweep on session-mutation endpoints
  - Added _require_owner_mutation_scope() helper
  - Applied owner guards to update_session, delete_session, and all mutation endpoints
  - Ensured no bypass of existing permission checks

- T043: i18n coverage for dataset-review UI
  - Added workspace state labels (empty/importing/review) to en.json and ru.json
  - Added batch action labels for semantics and mappings
  - Fixed workspace state comparison to lowercase strings
  - Removed hardcoded workspace state display strings

Signed-off-by: Implementation Specialist <impl@ss-tools>
2026-03-17 14:29:33 +03:00

24 KiB
Raw Blame History

Semantic Module Contracts: LLM Dataset Orchestration

Feature: LLM Dataset Orchestration
Branch: 027-dataset-llm-orchestration

This document defines the semantic contracts for the core components of the Dataset LLM Orchestration feature, following the GRACE-Poly Standard.


1. Backend Modules

[DEF:DatasetReviewOrchestrator:Module]

@COMPLEXITY: 5

@PURPOSE: Coordinate the full dataset review session lifecycle across intake, recovery, semantic review, clarification, mapping review, preview generation, and launch.

@LAYER: Domain

@RELATION: [DEPENDS_ON] ->[DatasetReviewSessionRepository]

@RELATION: [DEPENDS_ON] ->[SemanticSourceResolver]

@RELATION: [DEPENDS_ON] ->[ClarificationEngine]

@RELATION: [DEPENDS_ON] ->[SupersetContextExtractor]

@RELATION: [DEPENDS_ON] ->[SupersetCompilationAdapter]

@RELATION: [DEPENDS_ON] ->[TaskManager]

@PRE: session mutations must execute inside a persisted session boundary scoped to one authenticated user.

@POST: state transitions are persisted atomically and emit observable progress for long-running steps.

@SIDE_EFFECT: creates task records, updates session aggregates, triggers upstream Superset calls, persists audit artifacts.

@DATA_CONTRACT: Input[SessionCommand] -> Output[DatasetReviewSession | CompiledPreview | DatasetRunContext]

@INVARIANT: Launch is blocked unless a current session has no open blocking findings, all launch-sensitive mappings are approved, and a non-stale Superset-generated compiled preview matches the current input fingerprint.

@TEST_CONTRACT: start_or_resume_session -> returns persisted session shell with recommended next action

@TEST_SCENARIO: launch_gate_blocks_stale_preview -> launch rejected when preview fingerprint no longer matches current mapping inputs

@TEST_EDGE: missing_dataset_ref -> blocking failure

@TEST_EDGE: stale_preview -> blocking failure

@TEST_EDGE: sql_lab_launch_failure -> terminal failed launch state with audit record

@TEST_INVARIANT: launch_gate -> VERIFIED_BY: [launch_gate_blocks_stale_preview]

ƒ start_session

@PURPOSE: Initialize a new session from a Superset link or dataset selection and trigger context recovery.

@PRE: source input is non-empty and environment is accessible.

@POST: session exists in persisted storage with intake/recovery state and task linkage when async work is required.

@SIDE_EFFECT: persists session and may enqueue recovery task.

ƒ apply_semantic_source

@PURPOSE: Apply a selected semantic source and update field-level candidate/decision state.

@PRE: source exists and session is not terminal.

@POST: semantic field entries and findings reflect selected-source outcomes without overwriting locked manual values.

@SIDE_EFFECT: updates semantic decisions and conflict findings.

ƒ record_clarification_answer

@PURPOSE: Persist one clarification answer and re-evaluate profile, findings, and readiness.

@PRE: target question belongs to the sessions active clarification session.

@POST: answer is saved before current-question pointer advances.

@SIDE_EFFECT: updates clarification and finding state.

ƒ prepare_launch_preview

@PURPOSE: Assemble effective execution inputs and trigger Superset-side preview compilation.

@PRE: all required variables have candidate values or explicitly accepted defaults.

@POST: returns preview artifact in pending, ready, failed, or stale state.

@SIDE_EFFECT: persists preview attempt and upstream compilation diagnostics.

ƒ launch_dataset

@PURPOSE: Start the approved dataset execution through SQL Lab and persist run context for audit/replay.

@PRE: session is run-ready and compiled preview is current.

@POST: returns persisted run context with SQL Lab session reference and launch outcome.

@SIDE_EFFECT: creates SQL Lab execution session and audit snapshot.

[/DEF:DatasetReviewOrchestrator:Module]


[DEF:DatasetReviewSessionRepository:Module]

@COMPLEXITY: 5

@PURPOSE: Persist and retrieve dataset review session aggregates, including readiness, findings, semantic decisions, clarification state, previews, and run contexts.

@LAYER: Domain

@RELATION: [DEPENDS_ON] ->[DatasetReviewSession]

@RELATION: [DEPENDS_ON] ->[DatasetProfile]

@RELATION: [DEPENDS_ON] ->[ValidationFinding]

@RELATION: [DEPENDS_ON] ->[CompiledPreview]

@PRE: repository operations execute within authenticated request or task scope.

@POST: session aggregate reads are structurally consistent and writes preserve ownership and version semantics.

@SIDE_EFFECT: reads/writes application persistence layer.

@DATA_CONTRACT: Input[SessionMutation] -> Output[PersistedSessionAggregate]

@INVARIANT: answers, mapping approvals, preview artifacts, and launch snapshots are never attributed to the wrong user or session.

@TEST_CONTRACT: save_then_resume -> persisted session can be reopened without losing semantic/manual/clarification state

@TEST_SCENARIO: resume_session_preserves_manual_overrides -> locked semantic fields remain active after reload

@TEST_EDGE: foreign_user_access -> rejected

@TEST_EDGE: missing_session -> not found

@TEST_EDGE: partial_preview_snapshot -> preserved but not marked launchable

@TEST_INVARIANT: ownership_scope -> VERIFIED_BY: [foreign_user_access]

ƒ create_session

@PURPOSE: Persist initial session shell.

ƒ load_session_detail

@PURPOSE: Return the full session aggregate for API/frontend use.

ƒ save_profile_and_findings

@PURPOSE: Persist profile and validation state together.

ƒ save_preview

@PURPOSE: Persist compiled preview attempt and mark older fingerprints stale.

ƒ save_run_context

@PURPOSE: Persist immutable launch audit snapshot.

[/DEF:DatasetReviewSessionRepository:Module]


[DEF:SemanticSourceResolver:Module]

@COMPLEXITY: 4

@PURPOSE: Resolve, rank, and apply semantic metadata candidates from files, connected dictionaries, reference datasets, and AI generation fallback.

@LAYER: Domain

@RELATION: [DEPENDS_ON] ->[LLMProviderService]

@RELATION: [DEPENDS_ON] ->[SemanticSource]

@RELATION: [DEPENDS_ON] ->[SemanticFieldEntry]

@RELATION: [DEPENDS_ON] ->[SemanticCandidate]

@PRE: selected source and target field set must be known.

@POST: candidate ranking follows the configured confidence hierarchy and unresolved fuzzy matches remain reviewable.

@SIDE_EFFECT: may create conflict findings and semantic candidate records.

@DATA_CONTRACT: Input[SemanticSourceSelection | SemanticFieldSet | ManualFieldDecision] -> Output[SemanticCandidateSet | RankedSemanticResolution | ValidationFindingSet]

@INVARIANT: Manual overrides are never silently replaced by imported, inferred, or AI-generated values.

@TEST_CONTRACT: rank_candidates -> exact dictionary beats reference import beats fuzzy beats AI draft

@TEST_SCENARIO: manual_lock_survives_reimport -> locked field remains active after another source is applied

@TEST_EDGE: malformed_source_payload -> failed source application with explanatory finding

@TEST_EDGE: conflicting_sources -> conflict state preserved for review

@TEST_EDGE: no_trusted_matches -> AI draft fallback only

@TEST_INVARIANT: confidence_hierarchy -> VERIFIED_BY: [rank_candidates]

ƒ resolve_from_file

@PURPOSE: Normalize uploaded semantic file records into field-level candidates.

ƒ resolve_from_dictionary

@PURPOSE: Resolve candidates from connected tabular dictionary sources.

ƒ resolve_from_reference_dataset

@PURPOSE: Reuse semantic metadata from trusted Superset datasets.

ƒ rank_candidates

@PURPOSE: Apply confidence ordering and determine best candidate per field.

ƒ detect_conflicts

@PURPOSE: Mark competing candidate sets that require explicit user review.

ƒ apply_field_decision

@PURPOSE: Accept, reject, or manually override a field-level semantic value.

[/DEF:SemanticSourceResolver:Module]


[DEF:ClarificationEngine:Module]

@COMPLEXITY: 4

@PURPOSE: Manage one-question-at-a-time clarification sessions, including prioritization, answer persistence, and readiness impact updates.

@LAYER: Domain

@RELATION: [DEPENDS_ON] ->[ClarificationSession]

@RELATION: [DEPENDS_ON] ->[ClarificationQuestion]

@RELATION: [DEPENDS_ON] ->[ClarificationAnswer]

@RELATION: [DEPENDS_ON] ->[ValidationFinding]

@PRE: target session contains unresolved or contradictory review state.

@POST: every recorded answer updates the clarification session and associated session state deterministically.

@SIDE_EFFECT: creates clarification questions, persists answers, updates findings/profile state.

@DATA_CONTRACT: Input[ClarificationSessionState | ClarificationAnswerCommand] -> Output[ClarificationQuestionPayload | ClarificationProgressSnapshot | SessionReadinessDelta]

@INVARIANT: Clarification answers are persisted before the current question pointer or readiness state is advanced.

@TEST_CONTRACT: next_question_selection -> returns only one highest-priority unresolved question at a time

@TEST_SCENARIO: save_and_resume_clarification -> reopening session restores current question and prior answers

@TEST_EDGE: skipped_question -> unresolved topic remains visible

@TEST_EDGE: expert_review_marked -> topic deferred without false resolution

@TEST_EDGE: duplicate_answer_submission -> idempotent or rejected deterministically

@TEST_INVARIANT: single_active_question -> VERIFIED_BY: [next_question_selection]

ƒ start_or_resume

@PURPOSE: Open clarification mode on the highest-priority unresolved question.

ƒ build_question_payload

@PURPOSE: Return question, why-it-matters text, current guess, and suggested options.

ƒ record_answer

@PURPOSE: Persist one answer and compute state impact.

ƒ summarize_progress

@PURPOSE: Produce the clarification change summary shown on exit or pause.

[/DEF:ClarificationEngine:Module]


[DEF:SupersetContextExtractor:Module]

@COMPLEXITY: 4

@PURPOSE: Recover dataset, dashboard, filter, and runtime-template context from Superset links and related API payloads.

@LAYER: Infra

@RELATION: [DEPENDS_ON] ->[ImportedFilter]

@RELATION: [DEPENDS_ON] ->[TemplateVariable]

@RELATION: [DEPENDS_ON] ->[backend.src.core.superset_client.SupersetClient]

@DATA_CONTRACT: Input[SupersetLink | DatasetReference | EnvironmentContext] -> Output[RecoveredSupersetContext | ImportedFilterSet | TemplateVariableSet | RecoverySummary]

@PRE: Superset link or dataset reference must be parseable enough to resolve an environment-scoped target resource.

@POST: returns the best available recovered context with explicit provenance and partial-recovery markers when necessary.

@SIDE_EFFECT: performs upstream Superset API reads.

@INVARIANT: Partial recovery is surfaced explicitly and never misrepresented as fully confirmed context.

@TEST_CONTRACT: recover_context_from_link -> output distinguishes URL-derived, native-filter-derived, and unresolved context

@TEST_SCENARIO: partial_filter_recovery_marks_recovery_required -> session remains usable but not falsely complete

@TEST_EDGE: unsupported_link_shape -> intake failure with actionable finding

@TEST_EDGE: dataset_without_filters -> successful dataset recovery with empty imported filter set

@TEST_EDGE: missing_dashboard_binding -> partial recovery only

@TEST_INVARIANT: provenance_visibility -> VERIFIED_BY: [recover_context_from_link]

@PURPOSE: Extract candidate identifiers and query state from supported Superset URLs.

ƒ recover_imported_filters

@PURPOSE: Build imported filter entries from URL state and Superset-side saved context.

ƒ discover_template_variables

@PURPOSE: Detect runtime variables and Jinja references from dataset query-bearing fields.

ƒ build_recovery_summary

@PURPOSE: Summarize recovered, partial, and unresolved context for session state and UX.

[/DEF:SupersetContextExtractor:Module]


[DEF:SupersetCompilationAdapter:Module]

@COMPLEXITY: 4

@PURPOSE: Interact with Superset preview compilation and SQL Lab execution endpoints using the current approved execution context.

@LAYER: Infra

@RELATION: [DEPENDS_ON] ->[CompiledPreview]

@RELATION: [DEPENDS_ON] ->[DatasetRunContext]

@RELATION: [DEPENDS_ON] ->[backend.src.core.superset_client.SupersetClient]

@DATA_CONTRACT: Input[ApprovedExecutionContext | PreviewFingerprint | LaunchRequest] -> Output[CompiledPreview | PreviewFailureArtifact | DatasetRunContext | LaunchFailureAudit]

@PRE: effective template params and dataset execution reference are available.

@POST: preview and launch calls return Superset-originated artifacts or explicit errors.

@SIDE_EFFECT: performs upstream Superset preview and SQL Lab calls.

@INVARIANT: The adapter never fabricates compiled SQL locally; preview truth is delegated to Superset only.

@TEST_CONTRACT: compile_then_launch -> launch uses the same effective input fingerprint verified in preview

@TEST_SCENARIO: preview_failure_blocks_launch -> no SQL Lab session is created after failed preview

@TEST_EDGE: compilation_endpoint_error -> failed preview artifact with readable diagnostics

@TEST_EDGE: sql_lab_creation_error -> failed launch audit state

@TEST_EDGE: fingerprint_mismatch -> launch rejected

@TEST_INVARIANT: superset_truth_source -> VERIFIED_BY: [preview_failure_blocks_launch]

ƒ compile_preview

@PURPOSE: Request Superset-side compiled SQL preview for the current effective inputs.

ƒ mark_preview_stale

@PURPOSE: Invalidate previous preview after mapping or value changes.

ƒ create_sql_lab_session

@PURPOSE: Create the canonical audited execution session after all launch gates pass.

[/DEF:SupersetCompilationAdapter:Module]


2. Frontend Components

ƒ handleSourceSubmit

ƒ handleResumeSession

ƒ handleLaunch


ƒ submitDatasetSelection


ƒ groupFindingsBySeverity

ƒ jumpToFindingTarget


ƒ applyManualOverride

ƒ applyCandidateSelection


ƒ submitAnswer

ƒ skipQuestion

ƒ pauseClarification


ƒ approveMapping

ƒ overrideMappingValue


ƒ requestPreview

ƒ showPreviewErrorTarget


ƒ buildLaunchSummary

ƒ confirmLaunch


3. Contract Coverage Notes

The feature requires:

  • dedicated semantic resolution contracts instead of hiding source-ranking logic inside orchestration,
  • a first-class clarification engine because guided ambiguity resolution is a persisted workflow, not a simple endpoint,
  • a Superset extraction boundary distinct from preview/launch behavior,
  • UI contracts that cover the UX state machine rather than only the happy path.

These contracts are intended to align directly with: