Files
ss-tools/specs/027-dataset-llm-orchestration/data-model.md
busya 7c85552132 feat(ui): add chat-driven dataset review flow
Move dataset review clarification into the assistant workspace and
rework the review page into a chat-centric layout with execution rails.

Add session-scoped assistant actions for mappings, semantic fields,
and SQL preview generation. Introduce optimistic locking for dataset
review mutations, propagate session versions through API responses,
and mask imported filter values before assistant exposure.

Refresh tests, i18n, and spec artifacts to match the new workflow.

BREAKING CHANGE: dataset review mutation endpoints now require the
X-Session-Version header, and clarification is no longer handled
through ClarificationDialog-based flows
2026-03-26 13:33:12 +03:00

27 KiB

Data Model: LLM Dataset Orchestration

Feature: LLM Dataset Orchestration
Branch: 027-dataset-llm-orchestration
Date: 2026-03-16

Overview

This document defines the domain entities, relationships, lifecycle states, and validation rules for the dataset review, semantic enrichment, clarification, preview, and launch workflow described in spec.md and grounded by the decisions in research.md.

The model is intentionally split into:

  • session aggregate entities for resumable workflow state,
  • semantic/provenance entities for enrichment and conflict handling,
  • execution entities for mapping, preview, and launch audit,
  • export projections for sharing outputs.

1. Core Aggregate: DatasetReviewSession

Entity: SessionCollaborator

Field Type Required Description
user_id string yes Collaborating user ID
role enum yes viewer, reviewer, approver
added_at datetime yes When they were added

Entity: DatasetReviewSession

Represents the top-level resumable workflow container for one dataset review/execution effort.

Field Type Required Description
session_id string (UUID) yes Stable unique identifier for the review session
user_id string yes Authenticated User ID of the session owner
collaborators list[SessionCollaborator] no Shared access and roles
environment_id string yes Superset environment context
source_kind enum yes Origin kind: superset_link, dataset_selection
source_input string yes Original link or selected dataset reference
dataset_ref string yes Canonical dataset reference used by the feature
dataset_id integer | null no Superset dataset id when resolved
dashboard_id integer | null no Superset dashboard id if imported from dashboard link
readiness_state enum yes Current workflow readiness state
recommended_action enum yes Explicit next recommended action
version integer yes Optimistic-lock version incremented on every persisted session mutation
status enum yes Session lifecycle status
current_phase enum yes Active workflow phase
active_task_id string | null no Linked long-running task if one is active
last_preview_id string | null no Most recent preview snapshot
last_run_context_id string | null no Most recent launch audit record
created_at datetime yes Session creation timestamp
updated_at datetime yes Last mutation timestamp
last_activity_at datetime yes Last user/system activity timestamp
closed_at datetime | null no Terminal close/archive timestamp

Validation rules

  • session_id must be globally unique.
  • source_input must be non-empty.
  • environment_id must resolve to a configured environment.
  • readiness_state and recommended_action must always be present.
  • version starts at 0 on session creation and increments monotonically after every successful session mutation.
  • user_id ownership must be enforced for all mutations, unless collaborator roles allow otherwise.
  • dataset_id becomes required before preview or launch phases.
  • last_preview_id must refer to a preview generated from the same session.
  • Mutating requests must include the caller's last observed session version; mismatches are rejected as optimistic-lock conflicts rather than silently merged.

Enums

SessionStatus

  • active
  • paused
  • completed
  • archived
  • cancelled

SessionPhase

  • intake
  • recovery
  • review
  • semantic_review
  • clarification
  • mapping_review
  • preview
  • launch
  • post_run

ReadinessState

  • empty
  • importing
  • review_ready
  • semantic_source_review_needed
  • clarification_needed
  • clarification_active
  • mapping_review_needed
  • compiled_preview_ready
  • partially_ready
  • run_ready
  • run_in_progress
  • completed
  • recovery_required

RecommendedAction

  • import_from_superset
  • review_documentation
  • apply_semantic_source
  • start_clarification
  • answer_next_question
  • approve_mapping
  • generate_sql_preview
  • complete_required_values
  • launch_dataset
  • resume_session
  • export_outputs

2. Dataset Profile and Review State

Entity: DatasetProfile

Consolidated interpretation of dataset meaning, semantics, filters, assumptions, and readiness.

Field Type Required Description
profile_id string (UUID) yes Unique profile id
session_id string yes Parent session
dataset_name string yes Display dataset name
schema_name string | null no Schema if available
database_name string | null no Database if available
business_summary text yes Human-readable summary
business_summary_source enum yes Provenance of summary
description text | null no Dataset-level description
dataset_type enum | null no table, virtual, sqllab_view, unknown
is_sqllab_view boolean yes Whether dataset is SQL Lab derived
completeness_score number | null no Optional normalized completeness score
confidence_state enum yes Overall confidence posture
has_blocking_findings boolean yes Derived summary flag
has_warning_findings boolean yes Derived summary flag
manual_summary_locked boolean yes Protects user-entered summary
created_at datetime yes Created timestamp
updated_at datetime yes Updated timestamp

Validation rules

  • business_summary must always contain a usable string; if weak, it may be skeletal but not null.
  • manual_summary_locked=true prevents later automatic overwrite.
  • session_id must be unique if only one active profile snapshot is stored per session, or versioned if snapshots are retained.
  • confidence_state must reflect highest unresolved-risk posture, not just optimistic confidence.

BusinessSummarySource

  • confirmed
  • imported
  • inferred
  • ai_draft
  • manual_override

ConfidenceState

  • confirmed
  • mostly_confirmed
  • mixed
  • low_confidence
  • unresolved

3. Validation Findings

Entity: ValidationFinding

Represents a blocking issue, warning, or informational observation.

Field Type Required Description
finding_id string (UUID) yes Unique finding id
session_id string yes Parent session
area enum yes Affected domain area
severity enum yes blocking, warning, informational
code string yes Stable machine-readable finding code
title string yes Short label
message text yes Actionable human-readable explanation
resolution_state enum yes Current resolution status
resolution_note text | null no Optional explanation or approval note
caused_by_ref string | null no Related field/filter/mapping/question id
created_at datetime yes Creation timestamp
resolved_at datetime | null no Resolution timestamp

Validation rules

  • severity must be one of the allowed values.
  • resolution_state=resolved or approved requires either a system resolution event or user action.
  • launch is blocked if any open blocking finding remains.
  • warning findings tied to mapping transformations require explicit approval before launch if marked launch-sensitive.

FindingArea

  • source_intake
  • dataset_profile
  • semantic_enrichment
  • clarification
  • filter_recovery
  • template_mapping
  • compiled_preview
  • launch
  • audit

ResolutionState

  • open
  • resolved
  • approved
  • skipped
  • deferred
  • expert_review

4. Semantic Source and Field Decisions

Entity: SemanticSource

Represents a trusted or candidate source of semantic metadata.

Field Type Required Description
source_id string (UUID) yes Unique source id
session_id string yes Parent session
source_type enum yes Origin kind
source_ref string yes External reference, dataset ref, or uploaded artifact ref
source_version string yes Version/Snapshot for propagation tracking
display_name string yes Human-readable source name
trust_level enum yes Source trust tier
schema_overlap_score number | null no Optional overlap signal
status enum yes Availability/applicability status
created_at datetime yes Creation timestamp

SemanticSourceType

  • uploaded_file
  • connected_dictionary
  • reference_dataset
  • neighbor_dataset
  • ai_generated

TrustLevel

  • trusted
  • recommended
  • candidate
  • generated

SemanticSourceStatus

  • available
  • selected
  • applied
  • rejected
  • partial
  • failed

Entity: SemanticFieldEntry

Canonical semantic state for one dataset field or metric.

Field Type Required Description
field_id string (UUID) yes Unique field semantic id
session_id string yes Parent session
field_name string yes Physical field/metric name
field_kind enum yes column, metric, filter_dimension, parameter
verbose_name string | null no Display label
description text | null no Human-readable description
display_format string | null no Formatting metadata such as d3 format
provenance enum yes Final chosen source class
source_id string | null no Winning source
confidence_rank integer | null no Final applied ranking
is_locked boolean yes Manual override protection
has_conflict boolean yes Whether competing candidates exist
needs_review boolean yes Whether user review is still needed
last_changed_by enum yes system, user, agent
user_feedback enum no User feedback: up, down, null
created_at datetime yes Creation timestamp
updated_at datetime yes Updated timestamp

Validation rules

  • field_name must be unique per session_id + field_kind.
  • is_locked=true prevents automatic overwrite.
  • provenance=manual_override implies is_locked=true.
  • has_conflict=true requires at least one competing candidate record.
  • Fuzzy/applied inferred values must keep needs_review=true until confirmed if policy requires explicit review.

FieldKind

  • column
  • metric
  • filter_dimension
  • parameter

FieldProvenance

  • dictionary_exact
  • reference_imported
  • fuzzy_inferred
  • ai_generated
  • manual_override
  • unresolved

Entity: SemanticCandidate

Stores competing candidate values before or alongside final field decision.

Field Type Required Description
candidate_id string (UUID) yes Unique candidate id
field_id string yes Parent semantic field
source_id string | null no Candidate source
candidate_rank integer yes Lower is stronger
match_type enum yes Exact, imported, fuzzy, generated
confidence_score number yes Normalized score
proposed_verbose_name string | null no Candidate verbose name
proposed_description text | null no Candidate description
proposed_display_format string | null no Candidate display format
status enum yes Candidate lifecycle
created_at datetime yes Creation timestamp

CandidateMatchType

  • exact
  • reference
  • fuzzy
  • generated

CandidateStatus

  • proposed
  • accepted
  • rejected
  • superseded

5. Imported Filters and Runtime Variables

Entity: ImportedFilter

Represents one recovered or user-supplied filter value.

Field Type Required Description
filter_id string (UUID) yes Unique filter id
session_id string yes Parent session
filter_name string yes Source filter name
display_name string | null no User-facing label
raw_value json yes Original recovered value
raw_value_masked boolean yes Whether the stored or exposed raw value has been masked/redacted for assistant or LLM-facing use
normalized_value json | null no Optional transformed value
source enum yes Origin of the filter
confidence_state enum yes Confidence/provenance class
requires_confirmation boolean yes Whether explicit review is needed
recovery_status enum yes Recovery completeness
notes text | null no Recovery explanation
created_at datetime yes Creation timestamp
updated_at datetime yes Updated timestamp

FilterSource

  • superset_native
  • superset_url
  • manual
  • inferred

FilterConfidenceState

  • confirmed
  • imported
  • inferred
  • ai_draft
  • unresolved

FilterRecoveryStatus

  • recovered
  • partial
  • missing
  • conflicted

Validation rules

  • raw_value may be stored for audit and replay, but any context passed into assistant or LLM-facing orchestration must use a masked/redacted representation when the value may contain PII or other sensitive identifiers.
  • raw_value_masked=true is required whenever the exported assistant context omits or redacts sensitive substrings from the original filter payload.
  • Masking policy must preserve enough structure for mapping and clarification, for example key shape, value type, cardinality hints, and non-sensitive tokens.

Entity: TemplateVariable

Represents a runtime variable discovered from dataset execution logic.

Field Type Required Description
variable_id string (UUID) yes Unique variable id
session_id string yes Parent session
variable_name string yes Canonical runtime variable name
expression_source text yes Raw expression or snippet where variable was found
variable_kind enum yes Detected variable class
is_required boolean yes Whether launch requires a mapped value
default_value json | null no Optional default
mapping_status enum yes Current mapping state
created_at datetime yes Creation timestamp
updated_at datetime yes Updated timestamp

VariableKind

  • native_filter
  • parameter
  • derived
  • unknown

MappingStatus

  • unmapped
  • proposed
  • approved
  • overridden
  • invalid

6. Mapping Review and Warning Approvals

Entity: ExecutionMapping

Represents the mapping between a recovered filter and a runtime variable.

Field Type Required Description
mapping_id string (UUID) yes Unique mapping id
session_id string yes Parent session
filter_id string yes Source imported filter
variable_id string yes Target template variable
mapping_method enum yes How mapping was produced
raw_input_value json yes Original input
effective_value json | null no Value to send to preview/launch
transformation_note text | null no Explanation of normalization
warning_level enum | null no Warning classification if transformation is risky
requires_explicit_approval boolean yes Whether launch gate applies
approval_state enum yes Approval lifecycle
approved_by_user_id string | null no Approver if approved
approved_at datetime | null no Approval timestamp
created_at datetime yes Creation timestamp
updated_at datetime yes Updated timestamp

Validation rules

  • filter_id + variable_id must be unique per session unless versioning is used.
  • requires_explicit_approval=true implies launch is blocked while approval_state != approved.
  • effective_value is required before preview when variable is required.
  • user override should set mapping_method=manual_override.

MappingMethod

  • direct_match
  • heuristic_match
  • semantic_match
  • manual_override

MappingWarningLevel

  • low
  • medium
  • high

ApprovalState

  • pending
  • approved
  • rejected
  • not_required

7. Clarification Workflow

Entity: ClarificationSession

Stores resumable clarification flow state for one review session.

Field Type Required Description
clarification_session_id string (UUID) yes Unique clarification session id
session_id string yes Parent review session
status enum yes Clarification lifecycle
current_question_id string | null no Current active question
resolved_count integer yes Count of answered/resolved items
remaining_count integer yes Count of unresolved items
summary_delta text | null no Human-readable change summary
started_at datetime yes Start time
updated_at datetime yes Last update
completed_at datetime | null no End time

ClarificationStatus

  • pending
  • active
  • paused
  • completed
  • cancelled

Entity: ClarificationQuestion

Represents one focused question in the clarification flow.

Field Type Required Description
question_id string (UUID) yes Unique question id
clarification_session_id string yes Parent clarification session
topic_ref string yes Related field/finding/mapping id
question_text text yes Focused question
why_it_matters text yes Business significance explanation
current_guess text | null no Best guess if available
priority integer yes Order score
state enum yes Question lifecycle
created_at datetime yes Creation timestamp
updated_at datetime yes Updated timestamp

QuestionState

  • open
  • answered
  • skipped
  • expert_review
  • superseded

Entity: ClarificationOption

Suggested selectable answer option for a question.

Field Type Required Description
option_id string (UUID) yes Unique option id
question_id string yes Parent question
label string yes UI label
value string yes Stored answer payload
is_recommended boolean yes Whether this is the recommended option
display_order integer yes UI ordering

Entity: ClarificationAnswer

Stores user response to one clarification question.

Field Type Required Description
answer_id string (UUID) yes Unique answer id
question_id string yes Parent question
answer_kind enum yes How user responded
answer_value text | null no Selected/custom answer
answered_by_user_id string yes Responding user
impact_summary text | null no Optional summary of resulting state changes
created_at datetime yes Answer timestamp

AnswerKind

  • selected
  • custom
  • skipped
  • expert_review

Validation rules

  • Each active question may have at most one current answer.
  • custom answers require non-empty answer_value.
  • selected answers must correspond to a valid option or normalized payload.
  • expert_review leaves the related topic unresolved but marked intentionally deferred.

8. Preview and Launch Audit

Entity: CompiledPreview

Stores the exact Superset-returned compiled SQL preview.

Field Type Required Description
preview_id string (UUID) yes Unique preview id
session_id string yes Parent session
preview_status enum yes Preview lifecycle state
compiled_sql text | null no Exact compiled SQL if successful
preview_fingerprint string yes Snapshot hash of mapping/inputs used
compiled_by enum yes Must be superset
error_code string | null no Optional failure code
error_details text | null no Readable preview error
compiled_at datetime | null no Successful compile timestamp
created_at datetime yes Record creation timestamp

Validation rules

  • compiled_by must be superset.
  • compiled_sql is required when preview_status=ready.
  • compiled_sql must be null when preview_status=failed unless partial diagnostics are intentionally stored elsewhere.
  • preview_fingerprint must be compared against current session inputs before launch.
  • Launch requires preview_status=ready and matching current fingerprint.

PreviewStatus

  • pending
  • ready
  • failed
  • stale

Entity: DatasetRunContext

Audited execution snapshot created at launch.

Field Type Required Description
run_context_id string (UUID) yes Unique run context id
session_id string yes Parent review session
dataset_ref string yes Canonical dataset identity
environment_id string yes Execution environment
preview_id string yes Bound compiled preview
sql_lab_session_ref string yes Canonical SQL Lab reference
effective_filters json yes Final filter payload
template_params json yes Final template parameter object
approved_mapping_ids json array yes Explicit approvals used for launch
semantic_decision_refs json array yes Applied semantic decision references
open_warning_refs json array yes Warnings that remained visible at launch
launch_status enum yes Launch outcome
launch_error text | null no Error if launch failed
created_at datetime yes Launch record timestamp

Validation rules

  • preview_id must reference a CompiledPreview with ready status.
  • sql_lab_session_ref is mandatory for successful launch.
  • effective_filters and template_params must match the preview fingerprint used.
  • launch_status=started or success requires a non-empty SQL Lab reference.

LaunchStatus

  • started
  • success
  • failed

9. Export Projections

Entity: ExportArtifact

Tracks generated exports for sharing documentation and validation outputs.

Field Type Required Description
artifact_id string (UUID) yes Unique artifact id
session_id string yes Parent session
artifact_type enum yes Export type
format enum yes File/output format
storage_ref string yes Storage/file reference
created_by_user_id string yes Requesting user
created_at datetime yes Artifact creation time

ArtifactType

  • documentation
  • validation_report
  • run_summary

ArtifactFormat

  • json
  • markdown
  • csv
  • pdf

10. Relationships

One-to-one / aggregate-root relationships

  • DatasetReviewSessionDatasetProfile (current active profile view)
  • DatasetReviewSessionClarificationSession (current or latest)
  • DatasetReviewSessionCompiledPreview (latest/current preview)
  • DatasetReviewSessionDatasetRunContext (latest/current launch audit)

One-to-many relationships

  • DatasetReviewSession → many ValidationFinding
  • DatasetReviewSession → many SemanticSource
  • DatasetReviewSession → many SemanticFieldEntry
  • SemanticFieldEntry → many SemanticCandidate
  • DatasetReviewSession → many ImportedFilter
  • DatasetReviewSession → many TemplateVariable
  • DatasetReviewSession → many ExecutionMapping
  • ClarificationSession → many ClarificationQuestion
  • ClarificationQuestion → many ClarificationOption
  • ClarificationQuestion → zero/one current ClarificationAnswer
  • DatasetReviewSession → many ExportArtifact
  • DatasetReviewSession → many SessionEvent
  • DatasetReviewSession → many SessionEvent

11. Derived Rules and Invariants

Run readiness invariant

A session is run_ready only if:

  • no open blocking findings remain,
  • all required template variables have approved/effective mappings,
  • all launch-sensitive mapping warnings have been explicitly approved,
  • a non-stale CompiledPreview exists for the current fingerprint.

Manual intent invariant

If a field is manually overridden:

  • SemanticFieldEntry.is_locked = true
  • SemanticFieldEntry.provenance = manual_override
  • later imports or inferred candidates may be recorded, but cannot replace the active value automatically.

Progressive recovery invariant

Partial Superset recovery must preserve usable state:

  • imported filters may be partial,
  • unresolved variables may remain unmapped,
  • findings must explain what is still missing,
  • session remains resumable.

Clarification persistence invariant

Clarification answers must be persisted before:

  • finding severity is downgraded,
  • profile state is updated,
  • current question pointer advances.

Preview truth invariant

Compiled preview must be:

  • generated by Superset,
  • tied to the exact current effective inputs,
  • treated as invalid if mappings/values change afterward.

12. Migration & Evolution Strategy

  • Baseline: The initial implementation (Milestone 1) will include the core session and profile entities.
  • Incremental Growth: Subsequent milestones will add clarification, mapping, and launch audit entities via standard SQLAlchemy migrations.
  • Compatibility: The DatasetReviewSession aggregate root will remain the stable entry point for all sub-entities to ensure forward compatibility with saved user state.

13. Suggested Backend DTO Grouping

The future API and persistence layers should group models roughly as follows:

Session DTOs

  • SessionSummary
  • SessionDetail
  • SessionListItem

SessionSummary and SessionDetail should both surface the current version so frontend workspace state, collaborator actions, and assistant-driven mutations can use the same optimistic-lock boundary.

Review DTOs

  • DatasetProfileDto
  • ValidationFindingDto
  • ReadinessChecklistDto

Semantic DTOs

  • SemanticSourceDto
  • SemanticFieldEntryDto
  • SemanticCandidateDto

Clarification DTOs

  • ClarificationSessionDto
  • ClarificationQuestionDto
  • ClarificationAnswerRequest

Execution DTOs

  • ImportedFilterDto
  • TemplateVariableDto
  • ExecutionMappingDto
  • CompiledPreviewDto
  • LaunchSummaryDto

Export DTOs

  • ExportArtifactDto

13. Open Modeling Notes Resolved

The Phase 0 research questions are considered resolved for design purposes:

  • SQL preview is modeled as a first-class persisted artifact.
  • SQL Lab is modeled as the only canonical launch target.
  • semantic resolution and clarification are modeled as separate domain boundaries.
  • field-level overrides and mapping approvals are first-class entities.
  • session persistence is separate from task execution state.

This model is ready to drive: