26 KiB
Data Model: LLM Dataset Orchestration
Feature: LLM Dataset Orchestration
Branch: 027-dataset-llm-orchestration
Date: 2026-03-16
Overview
This document defines the domain entities, relationships, lifecycle states, and validation rules for the dataset review, semantic enrichment, clarification, preview, and launch workflow described in spec.md and grounded by the decisions in research.md.
The model is intentionally split into:
- session aggregate entities for resumable workflow state,
- semantic/provenance entities for enrichment and conflict handling,
- execution entities for mapping, preview, and launch audit,
- export projections for sharing outputs.
1. Core Aggregate: DatasetReviewSession
Entity: SessionCollaborator
| Field | Type | Required | Description |
|---|---|---|---|
user_id |
string | yes | Collaborating user ID |
role |
enum | yes | viewer, reviewer, approver |
added_at |
datetime | yes | When they were added |
Entity: DatasetReviewSession
Represents the top-level resumable workflow container for one dataset review/execution effort.
| Field | Type | Required | Description |
|---|---|---|---|
session_id |
string (UUID) | yes | Stable unique identifier for the review session |
user_id |
string | yes | Authenticated User ID of the session owner |
collaborators |
list[SessionCollaborator] | no | Shared access and roles |
environment_id |
string | yes | Superset environment context |
source_kind |
enum | yes | Origin kind: superset_link, dataset_selection |
source_input |
string | yes | Original link or selected dataset reference |
dataset_ref |
string | yes | Canonical dataset reference used by the feature |
dataset_id |
integer | null | no | Superset dataset id when resolved |
dashboard_id |
integer | null | no | Superset dashboard id if imported from dashboard link |
readiness_state |
enum | yes | Current workflow readiness state |
recommended_action |
enum | yes | Explicit next recommended action |
status |
enum | yes | Session lifecycle status |
current_phase |
enum | yes | Active workflow phase |
active_task_id |
string | null | no | Linked long-running task if one is active |
last_preview_id |
string | null | no | Most recent preview snapshot |
last_run_context_id |
string | null | no | Most recent launch audit record |
created_at |
datetime | yes | Session creation timestamp |
updated_at |
datetime | yes | Last mutation timestamp |
last_activity_at |
datetime | yes | Last user/system activity timestamp |
closed_at |
datetime | null | no | Terminal close/archive timestamp |
Validation rules
session_idmust be globally unique.source_inputmust be non-empty.environment_idmust resolve to a configured environment.readiness_stateandrecommended_actionmust always be present.user_idownership must be enforced for all mutations, unless collaborator roles allow otherwise.dataset_idbecomes required before preview or launch phases.last_preview_idmust refer to a preview generated from the same session.
Enums
SessionStatus
activepausedcompletedarchivedcancelled
SessionPhase
intakerecoveryreviewsemantic_reviewclarificationmapping_reviewpreviewlaunchpost_run
ReadinessState
emptyimportingreview_readysemantic_source_review_neededclarification_neededclarification_activemapping_review_neededcompiled_preview_readypartially_readyrun_readyrun_in_progresscompletedrecovery_required
RecommendedAction
import_from_supersetreview_documentationapply_semantic_sourcestart_clarificationanswer_next_questionapprove_mappinggenerate_sql_previewcomplete_required_valueslaunch_datasetresume_sessionexport_outputs
2. Dataset Profile and Review State
Entity: DatasetProfile
Consolidated interpretation of dataset meaning, semantics, filters, assumptions, and readiness.
| Field | Type | Required | Description |
|---|---|---|---|
profile_id |
string (UUID) | yes | Unique profile id |
session_id |
string | yes | Parent session |
dataset_name |
string | yes | Display dataset name |
schema_name |
string | null | no | Schema if available |
database_name |
string | null | no | Database if available |
business_summary |
text | yes | Human-readable summary |
business_summary_source |
enum | yes | Provenance of summary |
description |
text | null | no | Dataset-level description |
dataset_type |
enum | null | no | table, virtual, sqllab_view, unknown |
is_sqllab_view |
boolean | yes | Whether dataset is SQL Lab derived |
completeness_score |
number | null | no | Optional normalized completeness score |
confidence_state |
enum | yes | Overall confidence posture |
has_blocking_findings |
boolean | yes | Derived summary flag |
has_warning_findings |
boolean | yes | Derived summary flag |
manual_summary_locked |
boolean | yes | Protects user-entered summary |
created_at |
datetime | yes | Created timestamp |
updated_at |
datetime | yes | Updated timestamp |
Validation rules
business_summarymust always contain a usable string; if weak, it may be skeletal but not null.manual_summary_locked=trueprevents later automatic overwrite.session_idmust be unique if only one active profile snapshot is stored per session, or versioned if snapshots are retained.confidence_statemust reflect highest unresolved-risk posture, not just optimistic confidence.
BusinessSummarySource
confirmedimportedinferredai_draftmanual_override
ConfidenceState
confirmedmostly_confirmedmixedlow_confidenceunresolved
3. Validation Findings
Entity: ValidationFinding
Represents a blocking issue, warning, or informational observation.
| Field | Type | Required | Description |
|---|---|---|---|
finding_id |
string (UUID) | yes | Unique finding id |
session_id |
string | yes | Parent session |
area |
enum | yes | Affected domain area |
severity |
enum | yes | blocking, warning, informational |
code |
string | yes | Stable machine-readable finding code |
title |
string | yes | Short label |
message |
text | yes | Actionable human-readable explanation |
resolution_state |
enum | yes | Current resolution status |
resolution_note |
text | null | no | Optional explanation or approval note |
caused_by_ref |
string | null | no | Related field/filter/mapping/question id |
created_at |
datetime | yes | Creation timestamp |
resolved_at |
datetime | null | no | Resolution timestamp |
Validation rules
severitymust be one of the allowed values.resolution_state=resolvedorapprovedrequires either a system resolution event or user action.launchis blocked if any openblockingfinding remains.warningfindings tied to mapping transformations require explicit approval before launch if marked launch-sensitive.
FindingArea
source_intakedataset_profilesemantic_enrichmentclarificationfilter_recoverytemplate_mappingcompiled_previewlaunchaudit
ResolutionState
openresolvedapprovedskippeddeferredexpert_review
4. Semantic Source and Field Decisions
Entity: SemanticSource
Represents a trusted or candidate source of semantic metadata.
| Field | Type | Required | Description |
|---|---|---|---|
source_id |
string (UUID) | yes | Unique source id |
session_id |
string | yes | Parent session |
source_type |
enum | yes | Origin kind |
source_ref |
string | yes | External reference, dataset ref, or uploaded artifact ref |
source_version |
string | yes | Version/Snapshot for propagation tracking |
display_name |
string | yes | Human-readable source name |
trust_level |
enum | yes | Source trust tier |
schema_overlap_score |
number | null | no | Optional overlap signal |
status |
enum | yes | Availability/applicability status |
created_at |
datetime | yes | Creation timestamp |
SemanticSourceType
uploaded_fileconnected_dictionaryreference_datasetneighbor_datasetai_generated
TrustLevel
trustedrecommendedcandidategenerated
SemanticSourceStatus
availableselectedappliedrejectedpartialfailed
Entity: SemanticFieldEntry
Canonical semantic state for one dataset field or metric.
| Field | Type | Required | Description |
|---|---|---|---|
field_id |
string (UUID) | yes | Unique field semantic id |
session_id |
string | yes | Parent session |
field_name |
string | yes | Physical field/metric name |
field_kind |
enum | yes | column, metric, filter_dimension, parameter |
verbose_name |
string | null | no | Display label |
description |
text | null | no | Human-readable description |
display_format |
string | null | no | Formatting metadata such as d3 format |
provenance |
enum | yes | Final chosen source class |
source_id |
string | null | no | Winning source |
confidence_rank |
integer | null | no | Final applied ranking |
is_locked |
boolean | yes | Manual override protection |
has_conflict |
boolean | yes | Whether competing candidates exist |
needs_review |
boolean | yes | Whether user review is still needed |
last_changed_by |
enum | yes | system, user, agent |
user_feedback |
enum | no | User feedback: up, down, null |
created_at |
datetime | yes | Creation timestamp |
updated_at |
datetime | yes | Updated timestamp |
Validation rules
field_namemust be unique persession_id + field_kind.is_locked=trueprevents automatic overwrite.provenance=manual_overrideimpliesis_locked=true.has_conflict=truerequires at least one competing candidate record.- Fuzzy/applied inferred values must keep
needs_review=trueuntil confirmed if policy requires explicit review.
FieldKind
columnmetricfilter_dimensionparameter
FieldProvenance
dictionary_exactreference_importedfuzzy_inferredai_generatedmanual_overrideunresolved
Entity: SemanticCandidate
Stores competing candidate values before or alongside final field decision.
| Field | Type | Required | Description |
|---|---|---|---|
candidate_id |
string (UUID) | yes | Unique candidate id |
field_id |
string | yes | Parent semantic field |
source_id |
string | null | no | Candidate source |
candidate_rank |
integer | yes | Lower is stronger |
match_type |
enum | yes | Exact, imported, fuzzy, generated |
confidence_score |
number | yes | Normalized score |
proposed_verbose_name |
string | null | no | Candidate verbose name |
proposed_description |
text | null | no | Candidate description |
proposed_display_format |
string | null | no | Candidate display format |
status |
enum | yes | Candidate lifecycle |
created_at |
datetime | yes | Creation timestamp |
CandidateMatchType
exactreferencefuzzygenerated
CandidateStatus
proposedacceptedrejectedsuperseded
5. Imported Filters and Runtime Variables
Entity: ImportedFilter
Represents one recovered or user-supplied filter value.
| Field | Type | Required | Description |
|---|---|---|---|
filter_id |
string (UUID) | yes | Unique filter id |
session_id |
string | yes | Parent session |
filter_name |
string | yes | Source filter name |
display_name |
string | null | no | User-facing label |
raw_value |
json | yes | Original recovered value |
normalized_value |
json | null | no | Optional transformed value |
source |
enum | yes | Origin of the filter |
confidence_state |
enum | yes | Confidence/provenance class |
requires_confirmation |
boolean | yes | Whether explicit review is needed |
recovery_status |
enum | yes | Recovery completeness |
notes |
text | null | no | Recovery explanation |
created_at |
datetime | yes | Creation timestamp |
updated_at |
datetime | yes | Updated timestamp |
FilterSource
superset_nativesuperset_urlmanualinferred
FilterConfidenceState
confirmedimportedinferredai_draftunresolved
FilterRecoveryStatus
recoveredpartialmissingconflicted
Entity: TemplateVariable
Represents a runtime variable discovered from dataset execution logic.
| Field | Type | Required | Description |
|---|---|---|---|
variable_id |
string (UUID) | yes | Unique variable id |
session_id |
string | yes | Parent session |
variable_name |
string | yes | Canonical runtime variable name |
expression_source |
text | yes | Raw expression or snippet where variable was found |
variable_kind |
enum | yes | Detected variable class |
is_required |
boolean | yes | Whether launch requires a mapped value |
default_value |
json | null | no | Optional default |
mapping_status |
enum | yes | Current mapping state |
created_at |
datetime | yes | Creation timestamp |
updated_at |
datetime | yes | Updated timestamp |
VariableKind
native_filterparameterderivedunknown
MappingStatus
unmappedproposedapprovedoverriddeninvalid
6. Mapping Review and Warning Approvals
Entity: ExecutionMapping
Represents the mapping between a recovered filter and a runtime variable.
| Field | Type | Required | Description |
|---|---|---|---|
mapping_id |
string (UUID) | yes | Unique mapping id |
session_id |
string | yes | Parent session |
filter_id |
string | yes | Source imported filter |
variable_id |
string | yes | Target template variable |
mapping_method |
enum | yes | How mapping was produced |
raw_input_value |
json | yes | Original input |
effective_value |
json | null | no | Value to send to preview/launch |
transformation_note |
text | null | no | Explanation of normalization |
warning_level |
enum | null | no | Warning classification if transformation is risky |
requires_explicit_approval |
boolean | yes | Whether launch gate applies |
approval_state |
enum | yes | Approval lifecycle |
approved_by_user_id |
string | null | no | Approver if approved |
approved_at |
datetime | null | no | Approval timestamp |
created_at |
datetime | yes | Creation timestamp |
updated_at |
datetime | yes | Updated timestamp |
Validation rules
filter_id + variable_idmust be unique per session unless versioning is used.requires_explicit_approval=trueimplies launch is blocked whileapproval_state != approved.effective_valueis required before preview when variable is required.- user override should set
mapping_method=manual_override.
MappingMethod
direct_matchheuristic_matchsemantic_matchmanual_override
MappingWarningLevel
lowmediumhigh
ApprovalState
pendingapprovedrejectednot_required
7. Clarification Workflow
Entity: ClarificationSession
Stores resumable clarification flow state for one review session.
| Field | Type | Required | Description |
|---|---|---|---|
clarification_session_id |
string (UUID) | yes | Unique clarification session id |
session_id |
string | yes | Parent review session |
status |
enum | yes | Clarification lifecycle |
current_question_id |
string | null | no | Current active question |
resolved_count |
integer | yes | Count of answered/resolved items |
remaining_count |
integer | yes | Count of unresolved items |
summary_delta |
text | null | no | Human-readable change summary |
started_at |
datetime | yes | Start time |
updated_at |
datetime | yes | Last update |
completed_at |
datetime | null | no | End time |
ClarificationStatus
pendingactivepausedcompletedcancelled
Entity: ClarificationQuestion
Represents one focused question in the clarification flow.
| Field | Type | Required | Description |
|---|---|---|---|
question_id |
string (UUID) | yes | Unique question id |
clarification_session_id |
string | yes | Parent clarification session |
topic_ref |
string | yes | Related field/finding/mapping id |
question_text |
text | yes | Focused question |
why_it_matters |
text | yes | Business significance explanation |
current_guess |
text | null | no | Best guess if available |
priority |
integer | yes | Order score |
state |
enum | yes | Question lifecycle |
created_at |
datetime | yes | Creation timestamp |
updated_at |
datetime | yes | Updated timestamp |
QuestionState
openansweredskippedexpert_reviewsuperseded
Entity: ClarificationOption
Suggested selectable answer option for a question.
| Field | Type | Required | Description |
|---|---|---|---|
option_id |
string (UUID) | yes | Unique option id |
question_id |
string | yes | Parent question |
label |
string | yes | UI label |
value |
string | yes | Stored answer payload |
is_recommended |
boolean | yes | Whether this is the recommended option |
display_order |
integer | yes | UI ordering |
Entity: ClarificationAnswer
Stores user response to one clarification question.
| Field | Type | Required | Description |
|---|---|---|---|
answer_id |
string (UUID) | yes | Unique answer id |
question_id |
string | yes | Parent question |
answer_kind |
enum | yes | How user responded |
answer_value |
text | null | no | Selected/custom answer |
answered_by_user_id |
string | yes | Responding user |
impact_summary |
text | null | no | Optional summary of resulting state changes |
created_at |
datetime | yes | Answer timestamp |
AnswerKind
selectedcustomskippedexpert_review
Validation rules
- Each active question may have at most one current answer.
customanswers require non-emptyanswer_value.selectedanswers must correspond to a valid option or normalized payload.expert_reviewleaves the related topic unresolved but marked intentionally deferred.
8. Preview and Launch Audit
Entity: CompiledPreview
Stores the exact Superset-returned compiled SQL preview.
| Field | Type | Required | Description |
|---|---|---|---|
preview_id |
string (UUID) | yes | Unique preview id |
session_id |
string | yes | Parent session |
preview_status |
enum | yes | Preview lifecycle state |
compiled_sql |
text | null | no | Exact compiled SQL if successful |
preview_fingerprint |
string | yes | Snapshot hash of mapping/inputs used |
compiled_by |
enum | yes | Must be superset |
error_code |
string | null | no | Optional failure code |
error_details |
text | null | no | Readable preview error |
compiled_at |
datetime | null | no | Successful compile timestamp |
created_at |
datetime | yes | Record creation timestamp |
Validation rules
compiled_bymust besuperset.compiled_sqlis required whenpreview_status=ready.compiled_sqlmust be null whenpreview_status=failedunless partial diagnostics are intentionally stored elsewhere.preview_fingerprintmust be compared against current session inputs before launch.- Launch requires
preview_status=readyand matching current fingerprint.
PreviewStatus
pendingreadyfailedstale
Entity: DatasetRunContext
Audited execution snapshot created at launch.
| Field | Type | Required | Description |
|---|---|---|---|
run_context_id |
string (UUID) | yes | Unique run context id |
session_id |
string | yes | Parent review session |
dataset_ref |
string | yes | Canonical dataset identity |
environment_id |
string | yes | Execution environment |
preview_id |
string | yes | Bound compiled preview |
sql_lab_session_ref |
string | yes | Canonical SQL Lab reference |
effective_filters |
json | yes | Final filter payload |
template_params |
json | yes | Final template parameter object |
approved_mapping_ids |
json array | yes | Explicit approvals used for launch |
semantic_decision_refs |
json array | yes | Applied semantic decision references |
open_warning_refs |
json array | yes | Warnings that remained visible at launch |
launch_status |
enum | yes | Launch outcome |
launch_error |
text | null | no | Error if launch failed |
created_at |
datetime | yes | Launch record timestamp |
Validation rules
preview_idmust reference aCompiledPreviewwithreadystatus.sql_lab_session_refis mandatory for successful launch.effective_filtersandtemplate_paramsmust match the preview fingerprint used.launch_status=startedorsuccessrequires a non-empty SQL Lab reference.
LaunchStatus
startedsuccessfailed
9. Export Projections
Entity: ExportArtifact
Tracks generated exports for sharing documentation and validation outputs.
| Field | Type | Required | Description |
|---|---|---|---|
artifact_id |
string (UUID) | yes | Unique artifact id |
session_id |
string | yes | Parent session |
artifact_type |
enum | yes | Export type |
format |
enum | yes | File/output format |
storage_ref |
string | yes | Storage/file reference |
created_by_user_id |
string | yes | Requesting user |
created_at |
datetime | yes | Artifact creation time |
ArtifactType
documentationvalidation_reportrun_summary
ArtifactFormat
jsonmarkdowncsvpdf
10. Relationships
One-to-one / aggregate-root relationships
DatasetReviewSession→DatasetProfile(current active profile view)DatasetReviewSession→ClarificationSession(current or latest)DatasetReviewSession→CompiledPreview(latest/current preview)DatasetReviewSession→DatasetRunContext(latest/current launch audit)
One-to-many relationships
DatasetReviewSession→ manyValidationFindingDatasetReviewSession→ manySemanticSourceDatasetReviewSession→ manySemanticFieldEntrySemanticFieldEntry→ manySemanticCandidateDatasetReviewSession→ manyImportedFilterDatasetReviewSession→ manyTemplateVariableDatasetReviewSession→ manyExecutionMappingClarificationSession→ manyClarificationQuestionClarificationQuestion→ manyClarificationOptionClarificationQuestion→ zero/one currentClarificationAnswerDatasetReviewSession→ manyExportArtifactDatasetReviewSession→ manySessionEventDatasetReviewSession→ manySessionEvent
11. Derived Rules and Invariants
Run readiness invariant
A session is run_ready only if:
- no open blocking findings remain,
- all required template variables have approved/effective mappings,
- all launch-sensitive mapping warnings have been explicitly approved,
- a non-stale
CompiledPreviewexists for the current fingerprint.
Manual intent invariant
If a field is manually overridden:
SemanticFieldEntry.is_locked = trueSemanticFieldEntry.provenance = manual_override- later imports or inferred candidates may be recorded, but cannot replace the active value automatically.
Progressive recovery invariant
Partial Superset recovery must preserve usable state:
- imported filters may be
partial, - unresolved variables may remain
unmapped, - findings must explain what is still missing,
- session remains resumable.
Clarification persistence invariant
Clarification answers must be persisted before:
- finding severity is downgraded,
- profile state is updated,
- current question pointer advances.
Preview truth invariant
Compiled preview must be:
- generated by Superset,
- tied to the exact current effective inputs,
- treated as invalid if mappings/values change afterward.
12. Migration & Evolution Strategy
- Baseline: The initial implementation (Milestone 1) will include the core session and profile entities.
- Incremental Growth: Subsequent milestones will add clarification, mapping, and launch audit entities via standard SQLAlchemy migrations.
- Compatibility: The
DatasetReviewSessionaggregate root will remain the stable entry point for all sub-entities to ensure forward compatibility with saved user state.
13. Suggested Backend DTO Grouping
The future API and persistence layers should group models roughly as follows:
Session DTOs
SessionSummarySessionDetailSessionListItem
Review DTOs
DatasetProfileDtoValidationFindingDtoReadinessChecklistDto
Semantic DTOs
SemanticSourceDtoSemanticFieldEntryDtoSemanticCandidateDto
Clarification DTOs
ClarificationSessionDtoClarificationQuestionDtoClarificationAnswerRequest
Execution DTOs
ImportedFilterDtoTemplateVariableDtoExecutionMappingDtoCompiledPreviewDtoLaunchSummaryDto
Export DTOs
ExportArtifactDto
13. Open Modeling Notes Resolved
The Phase 0 research questions are considered resolved for design purposes:
- SQL preview is modeled as a first-class persisted artifact.
- SQL Lab is modeled as the only canonical launch target.
- semantic resolution and clarification are modeled as separate domain boundaries.
- field-level overrides and mapping approvals are first-class entities.
- session persistence is separate from task execution state.
This model is ready to drive: