# Data Model: LLM Dataset Orchestration **Feature**: [LLM Dataset Orchestration](./spec.md) **Branch**: `027-dataset-llm-orchestration` **Date**: 2026-03-16 ## Overview This document defines the domain entities, relationships, lifecycle states, and validation rules for the dataset review, semantic enrichment, clarification, preview, and launch workflow described in [`spec.md`](./spec.md) and grounded by the decisions in [`research.md`](./research.md). The model is intentionally split into: - **session aggregate** entities for resumable workflow state, - **semantic/provenance** entities for enrichment and conflict handling, - **execution** entities for mapping, preview, and launch audit, - **export** projections for sharing outputs. --- ## 1. Core Aggregate: DatasetReviewSession ### Entity: `SessionCollaborator` | Field | Type | Required | Description | |---|---|---:|---| | `user_id` | string | yes | Collaborating user ID | | `role` | enum | yes | `viewer`, `reviewer`, `approver` | | `added_at` | datetime | yes | When they were added | ### Entity: `DatasetReviewSession` Represents the top-level resumable workflow container for one dataset review/execution effort. | Field | Type | Required | Description | |---|---|---:|---| | `session_id` | string (UUID) | yes | Stable unique identifier for the review session | | `user_id` | string | yes | Authenticated User ID of the session owner | | `collaborators` | list[SessionCollaborator] | no | Shared access and roles | | `environment_id` | string | yes | Superset environment context | | `source_kind` | enum | yes | Origin kind: `superset_link`, `dataset_selection` | | `source_input` | string | yes | Original link or selected dataset reference | | `dataset_ref` | string | yes | Canonical dataset reference used by the feature | | `dataset_id` | integer \| null | no | Superset dataset id when resolved | | `dashboard_id` | integer \| null | no | Superset dashboard id if imported from dashboard link | | `readiness_state` | enum | yes | Current workflow readiness state | | `recommended_action` | enum | yes | Explicit next recommended action | | `status` | enum | yes | Session lifecycle status | | `current_phase` | enum | yes | Active workflow phase | | `active_task_id` | string \| null | no | Linked long-running task if one is active | | `last_preview_id` | string \| null | no | Most recent preview snapshot | | `last_run_context_id` | string \| null | no | Most recent launch audit record | | `created_at` | datetime | yes | Session creation timestamp | | `updated_at` | datetime | yes | Last mutation timestamp | | `last_activity_at` | datetime | yes | Last user/system activity timestamp | | `closed_at` | datetime \| null | no | Terminal close/archive timestamp | ### Validation rules - `session_id` must be globally unique. - `source_input` must be non-empty. - `environment_id` must resolve to a configured environment. - `readiness_state` and `recommended_action` must always be present. - `user_id` ownership must be enforced for all mutations, unless collaborator roles allow otherwise. - `dataset_id` becomes required before preview or launch phases. - `last_preview_id` must refer to a preview generated from the same session. ### Enums #### `SessionStatus` - `active` - `paused` - `completed` - `archived` - `cancelled` #### `SessionPhase` - `intake` - `recovery` - `review` - `semantic_review` - `clarification` - `mapping_review` - `preview` - `launch` - `post_run` #### `ReadinessState` - `empty` - `importing` - `review_ready` - `semantic_source_review_needed` - `clarification_needed` - `clarification_active` - `mapping_review_needed` - `compiled_preview_ready` - `partially_ready` - `run_ready` - `run_in_progress` - `completed` - `recovery_required` #### `RecommendedAction` - `import_from_superset` - `review_documentation` - `apply_semantic_source` - `start_clarification` - `answer_next_question` - `approve_mapping` - `generate_sql_preview` - `complete_required_values` - `launch_dataset` - `resume_session` - `export_outputs` --- ## 2. Dataset Profile and Review State ### Entity: `DatasetProfile` Consolidated interpretation of dataset meaning, semantics, filters, assumptions, and readiness. | Field | Type | Required | Description | |---|---|---:|---| | `profile_id` | string (UUID) | yes | Unique profile id | | `session_id` | string | yes | Parent session | | `dataset_name` | string | yes | Display dataset name | | `schema_name` | string \| null | no | Schema if available | | `database_name` | string \| null | no | Database if available | | `business_summary` | text | yes | Human-readable summary | | `business_summary_source` | enum | yes | Provenance of summary | | `description` | text \| null | no | Dataset-level description | | `dataset_type` | enum \| null | no | `table`, `virtual`, `sqllab_view`, `unknown` | | `is_sqllab_view` | boolean | yes | Whether dataset is SQL Lab derived | | `completeness_score` | number \| null | no | Optional normalized completeness score | | `confidence_state` | enum | yes | Overall confidence posture | | `has_blocking_findings` | boolean | yes | Derived summary flag | | `has_warning_findings` | boolean | yes | Derived summary flag | | `manual_summary_locked` | boolean | yes | Protects user-entered summary | | `created_at` | datetime | yes | Created timestamp | | `updated_at` | datetime | yes | Updated timestamp | ### Validation rules - `business_summary` must always contain a usable string; if weak, it may be skeletal but not null. - `manual_summary_locked=true` prevents later automatic overwrite. - `session_id` must be unique if only one active profile snapshot is stored per session, or versioned if snapshots are retained. - `confidence_state` must reflect highest unresolved-risk posture, not just optimistic confidence. #### `BusinessSummarySource` - `confirmed` - `imported` - `inferred` - `ai_draft` - `manual_override` #### `ConfidenceState` - `confirmed` - `mostly_confirmed` - `mixed` - `low_confidence` - `unresolved` --- ## 3. Validation Findings ### Entity: `ValidationFinding` Represents a blocking issue, warning, or informational observation. | Field | Type | Required | Description | |---|---|---:|---| | `finding_id` | string (UUID) | yes | Unique finding id | | `session_id` | string | yes | Parent session | | `area` | enum | yes | Affected domain area | | `severity` | enum | yes | `blocking`, `warning`, `informational` | | `code` | string | yes | Stable machine-readable finding code | | `title` | string | yes | Short label | | `message` | text | yes | Actionable human-readable explanation | | `resolution_state` | enum | yes | Current resolution status | | `resolution_note` | text \| null | no | Optional explanation or approval note | | `caused_by_ref` | string \| null | no | Related field/filter/mapping/question id | | `created_at` | datetime | yes | Creation timestamp | | `resolved_at` | datetime \| null | no | Resolution timestamp | ### Validation rules - `severity` must be one of the allowed values. - `resolution_state=resolved` or `approved` requires either a system resolution event or user action. - `launch` is blocked if any open `blocking` finding remains. - `warning` findings tied to mapping transformations require explicit approval before launch if marked launch-sensitive. #### `FindingArea` - `source_intake` - `dataset_profile` - `semantic_enrichment` - `clarification` - `filter_recovery` - `template_mapping` - `compiled_preview` - `launch` - `audit` #### `ResolutionState` - `open` - `resolved` - `approved` - `skipped` - `deferred` - `expert_review` --- ## 4. Semantic Source and Field Decisions ### Entity: `SemanticSource` Represents a trusted or candidate source of semantic metadata. | Field | Type | Required | Description | |---|---|---:|---| | `source_id` | string (UUID) | yes | Unique source id | | `session_id` | string | yes | Parent session | | `source_type` | enum | yes | Origin kind | | `source_ref` | string | yes | External reference, dataset ref, or uploaded artifact ref | | `source_version` | string | yes | Version/Snapshot for propagation tracking | | `display_name` | string | yes | Human-readable source name | | `trust_level` | enum | yes | Source trust tier | | `schema_overlap_score` | number \| null | no | Optional overlap signal | | `status` | enum | yes | Availability/applicability status | | `created_at` | datetime | yes | Creation timestamp | #### `SemanticSourceType` - `uploaded_file` - `connected_dictionary` - `reference_dataset` - `neighbor_dataset` - `ai_generated` #### `TrustLevel` - `trusted` - `recommended` - `candidate` - `generated` #### `SemanticSourceStatus` - `available` - `selected` - `applied` - `rejected` - `partial` - `failed` --- ### Entity: `SemanticFieldEntry` Canonical semantic state for one dataset field or metric. | Field | Type | Required | Description | |---|---|---:|---| | `field_id` | string (UUID) | yes | Unique field semantic id | | `session_id` | string | yes | Parent session | | `field_name` | string | yes | Physical field/metric name | | `field_kind` | enum | yes | `column`, `metric`, `filter_dimension`, `parameter` | | `verbose_name` | string \| null | no | Display label | | `description` | text \| null | no | Human-readable description | | `display_format` | string \| null | no | Formatting metadata such as d3 format | | `provenance` | enum | yes | Final chosen source class | | `source_id` | string \| null | no | Winning source | | `confidence_rank` | integer \| null | no | Final applied ranking | | `is_locked` | boolean | yes | Manual override protection | | `has_conflict` | boolean | yes | Whether competing candidates exist | | `needs_review` | boolean | yes | Whether user review is still needed | | `last_changed_by` | enum | yes | `system`, `user`, `agent` | | `user_feedback` | enum | no | User feedback: `up`, `down`, `null` | | `created_at` | datetime | yes | Creation timestamp | | `updated_at` | datetime | yes | Updated timestamp | ### Validation rules - `field_name` must be unique per `session_id + field_kind`. - `is_locked=true` prevents automatic overwrite. - `provenance=manual_override` implies `is_locked=true`. - `has_conflict=true` requires at least one competing candidate record. - Fuzzy/applied inferred values must keep `needs_review=true` until confirmed if policy requires explicit review. #### `FieldKind` - `column` - `metric` - `filter_dimension` - `parameter` #### `FieldProvenance` - `dictionary_exact` - `reference_imported` - `fuzzy_inferred` - `ai_generated` - `manual_override` - `unresolved` --- ### Entity: `SemanticCandidate` Stores competing candidate values before or alongside final field decision. | Field | Type | Required | Description | |---|---|---:|---| | `candidate_id` | string (UUID) | yes | Unique candidate id | | `field_id` | string | yes | Parent semantic field | | `source_id` | string \| null | no | Candidate source | | `candidate_rank` | integer | yes | Lower is stronger | | `match_type` | enum | yes | Exact, imported, fuzzy, generated | | `confidence_score` | number | yes | Normalized score | | `proposed_verbose_name` | string \| null | no | Candidate verbose name | | `proposed_description` | text \| null | no | Candidate description | | `proposed_display_format` | string \| null | no | Candidate display format | | `status` | enum | yes | Candidate lifecycle | | `created_at` | datetime | yes | Creation timestamp | #### `CandidateMatchType` - `exact` - `reference` - `fuzzy` - `generated` #### `CandidateStatus` - `proposed` - `accepted` - `rejected` - `superseded` --- ## 5. Imported Filters and Runtime Variables ### Entity: `ImportedFilter` Represents one recovered or user-supplied filter value. | Field | Type | Required | Description | |---|---|---:|---| | `filter_id` | string (UUID) | yes | Unique filter id | | `session_id` | string | yes | Parent session | | `filter_name` | string | yes | Source filter name | | `display_name` | string \| null | no | User-facing label | | `raw_value` | json | yes | Original recovered value | | `normalized_value` | json \| null | no | Optional transformed value | | `source` | enum | yes | Origin of the filter | | `confidence_state` | enum | yes | Confidence/provenance class | | `requires_confirmation` | boolean | yes | Whether explicit review is needed | | `recovery_status` | enum | yes | Recovery completeness | | `notes` | text \| null | no | Recovery explanation | | `created_at` | datetime | yes | Creation timestamp | | `updated_at` | datetime | yes | Updated timestamp | #### `FilterSource` - `superset_native` - `superset_url` - `manual` - `inferred` #### `FilterConfidenceState` - `confirmed` - `imported` - `inferred` - `ai_draft` - `unresolved` #### `FilterRecoveryStatus` - `recovered` - `partial` - `missing` - `conflicted` --- ### Entity: `TemplateVariable` Represents a runtime variable discovered from dataset execution logic. | Field | Type | Required | Description | |---|---|---:|---| | `variable_id` | string (UUID) | yes | Unique variable id | | `session_id` | string | yes | Parent session | | `variable_name` | string | yes | Canonical runtime variable name | | `expression_source` | text | yes | Raw expression or snippet where variable was found | | `variable_kind` | enum | yes | Detected variable class | | `is_required` | boolean | yes | Whether launch requires a mapped value | | `default_value` | json \| null | no | Optional default | | `mapping_status` | enum | yes | Current mapping state | | `created_at` | datetime | yes | Creation timestamp | | `updated_at` | datetime | yes | Updated timestamp | #### `VariableKind` - `native_filter` - `parameter` - `derived` - `unknown` #### `MappingStatus` - `unmapped` - `proposed` - `approved` - `overridden` - `invalid` --- ## 6. Mapping Review and Warning Approvals ### Entity: `ExecutionMapping` Represents the mapping between a recovered filter and a runtime variable. | Field | Type | Required | Description | |---|---|---:|---| | `mapping_id` | string (UUID) | yes | Unique mapping id | | `session_id` | string | yes | Parent session | | `filter_id` | string | yes | Source imported filter | | `variable_id` | string | yes | Target template variable | | `mapping_method` | enum | yes | How mapping was produced | | `raw_input_value` | json | yes | Original input | | `effective_value` | json \| null | no | Value to send to preview/launch | | `transformation_note` | text \| null | no | Explanation of normalization | | `warning_level` | enum \| null | no | Warning classification if transformation is risky | | `requires_explicit_approval` | boolean | yes | Whether launch gate applies | | `approval_state` | enum | yes | Approval lifecycle | | `approved_by_user_id` | string \| null | no | Approver if approved | | `approved_at` | datetime \| null | no | Approval timestamp | | `created_at` | datetime | yes | Creation timestamp | | `updated_at` | datetime | yes | Updated timestamp | ### Validation rules - `filter_id + variable_id` must be unique per session unless versioning is used. - `requires_explicit_approval=true` implies launch is blocked while `approval_state != approved`. - `effective_value` is required before preview when variable is required. - user override should set `mapping_method=manual_override`. #### `MappingMethod` - `direct_match` - `heuristic_match` - `semantic_match` - `manual_override` #### `MappingWarningLevel` - `low` - `medium` - `high` #### `ApprovalState` - `pending` - `approved` - `rejected` - `not_required` --- ## 7. Clarification Workflow ### Entity: `ClarificationSession` Stores resumable clarification flow state for one review session. | Field | Type | Required | Description | |---|---|---:|---| | `clarification_session_id` | string (UUID) | yes | Unique clarification session id | | `session_id` | string | yes | Parent review session | | `status` | enum | yes | Clarification lifecycle | | `current_question_id` | string \| null | no | Current active question | | `resolved_count` | integer | yes | Count of answered/resolved items | | `remaining_count` | integer | yes | Count of unresolved items | | `summary_delta` | text \| null | no | Human-readable change summary | | `started_at` | datetime | yes | Start time | | `updated_at` | datetime | yes | Last update | | `completed_at` | datetime \| null | no | End time | #### `ClarificationStatus` - `pending` - `active` - `paused` - `completed` - `cancelled` --- ### Entity: `ClarificationQuestion` Represents one focused question in the clarification flow. | Field | Type | Required | Description | |---|---|---:|---| | `question_id` | string (UUID) | yes | Unique question id | | `clarification_session_id` | string | yes | Parent clarification session | | `topic_ref` | string | yes | Related field/finding/mapping id | | `question_text` | text | yes | Focused question | | `why_it_matters` | text | yes | Business significance explanation | | `current_guess` | text \| null | no | Best guess if available | | `priority` | integer | yes | Order score | | `state` | enum | yes | Question lifecycle | | `created_at` | datetime | yes | Creation timestamp | | `updated_at` | datetime | yes | Updated timestamp | #### `QuestionState` - `open` - `answered` - `skipped` - `expert_review` - `superseded` --- ### Entity: `ClarificationOption` Suggested selectable answer option for a question. | Field | Type | Required | Description | |---|---|---:|---| | `option_id` | string (UUID) | yes | Unique option id | | `question_id` | string | yes | Parent question | | `label` | string | yes | UI label | | `value` | string | yes | Stored answer payload | | `is_recommended` | boolean | yes | Whether this is the recommended option | | `display_order` | integer | yes | UI ordering | --- ### Entity: `ClarificationAnswer` Stores user response to one clarification question. | Field | Type | Required | Description | |---|---|---:|---| | `answer_id` | string (UUID) | yes | Unique answer id | | `question_id` | string | yes | Parent question | | `answer_kind` | enum | yes | How user responded | | `answer_value` | text \| null | no | Selected/custom answer | | `answered_by_user_id` | string | yes | Responding user | | `impact_summary` | text \| null | no | Optional summary of resulting state changes | | `created_at` | datetime | yes | Answer timestamp | #### `AnswerKind` - `selected` - `custom` - `skipped` - `expert_review` ### Validation rules - Each active question may have at most one current answer. - `custom` answers require non-empty `answer_value`. - `selected` answers must correspond to a valid option or normalized payload. - `expert_review` leaves the related topic unresolved but marked intentionally deferred. --- ## 8. Preview and Launch Audit ### Entity: `CompiledPreview` Stores the exact Superset-returned compiled SQL preview. | Field | Type | Required | Description | |---|---|---:|---| | `preview_id` | string (UUID) | yes | Unique preview id | | `session_id` | string | yes | Parent session | | `preview_status` | enum | yes | Preview lifecycle state | | `compiled_sql` | text \| null | no | Exact compiled SQL if successful | | `preview_fingerprint` | string | yes | Snapshot hash of mapping/inputs used | | `compiled_by` | enum | yes | Must be `superset` | | `error_code` | string \| null | no | Optional failure code | | `error_details` | text \| null | no | Readable preview error | | `compiled_at` | datetime \| null | no | Successful compile timestamp | | `created_at` | datetime | yes | Record creation timestamp | ### Validation rules - `compiled_by` must be `superset`. - `compiled_sql` is required when `preview_status=ready`. - `compiled_sql` must be null when `preview_status=failed` unless partial diagnostics are intentionally stored elsewhere. - `preview_fingerprint` must be compared against current session inputs before launch. - Launch requires `preview_status=ready` and matching current fingerprint. #### `PreviewStatus` - `pending` - `ready` - `failed` - `stale` --- ### Entity: `DatasetRunContext` Audited execution snapshot created at launch. | Field | Type | Required | Description | |---|---|---:|---| | `run_context_id` | string (UUID) | yes | Unique run context id | | `session_id` | string | yes | Parent review session | | `dataset_ref` | string | yes | Canonical dataset identity | | `environment_id` | string | yes | Execution environment | | `preview_id` | string | yes | Bound compiled preview | | `sql_lab_session_ref` | string | yes | Canonical SQL Lab reference | | `effective_filters` | json | yes | Final filter payload | | `template_params` | json | yes | Final template parameter object | | `approved_mapping_ids` | json array | yes | Explicit approvals used for launch | | `semantic_decision_refs` | json array | yes | Applied semantic decision references | | `open_warning_refs` | json array | yes | Warnings that remained visible at launch | | `launch_status` | enum | yes | Launch outcome | | `launch_error` | text \| null | no | Error if launch failed | | `created_at` | datetime | yes | Launch record timestamp | ### Validation rules - `preview_id` must reference a `CompiledPreview` with `ready` status. - `sql_lab_session_ref` is mandatory for successful launch. - `effective_filters` and `template_params` must match the preview fingerprint used. - `launch_status=started` or `success` requires a non-empty SQL Lab reference. #### `LaunchStatus` - `started` - `success` - `failed` --- ## 9. Export Projections ### Entity: `ExportArtifact` Tracks generated exports for sharing documentation and validation outputs. | Field | Type | Required | Description | |---|---|---:|---| | `artifact_id` | string (UUID) | yes | Unique artifact id | | `session_id` | string | yes | Parent session | | `artifact_type` | enum | yes | Export type | | `format` | enum | yes | File/output format | | `storage_ref` | string | yes | Storage/file reference | | `created_by_user_id` | string | yes | Requesting user | | `created_at` | datetime | yes | Artifact creation time | #### `ArtifactType` - `documentation` - `validation_report` - `run_summary` #### `ArtifactFormat` - `json` - `markdown` - `csv` - `pdf` --- ## 10. Relationships ## One-to-one / aggregate-root relationships - `DatasetReviewSession` → `DatasetProfile` (current active profile view) - `DatasetReviewSession` → `ClarificationSession` (current or latest) - `DatasetReviewSession` → `CompiledPreview` (latest/current preview) - `DatasetReviewSession` → `DatasetRunContext` (latest/current launch audit) ## One-to-many relationships - `DatasetReviewSession` → many `ValidationFinding` - `DatasetReviewSession` → many `SemanticSource` - `DatasetReviewSession` → many `SemanticFieldEntry` - `SemanticFieldEntry` → many `SemanticCandidate` - `DatasetReviewSession` → many `ImportedFilter` - `DatasetReviewSession` → many `TemplateVariable` - `DatasetReviewSession` → many `ExecutionMapping` - `ClarificationSession` → many `ClarificationQuestion` - `ClarificationQuestion` → many `ClarificationOption` - `ClarificationQuestion` → zero/one current `ClarificationAnswer` - `DatasetReviewSession` → many `ExportArtifact` - `DatasetReviewSession` → many `SessionEvent` - `DatasetReviewSession` → many `SessionEvent` --- ## 11. Derived Rules and Invariants ### Run readiness invariant A session is `run_ready` only if: - no open blocking findings remain, - all required template variables have approved/effective mappings, - all launch-sensitive mapping warnings have been explicitly approved, - a non-stale `CompiledPreview` exists for the current fingerprint. ### Manual intent invariant If a field is manually overridden: - `SemanticFieldEntry.is_locked = true` - `SemanticFieldEntry.provenance = manual_override` - later imports or inferred candidates may be recorded, but cannot replace the active value automatically. ### Progressive recovery invariant Partial Superset recovery must preserve usable state: - imported filters may be `partial`, - unresolved variables may remain `unmapped`, - findings must explain what is still missing, - session remains resumable. ### Clarification persistence invariant Clarification answers must be persisted before: - finding severity is downgraded, - profile state is updated, - current question pointer advances. ### Preview truth invariant Compiled preview must be: - generated by Superset, - tied to the exact current effective inputs, - treated as invalid if mappings/values change afterward. --- ## 12. Migration & Evolution Strategy - **Baseline**: The initial implementation (Milestone 1) will include the core session and profile entities. - **Incremental Growth**: Subsequent milestones will add clarification, mapping, and launch audit entities via standard SQLAlchemy migrations. - **Compatibility**: The `DatasetReviewSession` aggregate root will remain the stable entry point for all sub-entities to ensure forward compatibility with saved user state. ## 13. Suggested Backend DTO Grouping The future API and persistence layers should group models roughly as follows: ### Session DTOs - `SessionSummary` - `SessionDetail` - `SessionListItem` ### Review DTOs - `DatasetProfileDto` - `ValidationFindingDto` - `ReadinessChecklistDto` ### Semantic DTOs - `SemanticSourceDto` - `SemanticFieldEntryDto` - `SemanticCandidateDto` ### Clarification DTOs - `ClarificationSessionDto` - `ClarificationQuestionDto` - `ClarificationAnswerRequest` ### Execution DTOs - `ImportedFilterDto` - `TemplateVariableDto` - `ExecutionMappingDto` - `CompiledPreviewDto` - `LaunchSummaryDto` ### Export DTOs - `ExportArtifactDto` --- ## 13. Open Modeling Notes Resolved The Phase 0 research questions are considered resolved for design purposes: - SQL preview is modeled as a first-class persisted artifact. - SQL Lab is modeled as the only canonical launch target. - semantic resolution and clarification are modeled as separate domain boundaries. - field-level overrides and mapping approvals are first-class entities. - session persistence is separate from task execution state. This model is ready to drive: - [`contracts/modules.md`](./contracts/modules.md) - [`contracts/api.yaml`](./contracts/api.yaml) - [`quickstart.md`](./quickstart.md)