764 lines
26 KiB
Markdown
764 lines
26 KiB
Markdown
# Data Model: LLM Dataset Orchestration
|
|
|
|
**Feature**: [LLM Dataset Orchestration](./spec.md)
|
|
**Branch**: `027-dataset-llm-orchestration`
|
|
**Date**: 2026-03-16
|
|
|
|
## Overview
|
|
|
|
This document defines the domain entities, relationships, lifecycle states, and validation rules for the dataset review, semantic enrichment, clarification, preview, and launch workflow described in [`spec.md`](./spec.md) and grounded by the decisions in [`research.md`](./research.md).
|
|
|
|
The model is intentionally split into:
|
|
- **session aggregate** entities for resumable workflow state,
|
|
- **semantic/provenance** entities for enrichment and conflict handling,
|
|
- **execution** entities for mapping, preview, and launch audit,
|
|
- **export** projections for sharing outputs.
|
|
|
|
---
|
|
|
|
## 1. Core Aggregate: DatasetReviewSession
|
|
|
|
### Entity: `SessionCollaborator`
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `user_id` | string | yes | Collaborating user ID |
|
|
| `role` | enum | yes | `viewer`, `reviewer`, `approver` |
|
|
| `added_at` | datetime | yes | When they were added |
|
|
|
|
### Entity: `DatasetReviewSession`
|
|
|
|
Represents the top-level resumable workflow container for one dataset review/execution effort.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `session_id` | string (UUID) | yes | Stable unique identifier for the review session |
|
|
| `user_id` | string | yes | Authenticated User ID of the session owner |
|
|
| `collaborators` | list[SessionCollaborator] | no | Shared access and roles |
|
|
| `environment_id` | string | yes | Superset environment context |
|
|
| `source_kind` | enum | yes | Origin kind: `superset_link`, `dataset_selection` |
|
|
| `source_input` | string | yes | Original link or selected dataset reference |
|
|
| `dataset_ref` | string | yes | Canonical dataset reference used by the feature |
|
|
| `dataset_id` | integer \| null | no | Superset dataset id when resolved |
|
|
| `dashboard_id` | integer \| null | no | Superset dashboard id if imported from dashboard link |
|
|
| `readiness_state` | enum | yes | Current workflow readiness state |
|
|
| `recommended_action` | enum | yes | Explicit next recommended action |
|
|
| `status` | enum | yes | Session lifecycle status |
|
|
| `current_phase` | enum | yes | Active workflow phase |
|
|
| `active_task_id` | string \| null | no | Linked long-running task if one is active |
|
|
| `last_preview_id` | string \| null | no | Most recent preview snapshot |
|
|
| `last_run_context_id` | string \| null | no | Most recent launch audit record |
|
|
| `created_at` | datetime | yes | Session creation timestamp |
|
|
| `updated_at` | datetime | yes | Last mutation timestamp |
|
|
| `last_activity_at` | datetime | yes | Last user/system activity timestamp |
|
|
| `closed_at` | datetime \| null | no | Terminal close/archive timestamp |
|
|
|
|
### Validation rules
|
|
- `session_id` must be globally unique.
|
|
- `source_input` must be non-empty.
|
|
- `environment_id` must resolve to a configured environment.
|
|
- `readiness_state` and `recommended_action` must always be present.
|
|
- `user_id` ownership must be enforced for all mutations, unless collaborator roles allow otherwise.
|
|
- `dataset_id` becomes required before preview or launch phases.
|
|
- `last_preview_id` must refer to a preview generated from the same session.
|
|
|
|
### Enums
|
|
|
|
#### `SessionStatus`
|
|
- `active`
|
|
- `paused`
|
|
- `completed`
|
|
- `archived`
|
|
- `cancelled`
|
|
|
|
#### `SessionPhase`
|
|
- `intake`
|
|
- `recovery`
|
|
- `review`
|
|
- `semantic_review`
|
|
- `clarification`
|
|
- `mapping_review`
|
|
- `preview`
|
|
- `launch`
|
|
- `post_run`
|
|
|
|
#### `ReadinessState`
|
|
- `empty`
|
|
- `importing`
|
|
- `review_ready`
|
|
- `semantic_source_review_needed`
|
|
- `clarification_needed`
|
|
- `clarification_active`
|
|
- `mapping_review_needed`
|
|
- `compiled_preview_ready`
|
|
- `partially_ready`
|
|
- `run_ready`
|
|
- `run_in_progress`
|
|
- `completed`
|
|
- `recovery_required`
|
|
|
|
#### `RecommendedAction`
|
|
- `import_from_superset`
|
|
- `review_documentation`
|
|
- `apply_semantic_source`
|
|
- `start_clarification`
|
|
- `answer_next_question`
|
|
- `approve_mapping`
|
|
- `generate_sql_preview`
|
|
- `complete_required_values`
|
|
- `launch_dataset`
|
|
- `resume_session`
|
|
- `export_outputs`
|
|
|
|
---
|
|
|
|
## 2. Dataset Profile and Review State
|
|
|
|
### Entity: `DatasetProfile`
|
|
|
|
Consolidated interpretation of dataset meaning, semantics, filters, assumptions, and readiness.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `profile_id` | string (UUID) | yes | Unique profile id |
|
|
| `session_id` | string | yes | Parent session |
|
|
| `dataset_name` | string | yes | Display dataset name |
|
|
| `schema_name` | string \| null | no | Schema if available |
|
|
| `database_name` | string \| null | no | Database if available |
|
|
| `business_summary` | text | yes | Human-readable summary |
|
|
| `business_summary_source` | enum | yes | Provenance of summary |
|
|
| `description` | text \| null | no | Dataset-level description |
|
|
| `dataset_type` | enum \| null | no | `table`, `virtual`, `sqllab_view`, `unknown` |
|
|
| `is_sqllab_view` | boolean | yes | Whether dataset is SQL Lab derived |
|
|
| `completeness_score` | number \| null | no | Optional normalized completeness score |
|
|
| `confidence_state` | enum | yes | Overall confidence posture |
|
|
| `has_blocking_findings` | boolean | yes | Derived summary flag |
|
|
| `has_warning_findings` | boolean | yes | Derived summary flag |
|
|
| `manual_summary_locked` | boolean | yes | Protects user-entered summary |
|
|
| `created_at` | datetime | yes | Created timestamp |
|
|
| `updated_at` | datetime | yes | Updated timestamp |
|
|
|
|
### Validation rules
|
|
- `business_summary` must always contain a usable string; if weak, it may be skeletal but not null.
|
|
- `manual_summary_locked=true` prevents later automatic overwrite.
|
|
- `session_id` must be unique if only one active profile snapshot is stored per session, or versioned if snapshots are retained.
|
|
- `confidence_state` must reflect highest unresolved-risk posture, not just optimistic confidence.
|
|
|
|
#### `BusinessSummarySource`
|
|
- `confirmed`
|
|
- `imported`
|
|
- `inferred`
|
|
- `ai_draft`
|
|
- `manual_override`
|
|
|
|
#### `ConfidenceState`
|
|
- `confirmed`
|
|
- `mostly_confirmed`
|
|
- `mixed`
|
|
- `low_confidence`
|
|
- `unresolved`
|
|
|
|
---
|
|
|
|
## 3. Validation Findings
|
|
|
|
### Entity: `ValidationFinding`
|
|
|
|
Represents a blocking issue, warning, or informational observation.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `finding_id` | string (UUID) | yes | Unique finding id |
|
|
| `session_id` | string | yes | Parent session |
|
|
| `area` | enum | yes | Affected domain area |
|
|
| `severity` | enum | yes | `blocking`, `warning`, `informational` |
|
|
| `code` | string | yes | Stable machine-readable finding code |
|
|
| `title` | string | yes | Short label |
|
|
| `message` | text | yes | Actionable human-readable explanation |
|
|
| `resolution_state` | enum | yes | Current resolution status |
|
|
| `resolution_note` | text \| null | no | Optional explanation or approval note |
|
|
| `caused_by_ref` | string \| null | no | Related field/filter/mapping/question id |
|
|
| `created_at` | datetime | yes | Creation timestamp |
|
|
| `resolved_at` | datetime \| null | no | Resolution timestamp |
|
|
|
|
### Validation rules
|
|
- `severity` must be one of the allowed values.
|
|
- `resolution_state=resolved` or `approved` requires either a system resolution event or user action.
|
|
- `launch` is blocked if any open `blocking` finding remains.
|
|
- `warning` findings tied to mapping transformations require explicit approval before launch if marked launch-sensitive.
|
|
|
|
#### `FindingArea`
|
|
- `source_intake`
|
|
- `dataset_profile`
|
|
- `semantic_enrichment`
|
|
- `clarification`
|
|
- `filter_recovery`
|
|
- `template_mapping`
|
|
- `compiled_preview`
|
|
- `launch`
|
|
- `audit`
|
|
|
|
#### `ResolutionState`
|
|
- `open`
|
|
- `resolved`
|
|
- `approved`
|
|
- `skipped`
|
|
- `deferred`
|
|
- `expert_review`
|
|
|
|
---
|
|
|
|
## 4. Semantic Source and Field Decisions
|
|
|
|
### Entity: `SemanticSource`
|
|
|
|
Represents a trusted or candidate source of semantic metadata.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `source_id` | string (UUID) | yes | Unique source id |
|
|
| `session_id` | string | yes | Parent session |
|
|
| `source_type` | enum | yes | Origin kind |
|
|
| `source_ref` | string | yes | External reference, dataset ref, or uploaded artifact ref |
|
|
| `source_version` | string | yes | Version/Snapshot for propagation tracking |
|
|
| `display_name` | string | yes | Human-readable source name |
|
|
| `trust_level` | enum | yes | Source trust tier |
|
|
| `schema_overlap_score` | number \| null | no | Optional overlap signal |
|
|
| `status` | enum | yes | Availability/applicability status |
|
|
| `created_at` | datetime | yes | Creation timestamp |
|
|
|
|
#### `SemanticSourceType`
|
|
- `uploaded_file`
|
|
- `connected_dictionary`
|
|
- `reference_dataset`
|
|
- `neighbor_dataset`
|
|
- `ai_generated`
|
|
|
|
#### `TrustLevel`
|
|
- `trusted`
|
|
- `recommended`
|
|
- `candidate`
|
|
- `generated`
|
|
|
|
#### `SemanticSourceStatus`
|
|
- `available`
|
|
- `selected`
|
|
- `applied`
|
|
- `rejected`
|
|
- `partial`
|
|
- `failed`
|
|
|
|
---
|
|
|
|
### Entity: `SemanticFieldEntry`
|
|
|
|
Canonical semantic state for one dataset field or metric.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `field_id` | string (UUID) | yes | Unique field semantic id |
|
|
| `session_id` | string | yes | Parent session |
|
|
| `field_name` | string | yes | Physical field/metric name |
|
|
| `field_kind` | enum | yes | `column`, `metric`, `filter_dimension`, `parameter` |
|
|
| `verbose_name` | string \| null | no | Display label |
|
|
| `description` | text \| null | no | Human-readable description |
|
|
| `display_format` | string \| null | no | Formatting metadata such as d3 format |
|
|
| `provenance` | enum | yes | Final chosen source class |
|
|
| `source_id` | string \| null | no | Winning source |
|
|
| `confidence_rank` | integer \| null | no | Final applied ranking |
|
|
| `is_locked` | boolean | yes | Manual override protection |
|
|
| `has_conflict` | boolean | yes | Whether competing candidates exist |
|
|
| `needs_review` | boolean | yes | Whether user review is still needed |
|
|
| `last_changed_by` | enum | yes | `system`, `user`, `agent` |
|
|
| `user_feedback` | enum | no | User feedback: `up`, `down`, `null` |
|
|
| `created_at` | datetime | yes | Creation timestamp |
|
|
| `updated_at` | datetime | yes | Updated timestamp |
|
|
|
|
### Validation rules
|
|
- `field_name` must be unique per `session_id + field_kind`.
|
|
- `is_locked=true` prevents automatic overwrite.
|
|
- `provenance=manual_override` implies `is_locked=true`.
|
|
- `has_conflict=true` requires at least one competing candidate record.
|
|
- Fuzzy/applied inferred values must keep `needs_review=true` until confirmed if policy requires explicit review.
|
|
|
|
#### `FieldKind`
|
|
- `column`
|
|
- `metric`
|
|
- `filter_dimension`
|
|
- `parameter`
|
|
|
|
#### `FieldProvenance`
|
|
- `dictionary_exact`
|
|
- `reference_imported`
|
|
- `fuzzy_inferred`
|
|
- `ai_generated`
|
|
- `manual_override`
|
|
- `unresolved`
|
|
|
|
---
|
|
|
|
### Entity: `SemanticCandidate`
|
|
|
|
Stores competing candidate values before or alongside final field decision.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `candidate_id` | string (UUID) | yes | Unique candidate id |
|
|
| `field_id` | string | yes | Parent semantic field |
|
|
| `source_id` | string \| null | no | Candidate source |
|
|
| `candidate_rank` | integer | yes | Lower is stronger |
|
|
| `match_type` | enum | yes | Exact, imported, fuzzy, generated |
|
|
| `confidence_score` | number | yes | Normalized score |
|
|
| `proposed_verbose_name` | string \| null | no | Candidate verbose name |
|
|
| `proposed_description` | text \| null | no | Candidate description |
|
|
| `proposed_display_format` | string \| null | no | Candidate display format |
|
|
| `status` | enum | yes | Candidate lifecycle |
|
|
| `created_at` | datetime | yes | Creation timestamp |
|
|
|
|
#### `CandidateMatchType`
|
|
- `exact`
|
|
- `reference`
|
|
- `fuzzy`
|
|
- `generated`
|
|
|
|
#### `CandidateStatus`
|
|
- `proposed`
|
|
- `accepted`
|
|
- `rejected`
|
|
- `superseded`
|
|
|
|
---
|
|
|
|
## 5. Imported Filters and Runtime Variables
|
|
|
|
### Entity: `ImportedFilter`
|
|
|
|
Represents one recovered or user-supplied filter value.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `filter_id` | string (UUID) | yes | Unique filter id |
|
|
| `session_id` | string | yes | Parent session |
|
|
| `filter_name` | string | yes | Source filter name |
|
|
| `display_name` | string \| null | no | User-facing label |
|
|
| `raw_value` | json | yes | Original recovered value |
|
|
| `normalized_value` | json \| null | no | Optional transformed value |
|
|
| `source` | enum | yes | Origin of the filter |
|
|
| `confidence_state` | enum | yes | Confidence/provenance class |
|
|
| `requires_confirmation` | boolean | yes | Whether explicit review is needed |
|
|
| `recovery_status` | enum | yes | Recovery completeness |
|
|
| `notes` | text \| null | no | Recovery explanation |
|
|
| `created_at` | datetime | yes | Creation timestamp |
|
|
| `updated_at` | datetime | yes | Updated timestamp |
|
|
|
|
#### `FilterSource`
|
|
- `superset_native`
|
|
- `superset_url`
|
|
- `manual`
|
|
- `inferred`
|
|
|
|
#### `FilterConfidenceState`
|
|
- `confirmed`
|
|
- `imported`
|
|
- `inferred`
|
|
- `ai_draft`
|
|
- `unresolved`
|
|
|
|
#### `FilterRecoveryStatus`
|
|
- `recovered`
|
|
- `partial`
|
|
- `missing`
|
|
- `conflicted`
|
|
|
|
---
|
|
|
|
### Entity: `TemplateVariable`
|
|
|
|
Represents a runtime variable discovered from dataset execution logic.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `variable_id` | string (UUID) | yes | Unique variable id |
|
|
| `session_id` | string | yes | Parent session |
|
|
| `variable_name` | string | yes | Canonical runtime variable name |
|
|
| `expression_source` | text | yes | Raw expression or snippet where variable was found |
|
|
| `variable_kind` | enum | yes | Detected variable class |
|
|
| `is_required` | boolean | yes | Whether launch requires a mapped value |
|
|
| `default_value` | json \| null | no | Optional default |
|
|
| `mapping_status` | enum | yes | Current mapping state |
|
|
| `created_at` | datetime | yes | Creation timestamp |
|
|
| `updated_at` | datetime | yes | Updated timestamp |
|
|
|
|
#### `VariableKind`
|
|
- `native_filter`
|
|
- `parameter`
|
|
- `derived`
|
|
- `unknown`
|
|
|
|
#### `MappingStatus`
|
|
- `unmapped`
|
|
- `proposed`
|
|
- `approved`
|
|
- `overridden`
|
|
- `invalid`
|
|
|
|
---
|
|
|
|
## 6. Mapping Review and Warning Approvals
|
|
|
|
### Entity: `ExecutionMapping`
|
|
|
|
Represents the mapping between a recovered filter and a runtime variable.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `mapping_id` | string (UUID) | yes | Unique mapping id |
|
|
| `session_id` | string | yes | Parent session |
|
|
| `filter_id` | string | yes | Source imported filter |
|
|
| `variable_id` | string | yes | Target template variable |
|
|
| `mapping_method` | enum | yes | How mapping was produced |
|
|
| `raw_input_value` | json | yes | Original input |
|
|
| `effective_value` | json \| null | no | Value to send to preview/launch |
|
|
| `transformation_note` | text \| null | no | Explanation of normalization |
|
|
| `warning_level` | enum \| null | no | Warning classification if transformation is risky |
|
|
| `requires_explicit_approval` | boolean | yes | Whether launch gate applies |
|
|
| `approval_state` | enum | yes | Approval lifecycle |
|
|
| `approved_by_user_id` | string \| null | no | Approver if approved |
|
|
| `approved_at` | datetime \| null | no | Approval timestamp |
|
|
| `created_at` | datetime | yes | Creation timestamp |
|
|
| `updated_at` | datetime | yes | Updated timestamp |
|
|
|
|
### Validation rules
|
|
- `filter_id + variable_id` must be unique per session unless versioning is used.
|
|
- `requires_explicit_approval=true` implies launch is blocked while `approval_state != approved`.
|
|
- `effective_value` is required before preview when variable is required.
|
|
- user override should set `mapping_method=manual_override`.
|
|
|
|
#### `MappingMethod`
|
|
- `direct_match`
|
|
- `heuristic_match`
|
|
- `semantic_match`
|
|
- `manual_override`
|
|
|
|
#### `MappingWarningLevel`
|
|
- `low`
|
|
- `medium`
|
|
- `high`
|
|
|
|
#### `ApprovalState`
|
|
- `pending`
|
|
- `approved`
|
|
- `rejected`
|
|
- `not_required`
|
|
|
|
---
|
|
|
|
## 7. Clarification Workflow
|
|
|
|
### Entity: `ClarificationSession`
|
|
|
|
Stores resumable clarification flow state for one review session.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `clarification_session_id` | string (UUID) | yes | Unique clarification session id |
|
|
| `session_id` | string | yes | Parent review session |
|
|
| `status` | enum | yes | Clarification lifecycle |
|
|
| `current_question_id` | string \| null | no | Current active question |
|
|
| `resolved_count` | integer | yes | Count of answered/resolved items |
|
|
| `remaining_count` | integer | yes | Count of unresolved items |
|
|
| `summary_delta` | text \| null | no | Human-readable change summary |
|
|
| `started_at` | datetime | yes | Start time |
|
|
| `updated_at` | datetime | yes | Last update |
|
|
| `completed_at` | datetime \| null | no | End time |
|
|
|
|
#### `ClarificationStatus`
|
|
- `pending`
|
|
- `active`
|
|
- `paused`
|
|
- `completed`
|
|
- `cancelled`
|
|
|
|
---
|
|
|
|
### Entity: `ClarificationQuestion`
|
|
|
|
Represents one focused question in the clarification flow.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `question_id` | string (UUID) | yes | Unique question id |
|
|
| `clarification_session_id` | string | yes | Parent clarification session |
|
|
| `topic_ref` | string | yes | Related field/finding/mapping id |
|
|
| `question_text` | text | yes | Focused question |
|
|
| `why_it_matters` | text | yes | Business significance explanation |
|
|
| `current_guess` | text \| null | no | Best guess if available |
|
|
| `priority` | integer | yes | Order score |
|
|
| `state` | enum | yes | Question lifecycle |
|
|
| `created_at` | datetime | yes | Creation timestamp |
|
|
| `updated_at` | datetime | yes | Updated timestamp |
|
|
|
|
#### `QuestionState`
|
|
- `open`
|
|
- `answered`
|
|
- `skipped`
|
|
- `expert_review`
|
|
- `superseded`
|
|
|
|
---
|
|
|
|
### Entity: `ClarificationOption`
|
|
|
|
Suggested selectable answer option for a question.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `option_id` | string (UUID) | yes | Unique option id |
|
|
| `question_id` | string | yes | Parent question |
|
|
| `label` | string | yes | UI label |
|
|
| `value` | string | yes | Stored answer payload |
|
|
| `is_recommended` | boolean | yes | Whether this is the recommended option |
|
|
| `display_order` | integer | yes | UI ordering |
|
|
|
|
---
|
|
|
|
### Entity: `ClarificationAnswer`
|
|
|
|
Stores user response to one clarification question.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `answer_id` | string (UUID) | yes | Unique answer id |
|
|
| `question_id` | string | yes | Parent question |
|
|
| `answer_kind` | enum | yes | How user responded |
|
|
| `answer_value` | text \| null | no | Selected/custom answer |
|
|
| `answered_by_user_id` | string | yes | Responding user |
|
|
| `impact_summary` | text \| null | no | Optional summary of resulting state changes |
|
|
| `created_at` | datetime | yes | Answer timestamp |
|
|
|
|
#### `AnswerKind`
|
|
- `selected`
|
|
- `custom`
|
|
- `skipped`
|
|
- `expert_review`
|
|
|
|
### Validation rules
|
|
- Each active question may have at most one current answer.
|
|
- `custom` answers require non-empty `answer_value`.
|
|
- `selected` answers must correspond to a valid option or normalized payload.
|
|
- `expert_review` leaves the related topic unresolved but marked intentionally deferred.
|
|
|
|
---
|
|
|
|
## 8. Preview and Launch Audit
|
|
|
|
### Entity: `CompiledPreview`
|
|
|
|
Stores the exact Superset-returned compiled SQL preview.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `preview_id` | string (UUID) | yes | Unique preview id |
|
|
| `session_id` | string | yes | Parent session |
|
|
| `preview_status` | enum | yes | Preview lifecycle state |
|
|
| `compiled_sql` | text \| null | no | Exact compiled SQL if successful |
|
|
| `preview_fingerprint` | string | yes | Snapshot hash of mapping/inputs used |
|
|
| `compiled_by` | enum | yes | Must be `superset` |
|
|
| `error_code` | string \| null | no | Optional failure code |
|
|
| `error_details` | text \| null | no | Readable preview error |
|
|
| `compiled_at` | datetime \| null | no | Successful compile timestamp |
|
|
| `created_at` | datetime | yes | Record creation timestamp |
|
|
|
|
### Validation rules
|
|
- `compiled_by` must be `superset`.
|
|
- `compiled_sql` is required when `preview_status=ready`.
|
|
- `compiled_sql` must be null when `preview_status=failed` unless partial diagnostics are intentionally stored elsewhere.
|
|
- `preview_fingerprint` must be compared against current session inputs before launch.
|
|
- Launch requires `preview_status=ready` and matching current fingerprint.
|
|
|
|
#### `PreviewStatus`
|
|
- `pending`
|
|
- `ready`
|
|
- `failed`
|
|
- `stale`
|
|
|
|
---
|
|
|
|
### Entity: `DatasetRunContext`
|
|
|
|
Audited execution snapshot created at launch.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `run_context_id` | string (UUID) | yes | Unique run context id |
|
|
| `session_id` | string | yes | Parent review session |
|
|
| `dataset_ref` | string | yes | Canonical dataset identity |
|
|
| `environment_id` | string | yes | Execution environment |
|
|
| `preview_id` | string | yes | Bound compiled preview |
|
|
| `sql_lab_session_ref` | string | yes | Canonical SQL Lab reference |
|
|
| `effective_filters` | json | yes | Final filter payload |
|
|
| `template_params` | json | yes | Final template parameter object |
|
|
| `approved_mapping_ids` | json array | yes | Explicit approvals used for launch |
|
|
| `semantic_decision_refs` | json array | yes | Applied semantic decision references |
|
|
| `open_warning_refs` | json array | yes | Warnings that remained visible at launch |
|
|
| `launch_status` | enum | yes | Launch outcome |
|
|
| `launch_error` | text \| null | no | Error if launch failed |
|
|
| `created_at` | datetime | yes | Launch record timestamp |
|
|
|
|
### Validation rules
|
|
- `preview_id` must reference a `CompiledPreview` with `ready` status.
|
|
- `sql_lab_session_ref` is mandatory for successful launch.
|
|
- `effective_filters` and `template_params` must match the preview fingerprint used.
|
|
- `launch_status=started` or `success` requires a non-empty SQL Lab reference.
|
|
|
|
#### `LaunchStatus`
|
|
- `started`
|
|
- `success`
|
|
- `failed`
|
|
|
|
---
|
|
|
|
## 9. Export Projections
|
|
|
|
### Entity: `ExportArtifact`
|
|
|
|
Tracks generated exports for sharing documentation and validation outputs.
|
|
|
|
| Field | Type | Required | Description |
|
|
|---|---|---:|---|
|
|
| `artifact_id` | string (UUID) | yes | Unique artifact id |
|
|
| `session_id` | string | yes | Parent session |
|
|
| `artifact_type` | enum | yes | Export type |
|
|
| `format` | enum | yes | File/output format |
|
|
| `storage_ref` | string | yes | Storage/file reference |
|
|
| `created_by_user_id` | string | yes | Requesting user |
|
|
| `created_at` | datetime | yes | Artifact creation time |
|
|
|
|
#### `ArtifactType`
|
|
- `documentation`
|
|
- `validation_report`
|
|
- `run_summary`
|
|
|
|
#### `ArtifactFormat`
|
|
- `json`
|
|
- `markdown`
|
|
- `csv`
|
|
- `pdf`
|
|
|
|
---
|
|
|
|
## 10. Relationships
|
|
|
|
## One-to-one / aggregate-root relationships
|
|
- `DatasetReviewSession` → `DatasetProfile` (current active profile view)
|
|
- `DatasetReviewSession` → `ClarificationSession` (current or latest)
|
|
- `DatasetReviewSession` → `CompiledPreview` (latest/current preview)
|
|
- `DatasetReviewSession` → `DatasetRunContext` (latest/current launch audit)
|
|
|
|
## One-to-many relationships
|
|
- `DatasetReviewSession` → many `ValidationFinding`
|
|
- `DatasetReviewSession` → many `SemanticSource`
|
|
- `DatasetReviewSession` → many `SemanticFieldEntry`
|
|
- `SemanticFieldEntry` → many `SemanticCandidate`
|
|
- `DatasetReviewSession` → many `ImportedFilter`
|
|
- `DatasetReviewSession` → many `TemplateVariable`
|
|
- `DatasetReviewSession` → many `ExecutionMapping`
|
|
- `ClarificationSession` → many `ClarificationQuestion`
|
|
- `ClarificationQuestion` → many `ClarificationOption`
|
|
- `ClarificationQuestion` → zero/one current `ClarificationAnswer`
|
|
- `DatasetReviewSession` → many `ExportArtifact`
|
|
- `DatasetReviewSession` → many `SessionEvent`
|
|
- `DatasetReviewSession` → many `SessionEvent`
|
|
|
|
---
|
|
|
|
## 11. Derived Rules and Invariants
|
|
|
|
### Run readiness invariant
|
|
A session is `run_ready` only if:
|
|
- no open blocking findings remain,
|
|
- all required template variables have approved/effective mappings,
|
|
- all launch-sensitive mapping warnings have been explicitly approved,
|
|
- a non-stale `CompiledPreview` exists for the current fingerprint.
|
|
|
|
### Manual intent invariant
|
|
If a field is manually overridden:
|
|
- `SemanticFieldEntry.is_locked = true`
|
|
- `SemanticFieldEntry.provenance = manual_override`
|
|
- later imports or inferred candidates may be recorded, but cannot replace the active value automatically.
|
|
|
|
### Progressive recovery invariant
|
|
Partial Superset recovery must preserve usable state:
|
|
- imported filters may be `partial`,
|
|
- unresolved variables may remain `unmapped`,
|
|
- findings must explain what is still missing,
|
|
- session remains resumable.
|
|
|
|
### Clarification persistence invariant
|
|
Clarification answers must be persisted before:
|
|
- finding severity is downgraded,
|
|
- profile state is updated,
|
|
- current question pointer advances.
|
|
|
|
### Preview truth invariant
|
|
Compiled preview must be:
|
|
- generated by Superset,
|
|
- tied to the exact current effective inputs,
|
|
- treated as invalid if mappings/values change afterward.
|
|
|
|
---
|
|
|
|
## 12. Migration & Evolution Strategy
|
|
- **Baseline**: The initial implementation (Milestone 1) will include the core session and profile entities.
|
|
- **Incremental Growth**: Subsequent milestones will add clarification, mapping, and launch audit entities via standard SQLAlchemy migrations.
|
|
- **Compatibility**: The `DatasetReviewSession` aggregate root will remain the stable entry point for all sub-entities to ensure forward compatibility with saved user state.
|
|
|
|
## 13. Suggested Backend DTO Grouping
|
|
|
|
The future API and persistence layers should group models roughly as follows:
|
|
|
|
### Session DTOs
|
|
- `SessionSummary`
|
|
- `SessionDetail`
|
|
- `SessionListItem`
|
|
|
|
### Review DTOs
|
|
- `DatasetProfileDto`
|
|
- `ValidationFindingDto`
|
|
- `ReadinessChecklistDto`
|
|
|
|
### Semantic DTOs
|
|
- `SemanticSourceDto`
|
|
- `SemanticFieldEntryDto`
|
|
- `SemanticCandidateDto`
|
|
|
|
### Clarification DTOs
|
|
- `ClarificationSessionDto`
|
|
- `ClarificationQuestionDto`
|
|
- `ClarificationAnswerRequest`
|
|
|
|
### Execution DTOs
|
|
- `ImportedFilterDto`
|
|
- `TemplateVariableDto`
|
|
- `ExecutionMappingDto`
|
|
- `CompiledPreviewDto`
|
|
- `LaunchSummaryDto`
|
|
|
|
### Export DTOs
|
|
- `ExportArtifactDto`
|
|
|
|
---
|
|
|
|
## 13. Open Modeling Notes Resolved
|
|
|
|
The Phase 0 research questions are considered resolved for design purposes:
|
|
- SQL preview is modeled as a first-class persisted artifact.
|
|
- SQL Lab is modeled as the only canonical launch target.
|
|
- semantic resolution and clarification are modeled as separate domain boundaries.
|
|
- field-level overrides and mapping approvals are first-class entities.
|
|
- session persistence is separate from task execution state.
|
|
|
|
This model is ready to drive:
|
|
- [`contracts/modules.md`](./contracts/modules.md)
|
|
- [`contracts/api.yaml`](./contracts/api.yaml)
|
|
- [`quickstart.md`](./quickstart.md) |