Files
ss-tools/specs/027-dataset-llm-orchestration/spec.md
2026-03-16 23:11:19 +03:00

21 KiB
Raw Blame History

Feature Specification: LLM Dataset Orchestration

Feature Branch: 027-dataset-llm-orchestration
Created: 2026-03-16
Status: Draft
Input: User description: "Я хочу проработать механизм llm документирования и проверки датасетов. И в автоматическом режиме, и в режиме диалога с агентом с уточнением атрибутов и прочего неявного. Так же, нам нужен механизм запуска датасетов на стороне superset, с поддержкой jinja шаблонов. В идеале, пользователь должен скормить ссылку из суперсета с сохраненными native filters, ss-tools должен вытащить все фильтры и собрать их для датасета"

Clarifications

Session 2026-03-16

  • Q: Which execution target should be canonical for approved dataset launch? → A: Superset SQL Lab session is the canonical audited launch target.
  • Q: What user action should be required to clear mapping warnings before launch? → A: Any mapping warning requires explicit user approval, but manual edit is optional.
  • Q: What should happen if Superset-side SQL compilation is unavailable before launch? → A: Launch stays blocked until Superset-side compiled preview succeeds.

User Scenarios & Testing (mandatory)

User Story 1 - Recover, enrich, and explain dataset context automatically (Priority: P1)

A data engineer or analytics engineer submits a dataset or a Superset link and immediately receives a readable explanation of what the dataset is, which filters were recovered, which semantic labels were reused from trusted sources, and what still needs review.

Why this priority: The first user need is fast understanding with minimal reinvention. Without an immediate and trustworthy first-pass interpretation, neither clarification nor execution provides value.

Independent Test: Can be fully tested by submitting a dataset with partial metadata or a Superset link with saved filters and verifying that the system produces a business-readable summary, distinguishes source confidence, searches trusted semantic sources before generating new labels, and shows the next recommended action without requiring manual dialogue.

Acceptance Scenarios:

  1. Given a dataset with partial technical metadata, When the user starts automatic review, Then the system generates a business-readable documentation draft, groups known and unresolved attributes, and presents a current readiness state.
  2. Given a valid Superset link with reusable saved native filters, When the user imports it, Then the system recovers the available filter context and presents imported values separately from inferred or user-provided values.
  3. Given connected dictionaries, spreadsheet sources, or trusted reference datasets are available, When automatic review runs, Then the system attempts semantic enrichment from those sources before creating AI-generated labels from scratch.
  4. Given multiple semantic candidates exist for a field, When the first summary is shown, Then the system clearly indicates the provenance and confidence level of the chosen or suggested semantic value.

User Story 2 - Resolve ambiguities through guided clarification (Priority: P2)

A data steward, analytics engineer, or domain expert works with an agent to resolve ambiguous business meanings, conflicting metadata, conflicting semantic sources, and missing run-time values one issue at a time.

Why this priority: Real datasets often contain implicit semantics that cannot be derived safely from source metadata alone. Guided clarification converts uncertainty into auditable decisions.

Independent Test: Can be fully tested by opening clarification mode for a dataset with ambiguous attributes or conflicting semantic sources and verifying that the system asks focused questions, explains why each question matters, stores answers, and updates readiness and validation outcomes in real time.

Acceptance Scenarios:

  1. Given a dataset has blocking ambiguities, When the user starts guided clarification, Then the system asks one focused question at a time and explains the significance of the question in business terms.
  2. Given the system already has a current guess for an unresolved attribute, When the question is shown, Then the system presents that guess along with selectable answers, a custom-answer option, and a skip option.
  3. Given semantic source reuse is likely, When the system detects a strong match with a trusted dictionary or reference dataset, Then the agent can proactively suggest that source as the preferred basis for semantic enrichment.
  4. Given fuzzy semantic matches were found from a selected dictionary or dataset, When the system presents them, Then the user can approve them in bulk, review them individually, or keep only exact matches.
  5. Given the user confirms or edits an answer, When the response is saved, Then the system updates the dataset profile, validation findings, and readiness state without losing prior context.
  6. Given the user exits clarification before all issues are resolved, When the session is saved, Then the system preserves answered questions, unresolved questions, and the current recommended next action.

User Story 3 - Prepare and launch a controlled dataset run (Priority: P3)

A BI engineer reviews the assembled run context, verifies filters and placeholders, understands any remaining warnings, reviews the compiled SQL preview, and launches the dataset with confidence that the execution can be reproduced later.

Why this priority: Execution is the final high-value outcome, but it must feel controlled and auditable rather than opaque.

Independent Test: Can be fully tested by preparing a dataset run from imported or manually confirmed filter context and verifying that the system blocks missing required values, blocks missing preview approval conditions, allows review and editing, and records the exact run context used.

Acceptance Scenarios:

  1. Given an assembled dataset context contains required filters and placeholders, When the user opens run preparation, Then the system shows the effective filters, unresolved assumptions, semantic provenance signals, and current run readiness in one place.
  2. Given required values are still missing, When the user attempts to launch, Then the system blocks launch and highlights the specific values that must be completed.
  3. Given warning-level mapping transformations are present, When the user reviews run preparation, Then the system requires explicit approval for each warning before launch while still allowing optional manual edits.
  4. Given Superset-side SQL compilation preview is unavailable or fails, When the user attempts to launch, Then the system blocks launch until a successful compiled preview is available.
  5. Given the dataset is run-ready, When the user confirms launch, Then the system creates or starts a Superset SQL Lab session as the canonical execution target and records the dataset identity, effective filters, parameter values, outstanding warnings, and execution outcome for later audit or replay.

Edge Cases

  • What happens when a dataset has enough structural metadata to document technically but not enough business context to explain its meaning?
  • How does the system handle a Superset link that identifies the dataset but contains no reusable native filters?
  • What happens when imported filters conflict with previously saved defaults or with the datasets documented business meaning?
  • How does the system handle parameterized placeholders that exist in the run context but do not yet have values?
  • What happens when a user skips clarification questions and proceeds with warnings?
  • How does the system present cases where one attribute is confirmed by a user, inferred from metadata, and contradicted by imported filter context?
  • What happens when a user leaves during clarification or run preparation and returns later?
  • What happens when a semantic label exists in a spreadsheet dictionary, a reference dataset, and an AI proposal with different values?
  • How does the system handle fuzzy semantic matches where source and target names are similar in meaning but not identical in form?
  • What happens when a user manually edits a semantic value and a higher-confidence imported source becomes available later?

Requirements (mandatory)

Functional Requirements

  • FR-001: The system MUST allow users to start dataset review and execution preparation from the frontend workspace by selecting a dataset source or providing a Superset link.
  • FR-002: The system MUST generate an initial dataset profile that distinguishes confirmed metadata, inferred metadata, imported metadata, unresolved metadata, and AI-draft metadata where applicable.
  • FR-003: The system MUST produce human-readable dataset documentation that explains the dataset purpose, business meaning, major attributes, filters, and known limitations in language suitable for operational stakeholders.
  • FR-004: The system MUST assign and display a current readiness state for the dataset review so users can immediately understand whether the dataset is review-ready, semantic-source-review-needed, clarification-needed, partially ready, or run-ready.
  • FR-005: The system MUST validate dataset completeness and consistency across attributes, business semantics, semantic enrichment sources, filters, assumptions, and execution readiness.
  • FR-006: The system MUST classify validation findings into blocking issues, warnings, and informational findings.
  • FR-007: The system MUST allow users to inspect the provenance of important dataset values, including whether each value was confirmed from a connected dictionary, imported from a trusted dataset, inferred from fuzzy matching, generated as an AI draft, manually edited by a user, or still unresolved.
  • FR-008: The system MUST search connected semantic sources during automatic review, including supported external dictionaries and trusted reference datasets, before creating AI-generated semantic values from scratch.
  • FR-009: The system MUST support semantic enrichment for at least verbose_name, description, and display formatting metadata for dataset fields and metrics when such metadata is available from a trusted source.
  • FR-010: The system MUST apply a visible confidence hierarchy to semantic enrichment candidates in this order: exact dictionary/file match, trusted reference dataset match, fuzzy semantic match, AI-generated draft.
  • FR-011: The system MUST allow users to choose and apply a semantic source from the frontend workspace using supported source types, including uploaded files, connected tabular dictionaries, and existing trusted Superset datasets.
  • FR-012: The system MUST allow users to start a guided clarification flow for unresolved or contradictory dataset details.
  • FR-013: The guided clarification flow MUST present one focused question at a time rather than an unstructured list of unresolved items.
  • FR-014: Each clarification question MUST explain why the answer matters and, when available, show the systems current best guess.
  • FR-015: The system MUST allow the user to answer with a suggested option, provide a custom answer, skip the question, or mark the item for later expert review.
  • FR-016: The system MUST allow the agent to proactively recommend a semantic source when schema overlap or semantic similarity with a trusted source is strong enough to justify reuse.
  • FR-017: The system MUST distinguish exact semantic matches from fuzzy semantic matches and MUST require user review before fuzzy matches are applied.
  • FR-018: The system MUST preserve answers provided during clarification and immediately update the dataset profile, validation findings, and readiness state when those answers affect review outcomes.
  • FR-019: The system MUST allow users to pause and resume a clarification session without losing prior answers, unresolved items, or progress state.
  • FR-020: The system MUST summarize what changed when a clarification session ends, including resolved ambiguities, remaining ambiguities, and impact on run readiness.
  • FR-021: (Consolidated with FR-001)
  • FR-022: The system MUST extract reusable saved native filters from a provided Superset link whenever such filters are present and accessible.
  • FR-023: The system MUST detect and expose runtime template variables referenced by the dataset execution logic so they can be mapped from imported or user-provided filter values.
  • FR-024: The system MUST present extracted filters with their current value, source, confidence state, and whether user confirmation is required.
  • FR-025: The system MUST preserve partially recovered value when a Superset import is incomplete and MUST explain which parts were recovered successfully and which still require manual or guided completion.
  • FR-026: The system MUST support dataset execution contexts that include parameterized placeholders so users can complete required run-time values before launch.
  • FR-027: The system MUST provide a dedicated pre-run review that presents the effective dataset identity, selected filters, required placeholders, unresolved assumptions, and current warnings in one place before launch.
  • FR-028: The system MUST require explicit user approval for each warning-level mapping transformation before launch, while allowing the user to manually edit the mapped value instead of approving it.
  • FR-029: The system MUST require a successful Superset-side compiled SQL preview before launch and MUST keep launch blocked if the preview is unavailable or compilation fails.
  • FR-030: The system MUST prevent dataset launch when required values, required execution attributes, required warning approvals, or a required compiled preview are missing and MUST explain what must be completed.
  • FR-031: The system MUST allow users to review and adjust the assembled filter set before starting a dataset run.
  • FR-032: The system MUST use a Superset SQL Lab session as the canonical audited execution target for approved dataset launch.
  • FR-033: The system MUST record the dataset run context, including dataset identity, selected filters, parameter values, unresolved assumptions, the associated SQL Lab session reference, mapping approvals, semantic-source decisions, and execution outcome, so that users can audit or repeat the run later.
  • FR-034: The system MUST support a workflow where automatic review, semantic enrichment, guided clarification, and dataset execution can be used independently or in sequence on the same dataset.
  • FR-035: The system MUST provide exportable outputs for dataset documentation and validation results so users can share them outside the immediate workflow.
  • FR-036: The system MUST preserve a usable frontend session state when a user stops mid-flow so they can resume review, clarification, semantic enrichment review, or run preparation without reconstructing prior work.
  • FR-037: The system MUST make the recommended next action explicit at each major state of the workflow.
  • FR-038: The system MUST provide side-by-side comparison when multiple semantic sources disagree for the same field and MUST not silently overwrite a user-entered value with imported or AI-generated metadata.
  • FR-039: The system MUST preserve manual semantic overrides unless the user explicitly replaces them.
  • FR-040: The system MUST allow users to apply semantic enrichment selectively at field level rather than only as an all-or-nothing operation.
  • FR-041: The system MUST provide an inline feedback mechanism (thumbs up/down) for AI-generated content to support continuous improvement of semantic matching and summarization.
  • FR-042: The system MUST support multi-user collaboration on review sessions, allowing owners to invite collaborators with specific roles (viewer, reviewer, approver).
  • FR-043: The system MUST provide batch approval actions for mapping warnings and fuzzy semantic matches to reduce manual effort for experienced users.
  • FR-044: The system MUST capture and persist a structured event log of all session-related actions (e.g., source intake, answer submission, approval, launch) to support audit, replay, and collaboration visibility.

Key Entities (include if feature involves data)

  • Dataset Profile: The consolidated representation of a dataset, including business purpose, attributes, filters, assumptions, readiness state, validation state, provenance of each important fact, and semantic enrichment status.
  • Validation Finding: A blocking issue, warning, or informational observation raised during dataset review, including severity, explanation, affected area, and resolution state.
  • Clarification Session: A resumable interaction record that stores unresolved questions, user answers, system guesses, expert-review flags, and remaining ambiguities for a dataset.
  • Semantic Source: A reusable origin of semantic metadata, such as an uploaded file, connected tabular dictionary, or trusted reference dataset, used to enrich field- and metric-level business meaning.
  • Semantic Mapping Decision: A recorded choice about which semantic source or proposed value was accepted, rejected, edited, or left unresolved for a field or metric.
  • Imported Filter Set: The collection of reusable filters extracted from a Superset link, including source context, mapped dataset fields, current values, confidence state, and confirmation status.
  • Dataset Run Context: The execution-ready snapshot of dataset inputs, selected filters, parameterized placeholders, unresolved assumptions, warnings, mapping approvals, semantic-source decisions, the associated SQL Lab session reference, and launch outcome used for auditing or replay.
  • Readiness State: The current workflow status that tells the user whether the dataset is still being recovered, ready for review, needs semantic-source review, needs clarification, is partially ready, or is ready to run.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: At least 90% of datasets submitted with standard source metadata produce an initial documentation draft without requiring manual reconstruction from scratch.
  • SC-002: Users can reach a first readable validation and documentation summary for a newly submitted dataset in under 5 minutes for the primary workflow.
  • SC-003: At least 70% of eligible semantic fields are populated from trusted external dictionaries or trusted reference datasets before AI-generated drafting is needed.
  • SC-004: At least 85% of clarification questions shown in guided mode are judged by pilot users as relevant and helpful to resolving ambiguity (measured via the built-in feedback mechanism).
  • SC-005: At least 80% of Superset links containing reusable saved native filters result in an imported filter set that users can review without rebuilding the context manually.
  • SC-006: At least 85% of pilot users correctly identify which values are confirmed versus imported versus inferred versus AI-generated during moderated usability review.
  • SC-007: At least 90% of dataset runs started from an imported or clarified context include a complete recorded run context that can be reopened later.
  • SC-008: Pilot users successfully complete the end-to-end flow of import, review, semantic enrichment, clarification, and launch on their first attempt in at least 75% of observed sessions.
  • SC-009: Support requests caused by missing or unclear dataset attributes decrease by at least 40% within the target pilot group after adoption.

Assumptions

  • Users already have permission to access the datasets and Superset artifacts they submit to ss-tools.
  • Saved native filters embedded in a Superset link are considered the preferred reusable source of analytical context when available.
  • Users need both self-service automation and a guided conversational path because dataset semantics are often incomplete, implicit, conflicting, or distributed across multiple semantic sources.
  • The feature is intended for internal operational use where clarity, traceability, semantic consistency, and repeatable execution are more important than raw execution speed.
  • Exportable documentation and validation outputs are required for collaboration, review, and audit use cases.
  • Users may choose to proceed with warnings, but not with missing required execution inputs, missing required mapping approvals, or missing required compiled preview.
  • Superset SQL Lab session creation is the canonical audited launch path for approved execution.
  • Warning-level mapping transformations require explicit user approval before launch, while manual correction remains optional.
  • Launch requires a successful Superset-side compiled preview and cannot fall back to an unverified local approximation.
  • Trusted semantic sources already exist or can be introduced incrementally through frontend-managed files, connected dictionaries, or reference datasets without requiring organizations to discard existing semantic workflows.