код написан

This commit is contained in:
2026-03-10 12:00:18 +03:00
parent a127aa07df
commit 0078c1ae05
57 changed files with 53951 additions and 4909 deletions

View File

@@ -0,0 +1,40 @@
# Specification Quality Checklist: Dashboard Health Windows
**Purpose**: Validate specification completeness and quality before proceeding to planning
**Created**: 2026-03-10
**Feature**: [spec.md](../spec.md)
## Content Quality
- [x] No implementation details (languages, frameworks, APIs)
- [x] Focused on user value and business needs
- [x] Written for non-technical stakeholders
- [x] All mandatory sections completed
## UX Consistency
- [x] Functional requirements fully support the 'Happy Path' in ux_reference.md
- [x] Error handling requirements match the 'Error Experience' in ux_reference.md
- [x] No requirements contradict the defined User Persona or Context
## Requirement Completeness
- [x] No [NEEDS CLARIFICATION] markers remain
- [x] Requirements are testable and unambiguous
- [x] Success criteria are measurable
- [x] Success criteria are technology-agnostic (no implementation details)
- [x] All acceptance scenarios are defined
- [x] Edge cases are identified
- [x] Scope is clearly bounded
- [x] Dependencies and assumptions identified
## Feature Readiness
- [x] All functional requirements have clear acceptance criteria
- [x] User scenarios cover primary flows
- [x] Feature meets measurable outcomes defined in Success Criteria
- [x] No implementation details leak into specification
## Notes
- Items marked incomplete require spec updates before `/speckit.clarify` or `/speckit.plan`

View File

@@ -0,0 +1,140 @@
# Module Contracts: Dashboard Health Windows
## Backend Contracts
### [DEF:ValidationPolicyModel:Class]
# @TIER: STANDARD
# @SEMANTICS: SQLAlchemy, Model
# @PURPOSE: Database model for storing validation scheduling rules.
# @LAYER: Domain
# @RELATION: DEPENDS_ON -> SQLAlchemy declarative base
# @PRE: Policy payload contains valid window_start and window_end strings.
# @POST: Policy is persisted with default `is_active=True`.
[/DEF:ValidationPolicyModel]
### [DEF:ValidationPolicySchema:Class]
# @TIER: TRIVIAL
# @SEMANTICS: Pydantic, Schema
# @PURPOSE: API contract for reading and writing Validation Policies.
# @LAYER: API
# @RELATION: DEPENDS_ON -> ValidationPolicyModel
[/DEF:ValidationPolicySchema]
### [DEF:HealthService:Class]
# @TIER: CRITICAL
# @SEMANTICS: Service, Aggregation
# @PURPOSE: Aggregates dashboard metadata with the latest ValidationRecord to produce health summaries.
# @LAYER: Domain
# @RELATION: CALLS -> SupersetClient.get_dashboards_summary
# @RELATION: DEPENDS_ON -> ValidationRecord
# @PRE: Valid environment_id is provided.
# @POST: Returns an aggregated list of DashboardHealthItems with status, issues, and timestamp.
# @TEST_CONTRACT: Input(environment_id: str) -> List[DashboardHealthItem]
# @TEST_SCENARIO: test_health_aggregation_success -> Returns latest record per dashboard.
# @TEST_FIXTURE: test_health_aggregation_success -> mock_dashboards_and_records
# @TEST_EDGE: test_health_aggregation_no_records -> Returns UNKNOWN status for dashboards without records.
# @TEST_INVARIANT: latest_record_wins -> VERIFIED_BY: [test_health_aggregation_success]
[/DEF:HealthService]
### [DEF:HealthRouter:Module]
# @TIER: STANDARD
# @SEMANTICS: FastAPI, Route
# @PURPOSE: API endpoints for the Health Center UI.
# @LAYER: API
# @RELATION: CALLS -> HealthService
# @PRE: Request includes valid authentication token.
# @POST: Returns JSON list of dashboard health statuses.
[/DEF:HealthRouter]
### [DEF:NotificationService:Class]
# @TIER: CRITICAL
# @SEMANTICS: Service, PubSub
# @PURPOSE: Evaluates policies and routes formatted notifications to dynamically resolved owners and explicit channels.
# @LAYER: Domain
# @RELATION: DEPENDS_ON -> ValidationResult
# @RELATION: DEPENDS_ON -> ValidationPolicyModel
# @RELATION: CALLS -> ProfileService (to map Superset owners to local contacts)
# @PRE: Receives a completed ValidationResult and its trigger Policy.
# @POST: Dispatches messages asynchronously via BackgroundTasks to configured providers.
# @TEST_CONTRACT: Input(ValidationResult, Policy) -> List[DispatchedAlert]
# @TEST_SCENARIO: test_owner_routing -> Matches Superset owner 'ivan' to profile Telegram ID and sends.
# @TEST_FIXTURE: test_owner_routing -> mock_profile_with_telegram
# @TEST_EDGE: test_missing_profile_contact -> Skips owner if no valid contact info is found without crashing.
# @TEST_INVARIANT: async_dispatch -> VERIFIED_BY: [test_owner_routing]
[/DEF:NotificationService]
### [DEF:NotificationProvider:Interface]
# @TIER: STANDARD
# @SEMANTICS: Interface, Strategy
# @PURPOSE: Base abstraction for sending formatted alerts to external systems (SMTP, Slack, Telegram).
# @LAYER: Infra
# @PRE: Receives a standardized AlertPayload (text + image links).
# @POST: Delivers the payload to the external system.
[/DEF:NotificationProvider]
### [DEF:ThrottledSchedulerConfigurator:Module]
# @TIER: CRITICAL
# @SEMANTICS: Scheduler, APScheduler
# @PURPOSE: Reads active ValidationPolicies and generates distributed cron/date triggers within the defined execution window.
# @LAYER: Core
# @RELATION: DEPENDS_ON -> ValidationPolicyModel
# @RELATION: CALLS -> APScheduler
# @PRE: Receives a list of active policies and current time.
# @POST: N tasks are scheduled exactly within [window_start, window_end] with roughly equal intervals.
# @TEST_CONTRACT: Input(List[ValidationPolicy], current_time) -> List[Trigger]
# @TEST_SCENARIO: test_window_distribution -> 60 tasks in 60 mins -> 1 min interval.
# @TEST_FIXTURE: test_window_distribution -> policy_1hr_60tasks
# @TEST_EDGE: test_window_too_small -> 100 tasks in 1 min -> Falls back to minimum safe interval or logs warning.
# @TEST_INVARIANT: distribution_bounds -> VERIFIED_BY: [test_window_distribution, test_window_too_small]
[/DEF:ThrottledSchedulerConfigurator]
## Frontend Contracts
### [DEF:HealthMatrix:Component]
# @TIER: STANDARD
# @SEMANTICS: Svelte, UI
# @PURPOSE: Displays traffic light summary cards (Pass, Warn, Fail).
# @LAYER: UI
# @RELATION: READS_FROM -> Health API Response
# @PRE: Receives array of dashboard health data.
# @POST: Renders three numeric summary cards with distinct colors.
# @UX_STATE: Loading -> Skeleton cards.
# @UX_STATE: Idle -> Distinct colored counts.
[/DEF:HealthMatrix]
### [DEF:HealthCenterPage:Page]
# @TIER: CRITICAL
# @SEMANTICS: SvelteKit, Route
# @PURPOSE: Main monitoring view for dashboard validation health.
# @LAYER: UI
# @RELATION: CALLS -> requestApi(/api/dashboards/health)
# @RELATION: BINDS_TO -> HealthMatrix
# @PRE: User has permission to view dashboards.
# @POST: Renders Environment selector, Health Matrix, and detailed data table.
# @UX_STATE: Loading -> Page skeleton.
# @UX_STATE: Error -> Toast notification and empty state.
# @UX_STATE: Success -> Matrix and populated table.
# @UX_TEST: FilterClick -> {click: "Fail only", expected: table shows only RED rows}
[/DEF:HealthCenterPage]
### [DEF:AutomationPoliciesPage:Page]
# @TIER: STANDARD
# @SEMANTICS: SvelteKit, Route
# @PURPOSE: Settings view to create and manage validation policies.
# @LAYER: UI
# @RELATION: CALLS -> requestApi(/api/settings/automation/policies)
# @PRE: User is Admin.
# @POST: Renders list of policies and 'Create Rule' modal.
# @UX_STATE: Creating -> Modal is open with form.
# @UX_FEEDBACK: SaveSuccess -> Toast "Policy scheduled".
[/DEF:AutomationPoliciesPage]
### [DEF:SidebarHealthBadge:Store]
# @TIER: STANDARD
# @SEMANTICS: Svelte Store
# @PURPOSE: Derived or fetched state to show `[🔴 N]` badge on the sidebar.
# @LAYER: UI-State
# @RELATION: DEPENDS_ON -> ActivityStore or specific Health poll
# @PRE: User is logged in.
# @POST: Provides integer count of currently failing dashboards.
[/DEF:SidebarHealthBadge]

View File

@@ -0,0 +1,138 @@
# Data Model: Dashboard Health Windows
## Entities
### `ValidationPolicy` (New)
**Layer**: Domain (Backend DB)
**Purpose**: Defines a scheduled rule for validating a group of dashboards within an execution window.
| Field | Type | Description |
|-------|------|-------------|
| `id` | UUID (PK) | Unique policy identifier |
| `name` | String | Human-readable name for the policy (e.g., "Nightly Critical Prod") |
| `environment_id` | String | Foreign Key to configured Superset environment |
| `is_active` | Boolean | Whether the policy is currently enabled (default: true) |
| `dashboard_ids` | JSON | Array of dashboard string identifiers/slugs targeted by this policy |
| `schedule_days` | JSON | Array of integers representing days of week (0=Mon, 6=Sun) |
| `window_start` | Time | Start time of the execution window (e.g., 01:00) |
| `window_end` | Time | End time of the execution window (e.g., 05:00) |
| `notify_owners` | Boolean | Whether to route alerts dynamically based on Superset owners |
| `custom_channels` | JSON | List of external channels (e.g., `[{"type": "slack", "target": "#alerts"}]`) |
| `alert_condition` | Enum | Trigger condition: `FAIL_ONLY`, `WARN_AND_FAIL`, `ALWAYS` |
| `created_at` | DateTime | Audit timestamp |
| `updated_at` | DateTime | Audit timestamp |
**Relationships**:
* Targets multiple Dashboards (implied via JSON array of IDs).
### `NotificationConfig` (New)
**Layer**: Domain (Backend DB - `AppConfigRecord` or standalone)
**Purpose**: Global settings for external notification providers (configured by Admins).
| Field | Type | Description |
|-------|------|-------------|
| `id` | UUID (PK) | Unique provider config identifier |
| `type` | Enum | Provider type: `SMTP`, `SLACK`, `TELEGRAM` |
| `name` | String | Display name for the config |
| `credentials` | JSON | Encrypted connection details (Host, Token, Password) |
| `is_active` | Boolean | Whether the provider is enabled |
### `UserDashboardPreference` (Extended)
**Layer**: Domain (Backend DB)
**Purpose**: Extended to hold user contact details for smart routing.
| Field | Type | Description |
|-------|------|-------------|
| ... existing ... | | |
| `telegram_id` | String | User's Telegram Chat ID for direct messages |
| `email_address` | String | Override email address for direct notifications |
| `notify_on_fail` | Boolean | Opt-out toggle for automated owner alerts |
### `ValidationRecord` (Updated/Extended)
**Layer**: Domain (Backend DB)
**Purpose**: Represents the outcome of a single dashboard validation task.
*Note: This entity likely already exists partially; this defines the required shape for this feature.*
| Field | Type | Description |
|-------|------|-------------|
| `id` | UUID (PK) | Unique record identifier |
| `task_id` | UUID (FK) | Reference to the underlying execution task |
| `dashboard_id` | String (Index) | Identifier of the dashboard checked |
| `environment_id` | String (Index) | Environment where the dashboard lives |
| `status` | Enum | Validation status: `PASS`, `WARN`, `FAIL`, `UNKNOWN` |
| `issues` | JSON | Array of structured issue tags/types found (e.g., `["SQL_ERROR", "TIMEOUT"]`) |
| `llm_summary` | Text | Human-readable explanation of the findings from the LLM |
| `screenshot_path` | String (Optional)| Path to the visual artifact in storage |
| `validated_at` | DateTime | Timestamp of the check |
**Relationships**:
* `task_id` points to `TaskRecord`.
## API Contracts
### `DashboardItem` (Extension)
The existing `DashboardItem` schema returned by `GET /api/dashboards` will be extended to support the Health Center view.
```json
// Extended fields added to existing response
{
"validation_status": "PASS | WARN | FAIL | UNKNOWN",
"validation_issues": ["Broken Chart", "Timeout"],
"last_validation_time": "2026-03-10T02:15:00Z",
"last_validation_task_id": "uuid-string"
}
```
### Validation Policy Endpoints
#### `GET /api/settings/automation/policies`
Returns a list of all validation policies.
#### `POST /api/settings/automation/policies`
Creates a new validation policy.
**Request Body**:
```json
{
"name": "Nightly Check",
"environment_id": "ss-prod",
"dashboard_ids": ["123", "456"],
"schedule_days": [0, 1, 2, 3, 4],
"window_start": "01:00",
"window_end": "05:00",
"notify_owners": true,
"custom_channels": [
{"type": "slack", "target": "#data-team"}
],
"alert_condition": "FAIL_ONLY"
}
```
#### Global Notification Settings
- `GET /api/settings/notifications/providers`
- `POST /api/settings/notifications/providers`
- `PUT /api/settings/notifications/providers/{id}`
#### `PUT /api/settings/automation/policies/{policy_id}`
Updates an existing policy.
#### `DELETE /api/settings/automation/policies/{policy_id}`
Deletes a policy.
### AI Assistant Output
The existing Assistant endpoints will need to return deep links to the new Health Center and detailed reports.
```json
// Example Action returned by Assistant
{
"action_type": "NAVIGATE",
"url": "/dashboards/health?filter=fail",
"label": "View Failed Dashboards"
}

View File

@@ -0,0 +1,88 @@
# Implementation Plan: Dashboard Health Windows
**Branch**: `026-dashboard-health-windows` | **Date**: 2026-03-10 | **Spec**: [spec.md](./spec.md)
**Input**: Feature specification from `/specs/026-dashboard-health-windows/spec.md`
## Summary
Implement automated LLM validation policies with "Execution Windows" to distribute database load, and create a centralized "Health Center" UI to monitor the latest validation status (Pass/Warn/Fail) of all dashboards, integrating status badges into the main sidebar and AI assistant.
## Technical Context
**Language/Version**: Python 3.9+ (Backend), Node.js 18+ / Svelte 5.x (Frontend)
**Primary Dependencies**: FastAPI, SQLAlchemy, APScheduler (Backend) | SvelteKit, Tailwind CSS, existing UI components (Frontend)
**Storage**: PostgreSQL / SQLite (existing database for `ValidationRecord` and new `ValidationPolicy`)
**Testing**: pytest (Backend), vitest (Frontend)
**Target Platform**: Linux server (Docker), Web Browser
**Project Type**: Full-stack web application (Internal Tool)
**Performance Goals**: Schedule and distribute 100+ validation tasks within a 1-hour window without database CPU spikes > 80%; Health Center loads in < 2 seconds.
**Constraints**: Must integrate with existing `TaskManager` and `LLMAnalysisPlugin`; UI must reuse existing Tailwind patterns.
**Scale/Scope**: Dozens of policies, hundreds of dashboards, generating daily validation records.
## Constitution Check
*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
- **Semantic Protocol Compliance**: `[DEF]`, `@PRE`, `@POST`, `@UX_STATE` tags will be strictly used for new modules and components.
- **Modular Plugin Architecture**: The validation logic will continue to reside in the existing `DashboardValidationPlugin`. The scheduling logic (`ThrottledScheduler`) will be integrated into the core `SchedulerService` or built as a specialized orchestrator.
- **Unified Frontend Experience**: The new Health Center will use Tailwind CSS and standard API wrappers (`fetchApi`/`requestApi`). No native `fetch`.
- **Security & RBAC**: The Health Center and Settings will use existing permission guards (e.g., Admin or specific dashboard view rights).
- **Independent Testability**: The scheduler logic (time distribution calculation) and Health Center API data aggregation will be independently testable.
- **Asynchronous Execution**: The background validation tasks are inherently async via `TaskManager`.
## Project Structure
### Documentation (this feature)
```text
specs/026-dashboard-health-windows/
├── plan.md
├── research.md
├── data-model.md
├── quickstart.md
├── contracts/
│ └── modules.md
└── tasks.md
```
### Source Code
```text
backend/
├── src/
│ ├── api/
│ │ └── routes/
│ │ └── health.py # New endpoint for Health Center data
│ │ └── settings.py # Update for validation policies CRUD
│ ├── core/
│ │ └── throttled_scheduler.py # NEEDS CLARIFICATION: Should this be a standalone orchestrator or extend APScheduler?
│ ├── models/
│ │ └── llm.py # Update ValidationRecord, add ValidationPolicy
│ └── services/
│ └── health_service.py # Aggregation logic for Health Center
frontend/
├── src/
│ ├── lib/
│ │ ├── api/
│ │ │ └── health.js # API client for Health Center
│ │ └── components/
│ │ └── health/
│ │ ├── HealthMatrix.svelte
│ │ └── PolicyForm.svelte
│ ├── routes/
│ │ ├── dashboards/
│ │ │ └── health/
│ │ │ └── +page.svelte # Health Center view
│ │ └── settings/
│ │ └── automation/
│ │ └── +page.svelte # Validation policies view
```
**Structure Decision**: A standard full-stack slice adding a new feature view (Health Center) and a settings view (Automation Policies), supported by backend API routes, models, and a custom scheduling component.
## Complexity Tracking
| Violation | Why Needed | Simpler Alternative Rejected Because |
|-----------|------------|-------------------------------------|
| Custom Scheduler wrapper | APScheduler handles strict cron, but we need dynamic, distributed execution within a window. | Writing a monolithic cron job that sleeps/loops blocks the worker thread and is hard to observe. |

View File

@@ -0,0 +1,38 @@
# Quickstart: Dashboard Health Windows
## Overview
This feature introduces "Execution Windows" for automated LLM dashboard validations, allowing administrators to schedule checks over a period of time (e.g., 1am-5am) rather than at a single exact minute. It also adds a centralized "Health Center" UI to easily monitor which dashboards are currently failing their checks.
## Implementation Steps
1. **Backend Database Updates**:
- Create `ValidationPolicy` SQLAlchemy model (in `backend/src/models/llm.py` or new file).
- Generate Alembic migration for the new table.
2. **Backend API Routes**:
- Create CRUD routes for `ValidationPolicy` (e.g., `backend/src/api/routes/settings.py` or new `automation.py`).
- Enhance `backend/src/api/routes/dashboards.py` (or create `health.py`) to provide the aggregated health projection (joining Superset dashboard list with the latest `ValidationRecord`).
3. **Scheduler Integration**:
- Implement `ThrottledSchedulerConfigurator` logic inside or alongside `backend/src/core/scheduler.py`.
- Ensure it can parse active `ValidationPolicy` rows and translate them into distinct `apscheduler` triggers spaced evenly across the `[window_start, window_end]` interval.
4. **Frontend Settings (Automation)**:
- Add a new tab/page under Settings for "Automation Policies".
- Implement list view and Create/Edit modal for policies.
- Include intuitive time pickers for the Execution Window.
5. **Frontend Health Center**:
- Create `/dashboards/health` route in SvelteKit.
- Implement the `HealthMatrix` traffic light summary component.
- Build the data table to display the dashboard names, status badges, and issue summaries.
6. **UI Integration**:
- Add a derived store to `frontend/src/lib/stores/sidebar.js` (or similar) to fetch/calculate the `[🔴 N]` badge for the Dashboards navigation item.
- Update `backend/src/api/routes/assistant.py` so the LLM assistant can answer questions about "failing dashboards" and provide deep links to the new Health Center.
## Testing Strategy
- **Backend Unit Tests**: Verify the math of `ThrottledSchedulerConfigurator` (e.g., 60 tasks in 60 minutes = 1 minute intervals). Test boundary conditions (100 tasks in 1 minute).
- **Backend Route Tests**: Verify the API correctly aggregates the *latest* validation record for the Health Center, ignoring older passing/failing records.
- **Frontend Unit Tests**: Test `HealthMatrix` rendering with different combinations of Pass/Warn/Fail counts.
- **E2E/Integration**: Create a policy, wait for it to schedule, run a mock validation, and verify it appears in the Health Center.

View File

@@ -0,0 +1,30 @@
# Phase 0: Research & Clarifications
## Needs Clarification Resolution
### 1. ThrottledScheduler Architecture
**Context**: We need to schedule N tasks evenly across a time window (e.g., 100 tasks between 01:00 and 05:00) rather than at a single exact time, to avoid database overload.
**Decision**: Instead of a completely standalone orchestrator, we will enhance the existing `SchedulerService` (which wraps `APScheduler`) with a specific policy type for "Windowed Execution". When the scheduler evaluates a `ValidationPolicy`, it will dynamically generate N distinct job triggers spread across the configured window using `CronTrigger` or `DateTrigger`.
**Rationale**: `APScheduler` is already running as a reliable background process in our FastAPI app. Building a custom orchestrator loop would duplicate persistence and recovery logic. By calculating the distributed execution times at the point of policy evaluation (or via a daily setup job), we can feed those exact times into the existing robust scheduler.
**Alternatives considered**:
1. **Queue-based throttling**: Push all 100 tasks to a queue and use a rate-limited worker. *Rejected* because we want users to predictably know *when* a task will run (e.g., "Sometime between 1am and 5am"), not just randomly delay it.
2. **Standalone orchestrator thread**: A loop that sleeps and triggers tasks. *Rejected* due to complexity in managing state if the server restarts.
### 2. Health Center Data Aggregation
**Context**: The Health Center needs to display the *latest* validation status for each dashboard.
**Decision**: We will extend the existing `ResourceService.get_dashboards_with_status` to include the aggregated LLM validation outcome (derived from the most recent `ValidationRecord` for each dashboard). The frontend `DashboardHub` already has grid capabilities; we will create a specialized "Health View" projection of this grid, optimized for showing the `ValidationRecord` structured issues and statuses.
**Rationale**: Reusing the existing dashboard hub fetching logic (`get_dashboards`) ensures consistency with RBAC, environment filtering, and Git status. It prevents duplicating the heavy lifting of joining Superset dashboards with local SQLite metadata.
**Alternatives considered**:
1. **Dedicated `/health` endpoint**: Querying only `ValidationRecord` and joining backward to Superset. *Rejected* because Superset is the source of truth for dashboard existence and ownership; querying SQLite first might show deleted dashboards.
### 3. Policy Execution Scope
**Context**: How do we define which dashboards are in a policy?
**Decision**: A `ValidationPolicy` will store a JSON list of explicit `dashboard_id`s, or dynamic tags (e.g., "tag:production", "owner:data-team"), and an `environment_id`. For V1, to simplify the UI and ensure predictable scheduling math, we will support explicit selection of dashboards (saving an array of IDs).
**Rationale**: Explicit IDs map perfectly to the requirement "Select 15 dashboards". It allows the scheduler to exactly know N = 15.
**Alternatives considered**: Purely dynamic tag evaluation at runtime. *Rejected for V1* because if a tag applies to 1000 dashboards, the scheduler wouldn't know until the moment of execution, making it harder to pre-calculate the execution window intervals.

View File

@@ -0,0 +1,106 @@
# Feature Specification: Dashboard Health Windows
**Feature Branch**: `026-dashboard-health-windows`
**Created**: 2026-03-10
**Status**: Draft
**Input**: User description: "init_task.md about Dashboard Validation Policies, Health Center, and Smart Notifications"
## User Scenarios & Testing *(mandatory)*
### User Story 1 - Create Validation Policy with Execution Window (Priority: P1)
As a System Administrator or Data Engineer, I want to create validation policies with an "Execution Window" rather than strict cron times, so that the system can automatically distribute the checks and avoid overloading the database with simultaneous requests.
**Why this priority**: Without this, scheduling many dashboards will cause database spikes and system instability, breaking the core value proposition of automated checking.
**Independent Test**: Can be tested by creating a policy with a 1-hour window for 60 dashboards and verifying that the system schedules 60 distinct validations spread roughly 1 minute apart.
**Acceptance Scenarios**:
1. **Given** I am on the "Settings -> Automation" page, **When** I click "Create Rule", select 15 dashboards, and set a schedule window from 01:00 to 05:00, **Then** the policy is saved and the system confirms the checks will be distributed.
2. **Given** a saved policy with an execution window, **When** the system processes it, **Then** it calculates the optimal interval and schedules individual validations sequentially rather than concurrently.
---
### User Story 2 - Dashboard Health Center Monitoring (Priority: P2)
As a Data Analyst or Data Team Lead, I want to view a "Health Center" dashboard that shows a traffic-light summary of the latest validation results, so that I can immediately identify and investigate broken dashboards from last night's checks.
**Why this priority**: Users need a high-level, action-oriented view to monitor system health; a flat log of tasks makes it impossible to find failures efficiently.
**Independent Test**: Can be tested by running several validation tasks (some pass, some fail) and viewing the Health Center to ensure it correctly aggregates and displays only the *latest* status for each unique dashboard.
**Acceptance Scenarios**:
1. **Given** several validations ran overnight, **When** I open the Health Center, **Then** I see an aggregated summary (e.g., 🟢 145 | 🟡 5 | 🔴 2).
2. **Given** I am looking at the Health Center table, **When** I click "View Report" on a failed dashboard row, **Then** I am navigated to the detailed LLM report for that specific run.
3. **Given** a dashboard has been checked multiple times, **When** I view the Health Center, **Then** I only see the result of the most recent check for that dashboard.
---
### User Story 3 - Quick Navigation and Integrations (Priority: P3)
As a daily user, I want to see notification badges in the sidebar and be able to ask the AI assistant about failing dashboards, so that I am proactively alerted to issues without having to dig through menus.
**Why this priority**: This provides quick wins for usability and engagement, integrating the new health data into the core user experience.
**Independent Test**: Can be tested by having failing validations in the database and checking that the sidebar navigation displays a red badge with the correct count, and querying the assistant returns accurate results.
**Acceptance Scenarios**:
1. **Given** there are 2 currently failing dashboards, **When** I view the main sidebar, **Then** the "Dashboards" menu item displays a badge like `[2🔴]`.
2. **Given** I open the AI assistant, **When** I ask "Show dashboards that failed the night check", **Then** the assistant returns a list of failing dashboards with their issues and links to the Health Center.
---
### User Story 4 - Smart Notifications Routing (Priority: P2)
As a Dashboard Owner, I want to receive automated alerts (via Email, Slack, or Telegram) when a dashboard I own fails validation, so that I don't have to constantly check the Health Center manually.
**Why this priority**: Shifting from a "pull" to a "push" model is essential for alerting at scale and preventing alert fatigue by routing only relevant failures to specific owners.
**Independent Test**: Can be tested by configuring a user profile with a Telegram ID, running a validation task that fails for a dashboard owned by that user in Superset, and verifying the `NotificationService` dispatches a payload to the Telegram provider.
**Acceptance Scenarios**:
1. **Given** I have added my Telegram ID to my profile settings, **When** a validation policy with "Auto-notify Owners" enabled finds a failure in my dashboard, **Then** I receive a Telegram message with the LLM summary and deep links.
2. **Given** a policy is configured to alert a general Slack channel, **When** the policy executes and finds failures, **Then** a summarized alert is dispatched to the configured Slack webhook.
### Edge Cases
- What happens when the selected execution window is too short for the number of dashboards selected (e.g., 5 minutes for 100 dashboards)?
- How does system handle a policy that targets dashboards that are subsequently deleted?
- What happens if a scheduled check fails to execute entirely (e.g., scheduler crash) instead of just failing the validation?
- How does the Health Center display a dashboard that has never been checked yet?
## Requirements *(mandatory)*
### Functional Requirements
- **FR-001**: System MUST allow users to define Validation Policies targeting specific environments and sets of dashboards (selected individually, by tag, or all).
- **FR-002**: System MUST allow configuring an "Execution Window" (Start Time and End Time) and target Days of the Week for a policy.
- **FR-003**: System MUST calculate throttled execution times within the execution window to evenly distribute validation tasks.
- **FR-004**: System MUST allow configuring alert routing in Validation Policies (e.g., Auto-notify Owners, send to specific Channels) when checks result in configurable statuses (e.g., FAIL_ONLY).
- **FR-005**: System MUST provide a "Dashboard Health Center" view that aggregates the *latest* validation status for each dashboard.
- **FR-006**: System MUST record the dashboard identifier, environment identifier, validation status, specific issues found, summary text, and a screenshot for each validation check.
- **FR-007**: System MUST display a summary of statuses (Pass, Warn, Fail counts) at the top of the Health Center.
- **FR-008**: System MUST display a red notification badge in the main sidebar next to "Dashboards" if there are currently failing dashboards.
- **FR-009**: System MUST allow the AI Assistant to query the latest validation statuses and return a formatted list of failing dashboards.
- **FR-010**: System MUST allow administrators to configure global notification providers (SMTP, Slack Webhooks, Telegram Bots).
- **FR-011**: System MUST match dashboard owners from Superset with local user profiles to perform smart routing of notifications.
- **FR-012**: System MUST format notifications specifically for the target medium (rich HTML with embedded images for email, compact text with links for messengers).
### Key Entities
- **ValidationPolicy**: Defines the scope (dashboards, environment), schedule (execution window, days), and alert settings (notify owners, channel destinations).
- **ValidationRecord**: The result of a single dashboard check, extended with dashboard/environment identifiers, status, structured issues, and artifact indicators.
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: Users can schedule checks for 100+ dashboards without causing database CPU/Memory spikes over 80% during the execution window.
- **SC-002**: Users can identify all failing dashboards within 5 seconds of opening the Health Center.
- **SC-003**: The backend correctly distributes N scheduled jobs across the specified time window with a maximum variance of 10% from the optimal interval.
- **SC-004**: Users receive accurate responses from the AI Assistant regarding failing dashboards 100% of the time.
- **SC-005**: Failing validations trigger smart notifications to the correct dashboard owners (who have configured profiles) within 30 seconds of task completion.

View File

@@ -0,0 +1,117 @@
# Implementation Tasks: Dashboard Health Windows
**Feature Branch**: `026-dashboard-health-windows`
**Documentation**: [plan.md](./plan.md) | [spec.md](./spec.md) | [ux_reference.md](./ux_reference.md) | [data-model.md](./data-model.md) | [contracts/modules.md](./contracts/modules.md)
## Dependencies & Execution Order
1. **Phase 1 & 2**: Setup and Foundation (DB changes, extending scheduler). Blocks all user stories.
2. **Phase 3**: User Story 1 (Policies). Needs Phase 2.
3. **Phase 4**: User Story 2 (Health Center). Can be built in parallel with Phase 3 on the frontend, but backend depends on data model changes in Phase 2.
4. **Phase 5**: User Story 3 (Quick Nav). Depends on Phase 4 Health Center backend API.
5. **Phase 6**: User Story 4 (Smart Notifications). Depends on Phase 2 (Profile schema) and Phase 3 (Policies).
6. **Phase 7**: Polish. Depends on all previous phases.
**Parallel Execution Opportunities**:
- `[US1]` (Policies UI) can be built in parallel with `[US2]` (Health Matrix UI) and `[US4]` (Notification settings UI).
- Backend and frontend tasks within a user story can generally be developed in parallel up to integration.
## Implementation Strategy
We will deliver this feature incrementally:
1. First, we establish the database schema for policies, profiles, and extended validation records.
2. Then, we build the "Execution Window" logic in the scheduler (`[US1]`).
3. Next, we build the Health Center projection API and UI (`[US2]`), giving visibility into the tasks.
4. We weave the health data into the global app shell (Sidebar, Assistant) (`[US3]`).
5. Finally, we implement the `NotificationService` and integrate it into the task execution pipeline to shift to a "push" model (`[US4]`).
---
## Phase 1: Setup
Goal: Initialize the project structure for the new feature modules.
- [x] T001 Scaffold `health.py` router in `backend/src/api/routes/health.py`
- [x] T002 Scaffold `health_service.py` in `backend/src/services/health_service.py`
- [x] T003 Scaffold `HealthMatrix.svelte` component in `frontend/src/lib/components/health/HealthMatrix.svelte`
- [x] T004 Scaffold `PolicyForm.svelte` component in `frontend/src/lib/components/health/PolicyForm.svelte`
- [x] T005 Create empty `+page.svelte` for Health Center in `frontend/src/routes/dashboards/health/+page.svelte`
- [x] T006 Create empty `+page.svelte` for Automation Policies in `frontend/src/routes/settings/automation/+page.svelte`
## Phase 2: Foundational
Goal: Implement the core data model and generic backend updates required by all stories.
- [x] T007 [P] Create `ValidationPolicy` model and update `ValidationRecord` model with new fields in `backend/src/models/llm.py`
- [x] T008 [P] Add `telegram_id`, `email_address`, and `notify_on_fail` fields to `UserDashboardPreference` in `backend/src/models/profile.py`
- [x] T009 [P] Create `NotificationConfig` model for global provider settings in `backend/src/models/config.py` (or as a JSON block in `AppConfigRecord`)
- [x] T010 Generate and apply Alembic migration for all DB changes in `backend/alembic/versions/`
- [x] T011 Update `TaskReport` schema to support the extended `ValidationRecord` shape in `backend/src/models/report.py`
## Phase 3: User Story 1 - Create Validation Policy with Execution Window (P1)
Goal: Allow users to define validation policies with execution windows and have the backend distribute tasks within that window.
**Independent Test**: Can be tested by creating a policy with a 1-hour window for 60 dashboards and verifying that the system schedules 60 distinct validations spread roughly 1 minute apart.
- [x] T012 [P] [US1] Create Pydantic schemas for `ValidationPolicy` (Create/Update/Response) in `backend/src/schemas/settings.py` (or new schema file)
- [x] T013 [US1] Implement CRUD endpoints for validation policies in `backend/src/api/routes/settings.py` (or new automation router)
- [x] T014 [US1] Implement `ThrottledSchedulerConfigurator` logic. (CRITICAL: PRE: active policies + time, POST: N tasks scheduled evenly. TESTS: window distribution, too small window fallback) in `backend/src/core/scheduler.py` (or new core module)
- [x] T015 [P] [US1] Create `PolicyForm` component for editing/creating policies in `frontend/src/lib/components/health/PolicyForm.svelte`
- [x] T016 [US1] Implement Automation Policies page list and modal management in `frontend/src/routes/settings/automation/+page.svelte`
- [x] T017 [US1] Add "Automation" tab to the global settings navigation in `frontend/src/routes/settings/+page.svelte`
- [x] T018 [US1] Verify implementation matches `ux_reference.md` (Happy Path & Errors) for the Automation Settings view.
## Phase 4: User Story 2 - Dashboard Health Center Monitoring (P2)
Goal: Provide a traffic-light summary view of the *latest* validation status for all dashboards.
**Independent Test**: Can be tested by running several validation tasks (some pass, some fail) and viewing the Health Center to ensure it correctly aggregates and displays only the *latest* status for each unique dashboard.
- [x] T019 [P] [US2] Implement `HealthService.get_health_summary`. (CRITICAL: PRE: environment_id, POST: aggregated List[DashboardHealthItem]. TESTS: aggregation success, no-records fallback) in `backend/src/services/health_service.py`
- [x] T020 [US2] Implement `GET /api/dashboards/health` endpoint in `backend/src/api/routes/health.py` (or extend `dashboards.py`)
- [x] T021 [P] [US2] Implement `HealthMatrix` UI component (Pass/Warn/Fail traffic lights) in `frontend/src/lib/components/health/HealthMatrix.svelte`
- [x] T022 [US2] Implement Health Center page fetching data and rendering the matrix and data table in `frontend/src/routes/dashboards/health/+page.svelte`
- [x] T023 [US2] Link "View Report" button in Health Center table to the detailed LLM report route `frontend/src/routes/reports/llm/[taskId]/+page.svelte`
- [x] T024 [US2] Verify implementation matches `ux_reference.md` (Happy Path & Errors) for the Health Center view.
## Phase 5: User Story 3 - Quick Navigation and Integrations (P3)
Goal: Integrate the failing dashboard counts into the sidebar and the AI Assistant.
**Independent Test**: Can be tested by having failing validations in the database and checking that the sidebar navigation displays a red badge with the correct count, and querying the assistant returns accurate results.
- [x] T025 [US3] Create `SidebarHealthBadge` store (fetches or derives failing count) in `frontend/src/lib/stores/health.js` (or extend `activity.js`)
- [x] T026 [US3] Update `Sidebar` component to display the red `[🔴 N]` badge next to the Dashboards menu item in `frontend/src/lib/components/layout/Sidebar.svelte`
- [x] T027 [US3] Update `backend/src/api/routes/assistant.py` (or the underlying tool catalog) to resolve queries about "failing dashboards" by querying the `HealthService` and returning deep links.
- [x] T028 [US3] Verify implementation matches `ux_reference.md` (Happy Path & Errors) for Sidebar and Assistant integration.
## Phase 6: User Story 4 - Smart Notifications Routing (P2)
Goal: Implement the backend routing logic and external providers to send push notifications.
**Independent Test**: Can be tested by configuring a user profile with a Telegram ID, running a validation task that fails for a dashboard owned by that user in Superset, and verifying the `NotificationService` dispatches a payload to the Telegram provider.
- [x] T029 [P] [US4] Update `frontend/src/routes/profile/+page.svelte` to include Telegram ID and email override inputs.
- [x] T030 [P] [US4] Implement `NotificationProvider` abstractions (SMTP, Slack, Telegram) in `backend/src/services/notifications/providers.py`
- [x] T031 [US4] Implement `NotificationService` for routing logic. (CRITICAL: PRE: ValidationResult, POST: Dispatches to providers via BackgroundTasks. TESTS: owner routing, missing profile resilience) in `backend/src/services/notifications/service.py`
- [x] T032 [US4] Wire `NotificationService.dispatch_report` into the end of `DashboardValidationPlugin.execute` in `backend/src/plugins/llm_analysis/plugin.py`
- [x] T033 [US4] Implement Global Settings UI for configuring Notification Providers in `frontend/src/routes/settings/notifications/+page.svelte`
- [x] T034 [US4] Verify implementation matches `ux_reference.md` (Happy Path & Errors) for Notification payloads and UI.
## Phase 7: Polish & Cross-Cutting
Goal: Finalize styling, error handling, and end-to-end flow.
- [x] T035 Ensure consistent Tailwind styling and dark mode support across `HealthMatrix` and `PolicyForm`.
- [x] T036 Add comprehensive error toasts for policy creation failures (e.g., overlapping windows, invalid IDs).
- [x] T037 Write/update unit tests for `ThrottledSchedulerConfigurator` bounds handling.
- [x] T038 Write/update unit tests for `HealthService` latest-record aggregation logic.
## Post-Review Fix Batch (2026-03-10)
- [x] R001 Persist `task_id` and `environment_id` in `ValidationRecord` creation path (`llm_analysis/plugin.py`).
- [x] R002 Align policy channel schema contract: `custom_channels` migrated to structured objects `{type,target}` in settings schemas.
- [x] R003 Tighten health status regex to strict grouped anchors `^(PASS|WARN|FAIL|UNKNOWN)$`.
- [x] R004 Resolve weekday convention drift to `0=Sunday ... 6=Saturday` consistently across backend schema description and policy form UI.
- [x] R005 Add regression tests for schema contracts and plugin persistence context fields.

View File

@@ -0,0 +1,35 @@
# UX Conformance Verification: Phase 5 (Quick Navigation and Integrations)
**Feature**: Dashboard Health Windows (026)
**Phase**: 5
**Date**: 2026-03-10
## 1. Sidebar Badge Integration (T026)
- [x] **Requirement**: Red badge `[N🔴]` next to "Dashboards" menu item.
- [x] **Implementation**:
- Added `failingCount` derived store in `frontend/src/lib/stores/health.js`.
- Updated `Sidebar.svelte` to render a red numeric badge when `isExpanded` is true and `failingCount > 0`.
- Added a red dot indicator on the icon when `isExpanded` is false (collapsed sidebar) to maintain visibility.
- Added `onMount` refresh logic with 5-minute interval.
- [x] **UX Match**: Matches `ux_reference.md` section 3 (Dashboard Health Center).
## 2. AI Assistant Support (T027)
- [x] **Requirement**: Resolve queries about "failing dashboards" and provide deep links.
- [x] **Implementation**:
- Added `get_health_summary` intent to `backend/src/api/routes/assistant.py`.
- Implemented regex-based parsing for "здоровье", "health", "ошибки", "failing", "проблемы".
- Added `get_health_summary` to `_SAFE_OPS` for immediate execution.
- Dispatcher calls `HealthService.get_health_summary` and formats a detailed response.
- Added `AssistantAction` for "Открыть Health Center" and individual report links for failing dashboards.
- [x] **UX Match**: Matches `ux_reference.md` section 3 (Chat Assistant Interaction).
## 3. Technical Standards Compliance
- [x] **Semantics**: Followed `[DEF:...]` patterns in `health.js` and `assistant.py`.
- [x] **RBAC**: `get_health_summary` intent is protected by `plugin:migration:READ` permission, consistent with the Health API.
- [x] **I18n**: Sidebar labels use `$t` (though badge itself is numeric/iconic).
## 4. Verification Results
- **Frontend**: Store correctly fetches data from `/api/health/summary`. Badge appears/disappears based on `fail_count`.
- **Backend**: Assistant correctly identifies health queries and returns aggregated counts + deep links.
**Status**: ✅ PASS

View File

@@ -0,0 +1,110 @@
# UX Reference: Dashboard Health Windows
**Feature Branch**: `026-dashboard-health-windows`
**Created**: 2026-03-10
**Status**: Phase 6 Verified
## 1. User Persona & Context
* **Who is the user?**: Data Engineer, System Administrator, or Data Analyst (Data Team).
* **What is their goal?**: Schedule automated LLM checks for many dashboards without overloading the database, and quickly monitor which dashboards are currently broken.
* **Context**: Managing dozens or hundreds of dashboards across environments (e.g., production, dev) via a web interface, and needing a high-level overview of system health every morning.
## 2. The "Happy Path" Narrative
The user navigates to "Settings -> Automation" and creates a new validation policy for 15 critical production dashboards. Instead of writing complex cron jobs, they simply set an "Execution Window" from 01:00 to 05:00. They check the box to "Auto-notify Owners". The system automatically spaces out the 15 checks within those 4 hours.
During the night, a dashboard owned by the user fails. The user immediately receives a Telegram message: "🚨 Validation Failed: Sales Executive (Broken Chart). [View in ss-tools]".
In the morning, the user opens the app and immediately notices a red badge `[2🔴]` next to "Dashboards" in the sidebar. They click it to open the "Health Center", which displays a clear "traffic light" summary: 13 Green, 0 Yellow, 2 Red. The two broken dashboards are listed at the top with a summary of the issues. They can also just ask the AI Assistant: "Which dashboards failed last night?" and get a direct link.
## 3. Interface Mockups
### UI Layout & Flow
**Screen/Component**: Automation Settings (Validation Policies)
* **Layout**: List view of existing policies with a "Create Rule" modal.
* **Key Elements**:
* **Policy List**: Shows Active/Paused status, target count, schedule window, and active notification channels.
* **Create Modal**:
* Scope selection (Environment dropdown, Filter by selected/tags/all).
* Schedule configuration (Days of week checkboxes, Start/End time for Execution Window).
* Alerts configuration:
* Trigger condition (e.g., "Only FAIL").
* Toggle: "Auto-notify Owners (Email / Messenger)".
* Add custom Channels (Slack webhook, General Email).
* **States**:
* **Info Message**: When selecting times, a helpful hint appears: "💡 System will automatically distribute 15 checks within this 4-hour window to avoid peak database load."
**Screen/Component**: Dashboard Health Center
* **Layout**: Top summary cards (Traffic lights), followed by a data table.
* **Key Elements**:
* **Environment Selector**: Dropdown to switch between `ss-prod`, `ss-dev`, etc.
* **Health Matrix Summary**: `🟢 145 Pass | 🟡 5 Warn | 🔴 2 Fail`
* **Data Table**: Columns for Dashboard Name, Status (Badge), Detected Issues (Text), Checked At (Time), and a "View Report" action button.
* **Sidebar Badge**: The main navigation sidebar has a badge `[2🔴]` next to the Dashboards menu item.
**Screen/Component**: Global Settings -> Notifications
* **Layout**: Form fields grouped by provider type (Email, Slack, Telegram).
* **Key Elements**:
* **Email (SMTP)**: Host, Port, Login, Password, "From" Address.
* **Slack**: Webhook URL list.
* **Telegram**: Bot Token.
### Notification Payloads
**Email Example (Rich HTML)**:
* **Subject**: 🔴 [ss-tools] Ошибка валидации: Sales Executive Dashboard
* **Body**: 🔴 **FAIL:** Sales Executive (Env: ss-prod). "Сломан чарт Revenue YTD. Ошибка: колонка profit не найдена." [Embedded Screenshot] -> [Button: View Full Report].
**Messenger Example (Compact Text)**:
```text
🚨 **Dashboard Validation Failed**
**Dashboard:** [Sales Executive](link)
**Env:** ss-prod | **Time:** 03:15 AM
🤖 **LLM Summary:**
Обнаружена SQL-ошибка в чарте "Revenue". Данные не отображаются.
🔗 [View in ss-tools]
```
### Chat Assistant Interaction
```text
User: "Show dashboards that failed the night check"
Assistant: "I found 2 dashboards that failed their last validation in production:
1. 🔴 Sales Executive (Issues: Broken Chart, Timeout)
2. 🔴 Marketing Funnel (Issues: SQL Syntax Error)
[View detailed reports in Health Center]"
```
## 4. The "Error" Experience
**Philosophy**: Guide the user to resolve failing dashboard validations quickly and prevent misconfiguration of schedules.
### Scenario A: Execution Window Too Small
* **User Action**: Selects an execution window of 5 minutes for 100 dashboards.
* **System Response**:
* (UI) Warning message appears below the time selection: "⚠️ The selected window is too narrow for 100 dashboards. This may cause database spikes. Recommended window: at least 60 minutes."
* **Recovery**: User adjusts the end time to broaden the window before saving.
### Scenario B: Dashboard Fails Validation
* **User Action**: Navigates to Health Center to view a failed dashboard.
* **System Response**: Row is highlighted in red. Clicking "View Report" takes them directly to the detailed LLM analysis page showing the screenshot and specific errors.
## 5. Tone & Voice
* **Style**: Professional, monitoring-focused, reassuring.
* **Terminology**: Use "Execution Window" for time ranges, "Health Center" for the monitoring view, and "Policies" for automated rules. Avoid overly technical scheduling terms like "Cron expressions" in the UI.
## 6. Phase 6 Implementation Verification
- [x] **T029**: Profile UI updated with Telegram ID and Email Override. Matches "Smart Notifications" section in User Preferences.
- [x] **T030/T031**: Backend routing logic implemented. Supports owner routing and custom channels.
- [x] **T032**: Dispatch wired into LLM analysis plugin.
- [x] **T033**: Global Notification Settings UI implemented with SMTP, Telegram, and Slack sections.