ss-tools/specs/026-dashboard-health-windows/spec.md

# Feature Specification: Dashboard Health Windows

**Feature Branch**: `026-dashboard-health-windows`
**Created**: 2026-03-10
**Status**: Draft
**Input**: User description: "init_task.md about Dashboard Validation Policies, Health Center, and Smart Notifications"

## User Scenarios & Testing *(mandatory)*

### User Story 1 - Create Validation Policy with Execution Window (Priority: P1)

As a System Administrator or Data Engineer, I want to create validation policies with an "Execution Window" rather than strict cron times, so that the system can automatically distribute the checks and avoid overloading the database with simultaneous requests.

**Why this priority**: Without this, scheduling many dashboards will cause database spikes and system instability, breaking the core value proposition of automated checking.

**Independent Test**: Can be tested by creating a policy with a 1-hour window for 60 dashboards and verifying that the system schedules 60 distinct validations spread roughly 1 minute apart.

**Acceptance Scenarios**:

1. **Given** I am on the "Settings -> Automation" page, **When** I click "Create Rule", select 15 dashboards, and set a schedule window from 01:00 to 05:00, **Then** the policy is saved and the system confirms the checks will be distributed.
2. **Given** a saved policy with an execution window, **When** the system processes it, **Then** it calculates the optimal interval and schedules individual validations sequentially rather than concurrently.

---

### User Story 2 - Dashboard Health Center Monitoring (Priority: P2)

As a Data Analyst or Data Team Lead, I want to view a "Health Center" dashboard that shows a traffic-light summary of the latest validation results, so that I can immediately identify and investigate broken dashboards from last night's checks.

**Why this priority**: Users need a high-level, action-oriented view to monitor system health; a flat log of tasks makes it impossible to find failures efficiently.

**Independent Test**: Can be tested by running several validation tasks (some pass, some fail) and viewing the Health Center to ensure it correctly aggregates and displays only the *latest* status for each unique dashboard.

**Acceptance Scenarios**:

1. **Given** several validations ran overnight, **When** I open the Health Center, **Then** I see an aggregated summary (e.g., 🟢 145 | 🟡 5 | 🔴 2).
2. **Given** I am looking at the Health Center table, **When** I click "View Report" on a failed dashboard row, **Then** I am navigated to the detailed LLM report for that specific run.
3. **Given** a dashboard has been checked multiple times, **When** I view the Health Center, **Then** I only see the result of the most recent check for that dashboard.

---

### User Story 3 - Quick Navigation and Integrations (Priority: P3)

As a daily user, I want to see notification badges in the sidebar and be able to ask the AI assistant about failing dashboards, so that I am proactively alerted to issues without having to dig through menus.

**Why this priority**: This provides quick wins for usability and engagement, integrating the new health data into the core user experience.

**Independent Test**: Can be tested by having failing validations in the database and checking that the sidebar navigation displays a red badge with the correct count, and querying the assistant returns accurate results.

**Acceptance Scenarios**:

1. **Given** there are 2 currently failing dashboards, **When** I view the main sidebar, **Then** the "Dashboards" menu item displays a badge like `[2🔴]`.
2. **Given** I open the AI assistant, **When** I ask "Show dashboards that failed the night check", **Then** the assistant returns a list of failing dashboards with their issues and links to the Health Center.

---

### User Story 4 - Smart Notifications Routing (Priority: P2)

As a Dashboard Owner, I want to receive automated alerts (via Email, Slack, or Telegram) when a dashboard I own fails validation, so that I don't have to constantly check the Health Center manually.

**Why this priority**: Shifting from a "pull" to a "push" model is essential for alerting at scale and preventing alert fatigue by routing only relevant failures to specific owners.

**Independent Test**: Can be tested by configuring a user profile with a Telegram ID, running a validation task that fails for a dashboard owned by that user in Superset, and verifying the `NotificationService` dispatches a payload to the Telegram provider.

**Acceptance Scenarios**:

1. **Given** I have added my Telegram ID to my profile settings, **When** a validation policy with "Auto-notify Owners" enabled finds a failure in my dashboard, **Then** I receive a Telegram message with the LLM summary and deep links.
2. **Given** a policy is configured to alert a general Slack channel, **When** the policy executes and finds failures, **Then** a summarized alert is dispatched to the configured Slack webhook.

### Edge Cases

- What happens when the selected execution window is too short for the number of dashboards selected (e.g., 5 minutes for 100 dashboards)?
- How does system handle a policy that targets dashboards that are subsequently deleted?
- What happens if a scheduled check fails to execute entirely (e.g., scheduler crash) instead of just failing the validation?
- How does the Health Center display a dashboard that has never been checked yet?

## Requirements *(mandatory)*

### Functional Requirements

- **FR-001**: System MUST allow users to define Validation Policies targeting specific environments and sets of dashboards (selected individually, by tag, or all).
- **FR-002**: System MUST allow configuring an "Execution Window" (Start Time and End Time) and target Days of the Week for a policy.
- **FR-003**: System MUST calculate throttled execution times within the execution window to evenly distribute validation tasks.
- **FR-004**: System MUST allow configuring alert routing in Validation Policies (e.g., Auto-notify Owners, send to specific Channels) when checks result in configurable statuses (e.g., FAIL_ONLY).
- **FR-005**: System MUST provide a "Dashboard Health Center" view that aggregates the *latest* validation status for each dashboard.
- **FR-006**: System MUST record the dashboard identifier, environment identifier, validation status, specific issues found, summary text, and a screenshot for each validation check.
- **FR-007**: System MUST display a summary of statuses (Pass, Warn, Fail counts) at the top of the Health Center.
- **FR-008**: System MUST display a red notification badge in the main sidebar next to "Dashboards" if there are currently failing dashboards.
- **FR-009**: System MUST allow the AI Assistant to query the latest validation statuses and return a formatted list of failing dashboards.
- **FR-010**: System MUST allow administrators to configure global notification providers (SMTP, Slack Webhooks, Telegram Bots).
- **FR-011**: System MUST match dashboard owners from Superset with local user profiles to perform smart routing of notifications.
- **FR-012**: System MUST format notifications specifically for the target medium (rich HTML with embedded images for email, compact text with links for messengers).

### Key Entities

- **ValidationPolicy**: Defines the scope (dashboards, environment), schedule (execution window, days), and alert settings (notify owners, channel destinations).
- **ValidationRecord**: The result of a single dashboard check, extended with dashboard/environment identifiers, status, structured issues, and artifact indicators.

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001**: Users can schedule checks for 100+ dashboards without causing database CPU/Memory spikes over 80% during the execution window.
- **SC-002**: Users can identify all failing dashboards within 5 seconds of opening the Health Center.
- **SC-003**: The backend correctly distributes N scheduled jobs across the specified time window with a maximum variance of 10% from the optimal interval.
- **SC-004**: Users receive accurate responses from the AI Assistant regarding failing dashboards 100% of the time.
- **SC-005**: Failing validations trigger smart notifications to the correct dashboard owners (who have configured profiles) within 30 seconds of task completion.