tasks ready

2026-05-08 18:01:49 +03:00
parent d8df1fff59
commit bdd376595c
32 changed files with 3243 additions and 229 deletions
--- a/specs/028-llm-datasource-supeset/research.md
+++ b/specs/028-llm-datasource-supeset/research.md
@@ -0,0 +1,188 @@
+# Research: LLM Table Translation Service
+
+**Feature Branch**: `028-llm-datasource-supeset`  
+**Date**: 2026-05-08
+
+## R1: Plugin Placement — New Plugin vs. Extending LLMAnalysisPlugin
+
+### Decision
+Create a new standalone plugin `TranslationPlugin` at `backend/src/plugins/translate/` rather than extending the existing `LLMAnalysisPlugin`.
+
+### Rationale
+- `LLMAnalysisPlugin` is focused on dashboard validation and documentation generation — a different domain from batch table translation.
+- Translation requires new ORM models (`TranslationJob`, `TranslationRun`, `TranslationRecord`, `TerminologyDictionary`, `TranslationSchedule`, `TranslationEvent`), new API routes (`/api/translate/*`), new Svelte components, and new scheduler integration — scope that warrants a dedicated plugin.
+- Existing plugin system (`PluginBase`) already supports multiple independent plugins and lazy discovery via `plugin_loader.py`.
+- Separation avoids bloating `llm_analysis/plugin.py` (already 481 lines) and maintains the fractal limit (INV_7: <400 lines per module).
+
+### Alternatives Considered
+- **Extend LLMAnalysisPlugin**: Rejected because it would conflate two distinct feature domains, increase module size beyond fractal limit, and complicate RBAC permission boundaries.
+- **Create as a standalone service in `backend/src/services/translate/`**: Rejected because the plugin lifecycle (register, unregister, configuration persistence, API exposure) is already standardized via `PluginBase`. A standalone service would duplicate plugin machinery.
+
+### Impact
+- New directory: `backend/src/plugins/translate/` with `plugin.py`, `orchestrator.py`, `preview.py`, `executor.py`, `dictionary.py`, `sql_generator.py`, `scheduler.py`, `events.py`, `metrics.py`, `__tests__/`.
+- New route module: `backend/src/api/routes/translate.py`.
+- New model module: `backend/src/models/translate.py`.
+- New schema module: `backend/src/schemas/translate.py`.
+- Registered in `backend/src/api/routes/__init__.py` `__all__` list.
+
+---
+
+## R2: LLM Prompt Construction Strategy
+
+### Decision
+Construct prompts using a layered template approach: base system prompt → dictionary glossary (per-batch filtered) → context columns → translation column values. Leverage existing `llm_prompt_templates.py` for template rendering and `LLMProviderService` for provider selection.
+
+### Rationale
+- Existing `llm_prompt_templates.py` already supports `render_prompt()` with Jinja2-like substitution and multimodal detection.
+- Per-batch dictionary filtering (FR-044): scan batch rows for substring matches against dictionary `source_term` values; only include matched entries in the prompt. This keeps token usage proportional to batch content, not dictionary size.
+- Context columns are appended as structured fields (e.g., `Category: {category_name}\nDescription: {product_description}`) before the translation column value.
+- The system prompt explicitly instructs the LLM: "Use the provided glossary for exact matches. For partial matches, prefer glossary translations. For terms not in the glossary, translate naturally."
+
+### Alternatives Considered
+- **Full dictionary injection**: Rejected — would exceed LLM context window for dictionaries >5000 terms.
+- **Semantic embedding search**: Rejected — adds unnecessary complexity (vector DB dependency) when substring matching is sufficient for glossary use cases.
+- **Separate LLM call for glossary matching**: Rejected — doubles API cost and latency without proportional quality gain.
+
+### Impact
+- `DictionaryManager` must implement `filter_for_batch(rows: list[str]) -> list[dict]` returning matched entries.
+- Prompt template includes `{{ glossary }}` and `{{ context }}` placeholder blocks.
+
+---
+
+## R3: SQL Generation — Dialect-Aware INSERT/UPSERT for Superset API
+
+### Decision
+Detect the target database dialect from the Superset datasource's connection configuration at job save time. Generate dialect-appropriate safe SQL: `INSERT INTO ... VALUES (...)` for ClickHouse; `INSERT INTO ... VALUES (...)` or `INSERT ... ON CONFLICT ...` for PostgreSQL/Greenplum. Submit generated SQL to Superset via `/api/v1/sqllab/execute/`.
+
+### Rationale
+- Different databases use different UPSERT syntax: PostgreSQL has `ON CONFLICT`, ClickHouse has no standard UPSERT (use INSERT with deduplication or ALTER TABLE UPDATE). Greenplum is PostgreSQL-compatible.
+- Superset knows the database backend via the connection's `backend`/`engine` field — the system queries this at configuration time and caches the dialect on the TranslationJob.
+- For ClickHouse, the `insert` strategy generates plain INSERT; `skip_existing` is not natively supported (the system warns the user); `overwrite` uses ALTER TABLE UPDATE or INSERT with ReplacingMergeTree semantics (documented limitation).
+- For PostgreSQL/Greenplum, full UPSERT support: `ON CONFLICT DO NOTHING` (skip_existing) and `ON CONFLICT DO UPDATE` (overwrite).
+- Identifier quoting: PostgreSQL/Greenplum uses `"identifier"`; ClickHouse uses `` `identifier` `` or `"identifier"` depending on settings.
+- Values are safely encoded per dialect: strings escaped, NULLs rendered as `NULL`.
+
+### Alternatives Considered
+- **PostgreSQL-only**: Rejected — user's Superset instances may use ClickHouse as the primary analytical database. Dialect detection from the connection is the correct source of truth.
+- **Manual SQL Lab copy/paste**: Rejected — Superset API execution is the canonical path.
+- **UPDATE statements**: Rejected — source data is append-only (new-key-only strategy). UPSERT covers the overwrite case.
+
+### Impact
+- `TranslationJob.database_dialect` field caches the detected dialect at save time.
+- `SQLGenerator` dispatches to dialect-specific formatters.
+- Dialect-specific SQL syntax tests required for PostgreSQL and ClickHouse (SC-003).
+- Unsupported dialects are rejected at configuration time with a clear error message.
+
+---
+
+## R4: Schedule Execution — APScheduler Integration
+
+### Decision
+Extend the existing `SchedulerService` (`backend/src/core/scheduler.py`) with a new job type `translate_scheduled_run`. Each translation job's schedule configuration is stored in the `TranslationSchedule` model and loaded into APScheduler on service start and on schedule create/update.
+
+### Rationale
+- Existing `SchedulerService` already manages `BackgroundScheduler`, cron triggers, start/stop lifecycle, and task manager integration.
+- Translation schedules are distinct from backup schedules — stored in `translate` models, loaded via a registration callback pattern.
+- Schedule trigger: APScheduler fires → `run_scheduled_translation(job_id)` → creates `TranslationRun` → orchestrator processes new-key-only rows → generates INSERT statements.
+- Concurrency policy (skip/queue) enforced in the trigger handler before orchestrator invocation.
+
+### Alternatives Considered
+- **Separate scheduler instance**: Rejected — creates resource contention (two APScheduler instances) and complicates Docker deployment.
+- **Celery/Redis-based scheduling**: Rejected — adds infrastructure dependency; APScheduler is already proven in this codebase.
+- **Cron-based external scheduling**: Rejected — requires OS-level cron configuration, loses programmatic control over pause/resume and concurrency policies.
+
+### Impact
+- `backend/src/plugins/translate/scheduler.py` registers translation job schedules with `SchedulerService`.
+- New trigger function `_execute_scheduled_translation(job_id: str)` imported by `SchedulerService`.
+- Existing `SchedulerService.load_schedules()` extended to discover and register translation schedules alongside backup schedules.
+
+---
+
+## R5: Observability — Structured Event Log + MetricSnapshot
+
+### Decision
+Implement a dedicated `TranslationEvent` ORM model with type-specific payload (JSON) for structured event logging (FR-046). Events are written synchronously within the orchestrator flow. Per-job cumulative metrics (FR-047) are computed from live `TranslationEvent` rows (for recent data <90 days) combined with `MetricSnapshot` rows (for historical data >90 days). At pruning time, a `MetricSnapshot` is persisted capturing cumulative tokens, cost, and run counts before events are deleted.
+
+### Rationale
+- Structured events provide queryability for audit, trend analysis, and the admin dashboard without coupling to log parsing infrastructure.
+- JSON payload allows type-specific data.
+- MetricSnapshot persistence before pruning ensures cumulative metrics survive the 90-day retention window (SC-014).
+- The metrics dashboard reads: `latest MetricSnapshot + events WHERE timestamp > snapshot.covers_events_before`.
+- Synchronous event writes within the run transaction ensure no event loss during crashes.
+
+### Alternatives Considered
+- **Application log (stdout) only**: Rejected — not queryable for dashboards or audit.
+- **Separate metrics table with counters (dual-write)**: Rejected — dual-write consistency risk; event-sourced + snapshot is simpler.
+- **Events-only (no snapshots)**: Rejected — cumulative metrics would be lost after 90-day pruning (FR-049).
+
+### Impact
+- `TranslationEvent` model with nullable `run_id` for pre-run events.
+- `MetricSnapshot` model with `covers_events_before` timestamp for correct cutoff.
+- `MetricsService` queries aggregation from events + snapshots.
+- APScheduler daily job: persist snapshot → prune expired events/records.
+
+---
+
+## R6: Frontend Architecture — Svelte 5 Runes Pattern
+
+### Decision
+Use Svelte 5 runes (`$state`, `$derived`, `$effect`) for all reactive state management in translation components. Store layer uses a dedicated `translate.js` Svelte store module with `$state` runes for job list, current job config, preview state, and run progress. API calls use the existing `requestApi`/`fetchApi` wrapper pattern.
+
+### Rationale
+- Svelte 5 runes are the canonical reactivity model for this codebase (Svelte 5.43+ in package.json).
+- Dedicated store per feature domain follows existing patterns (`frontend/src/lib/stores/` houses auth, settings, task stores).
+- WebSocket for run progress reuses the existing `TaskManager` WebSocket infrastructure — translation runs emit progress events on the same channel.
+- Components follow existing layout patterns: Tailwind CSS, `@UX_STATE`/`@UX_FEEDBACK`/`@UX_RECOVERY`/`@UX_REACTIVITY` contract tags.
+
+### Alternatives Considered
+- **Svelte 4 stores (writable/derived)**: Rejected — codebase has already migrated to Svelte 5 runes; mixing patterns creates inconsistency.
+- **Separate WebSocket channel**: Rejected — existing Task Drawer WebSocket infrastructure handles progress events generically; translation runs fit the same pattern.
+
+### Impact
+- New SvelteKit route: `frontend/src/routes/translate/` with sub-routes for job config, dictionaries, history.
+- New component library: `frontend/src/lib/components/translate/` with 8 components.
+- New store: `frontend/src/lib/stores/translate.js` with `$state` runes.
+- New API client: `frontend/src/lib/api/translate.js`.
+
+---
+
+## R7: RBAC Permission Model Integration
+
+### Decision
+Define 13 permission strings per the Access Control Matrix in spec.md and enforce them via the existing `PermissionChecker` dependency in FastAPI route handlers. No new database tables needed — the existing `permissions` and `role_permissions` tables store string-based permissions.
+
+### Rationale
+- Existing RBAC model stores permissions as strings in `role_permissions.permission` column, checked via dependency injection in route handlers.
+- Granular permissions per resource type align with the existing pattern.
+- Ownership constraints (owner OR admin) are enforced in route handlers alongside permission checks.
+- Missing from original design: `translate.job.view`, `translate.dictionary.view`, `translate.schedule.view`, `translate.metrics.view` added for read-only access scenarios.
+
+### Alternatives Considered
+- **Resource-level ownership only (no granular permissions)**: Rejected — spec explicitly requires granular permissions (FR-043).
+- **Separate permission table per resource**: Rejected — over-engineered; string-based permissions are sufficient.
+
+### Impact
+- 13 permission strings registered in RBAC seed.
+- Route handlers annotated with `Depends(require_permission(...))` + ownership checks.
+- Admin UI displays new permission strings for role assignment.
+- Default analyst role: `translate.job.view`, `translate.job.execute`, `translate.dictionary.view`, `translate.history.view`.
+
+---
+
+## R8: Testing Strategy
+
+### Decision
+Multi-layer testing: (1) pytest unit tests for orchestrator, executor, dictionary manager, SQL generator, scheduler, event log; (2) pytest integration tests for API routes with test database; (3) vitest component tests for Svelte components using @testing-library/svelte; (4) manual verification via `quickstart.md` for end-to-end flow.
+
+### Rationale
+- Unit tests with mocked LLM responses and Superset client ensure fast feedback for business logic.
+- Integration tests verify API contract, database schema, RBAC enforcement, and schedule trigger behavior.
+- Component tests validate Svelte 5 rune reactivity, UX state transitions, and error recovery paths.
+- Manual quickstart provides a human-verifiable happy path that catches integration issues between backend and frontend.
+
+### Alternatives Considered
+- **E2E tests with Playwright**: Deferred to future iteration — adds maintenance overhead; quickstart manual verification is sufficient for initial delivery.
+
+### Impact
+- Test files: `backend/src/plugins/translate/__tests__/`, `backend/tests/test_translate_api.py`, `frontend/src/lib/components/translate/__tests__/`.
+- Fixtures: mock LLM provider responses, mock Superset client, test dictionary data, test translation job configuration.