Files
ss-tools/specs/028-llm-datasource-supeset/research.md
2026-05-08 18:01:49 +03:00

189 lines
14 KiB
Markdown

# Research: LLM Table Translation Service
**Feature Branch**: `028-llm-datasource-supeset`
**Date**: 2026-05-08
## R1: Plugin Placement — New Plugin vs. Extending LLMAnalysisPlugin
### Decision
Create a new standalone plugin `TranslationPlugin` at `backend/src/plugins/translate/` rather than extending the existing `LLMAnalysisPlugin`.
### Rationale
- `LLMAnalysisPlugin` is focused on dashboard validation and documentation generation — a different domain from batch table translation.
- Translation requires new ORM models (`TranslationJob`, `TranslationRun`, `TranslationRecord`, `TerminologyDictionary`, `TranslationSchedule`, `TranslationEvent`), new API routes (`/api/translate/*`), new Svelte components, and new scheduler integration — scope that warrants a dedicated plugin.
- Existing plugin system (`PluginBase`) already supports multiple independent plugins and lazy discovery via `plugin_loader.py`.
- Separation avoids bloating `llm_analysis/plugin.py` (already 481 lines) and maintains the fractal limit (INV_7: <400 lines per module).
### Alternatives Considered
- **Extend LLMAnalysisPlugin**: Rejected because it would conflate two distinct feature domains, increase module size beyond fractal limit, and complicate RBAC permission boundaries.
- **Create as a standalone service in `backend/src/services/translate/`**: Rejected because the plugin lifecycle (register, unregister, configuration persistence, API exposure) is already standardized via `PluginBase`. A standalone service would duplicate plugin machinery.
### Impact
- New directory: `backend/src/plugins/translate/` with `plugin.py`, `orchestrator.py`, `preview.py`, `executor.py`, `dictionary.py`, `sql_generator.py`, `scheduler.py`, `events.py`, `metrics.py`, `__tests__/`.
- New route module: `backend/src/api/routes/translate.py`.
- New model module: `backend/src/models/translate.py`.
- New schema module: `backend/src/schemas/translate.py`.
- Registered in `backend/src/api/routes/__init__.py` `__all__` list.
---
## R2: LLM Prompt Construction Strategy
### Decision
Construct prompts using a layered template approach: base system prompt dictionary glossary (per-batch filtered) context columns translation column values. Leverage existing `llm_prompt_templates.py` for template rendering and `LLMProviderService` for provider selection.
### Rationale
- Existing `llm_prompt_templates.py` already supports `render_prompt()` with Jinja2-like substitution and multimodal detection.
- Per-batch dictionary filtering (FR-044): scan batch rows for substring matches against dictionary `source_term` values; only include matched entries in the prompt. This keeps token usage proportional to batch content, not dictionary size.
- Context columns are appended as structured fields (e.g., `Category: {category_name}\nDescription: {product_description}`) before the translation column value.
- The system prompt explicitly instructs the LLM: "Use the provided glossary for exact matches. For partial matches, prefer glossary translations. For terms not in the glossary, translate naturally."
### Alternatives Considered
- **Full dictionary injection**: Rejected would exceed LLM context window for dictionaries >5000 terms.
- **Semantic embedding search**: Rejected — adds unnecessary complexity (vector DB dependency) when substring matching is sufficient for glossary use cases.
- **Separate LLM call for glossary matching**: Rejected — doubles API cost and latency without proportional quality gain.
### Impact
- `DictionaryManager` must implement `filter_for_batch(rows: list[str]) -> list[dict]` returning matched entries.
- Prompt template includes `{{ glossary }}` and `{{ context }}` placeholder blocks.
---
## R3: SQL Generation — Dialect-Aware INSERT/UPSERT for Superset API
### Decision
Detect the target database dialect from the Superset datasource's connection configuration at job save time. Generate dialect-appropriate safe SQL: `INSERT INTO ... VALUES (...)` for ClickHouse; `INSERT INTO ... VALUES (...)` or `INSERT ... ON CONFLICT ...` for PostgreSQL/Greenplum. Submit generated SQL to Superset via `/api/v1/sqllab/execute/`.
### Rationale
- Different databases use different UPSERT syntax: PostgreSQL has `ON CONFLICT`, ClickHouse has no standard UPSERT (use INSERT with deduplication or ALTER TABLE UPDATE). Greenplum is PostgreSQL-compatible.
- Superset knows the database backend via the connection's `backend`/`engine` field — the system queries this at configuration time and caches the dialect on the TranslationJob.
- For ClickHouse, the `insert` strategy generates plain INSERT; `skip_existing` is not natively supported (the system warns the user); `overwrite` uses ALTER TABLE UPDATE or INSERT with ReplacingMergeTree semantics (documented limitation).
- For PostgreSQL/Greenplum, full UPSERT support: `ON CONFLICT DO NOTHING` (skip_existing) and `ON CONFLICT DO UPDATE` (overwrite).
- Identifier quoting: PostgreSQL/Greenplum uses `"identifier"`; ClickHouse uses `` `identifier` `` or `"identifier"` depending on settings.
- Values are safely encoded per dialect: strings escaped, NULLs rendered as `NULL`.
### Alternatives Considered
- **PostgreSQL-only**: Rejected — user's Superset instances may use ClickHouse as the primary analytical database. Dialect detection from the connection is the correct source of truth.
- **Manual SQL Lab copy/paste**: Rejected — Superset API execution is the canonical path.
- **UPDATE statements**: Rejected — source data is append-only (new-key-only strategy). UPSERT covers the overwrite case.
### Impact
- `TranslationJob.database_dialect` field caches the detected dialect at save time.
- `SQLGenerator` dispatches to dialect-specific formatters.
- Dialect-specific SQL syntax tests required for PostgreSQL and ClickHouse (SC-003).
- Unsupported dialects are rejected at configuration time with a clear error message.
---
## R4: Schedule Execution — APScheduler Integration
### Decision
Extend the existing `SchedulerService` (`backend/src/core/scheduler.py`) with a new job type `translate_scheduled_run`. Each translation job's schedule configuration is stored in the `TranslationSchedule` model and loaded into APScheduler on service start and on schedule create/update.
### Rationale
- Existing `SchedulerService` already manages `BackgroundScheduler`, cron triggers, start/stop lifecycle, and task manager integration.
- Translation schedules are distinct from backup schedules — stored in `translate` models, loaded via a registration callback pattern.
- Schedule trigger: APScheduler fires → `run_scheduled_translation(job_id)` → creates `TranslationRun` → orchestrator processes new-key-only rows → generates INSERT statements.
- Concurrency policy (skip/queue) enforced in the trigger handler before orchestrator invocation.
### Alternatives Considered
- **Separate scheduler instance**: Rejected — creates resource contention (two APScheduler instances) and complicates Docker deployment.
- **Celery/Redis-based scheduling**: Rejected — adds infrastructure dependency; APScheduler is already proven in this codebase.
- **Cron-based external scheduling**: Rejected — requires OS-level cron configuration, loses programmatic control over pause/resume and concurrency policies.
### Impact
- `backend/src/plugins/translate/scheduler.py` registers translation job schedules with `SchedulerService`.
- New trigger function `_execute_scheduled_translation(job_id: str)` imported by `SchedulerService`.
- Existing `SchedulerService.load_schedules()` extended to discover and register translation schedules alongside backup schedules.
---
## R5: Observability — Structured Event Log + MetricSnapshot
### Decision
Implement a dedicated `TranslationEvent` ORM model with type-specific payload (JSON) for structured event logging (FR-046). Events are written synchronously within the orchestrator flow. Per-job cumulative metrics (FR-047) are computed from live `TranslationEvent` rows (for recent data <90 days) combined with `MetricSnapshot` rows (for historical data >90 days). At pruning time, a `MetricSnapshot` is persisted capturing cumulative tokens, cost, and run counts before events are deleted.
### Rationale
- Structured events provide queryability for audit, trend analysis, and the admin dashboard without coupling to log parsing infrastructure.
- JSON payload allows type-specific data.
- MetricSnapshot persistence before pruning ensures cumulative metrics survive the 90-day retention window (SC-014).
- The metrics dashboard reads: `latest MetricSnapshot + events WHERE timestamp > snapshot.covers_events_before`.
- Synchronous event writes within the run transaction ensure no event loss during crashes.
### Alternatives Considered
- **Application log (stdout) only**: Rejected — not queryable for dashboards or audit.
- **Separate metrics table with counters (dual-write)**: Rejected — dual-write consistency risk; event-sourced + snapshot is simpler.
- **Events-only (no snapshots)**: Rejected — cumulative metrics would be lost after 90-day pruning (FR-049).
### Impact
- `TranslationEvent` model with nullable `run_id` for pre-run events.
- `MetricSnapshot` model with `covers_events_before` timestamp for correct cutoff.
- `MetricsService` queries aggregation from events + snapshots.
- APScheduler daily job: persist snapshot → prune expired events/records.
---
## R6: Frontend Architecture — Svelte 5 Runes Pattern
### Decision
Use Svelte 5 runes (`$state`, `$derived`, `$effect`) for all reactive state management in translation components. Store layer uses a dedicated `translate.js` Svelte store module with `$state` runes for job list, current job config, preview state, and run progress. API calls use the existing `requestApi`/`fetchApi` wrapper pattern.
### Rationale
- Svelte 5 runes are the canonical reactivity model for this codebase (Svelte 5.43+ in package.json).
- Dedicated store per feature domain follows existing patterns (`frontend/src/lib/stores/` houses auth, settings, task stores).
- WebSocket for run progress reuses the existing `TaskManager` WebSocket infrastructure — translation runs emit progress events on the same channel.
- Components follow existing layout patterns: Tailwind CSS, `@UX_STATE`/`@UX_FEEDBACK`/`@UX_RECOVERY`/`@UX_REACTIVITY` contract tags.
### Alternatives Considered
- **Svelte 4 stores (writable/derived)**: Rejected — codebase has already migrated to Svelte 5 runes; mixing patterns creates inconsistency.
- **Separate WebSocket channel**: Rejected — existing Task Drawer WebSocket infrastructure handles progress events generically; translation runs fit the same pattern.
### Impact
- New SvelteKit route: `frontend/src/routes/translate/` with sub-routes for job config, dictionaries, history.
- New component library: `frontend/src/lib/components/translate/` with 8 components.
- New store: `frontend/src/lib/stores/translate.js` with `$state` runes.
- New API client: `frontend/src/lib/api/translate.js`.
---
## R7: RBAC Permission Model Integration
### Decision
Define 13 permission strings per the Access Control Matrix in spec.md and enforce them via the existing `PermissionChecker` dependency in FastAPI route handlers. No new database tables needed — the existing `permissions` and `role_permissions` tables store string-based permissions.
### Rationale
- Existing RBAC model stores permissions as strings in `role_permissions.permission` column, checked via dependency injection in route handlers.
- Granular permissions per resource type align with the existing pattern.
- Ownership constraints (owner OR admin) are enforced in route handlers alongside permission checks.
- Missing from original design: `translate.job.view`, `translate.dictionary.view`, `translate.schedule.view`, `translate.metrics.view` added for read-only access scenarios.
### Alternatives Considered
- **Resource-level ownership only (no granular permissions)**: Rejected — spec explicitly requires granular permissions (FR-043).
- **Separate permission table per resource**: Rejected — over-engineered; string-based permissions are sufficient.
### Impact
- 13 permission strings registered in RBAC seed.
- Route handlers annotated with `Depends(require_permission(...))` + ownership checks.
- Admin UI displays new permission strings for role assignment.
- Default analyst role: `translate.job.view`, `translate.job.execute`, `translate.dictionary.view`, `translate.history.view`.
---
## R8: Testing Strategy
### Decision
Multi-layer testing: (1) pytest unit tests for orchestrator, executor, dictionary manager, SQL generator, scheduler, event log; (2) pytest integration tests for API routes with test database; (3) vitest component tests for Svelte components using @testing-library/svelte; (4) manual verification via `quickstart.md` for end-to-end flow.
### Rationale
- Unit tests with mocked LLM responses and Superset client ensure fast feedback for business logic.
- Integration tests verify API contract, database schema, RBAC enforcement, and schedule trigger behavior.
- Component tests validate Svelte 5 rune reactivity, UX state transitions, and error recovery paths.
- Manual quickstart provides a human-verifiable happy path that catches integration issues between backend and frontend.
### Alternatives Considered
- **E2E tests with Playwright**: Deferred to future iteration — adds maintenance overhead; quickstart manual verification is sufficient for initial delivery.
### Impact
- Test files: `backend/src/plugins/translate/__tests__/`, `backend/tests/test_translate_api.py`, `frontend/src/lib/components/translate/__tests__/`.
- Fixtures: mock LLM provider responses, mock Superset client, test dictionary data, test translation job configuration.