Files
ss-tools/specs/028-llm-datasource-supeset/research.md
2026-05-08 18:01:49 +03:00

14 KiB

Research: LLM Table Translation Service

Feature Branch: 028-llm-datasource-supeset
Date: 2026-05-08

R1: Plugin Placement — New Plugin vs. Extending LLMAnalysisPlugin

Decision

Create a new standalone plugin TranslationPlugin at backend/src/plugins/translate/ rather than extending the existing LLMAnalysisPlugin.

Rationale

  • LLMAnalysisPlugin is focused on dashboard validation and documentation generation — a different domain from batch table translation.
  • Translation requires new ORM models (TranslationJob, TranslationRun, TranslationRecord, TerminologyDictionary, TranslationSchedule, TranslationEvent), new API routes (/api/translate/*), new Svelte components, and new scheduler integration — scope that warrants a dedicated plugin.
  • Existing plugin system (PluginBase) already supports multiple independent plugins and lazy discovery via plugin_loader.py.
  • Separation avoids bloating llm_analysis/plugin.py (already 481 lines) and maintains the fractal limit (INV_7: <400 lines per module).

Alternatives Considered

  • Extend LLMAnalysisPlugin: Rejected because it would conflate two distinct feature domains, increase module size beyond fractal limit, and complicate RBAC permission boundaries.
  • Create as a standalone service in backend/src/services/translate/: Rejected because the plugin lifecycle (register, unregister, configuration persistence, API exposure) is already standardized via PluginBase. A standalone service would duplicate plugin machinery.

Impact

  • New directory: backend/src/plugins/translate/ with plugin.py, orchestrator.py, preview.py, executor.py, dictionary.py, sql_generator.py, scheduler.py, events.py, metrics.py, __tests__/.
  • New route module: backend/src/api/routes/translate.py.
  • New model module: backend/src/models/translate.py.
  • New schema module: backend/src/schemas/translate.py.
  • Registered in backend/src/api/routes/__init__.py __all__ list.

R2: LLM Prompt Construction Strategy

Decision

Construct prompts using a layered template approach: base system prompt → dictionary glossary (per-batch filtered) → context columns → translation column values. Leverage existing llm_prompt_templates.py for template rendering and LLMProviderService for provider selection.

Rationale

  • Existing llm_prompt_templates.py already supports render_prompt() with Jinja2-like substitution and multimodal detection.
  • Per-batch dictionary filtering (FR-044): scan batch rows for substring matches against dictionary source_term values; only include matched entries in the prompt. This keeps token usage proportional to batch content, not dictionary size.
  • Context columns are appended as structured fields (e.g., Category: {category_name}\nDescription: {product_description}) before the translation column value.
  • The system prompt explicitly instructs the LLM: "Use the provided glossary for exact matches. For partial matches, prefer glossary translations. For terms not in the glossary, translate naturally."

Alternatives Considered

  • Full dictionary injection: Rejected — would exceed LLM context window for dictionaries >5000 terms.
  • Semantic embedding search: Rejected — adds unnecessary complexity (vector DB dependency) when substring matching is sufficient for glossary use cases.
  • Separate LLM call for glossary matching: Rejected — doubles API cost and latency without proportional quality gain.

Impact

  • DictionaryManager must implement filter_for_batch(rows: list[str]) -> list[dict] returning matched entries.
  • Prompt template includes {{ glossary }} and {{ context }} placeholder blocks.

R3: SQL Generation — Dialect-Aware INSERT/UPSERT for Superset API

Decision

Detect the target database dialect from the Superset datasource's connection configuration at job save time. Generate dialect-appropriate safe SQL: INSERT INTO ... VALUES (...) for ClickHouse; INSERT INTO ... VALUES (...) or INSERT ... ON CONFLICT ... for PostgreSQL/Greenplum. Submit generated SQL to Superset via /api/v1/sqllab/execute/.

Rationale

  • Different databases use different UPSERT syntax: PostgreSQL has ON CONFLICT, ClickHouse has no standard UPSERT (use INSERT with deduplication or ALTER TABLE UPDATE). Greenplum is PostgreSQL-compatible.
  • Superset knows the database backend via the connection's backend/engine field — the system queries this at configuration time and caches the dialect on the TranslationJob.
  • For ClickHouse, the insert strategy generates plain INSERT; skip_existing is not natively supported (the system warns the user); overwrite uses ALTER TABLE UPDATE or INSERT with ReplacingMergeTree semantics (documented limitation).
  • For PostgreSQL/Greenplum, full UPSERT support: ON CONFLICT DO NOTHING (skip_existing) and ON CONFLICT DO UPDATE (overwrite).
  • Identifier quoting: PostgreSQL/Greenplum uses "identifier"; ClickHouse uses `identifier` or "identifier" depending on settings.
  • Values are safely encoded per dialect: strings escaped, NULLs rendered as NULL.

Alternatives Considered

  • PostgreSQL-only: Rejected — user's Superset instances may use ClickHouse as the primary analytical database. Dialect detection from the connection is the correct source of truth.
  • Manual SQL Lab copy/paste: Rejected — Superset API execution is the canonical path.
  • UPDATE statements: Rejected — source data is append-only (new-key-only strategy). UPSERT covers the overwrite case.

Impact

  • TranslationJob.database_dialect field caches the detected dialect at save time.
  • SQLGenerator dispatches to dialect-specific formatters.
  • Dialect-specific SQL syntax tests required for PostgreSQL and ClickHouse (SC-003).
  • Unsupported dialects are rejected at configuration time with a clear error message.

R4: Schedule Execution — APScheduler Integration

Decision

Extend the existing SchedulerService (backend/src/core/scheduler.py) with a new job type translate_scheduled_run. Each translation job's schedule configuration is stored in the TranslationSchedule model and loaded into APScheduler on service start and on schedule create/update.

Rationale

  • Existing SchedulerService already manages BackgroundScheduler, cron triggers, start/stop lifecycle, and task manager integration.
  • Translation schedules are distinct from backup schedules — stored in translate models, loaded via a registration callback pattern.
  • Schedule trigger: APScheduler fires → run_scheduled_translation(job_id) → creates TranslationRun → orchestrator processes new-key-only rows → generates INSERT statements.
  • Concurrency policy (skip/queue) enforced in the trigger handler before orchestrator invocation.

Alternatives Considered

  • Separate scheduler instance: Rejected — creates resource contention (two APScheduler instances) and complicates Docker deployment.
  • Celery/Redis-based scheduling: Rejected — adds infrastructure dependency; APScheduler is already proven in this codebase.
  • Cron-based external scheduling: Rejected — requires OS-level cron configuration, loses programmatic control over pause/resume and concurrency policies.

Impact

  • backend/src/plugins/translate/scheduler.py registers translation job schedules with SchedulerService.
  • New trigger function _execute_scheduled_translation(job_id: str) imported by SchedulerService.
  • Existing SchedulerService.load_schedules() extended to discover and register translation schedules alongside backup schedules.

R5: Observability — Structured Event Log + MetricSnapshot

Decision

Implement a dedicated TranslationEvent ORM model with type-specific payload (JSON) for structured event logging (FR-046). Events are written synchronously within the orchestrator flow. Per-job cumulative metrics (FR-047) are computed from live TranslationEvent rows (for recent data <90 days) combined with MetricSnapshot rows (for historical data >90 days). At pruning time, a MetricSnapshot is persisted capturing cumulative tokens, cost, and run counts before events are deleted.

Rationale

  • Structured events provide queryability for audit, trend analysis, and the admin dashboard without coupling to log parsing infrastructure.
  • JSON payload allows type-specific data.
  • MetricSnapshot persistence before pruning ensures cumulative metrics survive the 90-day retention window (SC-014).
  • The metrics dashboard reads: latest MetricSnapshot + events WHERE timestamp > snapshot.covers_events_before.
  • Synchronous event writes within the run transaction ensure no event loss during crashes.

Alternatives Considered

  • Application log (stdout) only: Rejected — not queryable for dashboards or audit.
  • Separate metrics table with counters (dual-write): Rejected — dual-write consistency risk; event-sourced + snapshot is simpler.
  • Events-only (no snapshots): Rejected — cumulative metrics would be lost after 90-day pruning (FR-049).

Impact

  • TranslationEvent model with nullable run_id for pre-run events.
  • MetricSnapshot model with covers_events_before timestamp for correct cutoff.
  • MetricsService queries aggregation from events + snapshots.
  • APScheduler daily job: persist snapshot → prune expired events/records.

R6: Frontend Architecture — Svelte 5 Runes Pattern

Decision

Use Svelte 5 runes ($state, $derived, $effect) for all reactive state management in translation components. Store layer uses a dedicated translate.js Svelte store module with $state runes for job list, current job config, preview state, and run progress. API calls use the existing requestApi/fetchApi wrapper pattern.

Rationale

  • Svelte 5 runes are the canonical reactivity model for this codebase (Svelte 5.43+ in package.json).
  • Dedicated store per feature domain follows existing patterns (frontend/src/lib/stores/ houses auth, settings, task stores).
  • WebSocket for run progress reuses the existing TaskManager WebSocket infrastructure — translation runs emit progress events on the same channel.
  • Components follow existing layout patterns: Tailwind CSS, @UX_STATE/@UX_FEEDBACK/@UX_RECOVERY/@UX_REACTIVITY contract tags.

Alternatives Considered

  • Svelte 4 stores (writable/derived): Rejected — codebase has already migrated to Svelte 5 runes; mixing patterns creates inconsistency.
  • Separate WebSocket channel: Rejected — existing Task Drawer WebSocket infrastructure handles progress events generically; translation runs fit the same pattern.

Impact

  • New SvelteKit route: frontend/src/routes/translate/ with sub-routes for job config, dictionaries, history.
  • New component library: frontend/src/lib/components/translate/ with 8 components.
  • New store: frontend/src/lib/stores/translate.js with $state runes.
  • New API client: frontend/src/lib/api/translate.js.

R7: RBAC Permission Model Integration

Decision

Define 13 permission strings per the Access Control Matrix in spec.md and enforce them via the existing PermissionChecker dependency in FastAPI route handlers. No new database tables needed — the existing permissions and role_permissions tables store string-based permissions.

Rationale

  • Existing RBAC model stores permissions as strings in role_permissions.permission column, checked via dependency injection in route handlers.
  • Granular permissions per resource type align with the existing pattern.
  • Ownership constraints (owner OR admin) are enforced in route handlers alongside permission checks.
  • Missing from original design: translate.job.view, translate.dictionary.view, translate.schedule.view, translate.metrics.view added for read-only access scenarios.

Alternatives Considered

  • Resource-level ownership only (no granular permissions): Rejected — spec explicitly requires granular permissions (FR-043).
  • Separate permission table per resource: Rejected — over-engineered; string-based permissions are sufficient.

Impact

  • 13 permission strings registered in RBAC seed.
  • Route handlers annotated with Depends(require_permission(...)) + ownership checks.
  • Admin UI displays new permission strings for role assignment.
  • Default analyst role: translate.job.view, translate.job.execute, translate.dictionary.view, translate.history.view.

R8: Testing Strategy

Decision

Multi-layer testing: (1) pytest unit tests for orchestrator, executor, dictionary manager, SQL generator, scheduler, event log; (2) pytest integration tests for API routes with test database; (3) vitest component tests for Svelte components using @testing-library/svelte; (4) manual verification via quickstart.md for end-to-end flow.

Rationale

  • Unit tests with mocked LLM responses and Superset client ensure fast feedback for business logic.
  • Integration tests verify API contract, database schema, RBAC enforcement, and schedule trigger behavior.
  • Component tests validate Svelte 5 rune reactivity, UX state transitions, and error recovery paths.
  • Manual quickstart provides a human-verifiable happy path that catches integration issues between backend and frontend.

Alternatives Considered

  • E2E tests with Playwright: Deferred to future iteration — adds maintenance overhead; quickstart manual verification is sufficient for initial delivery.

Impact

  • Test files: backend/src/plugins/translate/__tests__/, backend/tests/test_translate_api.py, frontend/src/lib/components/translate/__tests__/.
  • Fixtures: mock LLM provider responses, mock Superset client, test dictionary data, test translation job configuration.