Files
ss-tools/specs/028-llm-datasource-supeset/quickstart.md
2026-05-08 18:01:49 +03:00

289 lines
8.8 KiB
Markdown

# Quickstart: LLM Table Translation Service
**Feature Branch**: `028-llm-datasource-supeset`
**Date**: 2026-05-08
## Prerequisites
- Running ss-tools instance (Docker Compose or local)
- Superset connection configured in ss-tools settings
- At least one LLM provider configured (Settings → LLM)
- Target insertable PostgreSQL physical table exists in Superset with compatible schema
- User has appropriate RBAC permissions (admin by default)
## 1. Start the Application
```bash
# Docker (recommended)
cd /home/busya/dev/ss-tools
docker compose up --build
# Or local development
# Terminal 1 — Backend
cd backend
source .venv/bin/activate
python -m uvicorn src.app:app --reload --port 8001
# Terminal 2 — Frontend
cd frontend
npm run dev -- --port 5173
```
- Frontend: http://localhost:5173
- Backend API: http://localhost:8001
- API Docs: http://localhost:8001/docs
## 2. Create a Terminology Dictionary
### Via UI
1. Navigate to http://localhost:5173/translate/dictionaries
2. Click **[+ New Dictionary]**
3. Enter name: `Product Terms`, language: `ru`
4. Add entries inline or click **[Import CSV]**
5. Save
### Via API
```bash
curl -X POST http://localhost:8001/api/translate/dictionaries \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"name": "Product Terms",
"target_language": "ru",
"entries": [
{"source_term": "invoice", "target_term": "накладная"},
{"source_term": "widget", "target_term": "виджет"},
{"source_term": "backorder", "target_term": "предзаказ"}
]
}'
```
**Expected**: 201 Created with dictionary ID and entry count = 3.
## 3. Create a Translation Job
### Via UI
1. Navigate to http://localhost:5173/translate
2. Click **[+ New Translation Job]**
3. Select Superset datasource → columns auto-populate
4. Set:
- Translation column: `product_name`
- Context columns: `category_name`, `product_description`
- Key columns: `product_id`
- Target table: `products_i18n`
- Target column: `translated_name`
- Target language: `Russian`
- Attach dictionary: `Product Terms`
5. Click **[Save & Preview]**
### Via API
```bash
curl -X POST http://localhost:8001/api/translate/jobs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"name": "Products RU Translation",
"datasource_id": "<datasource-uuid>",
"source_table": "products",
"translation_col": "product_name",
"context_cols": ["category_name", "product_description"],
"source_key_cols": ["product_id"],
"target_key_cols": ["product_id"],
"target_table": "products_i18n",
"target_col": "translated_name",
"target_language": "ru",
"batch_size": 50,
"dictionary_ids": ["<dictionary-uuid>"]
}'
```
**Expected**: 201 Created with job ID. Validation passes (columns exist, target table accessible).
**Error case**: 422 if translation column is empty; 400 if target table not found.
## 4. Preview Translations
### Via UI
1. Open the saved job → click **[Preview]**
2. System shows ~10 rows with source, context, and LLM translation
3. Approve good translations, edit or reject bad ones
4. Click **[Approve All]** or handle individually
### Via API
```bash
curl -X POST http://localhost:8001/api/translate/jobs/<job-id>/preview \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{"sample_size": 10}'
```
**Expected**: 200 with array of PreviewRow objects (source_text, context, llm_translation, status=pending).
**Error case**: 503 if LLM provider unreachable; error message includes provider name and reason.
## 5. Execute Full Translation Run
### Via UI
1. After preview approval, click **[Start Full Run]**
2. Confirm cost estimate dialog
3. Watch live progress bar (WebSocket-driven)
4. On completion: view run summary with translation status, insert status, Superset query reference, and generated SQL (audit).
### Via API
```bash
curl -X POST http://localhost:8001/api/translate/jobs/<job-id>/runs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{"upsert_strategy": "insert"}'
```
**Expected**: 202 Accepted with run ID. WebSocket messages stream progress. Final GET returns run with `status=completed`, `translated_rows=N`, `insert_sql=<SQL>`.
**Partial failure**: `status=partial`, `failed_rows>0`. **[Retry Failed]** available.
## 6. Execute INSERT through Superset SQL Lab API
### Via UI
1. After translation completes, the system automatically submits SQL to Superset
2. Progress indicator shows: «📤 Submitting to Superset...»
3. On success: «✅ Insert succeeded · 1,241 rows affected · Query #a7f3b2c»
4. Click **[View SQL]** to audit the generated statement
### Via API
```bash
# Trigger full run (backend handles Superset submission automatically)
curl -X POST http://localhost:8001/api/translate/jobs/<job-id>/runs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{"upsert_strategy": "insert"}'
# Check run status (includes insert_status and superset_query_id)
curl http://localhost:8001/api/translate/runs/<run-id> \
-H "Authorization: Bearer <token>"
```
**Expected**: Run response includes `insert_status: "succeeded"`, `superset_query_id`, `rows_affected`.
**Insert failure**: `insert_status: "failed"`, `insert_error_message` populated. **[Retry Insert]** re-submits without re-translating.
### Verify in Target Table
```sql
-- Run directly in Superset SQL Lab to verify
SELECT * FROM products_i18n WHERE translated_name IS NOT NULL;
```
## 7. Feedback Loop — Correct a Translation
### Via UI
1. Open run results → find a mistranslated word
2. Highlight the word → **[Correct this term]** popup
3. Enter correction → select dictionary → submit
4. Re-run preview to verify correction is used
### Via API
```bash
curl -X POST http://localhost:8001/api/translate/corrections \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"record_id": "<record-uuid>",
"source_term": "Monitor Stand",
"source_term": "Monitor Stand",
"incorrect_target_term": "Мониторная стойка",
"corrected_target_term": "Подставка для монитора",
"dictionary_id": "<dictionary-uuid>"
}'
```
**Expected**: 201. Term pair added to dictionary. Conflict dialog if term already exists.
## 8. Configure Schedule
### Via UI
1. Open job → **Schedule** tab
2. Set type: Cron → `0 6 * * 1` (every Monday 06:00)
3. Toggle auto-INSERT: ON
4. Verify next 3 execution times
5. Enable schedule
### Via API
```bash
curl -X PUT http://localhost:8001/api/translate/jobs/<job-id>/schedule \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"schedule_type": "cron",
"cron_expression": "0 6 * * 1",
"timezone": "Europe/Moscow",
"concurrency": "skip"
}'
```
**Expected**: 200 with schedule config including `next_run_at`.
**Verify**: Check APScheduler jobs (backend log) or wait for next trigger and check run history.
## 9. View History and Metrics
### Via UI
1. Navigate to http://localhost:5173/translate/history
2. Filter by datasource, target table, or date range
3. Click a run for details: config snapshot, prompt, translations, INSERT SQL
### Via API
```bash
# List runs
curl http://localhost:8001/api/translate/runs?job_id=<job-id> \
-H "Authorization: Bearer <token>"
# Get metrics
curl http://localhost:8001/api/translate/jobs/<job-id>/metrics \
-H "Authorization: Bearer <token>"
```
**Expected**: Run list with status and row counts. Metrics with cumulative tokens and cost.
## 10. Verification Checklist
### Backend Tests
```bash
cd backend
source .venv/bin/activate
# Unit tests for translation plugin
pytest src/plugins/translate/__tests__/ -v
# Integration tests for translate API
pytest tests/test_translate_api.py -v
# All backend tests
pytest -v
```
### Frontend Tests
```bash
cd frontend
npm run test -- --run
```
### Linting
```bash
# Python
cd backend && ruff check src/plugins/translate/ src/api/routes/translate.py src/models/translate.py src/schemas/translate.py
# Svelte
cd frontend && npm run build # build includes type checking
```
### Manual Smoke Test
1. Create dictionary with 3 terms → verify in list
2. Import CSV with 50 terms → verify no duplicates (check conflict dialog)
3. Create job → verify column list populates from datasource
4. Preview with empty dictionary → verify LLM still translates
5. Preview with attached dictionary → verify glossary terms used (check `invoice``накладная`)
6. Full run with 50 rows → verify INSERT SQL has 50 VALUES tuples
7. Scheduled run (set to every 5 min for test) → verify run appears in history
8. Feedback loop: correct 1 term → re-preview → verify correction reflected
9. Delete dictionary attached to active job → verify blocked
10. Check metrics dashboard → verify run counts and token totals