semantic cleanup

2026-05-08 10:07:05 +03:00
parent 505864438e
commit d8df1fff59
90 changed files with 148541 additions and 2251 deletions
--- a/.opencode/reports/vectorization-technology-report.md
+++ b/.opencode/reports/vectorization-technology-report.md
@@ -0,0 +1,374 @@
+# [DEF:Report:Vectorization:Root:Module]
+# @COMPLEXITY 5
+# @PURPOSE Explain the current vectorization technology used by the Rust semantic index, step by step, in a contract-oriented format suitable for downstream LLM analysis.
+# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:EmbedText]
+# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:Normalize]
+# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonSerialize]
+# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonDeserialize]
+# @RELATION DEPENDS_ON -> [Axiom:DB:Store:UpsertEmbedding]
+# @RELATION DEPENDS_ON -> [Axiom:Services:Contract:Rebuild:SemanticIndex]
+# @RATIONALE The report is structured as semantic contracts so another LLM can reason about the implementation without reverse-engineering code first.
+# @REJECTED Free-form prose without @PRE/@POST was rejected because it weakens machine analysis and obscures invariants.
+
+# Vectorization Technology Report
+
+## 1. Executive Summary
+
+The current system uses a **deterministic local fallback embedding pipeline**.
+
+It is **not model-based** and **does not call any external embedding provider**. Instead, it computes a **128-dimensional vector** from raw text using **character-frequency hashing**, then **L2-normalizes** the vector and stores it in DuckDB as a **JSON array string** in the `embeddings` table.
+
+This design is optimized for:
+- deterministic rebuilds
+- offline operation
+- zero external dependencies at inference time
+- reproducible semantic indexing across agent sessions
+
+It is intentionally simpler than transformer embeddings.
+
+---
+
+## 2. Primary Production Contracts
+
+### [DEF:Report:Vectorization:ContractMap:Block]
+### @COMPLEXITY 4
+### @PURPOSE Map the production contracts that implement the vectorization pipeline.
+### @PRE Reader needs direct traceability from report steps to repository anchors.
+### @POST Each critical stage is linked to a concrete production contract.
+### @SIDE_EFFECT None.
+
+| Stage | Contract ID | Responsibility |
+|---|---|---|
+| Vector generation | `Axiom:Embedding:VSS:EmbedText` | Build a 128-dim vector from text via character hashing |
+| Normalization | `Axiom:Embedding:VSS:Normalize` | L2-normalize the vector |
+| Similarity | `Axiom:Embedding:VSS:CosineSimilarity` | Compute cosine similarity between normalized vectors |
+| Serialization | `Axiom:Embedding:VSS:JsonSerialize` | Encode vector as JSON string |
+| Deserialization | `Axiom:Embedding:VSS:JsonDeserialize` | Decode JSON string back to `[f64; 128]` |
+| Persistence | `Axiom:DB:Store:UpsertEmbedding` | Store embedding row in DuckDB |
+| Retrieval | `Axiom:DB:Store:GetEmbedding` | Load embedding row from DuckDB |
+| Rebuild orchestration | `Axiom:Services:Contract:Rebuild:SemanticIndex` | Trigger workspace reindex and optionally persist to DuckDB |
+
+---
+
+## 3. Step-by-Step Technology Flow
+
+### [DEF:Report:Vectorization:Step1:Block]
+### @COMPLEXITY 5
+### @PURPOSE Define the text source that becomes embedding input.
+### @PRE A semantic contract has already been parsed from workspace source and its `body` is available.
+### @POST The system has a deterministic text payload suitable for embedding generation.
+### @SIDE_EFFECT None directly; this step only defines input selection.
+### @DATA_CONTRACT `ContractNode.body -> embed_text(text)`
+### @INVARIANT The embedding source text is the contract body persisted by the indexer, not an external summary.
+
+**Implementation reality**
+- During rebuild, the system iterates over indexed contracts.
+- For each contract, it passes `contract.body` into `embed_text(&contract.body)`.
+- Therefore the vector represents the lexical content of the full `[DEF]...[/DEF]` body, including header metadata and body text.
+
+**Important consequence**
+- Similarity is influenced by both semantic tags (`@PURPOSE`, `@RELATION`, etc.) and implementation text.
+
+---
+
+### [DEF:Report:Vectorization:Step2:Block]
+### @COMPLEXITY 5
+### @PURPOSE Describe the deterministic vector construction algorithm.
+### @PRE Input text is available as UTF-8 Rust `&str`.
+### @POST A dense 128-dimensional floating-point vector is produced before normalization.
+### @SIDE_EFFECT None.
+### @DATA_CONTRACT `&str -> [f64; 128]`
+### @INVARIANT No network, no stochastic model weights, and no external provider are involved.
+### @RATIONALE Deterministic hashing is fast, portable, and reproducible.
+### @REJECTED Transformer-based embeddings were rejected due to runtime cost and external dependency coupling.
+
+**Algorithm**
+1. Initialize `vector = [0.0; 128]`.
+2. Iterate through `text.chars().take(2048)`.
+3. For each character `ch`, compute `idx = (ch as usize) % 128`.
+4. Increment `vector[idx] += 1.0`.
+
+**Interpretation**
+- This is a **character-bucket frequency sketch**.
+- It is closer to a hashed lexical fingerprint than a learned semantic embedding.
+
+**Strengths**
+- deterministic
+- cheap to compute
+- stable across platforms
+- robust enough for coarse lexical similarity
+
+**Weaknesses**
+- collisions are guaranteed because all characters map into 128 buckets
+- no contextual semantics beyond lexical distribution
+- weak synonym/generalization behavior compared with learned embeddings
+
+---
+
+### [DEF:Report:Vectorization:Step3:Block]
+### @COMPLEXITY 4
+### @PURPOSE Explain input bounding and its effect on reproducibility.
+### @PRE Raw contract body may be arbitrarily long.
+### @POST Embedding computation uses at most the first 2048 characters.
+### @SIDE_EFFECT Truncates effective semantic coverage for long contracts.
+### @INVARIANT Runtime cost remains bounded and reproducible for every rebuild.
+
+**Mechanism**
+- The generator uses `text.chars().take(2048)`.
+
+**Why it exists**
+- keeps rebuild cost bounded
+- prevents very large contracts from dominating runtime
+- ensures deterministic maximum work per contract
+
+**Trade-off**
+- content after the first 2048 characters does not affect the vector
+
+---
+
+### [DEF:Report:Vectorization:Step4:Block]
+### @COMPLEXITY 5
+### @PURPOSE Define the normalization stage that converts raw counts into a unit vector.
+### @PRE Raw 128-dim vector has non-negative frequency counts.
+### @POST Output vector has unit Euclidean norm unless the raw vector is all zeros.
+### @SIDE_EFFECT Mutates the vector in place.
+### @DATA_CONTRACT `[f64; 128] -> normalized [f64; 128]`
+### @INVARIANT Similarity scoring assumes normalized vectors.
+
+**Algorithm**
+1. Compute `sum_sq = Σ(x_i^2)`.
+2. Compute `norm = sqrt(sum_sq)`.
+3. If `norm > 0.0`, divide each component by `norm`.
+
+**Why normalization matters**
+- removes bias from absolute text length
+- enables cosine similarity as a direct dot product
+
+**Operational note**
+- for non-empty textual contracts, the vector should normally be non-zero and therefore normalized successfully
+
+---
+
+### [DEF:Report:Vectorization:Step5:Block]
+### @COMPLEXITY 4
+### @PURPOSE Explain persistence encoding for DuckDB storage.
+### @PRE A normalized `[f64; 128]` vector exists in memory.
+### @POST The vector is serialized into a compact JSON array string.
+### @SIDE_EFFECT None.
+### @DATA_CONTRACT `[f64; 128] -> String(vector_json)`
+### @INVARIANT Stored vectors must remain length-128 after round-trip decoding.
+
+**Mechanism**
+- `vector_to_json` uses `serde_json::to_string(&vector.to_vec())`.
+- Result is stored in DuckDB column `embeddings.vector_json TEXT`.
+
+**Why JSON was chosen**
+- simple and portable
+- easy to inspect manually
+- no custom binary format needed
+
+**Cost**
+- larger on disk than binary
+- slower than native vector column types
+
+---
+
+### [DEF:Report:Vectorization:Step6:Block]
+### @COMPLEXITY 5
+### @PURPOSE Describe how vectors are written to DuckDB during rebuild.
+### @PRE Rebuild runs with `use_duckdb=true`; schema bootstrap has succeeded; contracts are available in memory.
+### @POST Each indexed contract receives an embedding row in `embeddings` when `refresh_embeddings=true`.
+### @SIDE_EFFECT Inserts or replaces rows in DuckDB.
+### @DATA_CONTRACT `ContractNode -> embeddings(contract_id, provider_id, vector_json, source_text)`
+### @INVARIANT Embedding row identity is keyed by `contract_id`.
+
+**Implementation path**
+1. `rebuild_semantic_index(...)` reindexes the workspace.
+2. If `use_duckdb=true`, it opens `graph.duckdb`.
+3. `DuckDbIndexStore::populate_from_index(...)` clears/repopulates tables.
+4. If `refresh_embeddings=true`, each contract body is embedded.
+5. `upsert_embedding(...)` stores:
+   - `contract_id`
+   - `provider_id` (currently `local-fallback`)
+   - `vector_json`
+   - `source_text`
+
+**Current provider identity**
+- storage path marks the provider as `local-fallback`
+- rebuild response payload separately reports `embedding_provider_id = lexical-graph`
+
+**Interpretation for downstream analysis**
+- both labels refer to the same local deterministic embedding strategy, but naming is currently inconsistent across layers
+
+---
+
+### [DEF:Report:Vectorization:Step7:Block]
+### @COMPLEXITY 4
+### @PURPOSE Explain how stored vectors are loaded back from DuckDB.
+### @PRE A row exists in `embeddings` for the target `contract_id`.
+### @POST The vector round-trips back into Rust as `[f64; 128]`.
+### @SIDE_EFFECT Reads DuckDB state.
+### @DATA_CONTRACT `contract_id -> Option<[f64; 128]>`
+### @INVARIANT Invalid JSON or non-128 vectors are treated as errors, not silently accepted.
+
+**Mechanism**
+- `get_embedding(contract_id)` loads `vector_json`
+- `vector_from_json(json_str)` parses `Vec<f64>`
+- parser enforces exact length `128`
+
+**Safety property**
+- malformed stored vectors fail loudly instead of contaminating similarity logic
+
+---
+
+### [DEF:Report:Vectorization:Step8:Block]
+### @COMPLEXITY 4
+### @PURPOSE Define the similarity metric expected by the vector system.
+### @PRE Both vectors are already L2-normalized and lengths are equal.
+### @POST Cosine similarity is computed as a dot product in `[-1, 1]`.
+### @SIDE_EFFECT None.
+### @DATA_CONTRACT `[f64; 128] x [f64; 128] -> f64`
+### @INVARIANT The similarity function assumes normalized inputs and does not renormalize them itself.
+
+**Mechanism**
+- `cosine_similarity(left, right) = Σ(left_i * right_i)`
+
+**Important note**
+- the primitive exists and is correct for the current representation
+- but a full production similarity-search API over DuckDB embeddings is still minimal and not yet a rich ANN/vector-index system
+
+---
+
+## 4. Storage Schema Relevant to Vectorization
+
+### [DEF:Report:Vectorization:Schema:Block]
+### @COMPLEXITY 4
+### @PURPOSE Describe the DuckDB schema fields directly involved in vectorization.
+### @PRE Reader needs storage-level understanding for independent analysis.
+### @POST The embedding persistence surface is explicitly documented.
+### @SIDE_EFFECT None.
+
+Relevant table:
+
+```sql
+CREATE TABLE IF NOT EXISTS embeddings (
+    contract_id TEXT PRIMARY KEY,
+    provider_id TEXT,
+    vector_json TEXT NOT NULL,
+    source_text TEXT
+);
+```
+
+Field meaning:
+- `contract_id`: stable logical owner of the vector
+- `provider_id`: identifier of embedding strategy
+- `vector_json`: serialized 128-dim normalized vector
+- `source_text`: source body used to create the vector
+
+---
+
+## 5. Rebuild-Time Operational Sequence
+
+### [DEF:Report:Vectorization:RebuildSequence:Block]
+### @COMPLEXITY 5
+### @PURPOSE Provide the operational sequence for vector generation during semantic index rebuild.
+### @PRE Workspace root and policy are valid; operational directories are creatable.
+### @POST JSON snapshot is updated and DuckDB embeddings are refreshed when enabled.
+### @SIDE_EFFECT Reads source files, writes JSON snapshot, writes DuckDB tables.
+### @INVARIANT JSON snapshot and DuckDB content are derived from the same in-memory index rebuild.
+
+Sequence:
+1. Reindex workspace contracts and edges in memory.
+2. Persist JSON semantic snapshot.
+3. If `use_duckdb=true`, open/create `graph.duckdb`.
+4. Bootstrap schema if absent.
+5. Clear/repopulate files, contracts, edges.
+6. If `refresh_embeddings=true`, compute a 128-dim vector for each contract body.
+7. Serialize vectors to JSON.
+8. Upsert rows into `embeddings`.
+9. Save `IndexStatus` with embedding counts.
+
+---
+
+## 6. Design Characteristics for Another LLM to Evaluate
+
+### [DEF:Report:Vectorization:EvaluationHints:Block]
+### @COMPLEXITY 4
+### @PURPOSE Highlight the key architectural properties another LLM should evaluate.
+### @PRE Reader wants actionable analysis targets rather than raw implementation trivia.
+### @POST The main trade-offs and audit points are explicit.
+### @SIDE_EFFECT None.
+
+### Strengths
+- deterministic and reproducible
+- offline-safe
+- cheap rebuild cost
+- no model-serving dependency
+- transparent storage format
+
+### Weaknesses
+- not semantically deep like transformer embeddings
+- collisions from modulo-128 hashing
+- truncation at 2048 characters
+- JSON storage instead of typed vector columns
+- provider naming inconsistency (`local-fallback` vs `lexical-graph`)
+
+### Questions worth analyzing
+1. Should metadata and code body be embedded together or separately?
+2. Should bucket count remain 128 or be increased?
+3. Should similarity search be exposed as a first-class tool/API?
+4. Should `provider_id` naming be normalized across rebuild response and storage?
+5. Should long contracts use chunking instead of hard truncation at 2048 chars?
+
+---
+
+## 7. Exact Minimal Pseudocode
+
+### [DEF:Report:Vectorization:Pseudocode:Block]
+### @COMPLEXITY 3
+### @PURPOSE Give another LLM a language-agnostic reproduction of the current embedding pipeline.
+### @PRE Reader needs a faithful abstract form of the implementation.
+### @POST The algorithm can be reimplemented without inspecting Rust syntax.
+### @SIDE_EFFECT None.
+
+```text
+function embed_text(text):
+    vector = [0.0] * 128
+    for ch in first_2048_characters(text):
+        idx = ord(ch) mod 128
+        vector[idx] += 1.0
+
+    norm = sqrt(sum(x*x for x in vector))
+    if norm > 0:
+        for i in range(128):
+            vector[i] /= norm
+
+    return vector
+
+function store_embedding(contract_id, text):
+    vector = embed_text(text)
+    vector_json = json_encode(vector)
+    upsert into embeddings(contract_id, provider_id, vector_json, source_text)
+```
+
+---
+
+## 8. Current Truth Statement
+
+### [DEF:Report:Vectorization:CurrentTruth:Block]
+### @COMPLEXITY 4
+### @PURPOSE Provide a final machine-readable summary of what is true today.
+### @PRE All previous sections have been read or can be ignored for a compact summary.
+### @POST Another LLM can extract the operative facts in one pass.
+### @SIDE_EFFECT None.
+
+- Vectorization technology: **deterministic character-frequency hashing**
+- Embedding dimensionality: **128**
+- Input cap: **first 2048 characters**
+- Normalization: **L2 normalization**
+- Storage encoding: **JSON array in DuckDB `embeddings.vector_json`**
+- Similarity metric: **cosine similarity via dot product of normalized vectors**
+- External model/provider dependency: **none**
+- Primary objective: **cheap, deterministic, offline lexical-semantic approximation**
+
+# [/DEF:Report:Vectorization:Root:Module]