semantic cleanup

This commit is contained in:
2026-05-08 10:07:05 +03:00
parent 505864438e
commit d8df1fff59
90 changed files with 148541 additions and 2251 deletions

View File

@@ -0,0 +1,374 @@
# [DEF:Report:Vectorization:Root:Module]
# @COMPLEXITY 5
# @PURPOSE Explain the current vectorization technology used by the Rust semantic index, step by step, in a contract-oriented format suitable for downstream LLM analysis.
# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:EmbedText]
# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:Normalize]
# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonSerialize]
# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonDeserialize]
# @RELATION DEPENDS_ON -> [Axiom:DB:Store:UpsertEmbedding]
# @RELATION DEPENDS_ON -> [Axiom:Services:Contract:Rebuild:SemanticIndex]
# @RATIONALE The report is structured as semantic contracts so another LLM can reason about the implementation without reverse-engineering code first.
# @REJECTED Free-form prose without @PRE/@POST was rejected because it weakens machine analysis and obscures invariants.
# Vectorization Technology Report
## 1. Executive Summary
The current system uses a **deterministic local fallback embedding pipeline**.
It is **not model-based** and **does not call any external embedding provider**. Instead, it computes a **128-dimensional vector** from raw text using **character-frequency hashing**, then **L2-normalizes** the vector and stores it in DuckDB as a **JSON array string** in the `embeddings` table.
This design is optimized for:
- deterministic rebuilds
- offline operation
- zero external dependencies at inference time
- reproducible semantic indexing across agent sessions
It is intentionally simpler than transformer embeddings.
---
## 2. Primary Production Contracts
### [DEF:Report:Vectorization:ContractMap:Block]
### @COMPLEXITY 4
### @PURPOSE Map the production contracts that implement the vectorization pipeline.
### @PRE Reader needs direct traceability from report steps to repository anchors.
### @POST Each critical stage is linked to a concrete production contract.
### @SIDE_EFFECT None.
| Stage | Contract ID | Responsibility |
|---|---|---|
| Vector generation | `Axiom:Embedding:VSS:EmbedText` | Build a 128-dim vector from text via character hashing |
| Normalization | `Axiom:Embedding:VSS:Normalize` | L2-normalize the vector |
| Similarity | `Axiom:Embedding:VSS:CosineSimilarity` | Compute cosine similarity between normalized vectors |
| Serialization | `Axiom:Embedding:VSS:JsonSerialize` | Encode vector as JSON string |
| Deserialization | `Axiom:Embedding:VSS:JsonDeserialize` | Decode JSON string back to `[f64; 128]` |
| Persistence | `Axiom:DB:Store:UpsertEmbedding` | Store embedding row in DuckDB |
| Retrieval | `Axiom:DB:Store:GetEmbedding` | Load embedding row from DuckDB |
| Rebuild orchestration | `Axiom:Services:Contract:Rebuild:SemanticIndex` | Trigger workspace reindex and optionally persist to DuckDB |
---
## 3. Step-by-Step Technology Flow
### [DEF:Report:Vectorization:Step1:Block]
### @COMPLEXITY 5
### @PURPOSE Define the text source that becomes embedding input.
### @PRE A semantic contract has already been parsed from workspace source and its `body` is available.
### @POST The system has a deterministic text payload suitable for embedding generation.
### @SIDE_EFFECT None directly; this step only defines input selection.
### @DATA_CONTRACT `ContractNode.body -> embed_text(text)`
### @INVARIANT The embedding source text is the contract body persisted by the indexer, not an external summary.
**Implementation reality**
- During rebuild, the system iterates over indexed contracts.
- For each contract, it passes `contract.body` into `embed_text(&contract.body)`.
- Therefore the vector represents the lexical content of the full `[DEF]...[/DEF]` body, including header metadata and body text.
**Important consequence**
- Similarity is influenced by both semantic tags (`@PURPOSE`, `@RELATION`, etc.) and implementation text.
---
### [DEF:Report:Vectorization:Step2:Block]
### @COMPLEXITY 5
### @PURPOSE Describe the deterministic vector construction algorithm.
### @PRE Input text is available as UTF-8 Rust `&str`.
### @POST A dense 128-dimensional floating-point vector is produced before normalization.
### @SIDE_EFFECT None.
### @DATA_CONTRACT `&str -> [f64; 128]`
### @INVARIANT No network, no stochastic model weights, and no external provider are involved.
### @RATIONALE Deterministic hashing is fast, portable, and reproducible.
### @REJECTED Transformer-based embeddings were rejected due to runtime cost and external dependency coupling.
**Algorithm**
1. Initialize `vector = [0.0; 128]`.
2. Iterate through `text.chars().take(2048)`.
3. For each character `ch`, compute `idx = (ch as usize) % 128`.
4. Increment `vector[idx] += 1.0`.
**Interpretation**
- This is a **character-bucket frequency sketch**.
- It is closer to a hashed lexical fingerprint than a learned semantic embedding.
**Strengths**
- deterministic
- cheap to compute
- stable across platforms
- robust enough for coarse lexical similarity
**Weaknesses**
- collisions are guaranteed because all characters map into 128 buckets
- no contextual semantics beyond lexical distribution
- weak synonym/generalization behavior compared with learned embeddings
---
### [DEF:Report:Vectorization:Step3:Block]
### @COMPLEXITY 4
### @PURPOSE Explain input bounding and its effect on reproducibility.
### @PRE Raw contract body may be arbitrarily long.
### @POST Embedding computation uses at most the first 2048 characters.
### @SIDE_EFFECT Truncates effective semantic coverage for long contracts.
### @INVARIANT Runtime cost remains bounded and reproducible for every rebuild.
**Mechanism**
- The generator uses `text.chars().take(2048)`.
**Why it exists**
- keeps rebuild cost bounded
- prevents very large contracts from dominating runtime
- ensures deterministic maximum work per contract
**Trade-off**
- content after the first 2048 characters does not affect the vector
---
### [DEF:Report:Vectorization:Step4:Block]
### @COMPLEXITY 5
### @PURPOSE Define the normalization stage that converts raw counts into a unit vector.
### @PRE Raw 128-dim vector has non-negative frequency counts.
### @POST Output vector has unit Euclidean norm unless the raw vector is all zeros.
### @SIDE_EFFECT Mutates the vector in place.
### @DATA_CONTRACT `[f64; 128] -> normalized [f64; 128]`
### @INVARIANT Similarity scoring assumes normalized vectors.
**Algorithm**
1. Compute `sum_sq = Σ(x_i^2)`.
2. Compute `norm = sqrt(sum_sq)`.
3. If `norm > 0.0`, divide each component by `norm`.
**Why normalization matters**
- removes bias from absolute text length
- enables cosine similarity as a direct dot product
**Operational note**
- for non-empty textual contracts, the vector should normally be non-zero and therefore normalized successfully
---
### [DEF:Report:Vectorization:Step5:Block]
### @COMPLEXITY 4
### @PURPOSE Explain persistence encoding for DuckDB storage.
### @PRE A normalized `[f64; 128]` vector exists in memory.
### @POST The vector is serialized into a compact JSON array string.
### @SIDE_EFFECT None.
### @DATA_CONTRACT `[f64; 128] -> String(vector_json)`
### @INVARIANT Stored vectors must remain length-128 after round-trip decoding.
**Mechanism**
- `vector_to_json` uses `serde_json::to_string(&vector.to_vec())`.
- Result is stored in DuckDB column `embeddings.vector_json TEXT`.
**Why JSON was chosen**
- simple and portable
- easy to inspect manually
- no custom binary format needed
**Cost**
- larger on disk than binary
- slower than native vector column types
---
### [DEF:Report:Vectorization:Step6:Block]
### @COMPLEXITY 5
### @PURPOSE Describe how vectors are written to DuckDB during rebuild.
### @PRE Rebuild runs with `use_duckdb=true`; schema bootstrap has succeeded; contracts are available in memory.
### @POST Each indexed contract receives an embedding row in `embeddings` when `refresh_embeddings=true`.
### @SIDE_EFFECT Inserts or replaces rows in DuckDB.
### @DATA_CONTRACT `ContractNode -> embeddings(contract_id, provider_id, vector_json, source_text)`
### @INVARIANT Embedding row identity is keyed by `contract_id`.
**Implementation path**
1. `rebuild_semantic_index(...)` reindexes the workspace.
2. If `use_duckdb=true`, it opens `graph.duckdb`.
3. `DuckDbIndexStore::populate_from_index(...)` clears/repopulates tables.
4. If `refresh_embeddings=true`, each contract body is embedded.
5. `upsert_embedding(...)` stores:
- `contract_id`
- `provider_id` (currently `local-fallback`)
- `vector_json`
- `source_text`
**Current provider identity**
- storage path marks the provider as `local-fallback`
- rebuild response payload separately reports `embedding_provider_id = lexical-graph`
**Interpretation for downstream analysis**
- both labels refer to the same local deterministic embedding strategy, but naming is currently inconsistent across layers
---
### [DEF:Report:Vectorization:Step7:Block]
### @COMPLEXITY 4
### @PURPOSE Explain how stored vectors are loaded back from DuckDB.
### @PRE A row exists in `embeddings` for the target `contract_id`.
### @POST The vector round-trips back into Rust as `[f64; 128]`.
### @SIDE_EFFECT Reads DuckDB state.
### @DATA_CONTRACT `contract_id -> Option<[f64; 128]>`
### @INVARIANT Invalid JSON or non-128 vectors are treated as errors, not silently accepted.
**Mechanism**
- `get_embedding(contract_id)` loads `vector_json`
- `vector_from_json(json_str)` parses `Vec<f64>`
- parser enforces exact length `128`
**Safety property**
- malformed stored vectors fail loudly instead of contaminating similarity logic
---
### [DEF:Report:Vectorization:Step8:Block]
### @COMPLEXITY 4
### @PURPOSE Define the similarity metric expected by the vector system.
### @PRE Both vectors are already L2-normalized and lengths are equal.
### @POST Cosine similarity is computed as a dot product in `[-1, 1]`.
### @SIDE_EFFECT None.
### @DATA_CONTRACT `[f64; 128] x [f64; 128] -> f64`
### @INVARIANT The similarity function assumes normalized inputs and does not renormalize them itself.
**Mechanism**
- `cosine_similarity(left, right) = Σ(left_i * right_i)`
**Important note**
- the primitive exists and is correct for the current representation
- but a full production similarity-search API over DuckDB embeddings is still minimal and not yet a rich ANN/vector-index system
---
## 4. Storage Schema Relevant to Vectorization
### [DEF:Report:Vectorization:Schema:Block]
### @COMPLEXITY 4
### @PURPOSE Describe the DuckDB schema fields directly involved in vectorization.
### @PRE Reader needs storage-level understanding for independent analysis.
### @POST The embedding persistence surface is explicitly documented.
### @SIDE_EFFECT None.
Relevant table:
```sql
CREATE TABLE IF NOT EXISTS embeddings (
contract_id TEXT PRIMARY KEY,
provider_id TEXT,
vector_json TEXT NOT NULL,
source_text TEXT
);
```
Field meaning:
- `contract_id`: stable logical owner of the vector
- `provider_id`: identifier of embedding strategy
- `vector_json`: serialized 128-dim normalized vector
- `source_text`: source body used to create the vector
---
## 5. Rebuild-Time Operational Sequence
### [DEF:Report:Vectorization:RebuildSequence:Block]
### @COMPLEXITY 5
### @PURPOSE Provide the operational sequence for vector generation during semantic index rebuild.
### @PRE Workspace root and policy are valid; operational directories are creatable.
### @POST JSON snapshot is updated and DuckDB embeddings are refreshed when enabled.
### @SIDE_EFFECT Reads source files, writes JSON snapshot, writes DuckDB tables.
### @INVARIANT JSON snapshot and DuckDB content are derived from the same in-memory index rebuild.
Sequence:
1. Reindex workspace contracts and edges in memory.
2. Persist JSON semantic snapshot.
3. If `use_duckdb=true`, open/create `graph.duckdb`.
4. Bootstrap schema if absent.
5. Clear/repopulate files, contracts, edges.
6. If `refresh_embeddings=true`, compute a 128-dim vector for each contract body.
7. Serialize vectors to JSON.
8. Upsert rows into `embeddings`.
9. Save `IndexStatus` with embedding counts.
---
## 6. Design Characteristics for Another LLM to Evaluate
### [DEF:Report:Vectorization:EvaluationHints:Block]
### @COMPLEXITY 4
### @PURPOSE Highlight the key architectural properties another LLM should evaluate.
### @PRE Reader wants actionable analysis targets rather than raw implementation trivia.
### @POST The main trade-offs and audit points are explicit.
### @SIDE_EFFECT None.
### Strengths
- deterministic and reproducible
- offline-safe
- cheap rebuild cost
- no model-serving dependency
- transparent storage format
### Weaknesses
- not semantically deep like transformer embeddings
- collisions from modulo-128 hashing
- truncation at 2048 characters
- JSON storage instead of typed vector columns
- provider naming inconsistency (`local-fallback` vs `lexical-graph`)
### Questions worth analyzing
1. Should metadata and code body be embedded together or separately?
2. Should bucket count remain 128 or be increased?
3. Should similarity search be exposed as a first-class tool/API?
4. Should `provider_id` naming be normalized across rebuild response and storage?
5. Should long contracts use chunking instead of hard truncation at 2048 chars?
---
## 7. Exact Minimal Pseudocode
### [DEF:Report:Vectorization:Pseudocode:Block]
### @COMPLEXITY 3
### @PURPOSE Give another LLM a language-agnostic reproduction of the current embedding pipeline.
### @PRE Reader needs a faithful abstract form of the implementation.
### @POST The algorithm can be reimplemented without inspecting Rust syntax.
### @SIDE_EFFECT None.
```text
function embed_text(text):
vector = [0.0] * 128
for ch in first_2048_characters(text):
idx = ord(ch) mod 128
vector[idx] += 1.0
norm = sqrt(sum(x*x for x in vector))
if norm > 0:
for i in range(128):
vector[i] /= norm
return vector
function store_embedding(contract_id, text):
vector = embed_text(text)
vector_json = json_encode(vector)
upsert into embeddings(contract_id, provider_id, vector_json, source_text)
```
---
## 8. Current Truth Statement
### [DEF:Report:Vectorization:CurrentTruth:Block]
### @COMPLEXITY 4
### @PURPOSE Provide a final machine-readable summary of what is true today.
### @PRE All previous sections have been read or can be ignored for a compact summary.
### @POST Another LLM can extract the operative facts in one pass.
### @SIDE_EFFECT None.
- Vectorization technology: **deterministic character-frequency hashing**
- Embedding dimensionality: **128**
- Input cap: **first 2048 characters**
- Normalization: **L2 normalization**
- Storage encoding: **JSON array in DuckDB `embeddings.vector_json`**
- Similarity metric: **cosine similarity via dot product of normalized vectors**
- External model/provider dependency: **none**
- Primary objective: **cheap, deterministic, offline lexical-semantic approximation**
# [/DEF:Report:Vectorization:Root:Module]