# [DEF:Report:Vectorization:Root:Module] # @COMPLEXITY 5 # @PURPOSE Explain the current vectorization technology used by the Rust semantic index, step by step, in a contract-oriented format suitable for downstream LLM analysis. # @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:EmbedText] # @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:Normalize] # @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonSerialize] # @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonDeserialize] # @RELATION DEPENDS_ON -> [Axiom:DB:Store:UpsertEmbedding] # @RELATION DEPENDS_ON -> [Axiom:Services:Contract:Rebuild:SemanticIndex] # @RATIONALE The report is structured as semantic contracts so another LLM can reason about the implementation without reverse-engineering code first. # @REJECTED Free-form prose without @PRE/@POST was rejected because it weakens machine analysis and obscures invariants. # Vectorization Technology Report ## 1. Executive Summary The current system uses a **deterministic local fallback embedding pipeline**. It is **not model-based** and **does not call any external embedding provider**. Instead, it computes a **128-dimensional vector** from raw text using **character-frequency hashing**, then **L2-normalizes** the vector and stores it in DuckDB as a **JSON array string** in the `embeddings` table. This design is optimized for: - deterministic rebuilds - offline operation - zero external dependencies at inference time - reproducible semantic indexing across agent sessions It is intentionally simpler than transformer embeddings. --- ## 2. Primary Production Contracts ### [DEF:Report:Vectorization:ContractMap:Block] ### @COMPLEXITY 4 ### @PURPOSE Map the production contracts that implement the vectorization pipeline. ### @PRE Reader needs direct traceability from report steps to repository anchors. ### @POST Each critical stage is linked to a concrete production contract. ### @SIDE_EFFECT None. | Stage | Contract ID | Responsibility | |---|---|---| | Vector generation | `Axiom:Embedding:VSS:EmbedText` | Build a 128-dim vector from text via character hashing | | Normalization | `Axiom:Embedding:VSS:Normalize` | L2-normalize the vector | | Similarity | `Axiom:Embedding:VSS:CosineSimilarity` | Compute cosine similarity between normalized vectors | | Serialization | `Axiom:Embedding:VSS:JsonSerialize` | Encode vector as JSON string | | Deserialization | `Axiom:Embedding:VSS:JsonDeserialize` | Decode JSON string back to `[f64; 128]` | | Persistence | `Axiom:DB:Store:UpsertEmbedding` | Store embedding row in DuckDB | | Retrieval | `Axiom:DB:Store:GetEmbedding` | Load embedding row from DuckDB | | Rebuild orchestration | `Axiom:Services:Contract:Rebuild:SemanticIndex` | Trigger workspace reindex and optionally persist to DuckDB | --- ## 3. Step-by-Step Technology Flow ### [DEF:Report:Vectorization:Step1:Block] ### @COMPLEXITY 5 ### @PURPOSE Define the text source that becomes embedding input. ### @PRE A semantic contract has already been parsed from workspace source and its `body` is available. ### @POST The system has a deterministic text payload suitable for embedding generation. ### @SIDE_EFFECT None directly; this step only defines input selection. ### @DATA_CONTRACT `ContractNode.body -> embed_text(text)` ### @INVARIANT The embedding source text is the contract body persisted by the indexer, not an external summary. **Implementation reality** - During rebuild, the system iterates over indexed contracts. - For each contract, it passes `contract.body` into `embed_text(&contract.body)`. - Therefore the vector represents the lexical content of the full `[DEF]...[/DEF]` body, including header metadata and body text. **Important consequence** - Similarity is influenced by both semantic tags (`@PURPOSE`, `@RELATION`, etc.) and implementation text. --- ### [DEF:Report:Vectorization:Step2:Block] ### @COMPLEXITY 5 ### @PURPOSE Describe the deterministic vector construction algorithm. ### @PRE Input text is available as UTF-8 Rust `&str`. ### @POST A dense 128-dimensional floating-point vector is produced before normalization. ### @SIDE_EFFECT None. ### @DATA_CONTRACT `&str -> [f64; 128]` ### @INVARIANT No network, no stochastic model weights, and no external provider are involved. ### @RATIONALE Deterministic hashing is fast, portable, and reproducible. ### @REJECTED Transformer-based embeddings were rejected due to runtime cost and external dependency coupling. **Algorithm** 1. Initialize `vector = [0.0; 128]`. 2. Iterate through `text.chars().take(2048)`. 3. For each character `ch`, compute `idx = (ch as usize) % 128`. 4. Increment `vector[idx] += 1.0`. **Interpretation** - This is a **character-bucket frequency sketch**. - It is closer to a hashed lexical fingerprint than a learned semantic embedding. **Strengths** - deterministic - cheap to compute - stable across platforms - robust enough for coarse lexical similarity **Weaknesses** - collisions are guaranteed because all characters map into 128 buckets - no contextual semantics beyond lexical distribution - weak synonym/generalization behavior compared with learned embeddings --- ### [DEF:Report:Vectorization:Step3:Block] ### @COMPLEXITY 4 ### @PURPOSE Explain input bounding and its effect on reproducibility. ### @PRE Raw contract body may be arbitrarily long. ### @POST Embedding computation uses at most the first 2048 characters. ### @SIDE_EFFECT Truncates effective semantic coverage for long contracts. ### @INVARIANT Runtime cost remains bounded and reproducible for every rebuild. **Mechanism** - The generator uses `text.chars().take(2048)`. **Why it exists** - keeps rebuild cost bounded - prevents very large contracts from dominating runtime - ensures deterministic maximum work per contract **Trade-off** - content after the first 2048 characters does not affect the vector --- ### [DEF:Report:Vectorization:Step4:Block] ### @COMPLEXITY 5 ### @PURPOSE Define the normalization stage that converts raw counts into a unit vector. ### @PRE Raw 128-dim vector has non-negative frequency counts. ### @POST Output vector has unit Euclidean norm unless the raw vector is all zeros. ### @SIDE_EFFECT Mutates the vector in place. ### @DATA_CONTRACT `[f64; 128] -> normalized [f64; 128]` ### @INVARIANT Similarity scoring assumes normalized vectors. **Algorithm** 1. Compute `sum_sq = Σ(x_i^2)`. 2. Compute `norm = sqrt(sum_sq)`. 3. If `norm > 0.0`, divide each component by `norm`. **Why normalization matters** - removes bias from absolute text length - enables cosine similarity as a direct dot product **Operational note** - for non-empty textual contracts, the vector should normally be non-zero and therefore normalized successfully --- ### [DEF:Report:Vectorization:Step5:Block] ### @COMPLEXITY 4 ### @PURPOSE Explain persistence encoding for DuckDB storage. ### @PRE A normalized `[f64; 128]` vector exists in memory. ### @POST The vector is serialized into a compact JSON array string. ### @SIDE_EFFECT None. ### @DATA_CONTRACT `[f64; 128] -> String(vector_json)` ### @INVARIANT Stored vectors must remain length-128 after round-trip decoding. **Mechanism** - `vector_to_json` uses `serde_json::to_string(&vector.to_vec())`. - Result is stored in DuckDB column `embeddings.vector_json TEXT`. **Why JSON was chosen** - simple and portable - easy to inspect manually - no custom binary format needed **Cost** - larger on disk than binary - slower than native vector column types --- ### [DEF:Report:Vectorization:Step6:Block] ### @COMPLEXITY 5 ### @PURPOSE Describe how vectors are written to DuckDB during rebuild. ### @PRE Rebuild runs with `use_duckdb=true`; schema bootstrap has succeeded; contracts are available in memory. ### @POST Each indexed contract receives an embedding row in `embeddings` when `refresh_embeddings=true`. ### @SIDE_EFFECT Inserts or replaces rows in DuckDB. ### @DATA_CONTRACT `ContractNode -> embeddings(contract_id, provider_id, vector_json, source_text)` ### @INVARIANT Embedding row identity is keyed by `contract_id`. **Implementation path** 1. `rebuild_semantic_index(...)` reindexes the workspace. 2. If `use_duckdb=true`, it opens `graph.duckdb`. 3. `DuckDbIndexStore::populate_from_index(...)` clears/repopulates tables. 4. If `refresh_embeddings=true`, each contract body is embedded. 5. `upsert_embedding(...)` stores: - `contract_id` - `provider_id` (currently `local-fallback`) - `vector_json` - `source_text` **Current provider identity** - storage path marks the provider as `local-fallback` - rebuild response payload separately reports `embedding_provider_id = lexical-graph` **Interpretation for downstream analysis** - both labels refer to the same local deterministic embedding strategy, but naming is currently inconsistent across layers --- ### [DEF:Report:Vectorization:Step7:Block] ### @COMPLEXITY 4 ### @PURPOSE Explain how stored vectors are loaded back from DuckDB. ### @PRE A row exists in `embeddings` for the target `contract_id`. ### @POST The vector round-trips back into Rust as `[f64; 128]`. ### @SIDE_EFFECT Reads DuckDB state. ### @DATA_CONTRACT `contract_id -> Option<[f64; 128]>` ### @INVARIANT Invalid JSON or non-128 vectors are treated as errors, not silently accepted. **Mechanism** - `get_embedding(contract_id)` loads `vector_json` - `vector_from_json(json_str)` parses `Vec` - parser enforces exact length `128` **Safety property** - malformed stored vectors fail loudly instead of contaminating similarity logic --- ### [DEF:Report:Vectorization:Step8:Block] ### @COMPLEXITY 4 ### @PURPOSE Define the similarity metric expected by the vector system. ### @PRE Both vectors are already L2-normalized and lengths are equal. ### @POST Cosine similarity is computed as a dot product in `[-1, 1]`. ### @SIDE_EFFECT None. ### @DATA_CONTRACT `[f64; 128] x [f64; 128] -> f64` ### @INVARIANT The similarity function assumes normalized inputs and does not renormalize them itself. **Mechanism** - `cosine_similarity(left, right) = Σ(left_i * right_i)` **Important note** - the primitive exists and is correct for the current representation - but a full production similarity-search API over DuckDB embeddings is still minimal and not yet a rich ANN/vector-index system --- ## 4. Storage Schema Relevant to Vectorization ### [DEF:Report:Vectorization:Schema:Block] ### @COMPLEXITY 4 ### @PURPOSE Describe the DuckDB schema fields directly involved in vectorization. ### @PRE Reader needs storage-level understanding for independent analysis. ### @POST The embedding persistence surface is explicitly documented. ### @SIDE_EFFECT None. Relevant table: ```sql CREATE TABLE IF NOT EXISTS embeddings ( contract_id TEXT PRIMARY KEY, provider_id TEXT, vector_json TEXT NOT NULL, source_text TEXT ); ``` Field meaning: - `contract_id`: stable logical owner of the vector - `provider_id`: identifier of embedding strategy - `vector_json`: serialized 128-dim normalized vector - `source_text`: source body used to create the vector --- ## 5. Rebuild-Time Operational Sequence ### [DEF:Report:Vectorization:RebuildSequence:Block] ### @COMPLEXITY 5 ### @PURPOSE Provide the operational sequence for vector generation during semantic index rebuild. ### @PRE Workspace root and policy are valid; operational directories are creatable. ### @POST JSON snapshot is updated and DuckDB embeddings are refreshed when enabled. ### @SIDE_EFFECT Reads source files, writes JSON snapshot, writes DuckDB tables. ### @INVARIANT JSON snapshot and DuckDB content are derived from the same in-memory index rebuild. Sequence: 1. Reindex workspace contracts and edges in memory. 2. Persist JSON semantic snapshot. 3. If `use_duckdb=true`, open/create `graph.duckdb`. 4. Bootstrap schema if absent. 5. Clear/repopulate files, contracts, edges. 6. If `refresh_embeddings=true`, compute a 128-dim vector for each contract body. 7. Serialize vectors to JSON. 8. Upsert rows into `embeddings`. 9. Save `IndexStatus` with embedding counts. --- ## 6. Design Characteristics for Another LLM to Evaluate ### [DEF:Report:Vectorization:EvaluationHints:Block] ### @COMPLEXITY 4 ### @PURPOSE Highlight the key architectural properties another LLM should evaluate. ### @PRE Reader wants actionable analysis targets rather than raw implementation trivia. ### @POST The main trade-offs and audit points are explicit. ### @SIDE_EFFECT None. ### Strengths - deterministic and reproducible - offline-safe - cheap rebuild cost - no model-serving dependency - transparent storage format ### Weaknesses - not semantically deep like transformer embeddings - collisions from modulo-128 hashing - truncation at 2048 characters - JSON storage instead of typed vector columns - provider naming inconsistency (`local-fallback` vs `lexical-graph`) ### Questions worth analyzing 1. Should metadata and code body be embedded together or separately? 2. Should bucket count remain 128 or be increased? 3. Should similarity search be exposed as a first-class tool/API? 4. Should `provider_id` naming be normalized across rebuild response and storage? 5. Should long contracts use chunking instead of hard truncation at 2048 chars? --- ## 7. Exact Minimal Pseudocode ### [DEF:Report:Vectorization:Pseudocode:Block] ### @COMPLEXITY 3 ### @PURPOSE Give another LLM a language-agnostic reproduction of the current embedding pipeline. ### @PRE Reader needs a faithful abstract form of the implementation. ### @POST The algorithm can be reimplemented without inspecting Rust syntax. ### @SIDE_EFFECT None. ```text function embed_text(text): vector = [0.0] * 128 for ch in first_2048_characters(text): idx = ord(ch) mod 128 vector[idx] += 1.0 norm = sqrt(sum(x*x for x in vector)) if norm > 0: for i in range(128): vector[i] /= norm return vector function store_embedding(contract_id, text): vector = embed_text(text) vector_json = json_encode(vector) upsert into embeddings(contract_id, provider_id, vector_json, source_text) ``` --- ## 8. Current Truth Statement ### [DEF:Report:Vectorization:CurrentTruth:Block] ### @COMPLEXITY 4 ### @PURPOSE Provide a final machine-readable summary of what is true today. ### @PRE All previous sections have been read or can be ignored for a compact summary. ### @POST Another LLM can extract the operative facts in one pass. ### @SIDE_EFFECT None. - Vectorization technology: **deterministic character-frequency hashing** - Embedding dimensionality: **128** - Input cap: **first 2048 characters** - Normalization: **L2 normalization** - Storage encoding: **JSON array in DuckDB `embeddings.vector_json`** - Similarity metric: **cosine similarity via dot product of normalized vectors** - External model/provider dependency: **none** - Primary objective: **cheap, deterministic, offline lexical-semantic approximation** # [/DEF:Report:Vectorization:Root:Module]