15 KiB
[DEF:Report:Vectorization:Root:Module]
@COMPLEXITY 5
@PURPOSE Explain the current vectorization technology used by the Rust semantic index, step by step, in a contract-oriented format suitable for downstream LLM analysis.
@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:EmbedText]
@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:Normalize]
@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonSerialize]
@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonDeserialize]
@RELATION DEPENDS_ON -> [Axiom:DB:Store:UpsertEmbedding]
@RELATION DEPENDS_ON -> [Axiom:Services:Contract:Rebuild:SemanticIndex]
@RATIONALE The report is structured as semantic contracts so another LLM can reason about the implementation without reverse-engineering code first.
@REJECTED Free-form prose without @PRE/@POST was rejected because it weakens machine analysis and obscures invariants.
Vectorization Technology Report
1. Executive Summary
The current system uses a deterministic local fallback embedding pipeline.
It is not model-based and does not call any external embedding provider. Instead, it computes a 128-dimensional vector from raw text using character-frequency hashing, then L2-normalizes the vector and stores it in DuckDB as a JSON array string in the embeddings table.
This design is optimized for:
- deterministic rebuilds
- offline operation
- zero external dependencies at inference time
- reproducible semantic indexing across agent sessions
It is intentionally simpler than transformer embeddings.
2. Primary Production Contracts
[DEF:Report:Vectorization:ContractMap:Block]
@COMPLEXITY 4
@PURPOSE Map the production contracts that implement the vectorization pipeline.
@PRE Reader needs direct traceability from report steps to repository anchors.
@POST Each critical stage is linked to a concrete production contract.
@SIDE_EFFECT None.
| Stage | Contract ID | Responsibility |
|---|---|---|
| Vector generation | Axiom:Embedding:VSS:EmbedText |
Build a 128-dim vector from text via character hashing |
| Normalization | Axiom:Embedding:VSS:Normalize |
L2-normalize the vector |
| Similarity | Axiom:Embedding:VSS:CosineSimilarity |
Compute cosine similarity between normalized vectors |
| Serialization | Axiom:Embedding:VSS:JsonSerialize |
Encode vector as JSON string |
| Deserialization | Axiom:Embedding:VSS:JsonDeserialize |
Decode JSON string back to [f64; 128] |
| Persistence | Axiom:DB:Store:UpsertEmbedding |
Store embedding row in DuckDB |
| Retrieval | Axiom:DB:Store:GetEmbedding |
Load embedding row from DuckDB |
| Rebuild orchestration | Axiom:Services:Contract:Rebuild:SemanticIndex |
Trigger workspace reindex and optionally persist to DuckDB |
3. Step-by-Step Technology Flow
[DEF:Report:Vectorization:Step1:Block]
@COMPLEXITY 5
@PURPOSE Define the text source that becomes embedding input.
@PRE A semantic contract has already been parsed from workspace source and its body is available.
@POST The system has a deterministic text payload suitable for embedding generation.
@SIDE_EFFECT None directly; this step only defines input selection.
@DATA_CONTRACT ContractNode.body -> embed_text(text)
@INVARIANT The embedding source text is the contract body persisted by the indexer, not an external summary.
Implementation reality
- During rebuild, the system iterates over indexed contracts.
- For each contract, it passes
contract.bodyintoembed_text(&contract.body). - Therefore the vector represents the lexical content of the full
[DEF]...[/DEF]body, including header metadata and body text.
Important consequence
- Similarity is influenced by both semantic tags (
@PURPOSE,@RELATION, etc.) and implementation text.
[DEF:Report:Vectorization:Step2:Block]
@COMPLEXITY 5
@PURPOSE Describe the deterministic vector construction algorithm.
@PRE Input text is available as UTF-8 Rust &str.
@POST A dense 128-dimensional floating-point vector is produced before normalization.
@SIDE_EFFECT None.
@DATA_CONTRACT &str -> [f64; 128]
@INVARIANT No network, no stochastic model weights, and no external provider are involved.
@RATIONALE Deterministic hashing is fast, portable, and reproducible.
@REJECTED Transformer-based embeddings were rejected due to runtime cost and external dependency coupling.
Algorithm
- Initialize
vector = [0.0; 128]. - Iterate through
text.chars().take(2048). - For each character
ch, computeidx = (ch as usize) % 128. - Increment
vector[idx] += 1.0.
Interpretation
- This is a character-bucket frequency sketch.
- It is closer to a hashed lexical fingerprint than a learned semantic embedding.
Strengths
- deterministic
- cheap to compute
- stable across platforms
- robust enough for coarse lexical similarity
Weaknesses
- collisions are guaranteed because all characters map into 128 buckets
- no contextual semantics beyond lexical distribution
- weak synonym/generalization behavior compared with learned embeddings
[DEF:Report:Vectorization:Step3:Block]
@COMPLEXITY 4
@PURPOSE Explain input bounding and its effect on reproducibility.
@PRE Raw contract body may be arbitrarily long.
@POST Embedding computation uses at most the first 2048 characters.
@SIDE_EFFECT Truncates effective semantic coverage for long contracts.
@INVARIANT Runtime cost remains bounded and reproducible for every rebuild.
Mechanism
- The generator uses
text.chars().take(2048).
Why it exists
- keeps rebuild cost bounded
- prevents very large contracts from dominating runtime
- ensures deterministic maximum work per contract
Trade-off
- content after the first 2048 characters does not affect the vector
[DEF:Report:Vectorization:Step4:Block]
@COMPLEXITY 5
@PURPOSE Define the normalization stage that converts raw counts into a unit vector.
@PRE Raw 128-dim vector has non-negative frequency counts.
@POST Output vector has unit Euclidean norm unless the raw vector is all zeros.
@SIDE_EFFECT Mutates the vector in place.
@DATA_CONTRACT [f64; 128] -> normalized [f64; 128]
@INVARIANT Similarity scoring assumes normalized vectors.
Algorithm
- Compute
sum_sq = Σ(x_i^2). - Compute
norm = sqrt(sum_sq). - If
norm > 0.0, divide each component bynorm.
Why normalization matters
- removes bias from absolute text length
- enables cosine similarity as a direct dot product
Operational note
- for non-empty textual contracts, the vector should normally be non-zero and therefore normalized successfully
[DEF:Report:Vectorization:Step5:Block]
@COMPLEXITY 4
@PURPOSE Explain persistence encoding for DuckDB storage.
@PRE A normalized [f64; 128] vector exists in memory.
@POST The vector is serialized into a compact JSON array string.
@SIDE_EFFECT None.
@DATA_CONTRACT [f64; 128] -> String(vector_json)
@INVARIANT Stored vectors must remain length-128 after round-trip decoding.
Mechanism
vector_to_jsonusesserde_json::to_string(&vector.to_vec()).- Result is stored in DuckDB column
embeddings.vector_json TEXT.
Why JSON was chosen
- simple and portable
- easy to inspect manually
- no custom binary format needed
Cost
- larger on disk than binary
- slower than native vector column types
[DEF:Report:Vectorization:Step6:Block]
@COMPLEXITY 5
@PURPOSE Describe how vectors are written to DuckDB during rebuild.
@PRE Rebuild runs with use_duckdb=true; schema bootstrap has succeeded; contracts are available in memory.
@POST Each indexed contract receives an embedding row in embeddings when refresh_embeddings=true.
@SIDE_EFFECT Inserts or replaces rows in DuckDB.
@DATA_CONTRACT ContractNode -> embeddings(contract_id, provider_id, vector_json, source_text)
@INVARIANT Embedding row identity is keyed by contract_id.
Implementation path
rebuild_semantic_index(...)reindexes the workspace.- If
use_duckdb=true, it opensgraph.duckdb. DuckDbIndexStore::populate_from_index(...)clears/repopulates tables.- If
refresh_embeddings=true, each contract body is embedded. upsert_embedding(...)stores:contract_idprovider_id(currentlylocal-fallback)vector_jsonsource_text
Current provider identity
- storage path marks the provider as
local-fallback - rebuild response payload separately reports
embedding_provider_id = lexical-graph
Interpretation for downstream analysis
- both labels refer to the same local deterministic embedding strategy, but naming is currently inconsistent across layers
[DEF:Report:Vectorization:Step7:Block]
@COMPLEXITY 4
@PURPOSE Explain how stored vectors are loaded back from DuckDB.
@PRE A row exists in embeddings for the target contract_id.
@POST The vector round-trips back into Rust as [f64; 128].
@SIDE_EFFECT Reads DuckDB state.
@DATA_CONTRACT contract_id -> Option<[f64; 128]>
@INVARIANT Invalid JSON or non-128 vectors are treated as errors, not silently accepted.
Mechanism
get_embedding(contract_id)loadsvector_jsonvector_from_json(json_str)parsesVec<f64>- parser enforces exact length
128
Safety property
- malformed stored vectors fail loudly instead of contaminating similarity logic
[DEF:Report:Vectorization:Step8:Block]
@COMPLEXITY 4
@PURPOSE Define the similarity metric expected by the vector system.
@PRE Both vectors are already L2-normalized and lengths are equal.
@POST Cosine similarity is computed as a dot product in [-1, 1].
@SIDE_EFFECT None.
@DATA_CONTRACT [f64; 128] x [f64; 128] -> f64
@INVARIANT The similarity function assumes normalized inputs and does not renormalize them itself.
Mechanism
cosine_similarity(left, right) = Σ(left_i * right_i)
Important note
- the primitive exists and is correct for the current representation
- but a full production similarity-search API over DuckDB embeddings is still minimal and not yet a rich ANN/vector-index system
4. Storage Schema Relevant to Vectorization
[DEF:Report:Vectorization:Schema:Block]
@COMPLEXITY 4
@PURPOSE Describe the DuckDB schema fields directly involved in vectorization.
@PRE Reader needs storage-level understanding for independent analysis.
@POST The embedding persistence surface is explicitly documented.
@SIDE_EFFECT None.
Relevant table:
CREATE TABLE IF NOT EXISTS embeddings (
contract_id TEXT PRIMARY KEY,
provider_id TEXT,
vector_json TEXT NOT NULL,
source_text TEXT
);
Field meaning:
contract_id: stable logical owner of the vectorprovider_id: identifier of embedding strategyvector_json: serialized 128-dim normalized vectorsource_text: source body used to create the vector
5. Rebuild-Time Operational Sequence
[DEF:Report:Vectorization:RebuildSequence:Block]
@COMPLEXITY 5
@PURPOSE Provide the operational sequence for vector generation during semantic index rebuild.
@PRE Workspace root and policy are valid; operational directories are creatable.
@POST JSON snapshot is updated and DuckDB embeddings are refreshed when enabled.
@SIDE_EFFECT Reads source files, writes JSON snapshot, writes DuckDB tables.
@INVARIANT JSON snapshot and DuckDB content are derived from the same in-memory index rebuild.
Sequence:
- Reindex workspace contracts and edges in memory.
- Persist JSON semantic snapshot.
- If
use_duckdb=true, open/creategraph.duckdb. - Bootstrap schema if absent.
- Clear/repopulate files, contracts, edges.
- If
refresh_embeddings=true, compute a 128-dim vector for each contract body. - Serialize vectors to JSON.
- Upsert rows into
embeddings. - Save
IndexStatuswith embedding counts.
6. Design Characteristics for Another LLM to Evaluate
[DEF:Report:Vectorization:EvaluationHints:Block]
@COMPLEXITY 4
@PURPOSE Highlight the key architectural properties another LLM should evaluate.
@PRE Reader wants actionable analysis targets rather than raw implementation trivia.
@POST The main trade-offs and audit points are explicit.
@SIDE_EFFECT None.
Strengths
- deterministic and reproducible
- offline-safe
- cheap rebuild cost
- no model-serving dependency
- transparent storage format
Weaknesses
- not semantically deep like transformer embeddings
- collisions from modulo-128 hashing
- truncation at 2048 characters
- JSON storage instead of typed vector columns
- provider naming inconsistency (
local-fallbackvslexical-graph)
Questions worth analyzing
- Should metadata and code body be embedded together or separately?
- Should bucket count remain 128 or be increased?
- Should similarity search be exposed as a first-class tool/API?
- Should
provider_idnaming be normalized across rebuild response and storage? - Should long contracts use chunking instead of hard truncation at 2048 chars?
7. Exact Minimal Pseudocode
[DEF:Report:Vectorization:Pseudocode:Block]
@COMPLEXITY 3
@PURPOSE Give another LLM a language-agnostic reproduction of the current embedding pipeline.
@PRE Reader needs a faithful abstract form of the implementation.
@POST The algorithm can be reimplemented without inspecting Rust syntax.
@SIDE_EFFECT None.
function embed_text(text):
vector = [0.0] * 128
for ch in first_2048_characters(text):
idx = ord(ch) mod 128
vector[idx] += 1.0
norm = sqrt(sum(x*x for x in vector))
if norm > 0:
for i in range(128):
vector[i] /= norm
return vector
function store_embedding(contract_id, text):
vector = embed_text(text)
vector_json = json_encode(vector)
upsert into embeddings(contract_id, provider_id, vector_json, source_text)
8. Current Truth Statement
[DEF:Report:Vectorization:CurrentTruth:Block]
@COMPLEXITY 4
@PURPOSE Provide a final machine-readable summary of what is true today.
@PRE All previous sections have been read or can be ignored for a compact summary.
@POST Another LLM can extract the operative facts in one pass.
@SIDE_EFFECT None.
- Vectorization technology: deterministic character-frequency hashing
- Embedding dimensionality: 128
- Input cap: first 2048 characters
- Normalization: L2 normalization
- Storage encoding: JSON array in DuckDB
embeddings.vector_json - Similarity metric: cosine similarity via dot product of normalized vectors
- External model/provider dependency: none
- Primary objective: cheap, deterministic, offline lexical-semantic approximation