Files
ss-tools/.opencode/reports/vectorization-technology-report.md
2026-05-08 10:07:05 +03:00

15 KiB

[DEF:Report:Vectorization:Root:Module]

@COMPLEXITY 5

@PURPOSE Explain the current vectorization technology used by the Rust semantic index, step by step, in a contract-oriented format suitable for downstream LLM analysis.

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:EmbedText]

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:Normalize]

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonSerialize]

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonDeserialize]

@RELATION DEPENDS_ON -> [Axiom:DB:Store:UpsertEmbedding]

@RELATION DEPENDS_ON -> [Axiom:Services:Contract:Rebuild:SemanticIndex]

@RATIONALE The report is structured as semantic contracts so another LLM can reason about the implementation without reverse-engineering code first.

@REJECTED Free-form prose without @PRE/@POST was rejected because it weakens machine analysis and obscures invariants.

Vectorization Technology Report

1. Executive Summary

The current system uses a deterministic local fallback embedding pipeline.

It is not model-based and does not call any external embedding provider. Instead, it computes a 128-dimensional vector from raw text using character-frequency hashing, then L2-normalizes the vector and stores it in DuckDB as a JSON array string in the embeddings table.

This design is optimized for:

  • deterministic rebuilds
  • offline operation
  • zero external dependencies at inference time
  • reproducible semantic indexing across agent sessions

It is intentionally simpler than transformer embeddings.


2. Primary Production Contracts

[DEF:Report:Vectorization:ContractMap:Block]

@COMPLEXITY 4

@PURPOSE Map the production contracts that implement the vectorization pipeline.

@PRE Reader needs direct traceability from report steps to repository anchors.

@POST Each critical stage is linked to a concrete production contract.

@SIDE_EFFECT None.

Stage Contract ID Responsibility
Vector generation Axiom:Embedding:VSS:EmbedText Build a 128-dim vector from text via character hashing
Normalization Axiom:Embedding:VSS:Normalize L2-normalize the vector
Similarity Axiom:Embedding:VSS:CosineSimilarity Compute cosine similarity between normalized vectors
Serialization Axiom:Embedding:VSS:JsonSerialize Encode vector as JSON string
Deserialization Axiom:Embedding:VSS:JsonDeserialize Decode JSON string back to [f64; 128]
Persistence Axiom:DB:Store:UpsertEmbedding Store embedding row in DuckDB
Retrieval Axiom:DB:Store:GetEmbedding Load embedding row from DuckDB
Rebuild orchestration Axiom:Services:Contract:Rebuild:SemanticIndex Trigger workspace reindex and optionally persist to DuckDB

3. Step-by-Step Technology Flow

[DEF:Report:Vectorization:Step1:Block]

@COMPLEXITY 5

@PURPOSE Define the text source that becomes embedding input.

@PRE A semantic contract has already been parsed from workspace source and its body is available.

@POST The system has a deterministic text payload suitable for embedding generation.

@SIDE_EFFECT None directly; this step only defines input selection.

@DATA_CONTRACT ContractNode.body -> embed_text(text)

@INVARIANT The embedding source text is the contract body persisted by the indexer, not an external summary.

Implementation reality

  • During rebuild, the system iterates over indexed contracts.
  • For each contract, it passes contract.body into embed_text(&contract.body).
  • Therefore the vector represents the lexical content of the full [DEF]...[/DEF] body, including header metadata and body text.

Important consequence

  • Similarity is influenced by both semantic tags (@PURPOSE, @RELATION, etc.) and implementation text.

[DEF:Report:Vectorization:Step2:Block]

@COMPLEXITY 5

@PURPOSE Describe the deterministic vector construction algorithm.

@PRE Input text is available as UTF-8 Rust &str.

@POST A dense 128-dimensional floating-point vector is produced before normalization.

@SIDE_EFFECT None.

@DATA_CONTRACT &str -> [f64; 128]

@INVARIANT No network, no stochastic model weights, and no external provider are involved.

@RATIONALE Deterministic hashing is fast, portable, and reproducible.

@REJECTED Transformer-based embeddings were rejected due to runtime cost and external dependency coupling.

Algorithm

  1. Initialize vector = [0.0; 128].
  2. Iterate through text.chars().take(2048).
  3. For each character ch, compute idx = (ch as usize) % 128.
  4. Increment vector[idx] += 1.0.

Interpretation

  • This is a character-bucket frequency sketch.
  • It is closer to a hashed lexical fingerprint than a learned semantic embedding.

Strengths

  • deterministic
  • cheap to compute
  • stable across platforms
  • robust enough for coarse lexical similarity

Weaknesses

  • collisions are guaranteed because all characters map into 128 buckets
  • no contextual semantics beyond lexical distribution
  • weak synonym/generalization behavior compared with learned embeddings

[DEF:Report:Vectorization:Step3:Block]

@COMPLEXITY 4

@PURPOSE Explain input bounding and its effect on reproducibility.

@PRE Raw contract body may be arbitrarily long.

@POST Embedding computation uses at most the first 2048 characters.

@SIDE_EFFECT Truncates effective semantic coverage for long contracts.

@INVARIANT Runtime cost remains bounded and reproducible for every rebuild.

Mechanism

  • The generator uses text.chars().take(2048).

Why it exists

  • keeps rebuild cost bounded
  • prevents very large contracts from dominating runtime
  • ensures deterministic maximum work per contract

Trade-off

  • content after the first 2048 characters does not affect the vector

[DEF:Report:Vectorization:Step4:Block]

@COMPLEXITY 5

@PURPOSE Define the normalization stage that converts raw counts into a unit vector.

@PRE Raw 128-dim vector has non-negative frequency counts.

@POST Output vector has unit Euclidean norm unless the raw vector is all zeros.

@SIDE_EFFECT Mutates the vector in place.

@DATA_CONTRACT [f64; 128] -> normalized [f64; 128]

@INVARIANT Similarity scoring assumes normalized vectors.

Algorithm

  1. Compute sum_sq = Σ(x_i^2).
  2. Compute norm = sqrt(sum_sq).
  3. If norm > 0.0, divide each component by norm.

Why normalization matters

  • removes bias from absolute text length
  • enables cosine similarity as a direct dot product

Operational note

  • for non-empty textual contracts, the vector should normally be non-zero and therefore normalized successfully

[DEF:Report:Vectorization:Step5:Block]

@COMPLEXITY 4

@PURPOSE Explain persistence encoding for DuckDB storage.

@PRE A normalized [f64; 128] vector exists in memory.

@POST The vector is serialized into a compact JSON array string.

@SIDE_EFFECT None.

@DATA_CONTRACT [f64; 128] -> String(vector_json)

@INVARIANT Stored vectors must remain length-128 after round-trip decoding.

Mechanism

  • vector_to_json uses serde_json::to_string(&vector.to_vec()).
  • Result is stored in DuckDB column embeddings.vector_json TEXT.

Why JSON was chosen

  • simple and portable
  • easy to inspect manually
  • no custom binary format needed

Cost

  • larger on disk than binary
  • slower than native vector column types

[DEF:Report:Vectorization:Step6:Block]

@COMPLEXITY 5

@PURPOSE Describe how vectors are written to DuckDB during rebuild.

@PRE Rebuild runs with use_duckdb=true; schema bootstrap has succeeded; contracts are available in memory.

@POST Each indexed contract receives an embedding row in embeddings when refresh_embeddings=true.

@SIDE_EFFECT Inserts or replaces rows in DuckDB.

@DATA_CONTRACT ContractNode -> embeddings(contract_id, provider_id, vector_json, source_text)

@INVARIANT Embedding row identity is keyed by contract_id.

Implementation path

  1. rebuild_semantic_index(...) reindexes the workspace.
  2. If use_duckdb=true, it opens graph.duckdb.
  3. DuckDbIndexStore::populate_from_index(...) clears/repopulates tables.
  4. If refresh_embeddings=true, each contract body is embedded.
  5. upsert_embedding(...) stores:
    • contract_id
    • provider_id (currently local-fallback)
    • vector_json
    • source_text

Current provider identity

  • storage path marks the provider as local-fallback
  • rebuild response payload separately reports embedding_provider_id = lexical-graph

Interpretation for downstream analysis

  • both labels refer to the same local deterministic embedding strategy, but naming is currently inconsistent across layers

[DEF:Report:Vectorization:Step7:Block]

@COMPLEXITY 4

@PURPOSE Explain how stored vectors are loaded back from DuckDB.

@PRE A row exists in embeddings for the target contract_id.

@POST The vector round-trips back into Rust as [f64; 128].

@SIDE_EFFECT Reads DuckDB state.

@DATA_CONTRACT contract_id -> Option<[f64; 128]>

@INVARIANT Invalid JSON or non-128 vectors are treated as errors, not silently accepted.

Mechanism

  • get_embedding(contract_id) loads vector_json
  • vector_from_json(json_str) parses Vec<f64>
  • parser enforces exact length 128

Safety property

  • malformed stored vectors fail loudly instead of contaminating similarity logic

[DEF:Report:Vectorization:Step8:Block]

@COMPLEXITY 4

@PURPOSE Define the similarity metric expected by the vector system.

@PRE Both vectors are already L2-normalized and lengths are equal.

@POST Cosine similarity is computed as a dot product in [-1, 1].

@SIDE_EFFECT None.

@DATA_CONTRACT [f64; 128] x [f64; 128] -> f64

@INVARIANT The similarity function assumes normalized inputs and does not renormalize them itself.

Mechanism

  • cosine_similarity(left, right) = Σ(left_i * right_i)

Important note

  • the primitive exists and is correct for the current representation
  • but a full production similarity-search API over DuckDB embeddings is still minimal and not yet a rich ANN/vector-index system

4. Storage Schema Relevant to Vectorization

[DEF:Report:Vectorization:Schema:Block]

@COMPLEXITY 4

@PURPOSE Describe the DuckDB schema fields directly involved in vectorization.

@PRE Reader needs storage-level understanding for independent analysis.

@POST The embedding persistence surface is explicitly documented.

@SIDE_EFFECT None.

Relevant table:

CREATE TABLE IF NOT EXISTS embeddings (
    contract_id TEXT PRIMARY KEY,
    provider_id TEXT,
    vector_json TEXT NOT NULL,
    source_text TEXT
);

Field meaning:

  • contract_id: stable logical owner of the vector
  • provider_id: identifier of embedding strategy
  • vector_json: serialized 128-dim normalized vector
  • source_text: source body used to create the vector

5. Rebuild-Time Operational Sequence

[DEF:Report:Vectorization:RebuildSequence:Block]

@COMPLEXITY 5

@PURPOSE Provide the operational sequence for vector generation during semantic index rebuild.

@PRE Workspace root and policy are valid; operational directories are creatable.

@POST JSON snapshot is updated and DuckDB embeddings are refreshed when enabled.

@SIDE_EFFECT Reads source files, writes JSON snapshot, writes DuckDB tables.

@INVARIANT JSON snapshot and DuckDB content are derived from the same in-memory index rebuild.

Sequence:

  1. Reindex workspace contracts and edges in memory.
  2. Persist JSON semantic snapshot.
  3. If use_duckdb=true, open/create graph.duckdb.
  4. Bootstrap schema if absent.
  5. Clear/repopulate files, contracts, edges.
  6. If refresh_embeddings=true, compute a 128-dim vector for each contract body.
  7. Serialize vectors to JSON.
  8. Upsert rows into embeddings.
  9. Save IndexStatus with embedding counts.

6. Design Characteristics for Another LLM to Evaluate

[DEF:Report:Vectorization:EvaluationHints:Block]

@COMPLEXITY 4

@PURPOSE Highlight the key architectural properties another LLM should evaluate.

@PRE Reader wants actionable analysis targets rather than raw implementation trivia.

@POST The main trade-offs and audit points are explicit.

@SIDE_EFFECT None.

Strengths

  • deterministic and reproducible
  • offline-safe
  • cheap rebuild cost
  • no model-serving dependency
  • transparent storage format

Weaknesses

  • not semantically deep like transformer embeddings
  • collisions from modulo-128 hashing
  • truncation at 2048 characters
  • JSON storage instead of typed vector columns
  • provider naming inconsistency (local-fallback vs lexical-graph)

Questions worth analyzing

  1. Should metadata and code body be embedded together or separately?
  2. Should bucket count remain 128 or be increased?
  3. Should similarity search be exposed as a first-class tool/API?
  4. Should provider_id naming be normalized across rebuild response and storage?
  5. Should long contracts use chunking instead of hard truncation at 2048 chars?

7. Exact Minimal Pseudocode

[DEF:Report:Vectorization:Pseudocode:Block]

@COMPLEXITY 3

@PURPOSE Give another LLM a language-agnostic reproduction of the current embedding pipeline.

@PRE Reader needs a faithful abstract form of the implementation.

@POST The algorithm can be reimplemented without inspecting Rust syntax.

@SIDE_EFFECT None.

function embed_text(text):
    vector = [0.0] * 128
    for ch in first_2048_characters(text):
        idx = ord(ch) mod 128
        vector[idx] += 1.0

    norm = sqrt(sum(x*x for x in vector))
    if norm > 0:
        for i in range(128):
            vector[i] /= norm

    return vector

function store_embedding(contract_id, text):
    vector = embed_text(text)
    vector_json = json_encode(vector)
    upsert into embeddings(contract_id, provider_id, vector_json, source_text)

8. Current Truth Statement

[DEF:Report:Vectorization:CurrentTruth:Block]

@COMPLEXITY 4

@PURPOSE Provide a final machine-readable summary of what is true today.

@PRE All previous sections have been read or can be ignored for a compact summary.

@POST Another LLM can extract the operative facts in one pass.

@SIDE_EFFECT None.

  • Vectorization technology: deterministic character-frequency hashing
  • Embedding dimensionality: 128
  • Input cap: first 2048 characters
  • Normalization: L2 normalization
  • Storage encoding: JSON array in DuckDB embeddings.vector_json
  • Similarity metric: cosine similarity via dot product of normalized vectors
  • External model/provider dependency: none
  • Primary objective: cheap, deterministic, offline lexical-semantic approximation

[/DEF:Report:Vectorization:Root:Module]