busya/ss-tools

Fork 0

Files

busya d8df1fff59 semantic cleanup

2026-05-08 10:07:05 +03:00

15 KiB

Raw Blame History

[DEF:Report:Vectorization:Root:Module]

@COMPLEXITY 5

@PURPOSE Explain the current vectorization technology used by the Rust semantic index, step by step, in a contract-oriented format suitable for downstream LLM analysis.

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:EmbedText]

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:Normalize]

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonSerialize]

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonDeserialize]

@RELATION DEPENDS_ON -> [Axiom:DB:Store:UpsertEmbedding]

@RELATION DEPENDS_ON -> [Axiom:Services:Contract:Rebuild:SemanticIndex]

@RATIONALE The report is structured as semantic contracts so another LLM can reason about the implementation without reverse-engineering code first.

@REJECTED Free-form prose without @PRE/@POST was rejected because it weakens machine analysis and obscures invariants.

Vectorization Technology Report

1. Executive Summary

The current system uses a deterministic local fallback embedding pipeline.

It is not model-based and does not call any external embedding provider. Instead, it computes a 128-dimensional vector from raw text using character-frequency hashing, then L2-normalizes the vector and stores it in DuckDB as a JSON array string in the embeddings table.

This design is optimized for:

deterministic rebuilds
offline operation
zero external dependencies at inference time
reproducible semantic indexing across agent sessions

It is intentionally simpler than transformer embeddings.

2. Primary Production Contracts

[DEF:Report:Vectorization:ContractMap:Block]

@COMPLEXITY 4

@PURPOSE Map the production contracts that implement the vectorization pipeline.

@PRE Reader needs direct traceability from report steps to repository anchors.

@POST Each critical stage is linked to a concrete production contract.

@SIDE_EFFECT None.

Stage	Contract ID	Responsibility
Vector generation	`Axiom:Embedding:VSS:EmbedText`	Build a 128-dim vector from text via character hashing
Normalization	`Axiom:Embedding:VSS:Normalize`	L2-normalize the vector
Similarity	`Axiom:Embedding:VSS:CosineSimilarity`	Compute cosine similarity between normalized vectors
Serialization	`Axiom:Embedding:VSS:JsonSerialize`	Encode vector as JSON string
Deserialization	`Axiom:Embedding:VSS:JsonDeserialize`	Decode JSON string back to `[f64; 128]`
Persistence	`Axiom:DB:Store:UpsertEmbedding`	Store embedding row in DuckDB
Retrieval	`Axiom:DB:Store:GetEmbedding`	Load embedding row from DuckDB
Rebuild orchestration	`Axiom:Services:Contract:Rebuild:SemanticIndex`	Trigger workspace reindex and optionally persist to DuckDB

3. Step-by-Step Technology Flow

[DEF:Report:Vectorization:Step1:Block]

@COMPLEXITY 5

@PURPOSE Define the text source that becomes embedding input.

@PRE A semantic contract has already been parsed from workspace source and its `body` is available.

@POST The system has a deterministic text payload suitable for embedding generation.

@SIDE_EFFECT None directly; this step only defines input selection.

@DATA_CONTRACT `ContractNode.body -> embed_text(text)`

@INVARIANT The embedding source text is the contract body persisted by the indexer, not an external summary.

Implementation reality

During rebuild, the system iterates over indexed contracts.
For each contract, it passes contract.body into embed_text(&contract.body).
Therefore the vector represents the lexical content of the full [DEF]...[/DEF] body, including header metadata and body text.

Important consequence

Similarity is influenced by both semantic tags (@PURPOSE, @RELATION, etc.) and implementation text.

[DEF:Report:Vectorization:Step2:Block]

@COMPLEXITY 5

@PURPOSE Describe the deterministic vector construction algorithm.

@PRE Input text is available as UTF-8 Rust `&str`.

@POST A dense 128-dimensional floating-point vector is produced before normalization.

@SIDE_EFFECT None.

@DATA_CONTRACT `&str -> [f64; 128]`

@INVARIANT No network, no stochastic model weights, and no external provider are involved.

@RATIONALE Deterministic hashing is fast, portable, and reproducible.

@REJECTED Transformer-based embeddings were rejected due to runtime cost and external dependency coupling.

Algorithm

Initialize vector = [0.0; 128].
Iterate through text.chars().take(2048).
For each character ch, compute idx = (ch as usize) % 128.
Increment vector[idx] += 1.0.

Interpretation

This is a character-bucket frequency sketch.
It is closer to a hashed lexical fingerprint than a learned semantic embedding.

Strengths

deterministic
cheap to compute
stable across platforms
robust enough for coarse lexical similarity

Weaknesses

collisions are guaranteed because all characters map into 128 buckets
no contextual semantics beyond lexical distribution
weak synonym/generalization behavior compared with learned embeddings

[DEF:Report:Vectorization:Step3:Block]

@COMPLEXITY 4

@PURPOSE Explain input bounding and its effect on reproducibility.

@PRE Raw contract body may be arbitrarily long.

@POST Embedding computation uses at most the first 2048 characters.

@SIDE_EFFECT Truncates effective semantic coverage for long contracts.

@INVARIANT Runtime cost remains bounded and reproducible for every rebuild.

Mechanism

The generator uses text.chars().take(2048).

Why it exists

keeps rebuild cost bounded
prevents very large contracts from dominating runtime
ensures deterministic maximum work per contract

Trade-off

content after the first 2048 characters does not affect the vector

[DEF:Report:Vectorization:Step4:Block]

@COMPLEXITY 5

@PURPOSE Define the normalization stage that converts raw counts into a unit vector.

@PRE Raw 128-dim vector has non-negative frequency counts.

@POST Output vector has unit Euclidean norm unless the raw vector is all zeros.

@SIDE_EFFECT Mutates the vector in place.

@DATA_CONTRACT `[f64; 128] -> normalized [f64; 128]`

@INVARIANT Similarity scoring assumes normalized vectors.

Algorithm

Compute sum_sq = Σ(x_i^2).
Compute norm = sqrt(sum_sq).
If norm > 0.0, divide each component by norm.

Why normalization matters

removes bias from absolute text length
enables cosine similarity as a direct dot product

Operational note

for non-empty textual contracts, the vector should normally be non-zero and therefore normalized successfully

[DEF:Report:Vectorization:Step5:Block]

@COMPLEXITY 4

@PURPOSE Explain persistence encoding for DuckDB storage.

@PRE A normalized `[f64; 128]` vector exists in memory.

@POST The vector is serialized into a compact JSON array string.

@SIDE_EFFECT None.

@DATA_CONTRACT `[f64; 128] -> String(vector_json)`

@INVARIANT Stored vectors must remain length-128 after round-trip decoding.

Mechanism

vector_to_json uses serde_json::to_string(&vector.to_vec()).
Result is stored in DuckDB column embeddings.vector_json TEXT.

Why JSON was chosen

simple and portable
easy to inspect manually
no custom binary format needed

Cost

larger on disk than binary
slower than native vector column types

[DEF:Report:Vectorization:Step6:Block]

@COMPLEXITY 5

@PURPOSE Describe how vectors are written to DuckDB during rebuild.

@PRE Rebuild runs with `use_duckdb=true`; schema bootstrap has succeeded; contracts are available in memory.

@POST Each indexed contract receives an embedding row in `embeddings` when `refresh_embeddings=true`.

@SIDE_EFFECT Inserts or replaces rows in DuckDB.

@DATA_CONTRACT `ContractNode -> embeddings(contract_id, provider_id, vector_json, source_text)`

@INVARIANT Embedding row identity is keyed by `contract_id`.

Implementation path

rebuild_semantic_index(...) reindexes the workspace.
If use_duckdb=true, it opens graph.duckdb.
DuckDbIndexStore::populate_from_index(...) clears/repopulates tables.
If refresh_embeddings=true, each contract body is embedded.
upsert_embedding(...) stores:
- contract_id
- provider_id (currently local-fallback)
- vector_json
- source_text

Current provider identity

storage path marks the provider as local-fallback
rebuild response payload separately reports embedding_provider_id = lexical-graph

Interpretation for downstream analysis

both labels refer to the same local deterministic embedding strategy, but naming is currently inconsistent across layers

[DEF:Report:Vectorization:Step7:Block]

@COMPLEXITY 4

@PURPOSE Explain how stored vectors are loaded back from DuckDB.

@PRE A row exists in `embeddings` for the target `contract_id`.

@POST The vector round-trips back into Rust as `[f64; 128]`.

@SIDE_EFFECT Reads DuckDB state.

@DATA_CONTRACT `contract_id -> Option<[f64; 128]>`

@INVARIANT Invalid JSON or non-128 vectors are treated as errors, not silently accepted.

Mechanism

get_embedding(contract_id) loads vector_json
vector_from_json(json_str) parses Vec<f64>
parser enforces exact length 128

Safety property

malformed stored vectors fail loudly instead of contaminating similarity logic

[DEF:Report:Vectorization:Step8:Block]

@COMPLEXITY 4

@PURPOSE Define the similarity metric expected by the vector system.

@PRE Both vectors are already L2-normalized and lengths are equal.

@POST Cosine similarity is computed as a dot product in `[-1, 1]`.

@SIDE_EFFECT None.

@DATA_CONTRACT `[f64; 128] x [f64; 128] -> f64`

@INVARIANT The similarity function assumes normalized inputs and does not renormalize them itself.

Mechanism

cosine_similarity(left, right) = Σ(left_i * right_i)

Important note

the primitive exists and is correct for the current representation
but a full production similarity-search API over DuckDB embeddings is still minimal and not yet a rich ANN/vector-index system

4. Storage Schema Relevant to Vectorization

[DEF:Report:Vectorization:Schema:Block]

@COMPLEXITY 4

@PURPOSE Describe the DuckDB schema fields directly involved in vectorization.

@PRE Reader needs storage-level understanding for independent analysis.

@POST The embedding persistence surface is explicitly documented.

@SIDE_EFFECT None.

Relevant table:

CREATE TABLE IF NOT EXISTS embeddings (
    contract_id TEXT PRIMARY KEY,
    provider_id TEXT,
    vector_json TEXT NOT NULL,
    source_text TEXT
);

Field meaning:

contract_id: stable logical owner of the vector
provider_id: identifier of embedding strategy
vector_json: serialized 128-dim normalized vector
source_text: source body used to create the vector

5. Rebuild-Time Operational Sequence

[DEF:Report:Vectorization:RebuildSequence:Block]

@COMPLEXITY 5

@PURPOSE Provide the operational sequence for vector generation during semantic index rebuild.

@PRE Workspace root and policy are valid; operational directories are creatable.

@POST JSON snapshot is updated and DuckDB embeddings are refreshed when enabled.

@SIDE_EFFECT Reads source files, writes JSON snapshot, writes DuckDB tables.

@INVARIANT JSON snapshot and DuckDB content are derived from the same in-memory index rebuild.

Sequence:

Reindex workspace contracts and edges in memory.
Persist JSON semantic snapshot.
If use_duckdb=true, open/create graph.duckdb.
Bootstrap schema if absent.
Clear/repopulate files, contracts, edges.
If refresh_embeddings=true, compute a 128-dim vector for each contract body.
Serialize vectors to JSON.
Upsert rows into embeddings.
Save IndexStatus with embedding counts.

6. Design Characteristics for Another LLM to Evaluate

[DEF:Report:Vectorization:EvaluationHints:Block]

@COMPLEXITY 4

@PURPOSE Highlight the key architectural properties another LLM should evaluate.

@PRE Reader wants actionable analysis targets rather than raw implementation trivia.

@POST The main trade-offs and audit points are explicit.

@SIDE_EFFECT None.

Strengths

deterministic and reproducible
offline-safe
cheap rebuild cost
no model-serving dependency
transparent storage format

Weaknesses

not semantically deep like transformer embeddings
collisions from modulo-128 hashing
truncation at 2048 characters
JSON storage instead of typed vector columns
provider naming inconsistency (local-fallback vs lexical-graph)

Questions worth analyzing

Should metadata and code body be embedded together or separately?
Should bucket count remain 128 or be increased?
Should similarity search be exposed as a first-class tool/API?
Should provider_id naming be normalized across rebuild response and storage?
Should long contracts use chunking instead of hard truncation at 2048 chars?

7. Exact Minimal Pseudocode

[DEF:Report:Vectorization:Pseudocode:Block]

@COMPLEXITY 3

@PURPOSE Give another LLM a language-agnostic reproduction of the current embedding pipeline.

@PRE Reader needs a faithful abstract form of the implementation.

@POST The algorithm can be reimplemented without inspecting Rust syntax.

@SIDE_EFFECT None.

function embed_text(text):
    vector = [0.0] * 128
    for ch in first_2048_characters(text):
        idx = ord(ch) mod 128
        vector[idx] += 1.0

    norm = sqrt(sum(x*x for x in vector))
    if norm > 0:
        for i in range(128):
            vector[i] /= norm

    return vector

function store_embedding(contract_id, text):
    vector = embed_text(text)
    vector_json = json_encode(vector)
    upsert into embeddings(contract_id, provider_id, vector_json, source_text)

8. Current Truth Statement

[DEF:Report:Vectorization:CurrentTruth:Block]

@COMPLEXITY 4

@PURPOSE Provide a final machine-readable summary of what is true today.

@PRE All previous sections have been read or can be ignored for a compact summary.

@POST Another LLM can extract the operative facts in one pass.

@SIDE_EFFECT None.

Vectorization technology: deterministic character-frequency hashing
Embedding dimensionality: 128
Input cap: first 2048 characters
Normalization: L2 normalization
Storage encoding: JSON array in DuckDB embeddings.vector_json
Similarity metric: cosine similarity via dot product of normalized vectors
External model/provider dependency: none
Primary objective: cheap, deterministic, offline lexical-semantic approximation

[/DEF:Report:Vectorization:Root:Module]

15 KiB Raw Blame History

[DEF:Report:Vectorization:Root:Module]

@COMPLEXITY 5

@PURPOSE Explain the current vectorization technology used by the Rust semantic index, step by step, in a contract-oriented format suitable for downstream LLM analysis.

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:EmbedText]

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:Normalize]

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonSerialize]

@RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonDeserialize]

@RELATION DEPENDS_ON -> [Axiom:DB:Store:UpsertEmbedding]

@RELATION DEPENDS_ON -> [Axiom:Services:Contract:Rebuild:SemanticIndex]

@RATIONALE The report is structured as semantic contracts so another LLM can reason about the implementation without reverse-engineering code first.

@REJECTED Free-form prose without @PRE/@POST was rejected because it weakens machine analysis and obscures invariants.

Vectorization Technology Report

1. Executive Summary

2. Primary Production Contracts

[DEF:Report:Vectorization:ContractMap:Block]

@COMPLEXITY 4

@PURPOSE Map the production contracts that implement the vectorization pipeline.

@PRE Reader needs direct traceability from report steps to repository anchors.

@POST Each critical stage is linked to a concrete production contract.

@SIDE_EFFECT None.

3. Step-by-Step Technology Flow

[DEF:Report:Vectorization:Step1:Block]

@COMPLEXITY 5

@PURPOSE Define the text source that becomes embedding input.

@PRE A semantic contract has already been parsed from workspace source and its body is available.

@POST The system has a deterministic text payload suitable for embedding generation.

@SIDE_EFFECT None directly; this step only defines input selection.

@DATA_CONTRACT ContractNode.body -> embed_text(text)

@INVARIANT The embedding source text is the contract body persisted by the indexer, not an external summary.

[DEF:Report:Vectorization:Step2:Block]

@COMPLEXITY 5

@PURPOSE Describe the deterministic vector construction algorithm.

@PRE Input text is available as UTF-8 Rust &str.

@POST A dense 128-dimensional floating-point vector is produced before normalization.

@SIDE_EFFECT None.

@DATA_CONTRACT &str -> [f64; 128]

@INVARIANT No network, no stochastic model weights, and no external provider are involved.

@RATIONALE Deterministic hashing is fast, portable, and reproducible.

@REJECTED Transformer-based embeddings were rejected due to runtime cost and external dependency coupling.

[DEF:Report:Vectorization:Step3:Block]

@COMPLEXITY 4

@PURPOSE Explain input bounding and its effect on reproducibility.

@PRE Raw contract body may be arbitrarily long.

@POST Embedding computation uses at most the first 2048 characters.

@SIDE_EFFECT Truncates effective semantic coverage for long contracts.

@INVARIANT Runtime cost remains bounded and reproducible for every rebuild.

[DEF:Report:Vectorization:Step4:Block]

@COMPLEXITY 5

@PURPOSE Define the normalization stage that converts raw counts into a unit vector.

@PRE Raw 128-dim vector has non-negative frequency counts.

@POST Output vector has unit Euclidean norm unless the raw vector is all zeros.

@SIDE_EFFECT Mutates the vector in place.

@DATA_CONTRACT [f64; 128] -> normalized [f64; 128]

@INVARIANT Similarity scoring assumes normalized vectors.

[DEF:Report:Vectorization:Step5:Block]

@COMPLEXITY 4

@PURPOSE Explain persistence encoding for DuckDB storage.

@PRE A normalized [f64; 128] vector exists in memory.

@POST The vector is serialized into a compact JSON array string.

@SIDE_EFFECT None.

@DATA_CONTRACT [f64; 128] -> String(vector_json)

@INVARIANT Stored vectors must remain length-128 after round-trip decoding.

[DEF:Report:Vectorization:Step6:Block]

@COMPLEXITY 5

@PURPOSE Describe how vectors are written to DuckDB during rebuild.

@PRE Rebuild runs with use_duckdb=true; schema bootstrap has succeeded; contracts are available in memory.

@POST Each indexed contract receives an embedding row in embeddings when refresh_embeddings=true.

@SIDE_EFFECT Inserts or replaces rows in DuckDB.

@DATA_CONTRACT ContractNode -> embeddings(contract_id, provider_id, vector_json, source_text)

@INVARIANT Embedding row identity is keyed by contract_id.

[DEF:Report:Vectorization:Step7:Block]

@COMPLEXITY 4

@PURPOSE Explain how stored vectors are loaded back from DuckDB.

@PRE A row exists in embeddings for the target contract_id.

@POST The vector round-trips back into Rust as [f64; 128].

@SIDE_EFFECT Reads DuckDB state.

@DATA_CONTRACT contract_id -> Option<[f64; 128]>

@INVARIANT Invalid JSON or non-128 vectors are treated as errors, not silently accepted.

[DEF:Report:Vectorization:Step8:Block]

15 KiB

Raw Blame History

@PRE A semantic contract has already been parsed from workspace source and its `body` is available.

@DATA_CONTRACT `ContractNode.body -> embed_text(text)`

@PRE Input text is available as UTF-8 Rust `&str`.

@DATA_CONTRACT `&str -> [f64; 128]`

@DATA_CONTRACT `[f64; 128] -> normalized [f64; 128]`

@PRE A normalized `[f64; 128]` vector exists in memory.

@DATA_CONTRACT `[f64; 128] -> String(vector_json)`

@PRE Rebuild runs with `use_duckdb=true`; schema bootstrap has succeeded; contracts are available in memory.

@POST Each indexed contract receives an embedding row in `embeddings` when `refresh_embeddings=true`.

@DATA_CONTRACT `ContractNode -> embeddings(contract_id, provider_id, vector_json, source_text)`

@INVARIANT Embedding row identity is keyed by `contract_id`.

@PRE A row exists in `embeddings` for the target `contract_id`.

@POST The vector round-trips back into Rust as `[f64; 128]`.

@DATA_CONTRACT `contract_id -> Option<[f64; 128]>`

@POST Cosine similarity is computed as a dot product in `[-1, 1]`.

@DATA_CONTRACT `[f64; 128] x [f64; 128] -> f64`