semantic cleanup
This commit is contained in:
374
.opencode/reports/vectorization-technology-report.md
Normal file
374
.opencode/reports/vectorization-technology-report.md
Normal file
@@ -0,0 +1,374 @@
|
||||
# [DEF:Report:Vectorization:Root:Module]
|
||||
# @COMPLEXITY 5
|
||||
# @PURPOSE Explain the current vectorization technology used by the Rust semantic index, step by step, in a contract-oriented format suitable for downstream LLM analysis.
|
||||
# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:EmbedText]
|
||||
# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:Normalize]
|
||||
# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonSerialize]
|
||||
# @RELATION DEPENDS_ON -> [Axiom:Embedding:VSS:JsonDeserialize]
|
||||
# @RELATION DEPENDS_ON -> [Axiom:DB:Store:UpsertEmbedding]
|
||||
# @RELATION DEPENDS_ON -> [Axiom:Services:Contract:Rebuild:SemanticIndex]
|
||||
# @RATIONALE The report is structured as semantic contracts so another LLM can reason about the implementation without reverse-engineering code first.
|
||||
# @REJECTED Free-form prose without @PRE/@POST was rejected because it weakens machine analysis and obscures invariants.
|
||||
|
||||
# Vectorization Technology Report
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
The current system uses a **deterministic local fallback embedding pipeline**.
|
||||
|
||||
It is **not model-based** and **does not call any external embedding provider**. Instead, it computes a **128-dimensional vector** from raw text using **character-frequency hashing**, then **L2-normalizes** the vector and stores it in DuckDB as a **JSON array string** in the `embeddings` table.
|
||||
|
||||
This design is optimized for:
|
||||
- deterministic rebuilds
|
||||
- offline operation
|
||||
- zero external dependencies at inference time
|
||||
- reproducible semantic indexing across agent sessions
|
||||
|
||||
It is intentionally simpler than transformer embeddings.
|
||||
|
||||
---
|
||||
|
||||
## 2. Primary Production Contracts
|
||||
|
||||
### [DEF:Report:Vectorization:ContractMap:Block]
|
||||
### @COMPLEXITY 4
|
||||
### @PURPOSE Map the production contracts that implement the vectorization pipeline.
|
||||
### @PRE Reader needs direct traceability from report steps to repository anchors.
|
||||
### @POST Each critical stage is linked to a concrete production contract.
|
||||
### @SIDE_EFFECT None.
|
||||
|
||||
| Stage | Contract ID | Responsibility |
|
||||
|---|---|---|
|
||||
| Vector generation | `Axiom:Embedding:VSS:EmbedText` | Build a 128-dim vector from text via character hashing |
|
||||
| Normalization | `Axiom:Embedding:VSS:Normalize` | L2-normalize the vector |
|
||||
| Similarity | `Axiom:Embedding:VSS:CosineSimilarity` | Compute cosine similarity between normalized vectors |
|
||||
| Serialization | `Axiom:Embedding:VSS:JsonSerialize` | Encode vector as JSON string |
|
||||
| Deserialization | `Axiom:Embedding:VSS:JsonDeserialize` | Decode JSON string back to `[f64; 128]` |
|
||||
| Persistence | `Axiom:DB:Store:UpsertEmbedding` | Store embedding row in DuckDB |
|
||||
| Retrieval | `Axiom:DB:Store:GetEmbedding` | Load embedding row from DuckDB |
|
||||
| Rebuild orchestration | `Axiom:Services:Contract:Rebuild:SemanticIndex` | Trigger workspace reindex and optionally persist to DuckDB |
|
||||
|
||||
---
|
||||
|
||||
## 3. Step-by-Step Technology Flow
|
||||
|
||||
### [DEF:Report:Vectorization:Step1:Block]
|
||||
### @COMPLEXITY 5
|
||||
### @PURPOSE Define the text source that becomes embedding input.
|
||||
### @PRE A semantic contract has already been parsed from workspace source and its `body` is available.
|
||||
### @POST The system has a deterministic text payload suitable for embedding generation.
|
||||
### @SIDE_EFFECT None directly; this step only defines input selection.
|
||||
### @DATA_CONTRACT `ContractNode.body -> embed_text(text)`
|
||||
### @INVARIANT The embedding source text is the contract body persisted by the indexer, not an external summary.
|
||||
|
||||
**Implementation reality**
|
||||
- During rebuild, the system iterates over indexed contracts.
|
||||
- For each contract, it passes `contract.body` into `embed_text(&contract.body)`.
|
||||
- Therefore the vector represents the lexical content of the full `[DEF]...[/DEF]` body, including header metadata and body text.
|
||||
|
||||
**Important consequence**
|
||||
- Similarity is influenced by both semantic tags (`@PURPOSE`, `@RELATION`, etc.) and implementation text.
|
||||
|
||||
---
|
||||
|
||||
### [DEF:Report:Vectorization:Step2:Block]
|
||||
### @COMPLEXITY 5
|
||||
### @PURPOSE Describe the deterministic vector construction algorithm.
|
||||
### @PRE Input text is available as UTF-8 Rust `&str`.
|
||||
### @POST A dense 128-dimensional floating-point vector is produced before normalization.
|
||||
### @SIDE_EFFECT None.
|
||||
### @DATA_CONTRACT `&str -> [f64; 128]`
|
||||
### @INVARIANT No network, no stochastic model weights, and no external provider are involved.
|
||||
### @RATIONALE Deterministic hashing is fast, portable, and reproducible.
|
||||
### @REJECTED Transformer-based embeddings were rejected due to runtime cost and external dependency coupling.
|
||||
|
||||
**Algorithm**
|
||||
1. Initialize `vector = [0.0; 128]`.
|
||||
2. Iterate through `text.chars().take(2048)`.
|
||||
3. For each character `ch`, compute `idx = (ch as usize) % 128`.
|
||||
4. Increment `vector[idx] += 1.0`.
|
||||
|
||||
**Interpretation**
|
||||
- This is a **character-bucket frequency sketch**.
|
||||
- It is closer to a hashed lexical fingerprint than a learned semantic embedding.
|
||||
|
||||
**Strengths**
|
||||
- deterministic
|
||||
- cheap to compute
|
||||
- stable across platforms
|
||||
- robust enough for coarse lexical similarity
|
||||
|
||||
**Weaknesses**
|
||||
- collisions are guaranteed because all characters map into 128 buckets
|
||||
- no contextual semantics beyond lexical distribution
|
||||
- weak synonym/generalization behavior compared with learned embeddings
|
||||
|
||||
---
|
||||
|
||||
### [DEF:Report:Vectorization:Step3:Block]
|
||||
### @COMPLEXITY 4
|
||||
### @PURPOSE Explain input bounding and its effect on reproducibility.
|
||||
### @PRE Raw contract body may be arbitrarily long.
|
||||
### @POST Embedding computation uses at most the first 2048 characters.
|
||||
### @SIDE_EFFECT Truncates effective semantic coverage for long contracts.
|
||||
### @INVARIANT Runtime cost remains bounded and reproducible for every rebuild.
|
||||
|
||||
**Mechanism**
|
||||
- The generator uses `text.chars().take(2048)`.
|
||||
|
||||
**Why it exists**
|
||||
- keeps rebuild cost bounded
|
||||
- prevents very large contracts from dominating runtime
|
||||
- ensures deterministic maximum work per contract
|
||||
|
||||
**Trade-off**
|
||||
- content after the first 2048 characters does not affect the vector
|
||||
|
||||
---
|
||||
|
||||
### [DEF:Report:Vectorization:Step4:Block]
|
||||
### @COMPLEXITY 5
|
||||
### @PURPOSE Define the normalization stage that converts raw counts into a unit vector.
|
||||
### @PRE Raw 128-dim vector has non-negative frequency counts.
|
||||
### @POST Output vector has unit Euclidean norm unless the raw vector is all zeros.
|
||||
### @SIDE_EFFECT Mutates the vector in place.
|
||||
### @DATA_CONTRACT `[f64; 128] -> normalized [f64; 128]`
|
||||
### @INVARIANT Similarity scoring assumes normalized vectors.
|
||||
|
||||
**Algorithm**
|
||||
1. Compute `sum_sq = Σ(x_i^2)`.
|
||||
2. Compute `norm = sqrt(sum_sq)`.
|
||||
3. If `norm > 0.0`, divide each component by `norm`.
|
||||
|
||||
**Why normalization matters**
|
||||
- removes bias from absolute text length
|
||||
- enables cosine similarity as a direct dot product
|
||||
|
||||
**Operational note**
|
||||
- for non-empty textual contracts, the vector should normally be non-zero and therefore normalized successfully
|
||||
|
||||
---
|
||||
|
||||
### [DEF:Report:Vectorization:Step5:Block]
|
||||
### @COMPLEXITY 4
|
||||
### @PURPOSE Explain persistence encoding for DuckDB storage.
|
||||
### @PRE A normalized `[f64; 128]` vector exists in memory.
|
||||
### @POST The vector is serialized into a compact JSON array string.
|
||||
### @SIDE_EFFECT None.
|
||||
### @DATA_CONTRACT `[f64; 128] -> String(vector_json)`
|
||||
### @INVARIANT Stored vectors must remain length-128 after round-trip decoding.
|
||||
|
||||
**Mechanism**
|
||||
- `vector_to_json` uses `serde_json::to_string(&vector.to_vec())`.
|
||||
- Result is stored in DuckDB column `embeddings.vector_json TEXT`.
|
||||
|
||||
**Why JSON was chosen**
|
||||
- simple and portable
|
||||
- easy to inspect manually
|
||||
- no custom binary format needed
|
||||
|
||||
**Cost**
|
||||
- larger on disk than binary
|
||||
- slower than native vector column types
|
||||
|
||||
---
|
||||
|
||||
### [DEF:Report:Vectorization:Step6:Block]
|
||||
### @COMPLEXITY 5
|
||||
### @PURPOSE Describe how vectors are written to DuckDB during rebuild.
|
||||
### @PRE Rebuild runs with `use_duckdb=true`; schema bootstrap has succeeded; contracts are available in memory.
|
||||
### @POST Each indexed contract receives an embedding row in `embeddings` when `refresh_embeddings=true`.
|
||||
### @SIDE_EFFECT Inserts or replaces rows in DuckDB.
|
||||
### @DATA_CONTRACT `ContractNode -> embeddings(contract_id, provider_id, vector_json, source_text)`
|
||||
### @INVARIANT Embedding row identity is keyed by `contract_id`.
|
||||
|
||||
**Implementation path**
|
||||
1. `rebuild_semantic_index(...)` reindexes the workspace.
|
||||
2. If `use_duckdb=true`, it opens `graph.duckdb`.
|
||||
3. `DuckDbIndexStore::populate_from_index(...)` clears/repopulates tables.
|
||||
4. If `refresh_embeddings=true`, each contract body is embedded.
|
||||
5. `upsert_embedding(...)` stores:
|
||||
- `contract_id`
|
||||
- `provider_id` (currently `local-fallback`)
|
||||
- `vector_json`
|
||||
- `source_text`
|
||||
|
||||
**Current provider identity**
|
||||
- storage path marks the provider as `local-fallback`
|
||||
- rebuild response payload separately reports `embedding_provider_id = lexical-graph`
|
||||
|
||||
**Interpretation for downstream analysis**
|
||||
- both labels refer to the same local deterministic embedding strategy, but naming is currently inconsistent across layers
|
||||
|
||||
---
|
||||
|
||||
### [DEF:Report:Vectorization:Step7:Block]
|
||||
### @COMPLEXITY 4
|
||||
### @PURPOSE Explain how stored vectors are loaded back from DuckDB.
|
||||
### @PRE A row exists in `embeddings` for the target `contract_id`.
|
||||
### @POST The vector round-trips back into Rust as `[f64; 128]`.
|
||||
### @SIDE_EFFECT Reads DuckDB state.
|
||||
### @DATA_CONTRACT `contract_id -> Option<[f64; 128]>`
|
||||
### @INVARIANT Invalid JSON or non-128 vectors are treated as errors, not silently accepted.
|
||||
|
||||
**Mechanism**
|
||||
- `get_embedding(contract_id)` loads `vector_json`
|
||||
- `vector_from_json(json_str)` parses `Vec<f64>`
|
||||
- parser enforces exact length `128`
|
||||
|
||||
**Safety property**
|
||||
- malformed stored vectors fail loudly instead of contaminating similarity logic
|
||||
|
||||
---
|
||||
|
||||
### [DEF:Report:Vectorization:Step8:Block]
|
||||
### @COMPLEXITY 4
|
||||
### @PURPOSE Define the similarity metric expected by the vector system.
|
||||
### @PRE Both vectors are already L2-normalized and lengths are equal.
|
||||
### @POST Cosine similarity is computed as a dot product in `[-1, 1]`.
|
||||
### @SIDE_EFFECT None.
|
||||
### @DATA_CONTRACT `[f64; 128] x [f64; 128] -> f64`
|
||||
### @INVARIANT The similarity function assumes normalized inputs and does not renormalize them itself.
|
||||
|
||||
**Mechanism**
|
||||
- `cosine_similarity(left, right) = Σ(left_i * right_i)`
|
||||
|
||||
**Important note**
|
||||
- the primitive exists and is correct for the current representation
|
||||
- but a full production similarity-search API over DuckDB embeddings is still minimal and not yet a rich ANN/vector-index system
|
||||
|
||||
---
|
||||
|
||||
## 4. Storage Schema Relevant to Vectorization
|
||||
|
||||
### [DEF:Report:Vectorization:Schema:Block]
|
||||
### @COMPLEXITY 4
|
||||
### @PURPOSE Describe the DuckDB schema fields directly involved in vectorization.
|
||||
### @PRE Reader needs storage-level understanding for independent analysis.
|
||||
### @POST The embedding persistence surface is explicitly documented.
|
||||
### @SIDE_EFFECT None.
|
||||
|
||||
Relevant table:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS embeddings (
|
||||
contract_id TEXT PRIMARY KEY,
|
||||
provider_id TEXT,
|
||||
vector_json TEXT NOT NULL,
|
||||
source_text TEXT
|
||||
);
|
||||
```
|
||||
|
||||
Field meaning:
|
||||
- `contract_id`: stable logical owner of the vector
|
||||
- `provider_id`: identifier of embedding strategy
|
||||
- `vector_json`: serialized 128-dim normalized vector
|
||||
- `source_text`: source body used to create the vector
|
||||
|
||||
---
|
||||
|
||||
## 5. Rebuild-Time Operational Sequence
|
||||
|
||||
### [DEF:Report:Vectorization:RebuildSequence:Block]
|
||||
### @COMPLEXITY 5
|
||||
### @PURPOSE Provide the operational sequence for vector generation during semantic index rebuild.
|
||||
### @PRE Workspace root and policy are valid; operational directories are creatable.
|
||||
### @POST JSON snapshot is updated and DuckDB embeddings are refreshed when enabled.
|
||||
### @SIDE_EFFECT Reads source files, writes JSON snapshot, writes DuckDB tables.
|
||||
### @INVARIANT JSON snapshot and DuckDB content are derived from the same in-memory index rebuild.
|
||||
|
||||
Sequence:
|
||||
1. Reindex workspace contracts and edges in memory.
|
||||
2. Persist JSON semantic snapshot.
|
||||
3. If `use_duckdb=true`, open/create `graph.duckdb`.
|
||||
4. Bootstrap schema if absent.
|
||||
5. Clear/repopulate files, contracts, edges.
|
||||
6. If `refresh_embeddings=true`, compute a 128-dim vector for each contract body.
|
||||
7. Serialize vectors to JSON.
|
||||
8. Upsert rows into `embeddings`.
|
||||
9. Save `IndexStatus` with embedding counts.
|
||||
|
||||
---
|
||||
|
||||
## 6. Design Characteristics for Another LLM to Evaluate
|
||||
|
||||
### [DEF:Report:Vectorization:EvaluationHints:Block]
|
||||
### @COMPLEXITY 4
|
||||
### @PURPOSE Highlight the key architectural properties another LLM should evaluate.
|
||||
### @PRE Reader wants actionable analysis targets rather than raw implementation trivia.
|
||||
### @POST The main trade-offs and audit points are explicit.
|
||||
### @SIDE_EFFECT None.
|
||||
|
||||
### Strengths
|
||||
- deterministic and reproducible
|
||||
- offline-safe
|
||||
- cheap rebuild cost
|
||||
- no model-serving dependency
|
||||
- transparent storage format
|
||||
|
||||
### Weaknesses
|
||||
- not semantically deep like transformer embeddings
|
||||
- collisions from modulo-128 hashing
|
||||
- truncation at 2048 characters
|
||||
- JSON storage instead of typed vector columns
|
||||
- provider naming inconsistency (`local-fallback` vs `lexical-graph`)
|
||||
|
||||
### Questions worth analyzing
|
||||
1. Should metadata and code body be embedded together or separately?
|
||||
2. Should bucket count remain 128 or be increased?
|
||||
3. Should similarity search be exposed as a first-class tool/API?
|
||||
4. Should `provider_id` naming be normalized across rebuild response and storage?
|
||||
5. Should long contracts use chunking instead of hard truncation at 2048 chars?
|
||||
|
||||
---
|
||||
|
||||
## 7. Exact Minimal Pseudocode
|
||||
|
||||
### [DEF:Report:Vectorization:Pseudocode:Block]
|
||||
### @COMPLEXITY 3
|
||||
### @PURPOSE Give another LLM a language-agnostic reproduction of the current embedding pipeline.
|
||||
### @PRE Reader needs a faithful abstract form of the implementation.
|
||||
### @POST The algorithm can be reimplemented without inspecting Rust syntax.
|
||||
### @SIDE_EFFECT None.
|
||||
|
||||
```text
|
||||
function embed_text(text):
|
||||
vector = [0.0] * 128
|
||||
for ch in first_2048_characters(text):
|
||||
idx = ord(ch) mod 128
|
||||
vector[idx] += 1.0
|
||||
|
||||
norm = sqrt(sum(x*x for x in vector))
|
||||
if norm > 0:
|
||||
for i in range(128):
|
||||
vector[i] /= norm
|
||||
|
||||
return vector
|
||||
|
||||
function store_embedding(contract_id, text):
|
||||
vector = embed_text(text)
|
||||
vector_json = json_encode(vector)
|
||||
upsert into embeddings(contract_id, provider_id, vector_json, source_text)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Current Truth Statement
|
||||
|
||||
### [DEF:Report:Vectorization:CurrentTruth:Block]
|
||||
### @COMPLEXITY 4
|
||||
### @PURPOSE Provide a final machine-readable summary of what is true today.
|
||||
### @PRE All previous sections have been read or can be ignored for a compact summary.
|
||||
### @POST Another LLM can extract the operative facts in one pass.
|
||||
### @SIDE_EFFECT None.
|
||||
|
||||
- Vectorization technology: **deterministic character-frequency hashing**
|
||||
- Embedding dimensionality: **128**
|
||||
- Input cap: **first 2048 characters**
|
||||
- Normalization: **L2 normalization**
|
||||
- Storage encoding: **JSON array in DuckDB `embeddings.vector_json`**
|
||||
- Similarity metric: **cosine similarity via dot product of normalized vectors**
|
||||
- External model/provider dependency: **none**
|
||||
- Primary objective: **cheap, deterministic, offline lexical-semantic approximation**
|
||||
|
||||
# [/DEF:Report:Vectorization:Root:Module]
|
||||
Reference in New Issue
Block a user