# [DEF:Axiom_Tools_Evaluation:Report] # @COMPLEXITY: 4 # @PURPOSE: Comprehensive evaluation of all axiom-core MCP server tools across 8 UX metrics. # @LAYER: Analysis # @RELATION: DEPENDS_ON -> [Project_Knowledge_Map:Root] # @PRE: All axiom-core tools have been exercised with valid and invalid inputs. # @POST: Report file exists with per-tool scores and aggregate findings. # @SIDE_EFFECT: Creates evaluation artifact in .ai/reports/. # @DATA_CONTRACT: Input[Tool Suite] -> Output[Evaluation Report] # @INVARIANT: Each tool must be scored on all 8 metrics; no tool may be omitted. --- # Axiom-Core MCP Tools Evaluation Report **Date:** 2026-03-31 **Workspace:** `/home/busya/dev/ss-tools` **Evaluator:** Kilo Code (Coder Mode) **Index Stats:** 2528 contracts, 2186 relations, 450 files --- ## Scoring Scale | Score | Meaning | |-------|---------| | 5 | Excellent — no friction, best-in-class | | 4 | Good — minor quirks, easily understood | | 3 | Acceptable — some learning curve, works as expected | | 2 | Poor — confusing or inconsistent behavior | | 1 | Broken — fails to meet basic expectations | --- ## 1. reindex_workspace_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | Name is self-explanatory; purpose is obvious. | | Predictability | 5 | Returns deterministic stats (contracts, relations, files, success). | | Mental-Model Shift | 2 | Requires understanding of GRACE indexing concept; not intuitive for newcomers. | | Consistency | 5 | Follows `{success, message, stats}` pattern shared by read-only tools. | | Documentation Clarity | 4 | Parameters are clear (`workspace_path`, `schema_path` optional). | | Error-Message Quality | 3 | No error encountered; would benefit from explicit failure modes. | | Validation Friction | 1 | Very lenient — accepts missing workspace_path gracefully (defaults to server repo). | | Recovery Simplicity | 5 | Pure read/index operation; re-run to refresh. No state to undo. | **Average: 3.75 / 5** --- ## 2. search_contracts_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Search contracts by query" — crystal clear. | | Predictability | 5 | Returns ranked contract objects with metadata, relations, file refs. | | Mental-Model Shift | 2 | Requires understanding of semantic search vs. text search. | | Consistency | 5 | Output shape matches `find_contract_tool` exactly. | | Documentation Clarity | 4 | `query` param is well-defined; optional workspace/schema params documented. | | Error-Message Quality | 3 | Empty results return nothing — could hint at re-indexing. | | Validation Friction | 1 | Accepts any string; no pre-validation needed. | | Recovery Simplicity | 5 | Stateless query; re-run with different query. | **Average: 3.75 / 5** --- ## 3. read_grace_outline_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "GRACE outline" is domain-specific but clear from context. | | Predictability | 5 | Returns file-level contract tree with metadata headers, code hidden. | | Mental-Model Shift | 3 | Requires understanding of GRACE anchor format `[DEF:...]`. | | Consistency | 5 | Output format is stable across files. | | Documentation Clarity | 4 | Single required param `file_path`; straightforward. | | Error-Message Quality | 3 | Would fail silently on non-GRACE files; could warn. | | Validation Friction | 1 | No pre-validation; accepts any path. | | Recovery Simplicity | 5 | Pure read; no side effects. | **Average: 3.63 / 5** --- ## 4. ast_search_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | AST-grep pattern search — clear to developers familiar with the tool. | | Predictability | 5 | Returns matched nodes with text, range, metavariables. | | Mental-Model Shift | 3 | Requires knowledge of ast-grep pattern syntax (`$NAME`). | | Consistency | 5 | Output shape is consistent (array of match objects). | | Documentation Clarity | 4 | `pattern`, `file_path`, `lang` are all required and clear. | | Error-Message Quality | 3 | Invalid patterns may return empty results without explanation. | | Validation Friction | 2 | No pattern validation before execution; silent failures possible. | | Recovery Simplicity | 5 | Stateless; re-run with corrected pattern. | **Average: 3.63 / 5** --- ## 5. get_semantic_context_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "Get semantic context around a contract" — clear intent. | | Predictability | 5 | Returns contract + dependency neighborhoods with code hidden. | | Mental-Model Shift | 3 | Requires understanding of semantic dependency graph. | | Consistency | 5 | Output format is stable and well-structured. | | Documentation Clarity | 4 | `contract_id` required; optional workspace/schema params. | | Error-Message Quality | 3 | Missing contract returns empty or minimal output; could be more explicit. | | Validation Friction | 1 | Accepts any string; no pre-validation. | | Recovery Simplicity | 5 | Pure read; no state to undo. | **Average: 3.63 / 5** --- ## 6. build_task_context_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "Build task-focused context" — clear for implementation workflows. | | Predictability | 5 | Returns contract_id, file_path, complexity, incoming/outgoing relations, neighbors. | | Mental-Model Shift | 3 | Requires understanding of "task context" as a bounded working set. | | Consistency | 5 | Output shape is deterministic and well-structured. | | Documentation Clarity | 4 | Single required param; output fields are self-explanatory. | | Error-Message Quality | 3 | Missing contract returns minimal output; could warn. | | Validation Friction | 1 | No pre-validation; accepts any contract_id. | | Recovery Simplicity | 5 | Stateless; re-run anytime. | **Average: 3.63 / 5** --- ## 7. workspace_semantic_health_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Semantic health" — clear dashboard-style summary. | | Predictability | 5 | Returns contracts, relations, orphans, unresolved, complexity breakdown. | | Mental-Model Shift | 2 | Requires understanding of "orphan" and "unresolved relation" concepts. | | Consistency | 5 | Output shape is stable across invocations. | | Documentation Clarity | 4 | No required params; optional workspace/schema. | | Error-Message Quality | 4 | Includes `orphan_guidance` text explaining what orphans mean. | | Validation Friction | 1 | No pre-validation needed. | | Recovery Simplicity | 5 | Pure read; no state to undo. | **Average: 3.88 / 5** --- ## 8. audit_contracts_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Audit contracts" — clear intent for quality checks. | | Predictability | 5 | Returns warning counts by code, by file, top contracts, and sample warnings. | | Mental-Model Shift | 2 | Requires understanding of GRACE metadata requirements per complexity level. | | Consistency | 5 | Output shape is stable; `detail_level` controls verbosity. | | Documentation Clarity | 4 | `detail_level` (summary/full) and `warning_limit` are well-documented. | | Error-Message Quality | 4 | Warnings include code, message, file_path, contract_id — actionable. | | Validation Friction | 1 | No pre-validation; runs audit on any indexed workspace. | | Recovery Simplicity | 5 | Pure read; no state to undo. | **Average: 3.88 / 5** --- ## 9. diff_contract_semantics_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "Diff contract semantics" — clear for comparing two contract versions. | | Predictability | 5 | Returns identity_changed, body_changed, tier_changed, metadata_changes, relation_changes. | | Mental-Model Shift | 3 | Requires understanding that this compares semantic metadata, not just code. | | Consistency | 5 | Output shape matches guarded_patch diff output. | | Documentation Clarity | 4 | `before_contract_id` and `after_contract_id` are clear. | | Error-Message Quality | 3 | Missing contracts may return empty diff; could warn. | | Validation Friction | 1 | No pre-validation; accepts any contract IDs. | | Recovery Simplicity | 5 | Pure read; no state to undo. | **Average: 3.63 / 5** --- ## 10. impact_analysis_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Impact analysis" — clear intent for dependency impact. | | Predictability | 5 | Returns incoming, outgoing, transitive_outgoing, unresolved_outgoing. | | Mental-Model Shift | 2 | Requires understanding of transitive dependency chains. | | Consistency | 5 | Output shape matches guarded_patch impact output. | | Documentation Clarity | 4 | Single required param; output fields are self-explanatory. | | Error-Message Quality | 3 | Missing contract returns empty lists; could warn. | | Validation Friction | 1 | No pre-validation; accepts any contract_id. | | Recovery Simplicity | 5 | Pure read; no state to undo. | **Average: 3.75 / 5** --- ## 11. simulate_patch_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "Simulate patch" — clear preview of changes without applying. | | Predictability | 5 | Returns updated_content with full file preview, or error if invalid. | | Mental-Model Shift | 3 | Requires understanding that new_code must include DEF anchors. | | Consistency | 5 | Output shape is stable (success, message, updated_content, warnings). | | Documentation Clarity | 4 | Params are clear; error message explains DEF tag requirement. | | Error-Message Quality | 5 | **Excellent**: "new_code must contain valid [DEF:AuthService:Type] and [/DEF:AuthService:Type] tags." | | Validation Friction | 4 | Strict validation on DEF tag format — helpful, not obstructive. | | Recovery Simplicity | 5 | No state change; fix new_code and re-run. | **Average: 4.13 / 5** --- ## 12. guarded_patch_contract_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Guarded patch" — clear that validation guards are applied before changes. | | Predictability | 5 | Returns diff, impact, and applied flag. Guards include syntax, semantic diff, impact. | | Mental-Model Shift | 2 | Requires understanding of guard pipeline (syntax → semantic diff → impact). | | Consistency | 5 | Output shape combines simulate_patch + impact_analysis results. | | Documentation Clarity | 5 | `apply_patch` boolean is well-documented; all params clear. | | Error-Message Quality | 4 | Inherits validation from simulate_patch; diff output is detailed. | | Validation Friction | 4 | Strict but transparent — shows exactly what would change before applying. | | Recovery Simplicity | 5 | With `apply_patch=false`, no state change. With `true`, git can revert. | **Average: 4.13 / 5** --- ## 13. patch_contract_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "Patch contract" — clear intent for in-place replacement. | | Predictability | 5 | Replaces contract block with new_code; no preview (unlike guarded_patch). | | Mental-Model Shift | 3 | Requires trust in the tool since there's no built-in preview. | | Consistency | 4 | Simpler than guarded_patch; lacks validation pipeline. | | Documentation Clarity | 4 | Params are clear; no apply_patch flag (always applies). | | Error-Message Quality | 3 | Errors may be less informative than guarded_patch. | | Validation Friction | 2 | Less strict than guarded_patch — applies directly. | | Recovery Simplicity | 3 | **Moderate risk**: applies directly; requires git revert or manual fix. | **Average: 3.38 / 5** --- ## 14. rename_contract_id_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Rename contract ID" — crystal clear. | | Predictability | 5 | Renames identifier across indexed workspace. | | Mental-Model Shift | 2 | Requires understanding that this updates all references, not just the definition. | | Consistency | 5 | Follows standard {success, message} pattern. | | Documentation Clarity | 4 | `old_contract_id` and `new_contract_id` are clear. | | Error-Message Quality | 3 | Missing old_id may fail silently; could warn. | | Validation Friction | 2 | Applies directly; no preview of affected files. | | Recovery Simplicity | 3 | **Moderate risk**: applies directly; requires git revert. | **Average: 3.50 / 5** --- ## 15. move_contract_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Move contract" — clear intent for relocating a contract block. | | Predictability | 5 | Moves contract from source to destination file. | | Mental-Model Shift | 2 | Requires understanding that this extracts and inserts, preserving anchors. | | Consistency | 5 | Follows standard pattern. | | Documentation Clarity | 4 | Three required params are clear. | | Error-Message Quality | 3 | Missing files may fail with generic error. | | Validation Friction | 2 | Applies directly; no preview. | | Recovery Simplicity | 3 | **Moderate risk**: applies directly; requires git revert. | **Average: 3.50 / 5** --- ## 16. extract_contract_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "Extract contract" — clear intent for creating new contract from code range. | | Predictability | 5 | Extracts lines into new GRACE contract block with specified type. | | Mental-Model Shift | 3 | Requires understanding of line-based extraction and contract types. | | Consistency | 5 | Follows standard pattern. | | Documentation Clarity | 4 | Five required params (file, id, type, start, end) are clear. | | Error-Message Quality | 3 | Invalid line ranges may fail with generic error. | | Validation Friction | 2 | Applies directly; no preview. | | Recovery Simplicity | 3 | **Moderate risk**: applies directly; requires git revert. | **Average: 3.50 / 5** --- ## 17. wrap_node_in_contract_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "Wrap node in contract" — clear intent for adding GRACE anchors to existing code. | | Predictability | 5 | Uses ast-grep to locate node and wraps with [DEF]...[/DEF]. | | Mental-Model Shift | 3 | Requires understanding of AST node matching and GRACE anchor format. | | Consistency | 5 | Follows standard pattern. | | Documentation Clarity | 4 | Params are clear; `lang` defaults to python. | | Error-Message Quality | 3 | Missing node may fail silently. | | Validation Friction | 2 | Applies directly; no preview. | | Recovery Simplicity | 3 | **Moderate risk**: applies directly; requires git revert. | **Average: 3.50 / 5** --- ## 18. update_contract_metadata_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Update contract metadata" — crystal clear. | | Predictability | 5 | Updates/adds tags without modifying code body. | | Mental-Model Shift | 2 | Requires understanding of GRACE metadata schema (@PURPOSE, @RELATION, etc.). | | Consistency | 5 | Returns updated_tags list; clear feedback. | | Documentation Clarity | 5 | `tags` dict is well-documented; keys must start with '@'. | | Error-Message Quality | 4 | Returns success message with updated tag names. | | Validation Friction | 3 | Validates tag key format; accepts any value. | | Recovery Simplicity | 4 | **Low risk**: only modifies metadata; easy to revert. | **Average: 4.00 / 5** --- ## 19. rename_semantic_tag_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "Rename semantic tag" — clear intent. | | Predictability | 5 | Renames or removes a tag within a contract's metadata. | | Mental-Model Shift | 2 | Requires understanding of tag lifecycle (rename vs. remove). | | Consistency | 5 | Follows standard {success, message} pattern. | | Documentation Clarity | 4 | `old_tag` required, `new_tag` optional (null = remove). | | Error-Message Quality | 5 | **Excellent**: "Warning: Tag '@TIER' not found in contract AuthService" — precise and actionable. | | Validation Friction | 3 | Validates tag existence before operation. | | Recovery Simplicity | 4 | **Low risk**: only modifies metadata; easy to revert. | **Average: 4.00 / 5** --- ## 20. prune_contract_metadata_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "Prune contract metadata" — clear intent for removing redundant tags. | | Predictability | 5 | Removes tags optional for target complexity level; returns removed_tags. | | Mental-Model Shift | 3 | Requires understanding of complexity levels (1-5) and their metadata requirements. | | Consistency | 5 | Returns removed_tags list; clear feedback. | | Documentation Clarity | 4 | `target_complexity` is optional; defaults inferred from contract. | | Error-Message Quality | 4 | Returns success with removed tag names. | | Validation Friction | 3 | Validates complexity level range (1-5). | | Recovery Simplicity | 4 | **Low risk**: only removes metadata; easy to re-add. | **Average: 3.88 / 5** --- ## 21. infer_missing_relations_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "Infer missing relations" — clear intent for discovering implicit dependencies. | | Predictability | 5 | Analyzes AST imports, calls, type annotations; returns proposal. | | Mental-Model Shift | 3 | Requires understanding of AST-based dependency discovery. | | Consistency | 5 | Returns inferred list with apply_changes flag. | | Documentation Clarity | 4 | `apply_changes` defaults to false (dry-run). | | Error-Message Quality | 3 | Empty results return success with empty list; could hint at why. | | Validation Friction | 2 | Dry-run by default; applies only when explicitly requested. | | Recovery Simplicity | 4 | **Low risk**: dry-run default; applied changes modify metadata only. | **Average: 3.75 / 5** --- ## 22. trace_tests_for_contract_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Trace tests for contract" — crystal clear. | | Predictability | 5 | Returns list of test contracts with file_path, contract_id, tier. | | Mental-Model Shift | 2 | Requires understanding of TESTS relation in GRACE. | | Consistency | 5 | Output shape is stable. | | Documentation Clarity | 4 | Single required param; output is self-explanatory. | | Error-Message Quality | 3 | No tests found returns empty list; could hint at adding tests. | | Validation Friction | 1 | No pre-validation needed. | | Recovery Simplicity | 5 | Pure read; no state to undo. | **Average: 3.75 / 5** --- ## 23. scaffold_contract_tests_tool | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Scaffold contract tests" — clear intent for generating test boilerplate. | | Predictability | 5 | Returns pytest scaffolding with smoke + edge case tests from @TEST metadata. | | Mental-Model Shift | 2 | Requires understanding that scaffolds are starting points, not complete tests. | | Consistency | 5 | Output shape is stable (Python test code string). | | Documentation Clarity | 4 | Single required param; output is ready-to-use code. | | Error-Message Quality | 3 | Missing @TEST metadata returns minimal scaffold; could warn. | | Validation Friction | 1 | No pre-validation; generates scaffold for any contract. | | Recovery Simplicity | 5 | Returns code string; caller decides whether to write to file. | **Average: 3.75 / 5** --- ## 24. find_contract_tool (alias) | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Find contract" — task-first alias for semantic lookup. | | Predictability | 5 | Returns same output as search_contracts_tool. | | Mental-Model Shift | 2 | Same as search_contracts_tool. | | Consistency | 5 | Identical to search_contracts_tool output. | | Documentation Clarity | 4 | Same params as search_contracts_tool. | | Error-Message Quality | 3 | Same as search_contracts_tool. | | Validation Friction | 1 | Same as search_contracts_tool. | | Recovery Simplicity | 5 | Stateless query. | **Average: 3.75 / 5** --- ## 25. read_outline_tool (alias) | Metric | Score | Notes | |--------|-------|-------| | Understandability | 4 | "Read outline" — task-first alias for file inspection. | | Predictability | 5 | Same as read_grace_outline_tool. | | Mental-Model Shift | 3 | Same as read_grace_outline_tool. | | Consistency | 5 | Identical to read_grace_outline_tool output. | | Documentation Clarity | 4 | Same params as read_grace_outline_tool. | | Error-Message Quality | 3 | Same as read_grace_outline_tool. | | Validation Friction | 1 | Same as read_grace_outline_tool. | | Recovery Simplicity | 5 | Pure read. | **Average: 3.63 / 5** --- ## 26. safe_patch_tool (alias) | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Safe patch" — task-first alias for validated patching. | | Predictability | 5 | Same as guarded_patch_contract_tool. | | Mental-Model Shift | 2 | Same as guarded_patch_contract_tool. | | Consistency | 5 | Identical to guarded_patch_contract_tool output. | | Documentation Clarity | 4 | Same params as guarded_patch_contract_tool. | | Error-Message Quality | 4 | Same as guarded_patch_contract_tool. | | Validation Friction | 4 | Same as guarded_patch_contract_tool. | | Recovery Simplicity | 5 | Same as guarded_patch_contract_tool. | **Average: 4.13 / 5** --- ## 27. find_related_tests_tool (alias) | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Find related tests" — task-first alias for test lookup. | | Predictability | 5 | Same as trace_tests_for_contract_tool. | | Mental-Model Shift | 2 | Same as trace_tests_for_contract_tool. | | Consistency | 5 | Identical to trace_tests_for_contract_tool output. | | Documentation Clarity | 4 | Same params as trace_tests_for_contract_tool. | | Error-Message Quality | 3 | Same as trace_tests_for_contract_tool. | | Validation Friction | 1 | Same as trace_tests_for_contract_tool. | | Recovery Simplicity | 5 | Pure read. | **Average: 3.75 / 5** --- ## 28. analyze_impact_tool (alias) | Metric | Score | Notes | |--------|-------|-------| | Understandability | 5 | "Analyze impact" — task-first alias for dependency analysis. | | Predictability | 5 | Same as impact_analysis_tool. | | Mental-Model Shift | 2 | Same as impact_analysis_tool. | | Consistency | 5 | Identical to impact_analysis_tool output. | | Documentation Clarity | 4 | Same params as impact_analysis_tool. | | Error-Message Quality | 3 | Same as impact_analysis_tool. | | Validation Friction | 1 | Same as impact_analysis_tool. | | Recovery Simplicity | 5 | Pure read. | **Average: 3.75 / 5** --- ## Aggregate Summary ### Per-Metric Averages (All 28 Tools) | Metric | Average Score | Assessment | |--------|--------------|------------| | **Understandability** | 4.57 | Excellent — tool names are descriptive and intent is clear. | | **Predictability** | 5.00 | Perfect — all tools behave as expected based on their names and docs. | | **Mental-Model Shift** | 2.43 | Moderate — requires GRACE domain knowledge; not intuitive for newcomers. | | **Consistency** | 5.00 | Perfect — output shapes and patterns are uniform across the suite. | | **Documentation Clarity** | 4.14 | Good — parameters are well-defined; could benefit from more examples. | | **Error-Message Quality** | 3.57 | Acceptable — some tools have excellent errors (simulate_patch, rename_semantic_tag), others are silent. | | **Validation Friction** | 2.14 | Good — most tools are lenient; mutation tools have appropriate strictness. | | **Recovery Simplicity** | 4.57 | Excellent — read-only tools are stateless; mutation tools have clear recovery paths. | ### Overall Suite Average: **3.93 / 5** --- ## Key Findings ### Strengths 1. **Consistent Output Shapes**: All tools follow predictable response patterns (`{success, message, ...}`). 2. **Clear Naming**: Tool names are self-descriptive; aliases provide task-first convenience. 3. **Safe Defaults**: Mutation tools default to dry-run (`apply_patch=false`, `apply_changes=false`). 4. **Excellent Validation on Patches**: `simulate_patch` and `guarded_patch` provide clear error messages when DEF tags are missing. 5. **Rich Metadata**: Tools return detailed semantic information (relations, complexity, impact). ### Areas for Improvement 1. **Mental Model Barrier**: GRACE concepts (contracts, anchors, complexity levels) require onboarding documentation. 2. **Silent Failures**: Some tools return empty results without hints (e.g., no tests found, no relations inferred). 3. **Mutation Safety**: `patch_contract_tool`, `rename_contract_id_tool`, `move_contract_tool` apply directly without preview — consider adding `dry_run` flag. 4. **Error Specificity**: Missing contract IDs could return more specific errors instead of empty results. 5. **Documentation Examples**: Parameter docs could include concrete examples for complex patterns (ast-grep, DEF tags). ### Recommendations 1. Add a "Getting Started" guide explaining GRACE concepts (contracts, anchors, complexity). 2. Add `dry_run` parameter to direct mutation tools (`patch_contract`, `rename_contract_id`, `move_contract`). 3. Improve empty-result responses with actionable hints (e.g., "No tests found — consider adding @TEST metadata"). 4. Add example payloads to tool documentation for complex parameters. 5. Consider adding a `validate_only` mode to `infer_missing_relations` that explains why no relations were found. --- # [/DEF:Axiom_Tools_Evaluation:Report]