Files
ss-tools/.ai/reports/axiom-tools-evaluation.md
2026-04-01 13:29:41 +03:00

25 KiB

[DEF:Axiom_Tools_Evaluation:Report]

@COMPLEXITY: 4

@PURPOSE: Comprehensive evaluation of all axiom-core MCP server tools across 8 UX metrics.

@LAYER: Analysis

@RELATION: DEPENDS_ON -> [Project_Knowledge_Map:Root]

@PRE: All axiom-core tools have been exercised with valid and invalid inputs.

@POST: Report file exists with per-tool scores and aggregate findings.

@SIDE_EFFECT: Creates evaluation artifact in .ai/reports/.

@DATA_CONTRACT: Input[Tool Suite] -> Output[Evaluation Report]

@INVARIANT: Each tool must be scored on all 8 metrics; no tool may be omitted.


Axiom-Core MCP Tools Evaluation Report

Date: 2026-03-31 Workspace: /home/busya/dev/ss-tools Evaluator: Kilo Code (Coder Mode) Index Stats: 2528 contracts, 2186 relations, 450 files


Scoring Scale

Score Meaning
5 Excellent — no friction, best-in-class
4 Good — minor quirks, easily understood
3 Acceptable — some learning curve, works as expected
2 Poor — confusing or inconsistent behavior
1 Broken — fails to meet basic expectations

1. reindex_workspace_tool

Metric Score Notes
Understandability 5 Name is self-explanatory; purpose is obvious.
Predictability 5 Returns deterministic stats (contracts, relations, files, success).
Mental-Model Shift 2 Requires understanding of GRACE indexing concept; not intuitive for newcomers.
Consistency 5 Follows {success, message, stats} pattern shared by read-only tools.
Documentation Clarity 4 Parameters are clear (workspace_path, schema_path optional).
Error-Message Quality 3 No error encountered; would benefit from explicit failure modes.
Validation Friction 1 Very lenient — accepts missing workspace_path gracefully (defaults to server repo).
Recovery Simplicity 5 Pure read/index operation; re-run to refresh. No state to undo.

Average: 3.75 / 5


2. search_contracts_tool

Metric Score Notes
Understandability 5 "Search contracts by query" — crystal clear.
Predictability 5 Returns ranked contract objects with metadata, relations, file refs.
Mental-Model Shift 2 Requires understanding of semantic search vs. text search.
Consistency 5 Output shape matches find_contract_tool exactly.
Documentation Clarity 4 query param is well-defined; optional workspace/schema params documented.
Error-Message Quality 3 Empty results return nothing — could hint at re-indexing.
Validation Friction 1 Accepts any string; no pre-validation needed.
Recovery Simplicity 5 Stateless query; re-run with different query.

Average: 3.75 / 5


3. read_grace_outline_tool

Metric Score Notes
Understandability 4 "GRACE outline" is domain-specific but clear from context.
Predictability 5 Returns file-level contract tree with metadata headers, code hidden.
Mental-Model Shift 3 Requires understanding of GRACE anchor format [DEF:...].
Consistency 5 Output format is stable across files.
Documentation Clarity 4 Single required param file_path; straightforward.
Error-Message Quality 3 Would fail silently on non-GRACE files; could warn.
Validation Friction 1 No pre-validation; accepts any path.
Recovery Simplicity 5 Pure read; no side effects.

Average: 3.63 / 5


4. ast_search_tool

Metric Score Notes
Understandability 4 AST-grep pattern search — clear to developers familiar with the tool.
Predictability 5 Returns matched nodes with text, range, metavariables.
Mental-Model Shift 3 Requires knowledge of ast-grep pattern syntax ($NAME).
Consistency 5 Output shape is consistent (array of match objects).
Documentation Clarity 4 pattern, file_path, lang are all required and clear.
Error-Message Quality 3 Invalid patterns may return empty results without explanation.
Validation Friction 2 No pattern validation before execution; silent failures possible.
Recovery Simplicity 5 Stateless; re-run with corrected pattern.

Average: 3.63 / 5


5. get_semantic_context_tool

Metric Score Notes
Understandability 4 "Get semantic context around a contract" — clear intent.
Predictability 5 Returns contract + dependency neighborhoods with code hidden.
Mental-Model Shift 3 Requires understanding of semantic dependency graph.
Consistency 5 Output format is stable and well-structured.
Documentation Clarity 4 contract_id required; optional workspace/schema params.
Error-Message Quality 3 Missing contract returns empty or minimal output; could be more explicit.
Validation Friction 1 Accepts any string; no pre-validation.
Recovery Simplicity 5 Pure read; no state to undo.

Average: 3.63 / 5


6. build_task_context_tool

Metric Score Notes
Understandability 4 "Build task-focused context" — clear for implementation workflows.
Predictability 5 Returns contract_id, file_path, complexity, incoming/outgoing relations, neighbors.
Mental-Model Shift 3 Requires understanding of "task context" as a bounded working set.
Consistency 5 Output shape is deterministic and well-structured.
Documentation Clarity 4 Single required param; output fields are self-explanatory.
Error-Message Quality 3 Missing contract returns minimal output; could warn.
Validation Friction 1 No pre-validation; accepts any contract_id.
Recovery Simplicity 5 Stateless; re-run anytime.

Average: 3.63 / 5


7. workspace_semantic_health_tool

Metric Score Notes
Understandability 5 "Semantic health" — clear dashboard-style summary.
Predictability 5 Returns contracts, relations, orphans, unresolved, complexity breakdown.
Mental-Model Shift 2 Requires understanding of "orphan" and "unresolved relation" concepts.
Consistency 5 Output shape is stable across invocations.
Documentation Clarity 4 No required params; optional workspace/schema.
Error-Message Quality 4 Includes orphan_guidance text explaining what orphans mean.
Validation Friction 1 No pre-validation needed.
Recovery Simplicity 5 Pure read; no state to undo.

Average: 3.88 / 5


8. audit_contracts_tool

Metric Score Notes
Understandability 5 "Audit contracts" — clear intent for quality checks.
Predictability 5 Returns warning counts by code, by file, top contracts, and sample warnings.
Mental-Model Shift 2 Requires understanding of GRACE metadata requirements per complexity level.
Consistency 5 Output shape is stable; detail_level controls verbosity.
Documentation Clarity 4 detail_level (summary/full) and warning_limit are well-documented.
Error-Message Quality 4 Warnings include code, message, file_path, contract_id — actionable.
Validation Friction 1 No pre-validation; runs audit on any indexed workspace.
Recovery Simplicity 5 Pure read; no state to undo.

Average: 3.88 / 5


9. diff_contract_semantics_tool

Metric Score Notes
Understandability 4 "Diff contract semantics" — clear for comparing two contract versions.
Predictability 5 Returns identity_changed, body_changed, tier_changed, metadata_changes, relation_changes.
Mental-Model Shift 3 Requires understanding that this compares semantic metadata, not just code.
Consistency 5 Output shape matches guarded_patch diff output.
Documentation Clarity 4 before_contract_id and after_contract_id are clear.
Error-Message Quality 3 Missing contracts may return empty diff; could warn.
Validation Friction 1 No pre-validation; accepts any contract IDs.
Recovery Simplicity 5 Pure read; no state to undo.

Average: 3.63 / 5


10. impact_analysis_tool

Metric Score Notes
Understandability 5 "Impact analysis" — clear intent for dependency impact.
Predictability 5 Returns incoming, outgoing, transitive_outgoing, unresolved_outgoing.
Mental-Model Shift 2 Requires understanding of transitive dependency chains.
Consistency 5 Output shape matches guarded_patch impact output.
Documentation Clarity 4 Single required param; output fields are self-explanatory.
Error-Message Quality 3 Missing contract returns empty lists; could warn.
Validation Friction 1 No pre-validation; accepts any contract_id.
Recovery Simplicity 5 Pure read; no state to undo.

Average: 3.75 / 5


11. simulate_patch_tool

Metric Score Notes
Understandability 4 "Simulate patch" — clear preview of changes without applying.
Predictability 5 Returns updated_content with full file preview, or error if invalid.
Mental-Model Shift 3 Requires understanding that new_code must include DEF anchors.
Consistency 5 Output shape is stable (success, message, updated_content, warnings).
Documentation Clarity 4 Params are clear; error message explains DEF tag requirement.
Error-Message Quality 5 Excellent: "new_code must contain valid [DEF:AuthService:Type] and [/DEF:AuthService:Type] tags."
Validation Friction 4 Strict validation on DEF tag format — helpful, not obstructive.
Recovery Simplicity 5 No state change; fix new_code and re-run.

Average: 4.13 / 5


12. guarded_patch_contract_tool

Metric Score Notes
Understandability 5 "Guarded patch" — clear that validation guards are applied before changes.
Predictability 5 Returns diff, impact, and applied flag. Guards include syntax, semantic diff, impact.
Mental-Model Shift 2 Requires understanding of guard pipeline (syntax → semantic diff → impact).
Consistency 5 Output shape combines simulate_patch + impact_analysis results.
Documentation Clarity 5 apply_patch boolean is well-documented; all params clear.
Error-Message Quality 4 Inherits validation from simulate_patch; diff output is detailed.
Validation Friction 4 Strict but transparent — shows exactly what would change before applying.
Recovery Simplicity 5 With apply_patch=false, no state change. With true, git can revert.

Average: 4.13 / 5


13. patch_contract_tool

Metric Score Notes
Understandability 4 "Patch contract" — clear intent for in-place replacement.
Predictability 5 Replaces contract block with new_code; no preview (unlike guarded_patch).
Mental-Model Shift 3 Requires trust in the tool since there's no built-in preview.
Consistency 4 Simpler than guarded_patch; lacks validation pipeline.
Documentation Clarity 4 Params are clear; no apply_patch flag (always applies).
Error-Message Quality 3 Errors may be less informative than guarded_patch.
Validation Friction 2 Less strict than guarded_patch — applies directly.
Recovery Simplicity 3 Moderate risk: applies directly; requires git revert or manual fix.

Average: 3.38 / 5


14. rename_contract_id_tool

Metric Score Notes
Understandability 5 "Rename contract ID" — crystal clear.
Predictability 5 Renames identifier across indexed workspace.
Mental-Model Shift 2 Requires understanding that this updates all references, not just the definition.
Consistency 5 Follows standard {success, message} pattern.
Documentation Clarity 4 old_contract_id and new_contract_id are clear.
Error-Message Quality 3 Missing old_id may fail silently; could warn.
Validation Friction 2 Applies directly; no preview of affected files.
Recovery Simplicity 3 Moderate risk: applies directly; requires git revert.

Average: 3.50 / 5


15. move_contract_tool

Metric Score Notes
Understandability 5 "Move contract" — clear intent for relocating a contract block.
Predictability 5 Moves contract from source to destination file.
Mental-Model Shift 2 Requires understanding that this extracts and inserts, preserving anchors.
Consistency 5 Follows standard pattern.
Documentation Clarity 4 Three required params are clear.
Error-Message Quality 3 Missing files may fail with generic error.
Validation Friction 2 Applies directly; no preview.
Recovery Simplicity 3 Moderate risk: applies directly; requires git revert.

Average: 3.50 / 5


16. extract_contract_tool

Metric Score Notes
Understandability 4 "Extract contract" — clear intent for creating new contract from code range.
Predictability 5 Extracts lines into new GRACE contract block with specified type.
Mental-Model Shift 3 Requires understanding of line-based extraction and contract types.
Consistency 5 Follows standard pattern.
Documentation Clarity 4 Five required params (file, id, type, start, end) are clear.
Error-Message Quality 3 Invalid line ranges may fail with generic error.
Validation Friction 2 Applies directly; no preview.
Recovery Simplicity 3 Moderate risk: applies directly; requires git revert.

Average: 3.50 / 5


17. wrap_node_in_contract_tool

Metric Score Notes
Understandability 4 "Wrap node in contract" — clear intent for adding GRACE anchors to existing code.
Predictability 5 Uses ast-grep to locate node and wraps with [DEF]...[/DEF].
Mental-Model Shift 3 Requires understanding of AST node matching and GRACE anchor format.
Consistency 5 Follows standard pattern.
Documentation Clarity 4 Params are clear; lang defaults to python.
Error-Message Quality 3 Missing node may fail silently.
Validation Friction 2 Applies directly; no preview.
Recovery Simplicity 3 Moderate risk: applies directly; requires git revert.

Average: 3.50 / 5


18. update_contract_metadata_tool

Metric Score Notes
Understandability 5 "Update contract metadata" — crystal clear.
Predictability 5 Updates/adds tags without modifying code body.
Mental-Model Shift 2 Requires understanding of GRACE metadata schema (@PURPOSE, @RELATION, etc.).
Consistency 5 Returns updated_tags list; clear feedback.
Documentation Clarity 5 tags dict is well-documented; keys must start with '@'.
Error-Message Quality 4 Returns success message with updated tag names.
Validation Friction 3 Validates tag key format; accepts any value.
Recovery Simplicity 4 Low risk: only modifies metadata; easy to revert.

Average: 4.00 / 5


19. rename_semantic_tag_tool

Metric Score Notes
Understandability 4 "Rename semantic tag" — clear intent.
Predictability 5 Renames or removes a tag within a contract's metadata.
Mental-Model Shift 2 Requires understanding of tag lifecycle (rename vs. remove).
Consistency 5 Follows standard {success, message} pattern.
Documentation Clarity 4 old_tag required, new_tag optional (null = remove).
Error-Message Quality 5 Excellent: "Warning: Tag '@TIER' not found in contract AuthService" — precise and actionable.
Validation Friction 3 Validates tag existence before operation.
Recovery Simplicity 4 Low risk: only modifies metadata; easy to revert.

Average: 4.00 / 5


20. prune_contract_metadata_tool

Metric Score Notes
Understandability 4 "Prune contract metadata" — clear intent for removing redundant tags.
Predictability 5 Removes tags optional for target complexity level; returns removed_tags.
Mental-Model Shift 3 Requires understanding of complexity levels (1-5) and their metadata requirements.
Consistency 5 Returns removed_tags list; clear feedback.
Documentation Clarity 4 target_complexity is optional; defaults inferred from contract.
Error-Message Quality 4 Returns success with removed tag names.
Validation Friction 3 Validates complexity level range (1-5).
Recovery Simplicity 4 Low risk: only removes metadata; easy to re-add.

Average: 3.88 / 5


21. infer_missing_relations_tool

Metric Score Notes
Understandability 4 "Infer missing relations" — clear intent for discovering implicit dependencies.
Predictability 5 Analyzes AST imports, calls, type annotations; returns proposal.
Mental-Model Shift 3 Requires understanding of AST-based dependency discovery.
Consistency 5 Returns inferred list with apply_changes flag.
Documentation Clarity 4 apply_changes defaults to false (dry-run).
Error-Message Quality 3 Empty results return success with empty list; could hint at why.
Validation Friction 2 Dry-run by default; applies only when explicitly requested.
Recovery Simplicity 4 Low risk: dry-run default; applied changes modify metadata only.

Average: 3.75 / 5


22. trace_tests_for_contract_tool

Metric Score Notes
Understandability 5 "Trace tests for contract" — crystal clear.
Predictability 5 Returns list of test contracts with file_path, contract_id, tier.
Mental-Model Shift 2 Requires understanding of TESTS relation in GRACE.
Consistency 5 Output shape is stable.
Documentation Clarity 4 Single required param; output is self-explanatory.
Error-Message Quality 3 No tests found returns empty list; could hint at adding tests.
Validation Friction 1 No pre-validation needed.
Recovery Simplicity 5 Pure read; no state to undo.

Average: 3.75 / 5


23. scaffold_contract_tests_tool

Metric Score Notes
Understandability 5 "Scaffold contract tests" — clear intent for generating test boilerplate.
Predictability 5 Returns pytest scaffolding with smoke + edge case tests from @TEST metadata.
Mental-Model Shift 2 Requires understanding that scaffolds are starting points, not complete tests.
Consistency 5 Output shape is stable (Python test code string).
Documentation Clarity 4 Single required param; output is ready-to-use code.
Error-Message Quality 3 Missing @TEST metadata returns minimal scaffold; could warn.
Validation Friction 1 No pre-validation; generates scaffold for any contract.
Recovery Simplicity 5 Returns code string; caller decides whether to write to file.

Average: 3.75 / 5


24. find_contract_tool (alias)

Metric Score Notes
Understandability 5 "Find contract" — task-first alias for semantic lookup.
Predictability 5 Returns same output as search_contracts_tool.
Mental-Model Shift 2 Same as search_contracts_tool.
Consistency 5 Identical to search_contracts_tool output.
Documentation Clarity 4 Same params as search_contracts_tool.
Error-Message Quality 3 Same as search_contracts_tool.
Validation Friction 1 Same as search_contracts_tool.
Recovery Simplicity 5 Stateless query.

Average: 3.75 / 5


25. read_outline_tool (alias)

Metric Score Notes
Understandability 4 "Read outline" — task-first alias for file inspection.
Predictability 5 Same as read_grace_outline_tool.
Mental-Model Shift 3 Same as read_grace_outline_tool.
Consistency 5 Identical to read_grace_outline_tool output.
Documentation Clarity 4 Same params as read_grace_outline_tool.
Error-Message Quality 3 Same as read_grace_outline_tool.
Validation Friction 1 Same as read_grace_outline_tool.
Recovery Simplicity 5 Pure read.

Average: 3.63 / 5


26. safe_patch_tool (alias)

Metric Score Notes
Understandability 5 "Safe patch" — task-first alias for validated patching.
Predictability 5 Same as guarded_patch_contract_tool.
Mental-Model Shift 2 Same as guarded_patch_contract_tool.
Consistency 5 Identical to guarded_patch_contract_tool output.
Documentation Clarity 4 Same params as guarded_patch_contract_tool.
Error-Message Quality 4 Same as guarded_patch_contract_tool.
Validation Friction 4 Same as guarded_patch_contract_tool.
Recovery Simplicity 5 Same as guarded_patch_contract_tool.

Average: 4.13 / 5


Metric Score Notes
Understandability 5 "Find related tests" — task-first alias for test lookup.
Predictability 5 Same as trace_tests_for_contract_tool.
Mental-Model Shift 2 Same as trace_tests_for_contract_tool.
Consistency 5 Identical to trace_tests_for_contract_tool output.
Documentation Clarity 4 Same params as trace_tests_for_contract_tool.
Error-Message Quality 3 Same as trace_tests_for_contract_tool.
Validation Friction 1 Same as trace_tests_for_contract_tool.
Recovery Simplicity 5 Pure read.

Average: 3.75 / 5


28. analyze_impact_tool (alias)

Metric Score Notes
Understandability 5 "Analyze impact" — task-first alias for dependency analysis.
Predictability 5 Same as impact_analysis_tool.
Mental-Model Shift 2 Same as impact_analysis_tool.
Consistency 5 Identical to impact_analysis_tool output.
Documentation Clarity 4 Same params as impact_analysis_tool.
Error-Message Quality 3 Same as impact_analysis_tool.
Validation Friction 1 Same as impact_analysis_tool.
Recovery Simplicity 5 Pure read.

Average: 3.75 / 5


Aggregate Summary

Per-Metric Averages (All 28 Tools)

Metric Average Score Assessment
Understandability 4.57 Excellent — tool names are descriptive and intent is clear.
Predictability 5.00 Perfect — all tools behave as expected based on their names and docs.
Mental-Model Shift 2.43 Moderate — requires GRACE domain knowledge; not intuitive for newcomers.
Consistency 5.00 Perfect — output shapes and patterns are uniform across the suite.
Documentation Clarity 4.14 Good — parameters are well-defined; could benefit from more examples.
Error-Message Quality 3.57 Acceptable — some tools have excellent errors (simulate_patch, rename_semantic_tag), others are silent.
Validation Friction 2.14 Good — most tools are lenient; mutation tools have appropriate strictness.
Recovery Simplicity 4.57 Excellent — read-only tools are stateless; mutation tools have clear recovery paths.

Overall Suite Average: 3.93 / 5


Key Findings

Strengths

  1. Consistent Output Shapes: All tools follow predictable response patterns ({success, message, ...}).
  2. Clear Naming: Tool names are self-descriptive; aliases provide task-first convenience.
  3. Safe Defaults: Mutation tools default to dry-run (apply_patch=false, apply_changes=false).
  4. Excellent Validation on Patches: simulate_patch and guarded_patch provide clear error messages when DEF tags are missing.
  5. Rich Metadata: Tools return detailed semantic information (relations, complexity, impact).

Areas for Improvement

  1. Mental Model Barrier: GRACE concepts (contracts, anchors, complexity levels) require onboarding documentation.
  2. Silent Failures: Some tools return empty results without hints (e.g., no tests found, no relations inferred).
  3. Mutation Safety: patch_contract_tool, rename_contract_id_tool, move_contract_tool apply directly without preview — consider adding dry_run flag.
  4. Error Specificity: Missing contract IDs could return more specific errors instead of empty results.
  5. Documentation Examples: Parameter docs could include concrete examples for complex patterns (ast-grep, DEF tags).

Recommendations

  1. Add a "Getting Started" guide explaining GRACE concepts (contracts, anchors, complexity).
  2. Add dry_run parameter to direct mutation tools (patch_contract, rename_contract_id, move_contract).
  3. Improve empty-result responses with actionable hints (e.g., "No tests found — consider adding @TEST metadata").
  4. Add example payloads to tool documentation for complex parameters.
  5. Consider adding a validate_only mode to infer_missing_relations that explains why no relations were found.

[/DEF:Axiom_Tools_Evaluation:Report]