busya/ss-tools

Fork 0

Files

busya 1e46073dd6 mcp tuning

2026-04-01 13:29:41 +03:00

25 KiB

Raw Blame History

[DEF:Axiom_Tools_Evaluation:Report]

@COMPLEXITY: 4

@PURPOSE: Comprehensive evaluation of all axiom-core MCP server tools across 8 UX metrics.

@LAYER: Analysis

@RELATION: DEPENDS_ON -> [Project_Knowledge_Map:Root]

@PRE: All axiom-core tools have been exercised with valid and invalid inputs.

@POST: Report file exists with per-tool scores and aggregate findings.

@SIDE_EFFECT: Creates evaluation artifact in .ai/reports/.

@DATA_CONTRACT: Input[Tool Suite] -> Output[Evaluation Report]

@INVARIANT: Each tool must be scored on all 8 metrics; no tool may be omitted.

Axiom-Core MCP Tools Evaluation Report

Date: 2026-03-31 Workspace: /home/busya/dev/ss-tools Evaluator: Kilo Code (Coder Mode) Index Stats: 2528 contracts, 2186 relations, 450 files

Scoring Scale

Score	Meaning
5	Excellent — no friction, best-in-class
4	Good — minor quirks, easily understood
3	Acceptable — some learning curve, works as expected
2	Poor — confusing or inconsistent behavior
1	Broken — fails to meet basic expectations

1. reindex_workspace_tool

Metric	Score	Notes
Understandability	5	Name is self-explanatory; purpose is obvious.
Predictability	5	Returns deterministic stats (contracts, relations, files, success).
Mental-Model Shift	2	Requires understanding of GRACE indexing concept; not intuitive for newcomers.
Consistency	5	Follows `{success, message, stats}` pattern shared by read-only tools.
Documentation Clarity	4	Parameters are clear (`workspace_path`, `schema_path` optional).
Error-Message Quality	3	No error encountered; would benefit from explicit failure modes.
Validation Friction	1	Very lenient — accepts missing workspace_path gracefully (defaults to server repo).
Recovery Simplicity	5	Pure read/index operation; re-run to refresh. No state to undo.

Average: 3.75 / 5

2. search_contracts_tool

Metric	Score	Notes
Understandability	5	"Search contracts by query" — crystal clear.
Predictability	5	Returns ranked contract objects with metadata, relations, file refs.
Mental-Model Shift	2	Requires understanding of semantic search vs. text search.
Consistency	5	Output shape matches `find_contract_tool` exactly.
Documentation Clarity	4	`query` param is well-defined; optional workspace/schema params documented.
Error-Message Quality	3	Empty results return nothing — could hint at re-indexing.
Validation Friction	1	Accepts any string; no pre-validation needed.
Recovery Simplicity	5	Stateless query; re-run with different query.

Average: 3.75 / 5

3. read_grace_outline_tool

Metric	Score	Notes
Understandability	4	"GRACE outline" is domain-specific but clear from context.
Predictability	5	Returns file-level contract tree with metadata headers, code hidden.
Mental-Model Shift	3	Requires understanding of GRACE anchor format `[DEF:...]`.
Consistency	5	Output format is stable across files.
Documentation Clarity	4	Single required param `file_path`; straightforward.
Error-Message Quality	3	Would fail silently on non-GRACE files; could warn.
Validation Friction	1	No pre-validation; accepts any path.
Recovery Simplicity	5	Pure read; no side effects.

Average: 3.63 / 5

4. ast_search_tool

Metric	Score	Notes
Understandability	4	AST-grep pattern search — clear to developers familiar with the tool.
Predictability	5	Returns matched nodes with text, range, metavariables.
Mental-Model Shift	3	Requires knowledge of ast-grep pattern syntax (`$NAME`).
Consistency	5	Output shape is consistent (array of match objects).
Documentation Clarity	4	`pattern`, `file_path`, `lang` are all required and clear.
Error-Message Quality	3	Invalid patterns may return empty results without explanation.
Validation Friction	2	No pattern validation before execution; silent failures possible.
Recovery Simplicity	5	Stateless; re-run with corrected pattern.

Average: 3.63 / 5

5. get_semantic_context_tool

Metric	Score	Notes
Understandability	4	"Get semantic context around a contract" — clear intent.
Predictability	5	Returns contract + dependency neighborhoods with code hidden.
Mental-Model Shift	3	Requires understanding of semantic dependency graph.
Consistency	5	Output format is stable and well-structured.
Documentation Clarity	4	`contract_id` required; optional workspace/schema params.
Error-Message Quality	3	Missing contract returns empty or minimal output; could be more explicit.
Validation Friction	1	Accepts any string; no pre-validation.
Recovery Simplicity	5	Pure read; no state to undo.

Average: 3.63 / 5

6. build_task_context_tool

Metric	Score	Notes
Understandability	4	"Build task-focused context" — clear for implementation workflows.
Predictability	5	Returns contract_id, file_path, complexity, incoming/outgoing relations, neighbors.
Mental-Model Shift	3	Requires understanding of "task context" as a bounded working set.
Consistency	5	Output shape is deterministic and well-structured.
Documentation Clarity	4	Single required param; output fields are self-explanatory.
Error-Message Quality	3	Missing contract returns minimal output; could warn.
Validation Friction	1	No pre-validation; accepts any contract_id.
Recovery Simplicity	5	Stateless; re-run anytime.

Average: 3.63 / 5

7. workspace_semantic_health_tool

Metric	Score	Notes
Understandability	5	"Semantic health" — clear dashboard-style summary.
Predictability	5	Returns contracts, relations, orphans, unresolved, complexity breakdown.
Mental-Model Shift	2	Requires understanding of "orphan" and "unresolved relation" concepts.
Consistency	5	Output shape is stable across invocations.
Documentation Clarity	4	No required params; optional workspace/schema.
Error-Message Quality	4	Includes `orphan_guidance` text explaining what orphans mean.
Validation Friction	1	No pre-validation needed.
Recovery Simplicity	5	Pure read; no state to undo.

Average: 3.88 / 5

8. audit_contracts_tool

Metric	Score	Notes
Understandability	5	"Audit contracts" — clear intent for quality checks.
Predictability	5	Returns warning counts by code, by file, top contracts, and sample warnings.
Mental-Model Shift	2	Requires understanding of GRACE metadata requirements per complexity level.
Consistency	5	Output shape is stable; `detail_level` controls verbosity.
Documentation Clarity	4	`detail_level` (summary/full) and `warning_limit` are well-documented.
Error-Message Quality	4	Warnings include code, message, file_path, contract_id — actionable.
Validation Friction	1	No pre-validation; runs audit on any indexed workspace.
Recovery Simplicity	5	Pure read; no state to undo.

Average: 3.88 / 5

9. diff_contract_semantics_tool

Metric	Score	Notes
Understandability	4	"Diff contract semantics" — clear for comparing two contract versions.
Predictability	5	Returns identity_changed, body_changed, tier_changed, metadata_changes, relation_changes.
Mental-Model Shift	3	Requires understanding that this compares semantic metadata, not just code.
Consistency	5	Output shape matches guarded_patch diff output.
Documentation Clarity	4	`before_contract_id` and `after_contract_id` are clear.
Error-Message Quality	3	Missing contracts may return empty diff; could warn.
Validation Friction	1	No pre-validation; accepts any contract IDs.
Recovery Simplicity	5	Pure read; no state to undo.

Average: 3.63 / 5

10. impact_analysis_tool

Metric	Score	Notes
Understandability	5	"Impact analysis" — clear intent for dependency impact.
Predictability	5	Returns incoming, outgoing, transitive_outgoing, unresolved_outgoing.
Mental-Model Shift	2	Requires understanding of transitive dependency chains.
Consistency	5	Output shape matches guarded_patch impact output.
Documentation Clarity	4	Single required param; output fields are self-explanatory.
Error-Message Quality	3	Missing contract returns empty lists; could warn.
Validation Friction	1	No pre-validation; accepts any contract_id.
Recovery Simplicity	5	Pure read; no state to undo.

Average: 3.75 / 5

11. simulate_patch_tool

Metric	Score	Notes
Understandability	4	"Simulate patch" — clear preview of changes without applying.
Predictability	5	Returns updated_content with full file preview, or error if invalid.
Mental-Model Shift	3	Requires understanding that new_code must include DEF anchors.
Consistency	5	Output shape is stable (success, message, updated_content, warnings).
Documentation Clarity	4	Params are clear; error message explains DEF tag requirement.
Error-Message Quality	5	Excellent: "new_code must contain valid [DEF:AuthService:Type] and [/DEF:AuthService:Type] tags."
Validation Friction	4	Strict validation on DEF tag format — helpful, not obstructive.
Recovery Simplicity	5	No state change; fix new_code and re-run.

Average: 4.13 / 5

12. guarded_patch_contract_tool

Metric	Score	Notes
Understandability	5	"Guarded patch" — clear that validation guards are applied before changes.
Predictability	5	Returns diff, impact, and applied flag. Guards include syntax, semantic diff, impact.
Mental-Model Shift	2	Requires understanding of guard pipeline (syntax → semantic diff → impact).
Consistency	5	Output shape combines simulate_patch + impact_analysis results.
Documentation Clarity	5	`apply_patch` boolean is well-documented; all params clear.
Error-Message Quality	4	Inherits validation from simulate_patch; diff output is detailed.
Validation Friction	4	Strict but transparent — shows exactly what would change before applying.
Recovery Simplicity	5	With `apply_patch=false`, no state change. With `true`, git can revert.

Average: 4.13 / 5

13. patch_contract_tool

Metric	Score	Notes
Understandability	4	"Patch contract" — clear intent for in-place replacement.
Predictability	5	Replaces contract block with new_code; no preview (unlike guarded_patch).
Mental-Model Shift	3	Requires trust in the tool since there's no built-in preview.
Consistency	4	Simpler than guarded_patch; lacks validation pipeline.
Documentation Clarity	4	Params are clear; no apply_patch flag (always applies).
Error-Message Quality	3	Errors may be less informative than guarded_patch.
Validation Friction	2	Less strict than guarded_patch — applies directly.
Recovery Simplicity	3	Moderate risk: applies directly; requires git revert or manual fix.

Average: 3.38 / 5

14. rename_contract_id_tool

Metric	Score	Notes
Understandability	5	"Rename contract ID" — crystal clear.
Predictability	5	Renames identifier across indexed workspace.
Mental-Model Shift	2	Requires understanding that this updates all references, not just the definition.
Consistency	5	Follows standard {success, message} pattern.
Documentation Clarity	4	`old_contract_id` and `new_contract_id` are clear.
Error-Message Quality	3	Missing old_id may fail silently; could warn.
Validation Friction	2	Applies directly; no preview of affected files.
Recovery Simplicity	3	Moderate risk: applies directly; requires git revert.

Average: 3.50 / 5

15. move_contract_tool

Metric	Score	Notes
Understandability	5	"Move contract" — clear intent for relocating a contract block.
Predictability	5	Moves contract from source to destination file.
Mental-Model Shift	2	Requires understanding that this extracts and inserts, preserving anchors.
Consistency	5	Follows standard pattern.
Documentation Clarity	4	Three required params are clear.
Error-Message Quality	3	Missing files may fail with generic error.
Validation Friction	2	Applies directly; no preview.
Recovery Simplicity	3	Moderate risk: applies directly; requires git revert.

Average: 3.50 / 5

16. extract_contract_tool

Metric	Score	Notes
Understandability	4	"Extract contract" — clear intent for creating new contract from code range.
Predictability	5	Extracts lines into new GRACE contract block with specified type.
Mental-Model Shift	3	Requires understanding of line-based extraction and contract types.
Consistency	5	Follows standard pattern.
Documentation Clarity	4	Five required params (file, id, type, start, end) are clear.
Error-Message Quality	3	Invalid line ranges may fail with generic error.
Validation Friction	2	Applies directly; no preview.
Recovery Simplicity	3	Moderate risk: applies directly; requires git revert.

Average: 3.50 / 5

17. wrap_node_in_contract_tool

Metric	Score	Notes
Understandability	4	"Wrap node in contract" — clear intent for adding GRACE anchors to existing code.
Predictability	5	Uses ast-grep to locate node and wraps with [DEF]...[/DEF].
Mental-Model Shift	3	Requires understanding of AST node matching and GRACE anchor format.
Consistency	5	Follows standard pattern.
Documentation Clarity	4	Params are clear; `lang` defaults to python.
Error-Message Quality	3	Missing node may fail silently.
Validation Friction	2	Applies directly; no preview.
Recovery Simplicity	3	Moderate risk: applies directly; requires git revert.

Average: 3.50 / 5

18. update_contract_metadata_tool

Metric	Score	Notes
Understandability	5	"Update contract metadata" — crystal clear.
Predictability	5	Updates/adds tags without modifying code body.
Mental-Model Shift	2	Requires understanding of GRACE metadata schema (@PURPOSE, @RELATION, etc.).
Consistency	5	Returns updated_tags list; clear feedback.
Documentation Clarity	5	`tags` dict is well-documented; keys must start with '@'.
Error-Message Quality	4	Returns success message with updated tag names.
Validation Friction	3	Validates tag key format; accepts any value.
Recovery Simplicity	4	Low risk: only modifies metadata; easy to revert.

Average: 4.00 / 5

19. rename_semantic_tag_tool

Metric	Score	Notes
Understandability	4	"Rename semantic tag" — clear intent.
Predictability	5	Renames or removes a tag within a contract's metadata.
Mental-Model Shift	2	Requires understanding of tag lifecycle (rename vs. remove).
Consistency	5	Follows standard {success, message} pattern.
Documentation Clarity	4	`old_tag` required, `new_tag` optional (null = remove).
Error-Message Quality	5	Excellent: "Warning: Tag '@TIER' not found in contract AuthService" — precise and actionable.
Validation Friction	3	Validates tag existence before operation.
Recovery Simplicity	4	Low risk: only modifies metadata; easy to revert.

Average: 4.00 / 5

20. prune_contract_metadata_tool

Metric	Score	Notes
Understandability	4	"Prune contract metadata" — clear intent for removing redundant tags.
Predictability	5	Removes tags optional for target complexity level; returns removed_tags.
Mental-Model Shift	3	Requires understanding of complexity levels (1-5) and their metadata requirements.
Consistency	5	Returns removed_tags list; clear feedback.
Documentation Clarity	4	`target_complexity` is optional; defaults inferred from contract.
Error-Message Quality	4	Returns success with removed tag names.
Validation Friction	3	Validates complexity level range (1-5).
Recovery Simplicity	4	Low risk: only removes metadata; easy to re-add.

Average: 3.88 / 5

21. infer_missing_relations_tool

Metric	Score	Notes
Understandability	4	"Infer missing relations" — clear intent for discovering implicit dependencies.
Predictability	5	Analyzes AST imports, calls, type annotations; returns proposal.
Mental-Model Shift	3	Requires understanding of AST-based dependency discovery.
Consistency	5	Returns inferred list with apply_changes flag.
Documentation Clarity	4	`apply_changes` defaults to false (dry-run).
Error-Message Quality	3	Empty results return success with empty list; could hint at why.
Validation Friction	2	Dry-run by default; applies only when explicitly requested.
Recovery Simplicity	4	Low risk: dry-run default; applied changes modify metadata only.

Average: 3.75 / 5

22. trace_tests_for_contract_tool

Metric	Score	Notes
Understandability	5	"Trace tests for contract" — crystal clear.
Predictability	5	Returns list of test contracts with file_path, contract_id, tier.
Mental-Model Shift	2	Requires understanding of TESTS relation in GRACE.
Consistency	5	Output shape is stable.
Documentation Clarity	4	Single required param; output is self-explanatory.
Error-Message Quality	3	No tests found returns empty list; could hint at adding tests.
Validation Friction	1	No pre-validation needed.
Recovery Simplicity	5	Pure read; no state to undo.

Average: 3.75 / 5

23. scaffold_contract_tests_tool

Metric	Score	Notes
Understandability	5	"Scaffold contract tests" — clear intent for generating test boilerplate.
Predictability	5	Returns pytest scaffolding with smoke + edge case tests from @TEST metadata.
Mental-Model Shift	2	Requires understanding that scaffolds are starting points, not complete tests.
Consistency	5	Output shape is stable (Python test code string).
Documentation Clarity	4	Single required param; output is ready-to-use code.
Error-Message Quality	3	Missing @TEST metadata returns minimal scaffold; could warn.
Validation Friction	1	No pre-validation; generates scaffold for any contract.
Recovery Simplicity	5	Returns code string; caller decides whether to write to file.

Average: 3.75 / 5

24. find_contract_tool (alias)

Metric	Score	Notes
Understandability	5	"Find contract" — task-first alias for semantic lookup.
Predictability	5	Returns same output as search_contracts_tool.
Mental-Model Shift	2	Same as search_contracts_tool.
Consistency	5	Identical to search_contracts_tool output.
Documentation Clarity	4	Same params as search_contracts_tool.
Error-Message Quality	3	Same as search_contracts_tool.
Validation Friction	1	Same as search_contracts_tool.
Recovery Simplicity	5	Stateless query.

Average: 3.75 / 5

25. read_outline_tool (alias)

Metric	Score	Notes
Understandability	4	"Read outline" — task-first alias for file inspection.
Predictability	5	Same as read_grace_outline_tool.
Mental-Model Shift	3	Same as read_grace_outline_tool.
Consistency	5	Identical to read_grace_outline_tool output.
Documentation Clarity	4	Same params as read_grace_outline_tool.
Error-Message Quality	3	Same as read_grace_outline_tool.
Validation Friction	1	Same as read_grace_outline_tool.
Recovery Simplicity	5	Pure read.

Average: 3.63 / 5

26. safe_patch_tool (alias)

Metric	Score	Notes
Understandability	5	"Safe patch" — task-first alias for validated patching.
Predictability	5	Same as guarded_patch_contract_tool.
Mental-Model Shift	2	Same as guarded_patch_contract_tool.
Consistency	5	Identical to guarded_patch_contract_tool output.
Documentation Clarity	4	Same params as guarded_patch_contract_tool.
Error-Message Quality	4	Same as guarded_patch_contract_tool.
Validation Friction	4	Same as guarded_patch_contract_tool.
Recovery Simplicity	5	Same as guarded_patch_contract_tool.

Average: 4.13 / 5

27. find_related_tests_tool (alias)

Metric	Score	Notes
Understandability	5	"Find related tests" — task-first alias for test lookup.
Predictability	5	Same as trace_tests_for_contract_tool.
Mental-Model Shift	2	Same as trace_tests_for_contract_tool.
Consistency	5	Identical to trace_tests_for_contract_tool output.
Documentation Clarity	4	Same params as trace_tests_for_contract_tool.
Error-Message Quality	3	Same as trace_tests_for_contract_tool.
Validation Friction	1	Same as trace_tests_for_contract_tool.
Recovery Simplicity	5	Pure read.

Average: 3.75 / 5

28. analyze_impact_tool (alias)

Metric	Score	Notes
Understandability	5	"Analyze impact" — task-first alias for dependency analysis.
Predictability	5	Same as impact_analysis_tool.
Mental-Model Shift	2	Same as impact_analysis_tool.
Consistency	5	Identical to impact_analysis_tool output.
Documentation Clarity	4	Same params as impact_analysis_tool.
Error-Message Quality	3	Same as impact_analysis_tool.
Validation Friction	1	Same as impact_analysis_tool.
Recovery Simplicity	5	Pure read.

Average: 3.75 / 5

Aggregate Summary

Per-Metric Averages (All 28 Tools)

Metric	Average Score	Assessment
Understandability	4.57	Excellent — tool names are descriptive and intent is clear.
Predictability	5.00	Perfect — all tools behave as expected based on their names and docs.
Mental-Model Shift	2.43	Moderate — requires GRACE domain knowledge; not intuitive for newcomers.
Consistency	5.00	Perfect — output shapes and patterns are uniform across the suite.
Documentation Clarity	4.14	Good — parameters are well-defined; could benefit from more examples.
Error-Message Quality	3.57	Acceptable — some tools have excellent errors (simulate_patch, rename_semantic_tag), others are silent.
Validation Friction	2.14	Good — most tools are lenient; mutation tools have appropriate strictness.
Recovery Simplicity	4.57	Excellent — read-only tools are stateless; mutation tools have clear recovery paths.

Overall Suite Average: 3.93 / 5

Key Findings

Strengths

Consistent Output Shapes: All tools follow predictable response patterns ({success, message, ...}).
Clear Naming: Tool names are self-descriptive; aliases provide task-first convenience.
Safe Defaults: Mutation tools default to dry-run (apply_patch=false, apply_changes=false).
Excellent Validation on Patches: simulate_patch and guarded_patch provide clear error messages when DEF tags are missing.
Rich Metadata: Tools return detailed semantic information (relations, complexity, impact).

Areas for Improvement

Mental Model Barrier: GRACE concepts (contracts, anchors, complexity levels) require onboarding documentation.
Silent Failures: Some tools return empty results without hints (e.g., no tests found, no relations inferred).
Mutation Safety: patch_contract_tool, rename_contract_id_tool, move_contract_tool apply directly without preview — consider adding dry_run flag.
Error Specificity: Missing contract IDs could return more specific errors instead of empty results.
Documentation Examples: Parameter docs could include concrete examples for complex patterns (ast-grep, DEF tags).

Recommendations

Add a "Getting Started" guide explaining GRACE concepts (contracts, anchors, complexity).
Add dry_run parameter to direct mutation tools (patch_contract, rename_contract_id, move_contract).
Improve empty-result responses with actionable hints (e.g., "No tests found — consider adding @TEST metadata").
Add example payloads to tool documentation for complex parameters.
Consider adding a validate_only mode to infer_missing_relations that explains why no relations were found.

[/DEF:Axiom_Tools_Evaluation:Report]

25 KiB Raw Blame History

[DEF:Axiom_Tools_Evaluation:Report]

@COMPLEXITY: 4

@PURPOSE: Comprehensive evaluation of all axiom-core MCP server tools across 8 UX metrics.

@LAYER: Analysis

@RELATION: DEPENDS_ON -> [Project_Knowledge_Map:Root]

@PRE: All axiom-core tools have been exercised with valid and invalid inputs.

@POST: Report file exists with per-tool scores and aggregate findings.

@SIDE_EFFECT: Creates evaluation artifact in .ai/reports/.

@DATA_CONTRACT: Input[Tool Suite] -> Output[Evaluation Report]

@INVARIANT: Each tool must be scored on all 8 metrics; no tool may be omitted.

Axiom-Core MCP Tools Evaluation Report

Scoring Scale

1. reindex_workspace_tool

2. search_contracts_tool

3. read_grace_outline_tool

4. ast_search_tool

5. get_semantic_context_tool

6. build_task_context_tool

7. workspace_semantic_health_tool

8. audit_contracts_tool

9. diff_contract_semantics_tool

10. impact_analysis_tool

11. simulate_patch_tool

12. guarded_patch_contract_tool

13. patch_contract_tool

14. rename_contract_id_tool

15. move_contract_tool

16. extract_contract_tool

17. wrap_node_in_contract_tool

18. update_contract_metadata_tool

19. rename_semantic_tag_tool

20. prune_contract_metadata_tool

21. infer_missing_relations_tool

22. trace_tests_for_contract_tool

23. scaffold_contract_tests_tool

24. find_contract_tool (alias)

25. read_outline_tool (alias)

26. safe_patch_tool (alias)

27. find_related_tests_tool (alias)

28. analyze_impact_tool (alias)

Aggregate Summary

Per-Metric Averages (All 28 Tools)

Overall Suite Average: 3.93 / 5

Key Findings

Strengths

Areas for Improvement

Recommendations

[/DEF:Axiom_Tools_Evaluation:Report]

25 KiB

Raw Blame History