Instruments that help people leverage AI
Context engineering. Retrieval workflows. Reproducible benchmarking. Spec management for the humans and agents doing the work.
One sentence of tool selection guidance eliminated a 13-point accuracy penalty from over-tooling.
O4Pre-rendered model views scored 0.893 vs 0.558 for agent-assembled context. d=1.01, N=10. 4x cheaper.
Exploratory study, single corpus, N=3-10 replications. Full methodology and threats →
What we build
Structural retrieval, graph traversal, and completeness checking for SysML v2 models. Rust. 14 commands, 10 MCP tools.
Reproducible evaluation of tool-augmented LLMs on structured engineering tasks. Python. 132 tasks, 5 models.
Tracks work through orient-plan-agree-execute-reflect-report. Task DAGs, propagation, stakeholder dispositions. Go.
Four primitives for LLM-correct codebases. Derived obligations, prescriptive failure, bundled enforcement, vacuity detection.
Tree-sitter grammar for SysML v2. 192 tests, 89% external file coverage. 6 language bindings.
Converts OMG KeBNF specs to ANTLR4 and tree-sitter. Parses all 640 KerML + SysML v2 rules. Rust.
How this started
This work started as an academic exploration of how AI interacts with structured engineering artifacts. We built tools, ran benchmarks, wrote papers. Along the way, we found alignment with GitLab's Knowledge Graph team, who are solving related problems in context engineering and retrieval at production scale. We've been contributing findings on prescriptive failure patterns and tool description effectiveness into their eval methodology.
Everything is MIT-licensed and on GitLab. If you work with engineering models and are curious about how AI performs on them, we'd like to talk.