Microsoft has released an open source framework called ASSERT—short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing—that lets developers build AI behavior tests from written descriptions instead of hand-coding every evaluation. The pitch is straightforward: describe what correct behavior looks like in text, and the framework generates the scoring logic to check for it.

The practical problem here is real. Testing AI systems is harder than testing traditional software because outputs aren't deterministic and "correct" is often a judgment call. Teams typically end up writing brittle, one-off eval scripts or relying on manual spot-checks that don't scale. A spec-driven approach aims to standardize how you define expected behavior so the same criteria can be reused across runs and models.

The regression-testing angle is the part worth paying attention to. When you swap models, tune a prompt, or upgrade a dependency, behavior can quietly drift. A framework that turns specs into repeatable scoring lets you re-run the same battery of checks and catch regressions before they ship—similar to how unit tests guard a codebase, but applied to model behavior.

For builders, the immediate move is to evaluate ASSERT against whatever you're using now, whether that's custom evals, LLM-as-judge setups, or libraries like promptfoo and DeepEval. Because it's open source, you can inspect how it converts text specs into scores, which matters: the credibility of any eval framework rests on whether its scoring actually reflects the behavior you care about, not just surface matches.

Start small. Pick one workflow where you already feel uncertain about quality—say a summarization or extraction task—write the behavior spec in plain language, and see how the generated tests perform against known good and bad outputs. That gives you a concrete read on whether the adaptive scoring holds up before you wire it into CI.