Weekend Warrior: Making Our AI Tool Rock-Solid 🚀

Hey folks! After Friday’s demo showed some stability issues with our AI investigation tool, I spent the weekend getting it into shape. Here’s the real deal on what went down:

The Big Wins 💪

Stability First!

The biggest pain point was the tool’s instability, so I tackled that head-on:

Smarter Agent Coordination: Completely revamped how our supervisor tracks agent calls. Added explicit counters in the supervisor state and better validation rules. The supervisor now keeps track of every investigator that’s been called and won’t let them be called again - no more infinite loops! Also added clear rules in the prompt about treating agent usage as global across the entire investigation.

Test All The Things

Added a bunch of tests, especially around edge cases. The best example? Our DataDog investigator would break on special characters in queries. Now we’ve got solid test coverage that catches these issues early. The test suite has already caught several bugs that would have hit prod.

Better State Management

Switched from JSON to YAML for our summarizer and cleaned up our state handling. The investigation journal is way cleaner now, and we’re properly tracking tool invocations across the entire investigation lifecycle.

Performance Boost 🏃‍♂️

Made some sweet optimizations that really made a difference:

Memory Diet: Our LLM context was getting huge! Now we only keep raw data for the latest tool call and use summaries for everything else. The supervisor’s investigation journal is way more efficient - it only includes what the LLM actually needs.
Smart Sampling: Added output sampling for large result dictionaries. When tools return huge amounts of data (looking at you, DataDog logs), we now sample it intelligently before passing it to the summarizer. No more LLM timeouts!
Faster Tools: Cut down tool timeouts and added better error handling. Tools now fail fast when they’re stuck, and we’ve got proper error propagation throughout the system.

LLM Improvements 🤖

The LLM side got some serious love:

Better Prompts: Major refactoring of our prompts. The supervisor now has explicit rules about agent usage and better examples of what a good final report looks like. Check out the new prompt - it’s like a mini-SRE playbook!
Smarter Tracking: Added Langfuse scoring throughout the system. We’re now tracking everything from tool execution times to LLM generation quality. This helps us catch performance issues early.

What’s Next? 🎯

While the tool is much more stable now, there’s still some exciting improvements on the horizon:

More caching strategies (because who doesn’t love faster responses?)
Even better test coverage (can never have too many tests!)
More investigator superpowers

Shout out to test-driven development - it really saved my bacon this weekend! Who knew writing tests first would actually make the whole process smoother? 😅

Key Takeaways 🎯

This weekend was a great reminder of some fundamental engineering principles:

Test-driven development isn’t just a buzzword - it actually saved us from several production issues and made the refactoring process much smoother
When working with LLMs, managing context size and prompt engineering are just as important as the core functionality
Complex systems need robust state management and clear rules for component interaction

Looking ahead, I’m excited to explore:

More sophisticated caching strategies for AI systems
Advanced patterns for managing multi-agent AI architectures
Ways to make testing AI systems more deterministic

If you’re working on similar challenges with AI tools, I’d love to connect and share experiences. The field is moving so fast, and there’s always something new to learn!

Feel free to reach out if you want to discuss AI architecture, testing strategies, or just geek out about building reliable AI systems. Always happy to chat with fellow engineers! 🙌

Share this page