Notes from AI That Works Camp by AI Tinkerers at SF
If you are looking to get past the hype and the ton of information out there about how to work with LLMs, you will find this post quite useful. And I would encourage you to attend more camps like these. It was worth the 8 hours I spent.
12-Factor Agents which stayed on top of HN for a day), (who's tackled hallucinations in AR at Microsoft and Google and is CEO of Boundary which created BAML), and from Baseten.co hosted an eight hour deep dive into LLM engineering this Saturday. They've been solving hard AI problems long before LLMs went mainstream, so their insights were both practical and battle-tested. Below you'll find my favorite takeaways. (author ofBonus, they had a pretty cool office space on the 16th floor with a view of boats cruising under the Bay Bridge.


Categorization at Scale
Using Large Label Sets
Feeding an LLM thousands of categories in one prompt leads to "attention drift," where early labels get overweighted and middle-of-the-pack options get ignored. The result is inconsistent or incorrect tagging. You need to trim that list before you ever show it to the model.
Filtering Strategies?
What would you choose?
- Hierarchy
- LLM Quorum
- Cosine Similarity
Winner - Cosine Similarity Clustering
Encode all your categories as embeddings and use cosine similarity to shortlist, say, the top 100 candidates. This lightweight vector search slashes your prompt size and hands the LLM a manageable list for final selection.
Hybrid Local vs. LLM Pipelines
The previous approach is a kind of cascading classifier.
Another approach that worked while classifying for insurance codes in the medical domain was to train or use a small, on-device model to auto-classify the 80% of "easy" cases, then hand the remaining 20%—the real edge cases—to an LLM. This split reduces API costs and speeds up processing by keeping the LLM focused on truly ambiguous inputs.
Identifying the Confident Set
- Manual Analysis
- Add simple ways to use user-feedback loops to track real-world performance (buttons show actions based on classification, e.g., "cancel order"). Use those signals to adjust confidence thresholds or retrain your local classifier, so your pipeline keeps improving over time.
Prompt Engineering
Roll Your Own Prompts
While LLMs can auto-generate prompts, they often churn out verbose, catch-all instructions that waste tokens. Writing prompts yourself lets you leverage domain expertise for precise, context-specific guidance and keeps your costs down.
Keep Identifiers Simple
Long UUIDs and noisy keys gobble tokens and distract the model. Instead, use concise labels like category1
, category2
, etc. You'll see more accurate outputs and slimmer prompts almost immediately.
Inline Few-Shot Examples
Rather than dumping examples at the top, tuck each sample right next to the data it illustrates. This localized approach prevents the model from scrolling through unrelated context and makes few-shot prompting more reliable.
Example Contamination & Hallucinations
Vaibhav faced a situation where, when finally a patient with an issue similar to the example condition showed up, the model used details from the example in its output. If you are wondering why the LLM is making stuff up, check the examples you are feeding it.
Skip the Obvious
Don't waste tokens describing fields like name
in an Employee
schema—anyone (and any LLM) can infer that. Let the model use its common-sense understanding and save your prompt budget for the real instructions.
Order Instructions Logically
Experiment with placing either your data or your instructions first, depending on which yields clearer reasoning. You can also break complex tasks into intermediate steps—e.g., list probable categories, then pick the final one—to guide the model's workflow.
Prioritization Mantra
Borrowing from aviate/navigate/communicate, try "make it run, make it right, make it fast, make it cheap." Tackle features in that sequence—or swap "fast" and "cheap" depending on your project phase—to stay focused and avoid premature optimization.
Reasoning & Chain of Thought

<thinking>
Blocks
Wrapping a <thinking>
section in your prompt gives the model a clear space to outline its reasoning steps. That transparency makes it much easier to debug flawed logic paths and understand where the model is getting stuck.
Guided Reasoning Steps
You can pre-fill part of the <thinking>
block with bullet points or partial logic to nudge the model toward your desired reasoning strategy. This subtle steering often yields more consistent chain-of-thoughts.
When to Prune COT
Explicit chain-of-thought increases accuracy but adds latency and cost. Once your prompt structure is solid, experiment with removing or shortening the reasoning step—or switch to a smaller, cheaper model for production.
Context Engineering
Garbage In, Garbage Out
If you feed messy or unstructured data into your prompt, you'll get messy outputs. Investing time in cleaning, normalizing, and structuring inputs pays off immediately in model performance.
Frameworks vs. Raw Control
Libraries like LangChain can jump-start your project but may abstract away key prompt-engineering knobs. When you need fine-grained control over attention strategies or token budgets, drop to raw prompt manipulation. This applies to other frameworks available when working on LLM applications
Why BAML
Decouple Logic from Formatting
Instead of crowding your prompt with strict output schemas, let the LLM generate content naturally and then use a BAML translation layer to enforce your desired structure. This two-step approach cuts token waste and reduces format-related errors. It also helps with reducing the scope of the problem that the LLM has to solve.
Test-Driven Prompt Development
BAML includes a playground and unit-test framework so you can write tests against your prompts—just like you test code. Catch edge cases, add error handlers, and ensure prompt stability before deployment.
Token Visualization Tools
BAML's tokenizer view highlights exactly how many tokens each segment consumes, making it easy to identify wasteful language and optimize your prompts.
Agent Architecture
Modular Run Loop
A typical agent cycle checks for a break condition, prompts the LLM for an action, runs that action in deterministic code, updates context, and repeats. Keeping each step isolated makes your orchestrator code both simpler and more debuggable.
Worth noting that this is how most game loops or any kind of decision based loops work. Fairly conventional flow architecture, not specific to LLMs.
Structured Action Enums
Defining a closed set of actions—such as end_run
, get_login_details
, or insufficient_information
—avoids the ambiguity of free-form text commands. Your orchestrator simply switches on the enum, making execution predictable.
Pause, Persist & Resume
Build a persistence layer so your agent can checkpoint its state, wait for human input, and then pick up exactly where it left off. This capability is crucial for long-running or human-in-the-loop processes.
Additional benefit is the ability to replay if you build this right
Models & Side Notes
Qwen3 & the /nothink
Command
Philip from Baseten praised Qwen3's ability to toggle internal reasoning on or off, plus its smaller siblings for speculative decoding. The /nothink
magic command disables chain-of-thought to boost speed, making Qwen3 a compelling choice on modest hardware.
Prompt Tones & Bias
Small tweaks like "you'll do a great job" can subtly steer the model's output style or content. Always validate any tone or bias changes against your evaluation pipeline to catch unintended side effects.
Demo 1: Mercoa - AI Powered Invoicing Agent
AI-Powered Invoicing Agent
Mercoa's agent parses invoice PDFs (even smartphone HEIC photos) with Python libraries, then runs a verification loop that matches line-item math against extracted totals. This automated check catches errors before invoices go out.
Temperature and Determinism
Even with the temperature set to 0, results can still be unpredictable because of how floating-point multiplication works. Groq's custom hardware, however, delivers a much higher degree of determinism.
PDF Extraction
LLMs are coding agents and often use python libraries to extract info from pdfs and the like.
Demo 2: Farhan - browse.dev
AI Browser Prototype
Farhan's browse.dev demo navigates Hacker News, summarizes articles, and writes those summaries directly into Google Docs. It also revealed DOM-parsing quirks—like bullet formatting issues—that highlight the need for AI-native browser functions.
Sensor Fusion in AR -> Browser Automation
Vaibhav voiceovered how in Computer Vision + SLAM, they merge high-frequency accelerometer data with slower visual frames by pausing one stream to correct based on information from the other slower but more accurate source - an approach you can adapt for agentic error correction and accuracy synchronization. In other words, use screenshots to correct your understanding of the dom state so far.
Automated Booking Flow
In a live demo, the browser logged itself into OpenTable, opened the booking dialog, prompted the user for login information, and completed a reservation in the browsser, live — demonstrating a full end-to-end action sequence.
Verbs/Actions in Browser UI
My own thoughts: AI first browser need to expose functions that can be called. Not literal functions but verbs such as 'login', 'reserve_dialog' etc. UI designers will then need to tag UI elements that are actionable. The basic tech for that exists: Accessiblity. However as the tech savvy start using tools such as Operator or other browser AI tools.
Resources
AI-That-Works Repo
The code we worked through in the camp. It has content, links to videos from past camps and talks. A must read.
12-Factor-Agents Repo
Dex's landmark 12 factor agents, which codified best practices for building agents.
Reasoning Models
A video from the camp's repo. Reasoning Models
12-Factor Agents Background
Origins of 12-Factor Principles
Heroku's "12-Factor App" manifesto laid out best practices for building scalable, cloud-native services—everything from strict dependency management to treating logs as event streams. Dex codified similar rules for AI agents to get agents that are easier to test, deploy, and scale.
I had to duck out at 4 PM for another commitment, but the camp ran until 7 PM and I regretted missing a single minute. If you ever get a chance to attend a similar LLM engineering workshop, block off your calendar—these deep dives are absolutely worth your time.