ai-notes:~$ cat Notes from AI That Works Camp

Notes from AI That Works Camp by AI Tinkerers at SF

If you are looking to get past the hype and the ton of information out there about how to work with LLMs, you will find this post quite useful. And I would encourage you to attend more camps like these. It was worth the 8 hours I spent.

Dex

GitHub Profile

Author of 12-Factor Agents which stayed on top of HN for a day.

(author of 12-Factor Agents which stayed on top of HN for a day), HelloVai/Vaibhav

Github Profile

Tackled hallucinations in AR at Hololens/CEO of Boundary/BAML.

(who's tackled hallucinations in AR at Microsoft and Google and is CEO of Boundary which created BAML), and Philip

GitHub Profile

From Baseten.co, sponsored the event.

from Baseten.co hosted an eight hour deep dive into LLM engineering this Saturday. They've been solving hard AI problems long before LLMs went mainstream, so their insights were both practical and battle-tested. Below you'll find my favorite takeaways.

Bonus, they had a pretty cool office space on the 16th floor with a view of boats cruising under the Bay Bridge.

Categorization at Scale

Using Large Label Sets

Feeding an LLM thousands of categories in one prompt leads to "attention drift," where early labels get overweighted and middle-of-the-pack options get ignored. The result is inconsistent or incorrect tagging. You need to trim that list before you ever show it to the model.

Filtering Strategies?

What would you choose?

Hierarchy
LLM Quorum
Cosine Similarity

Winner - Cosine Similarity Clustering

Encode all your categories as embeddings and use cosine similarity to shortlist, say, the top 100 candidates. This lightweight vector search slashes your prompt size and hands the LLM a manageable list for final selection.

Hybrid Local vs. LLM Pipelines

The previous approach is a kind of cascading classifier.

Another approach that worked while classifying for insurance codes in the medical domain was to train or use a small, on-device model to auto-classify the 80% of "easy" cases, then hand the remaining 20%—the real edge cases—to an LLM. This split reduces API costs and speeds up processing by keeping the LLM focused on truly ambiguous inputs.

Identifying the Confident Set

Manual Analysis
Add simple ways to use user-feedback loops to track real-world performance (buttons show actions based on classification, e.g., "cancel order"). Use those signals to adjust confidence thresholds or retrain your local classifier, so your pipeline keeps improving over time.

Prompt Engineering

Roll Your Own Prompts

While LLMs can auto-generate prompts, they often churn out verbose, catch-all instructions that waste tokens. Writing prompts yourself lets you leverage domain expertise for precise, context-specific guidance and keeps your costs down.

Keep Identifiers Simple

Long UUIDs and noisy keys gobble tokens and distract the model. Instead, use concise labels like category1, category2, etc. You'll see more accurate outputs and slimmer prompts almost immediately.

Inline Few-Shot Examples

Rather than dumping examples at the top, tuck each sample right next to the data it illustrates. This localized approach prevents the model from scrolling through unrelated context and makes few-shot prompting more reliable.

Example Contamination & Hallucinations

Vaibhav faced a situation where, when finally a patient with an issue similar to the example condition showed up, the model used details from the example in its output. If you are wondering why the LLM is making stuff up, check the examples you are feeding it.

Skip the Obvious

Don't waste tokens describing fields like name in an Employee schema—anyone (and any LLM) can infer that. Let the model use its common-sense understanding and save your prompt budget for the real instructions.

Order Instructions Logically

Experiment with placing either your data or your instructions first, depending on which yields clearer reasoning. You can also break complex tasks into intermediate steps—e.g., list probable categories, then pick the final one—to guide the model's workflow.

Prioritization Mantra

Borrowing from aviate/navigate/communicate, try "make it run, make it right, make it fast, make it cheap." Tackle features in that sequence—or swap "fast" and "cheap" depending on your project phase—to stay focused and avoid premature optimization.

Reasoning & Chain of Thought

Reasoning Model — Credit: Hellovai at ai-that-works

`<thinking>` Blocks

Wrapping a <thinking> section in your prompt gives the model a clear space to outline its reasoning steps. That transparency makes it much easier to debug flawed logic paths and understand where the model is getting stuck.

Guided Reasoning Steps

You can pre-fill part of the <thinking> block with bullet points or partial logic to nudge the model toward your desired reasoning strategy. This subtle steering often yields more consistent chain-of-thoughts.

When to Prune COT

Explicit chain-of-thought increases accuracy but adds latency and cost. Once your prompt structure is solid, experiment with removing or shortening the reasoning step—or switch to a smaller, cheaper model for production.

Context Engineering

Garbage In, Garbage Out

If you feed messy or unstructured data into your prompt, you'll get messy outputs. Investing time in cleaning, normalizing, and structuring inputs pays off immediately in model performance.

Frameworks vs. Raw Control

Libraries like LangChain can jump-start your project but may abstract away key prompt-engineering knobs. When you need fine-grained control over attention strategies or token budgets, drop to raw prompt manipulation. This applies to other frameworks available when working on LLM applications

Why BAML

Decouple Logic from Formatting

Instead of crowding your prompt with strict output schemas, let the LLM generate content naturally and then use a BAML translation layer to enforce your desired structure. This two-step approach cuts token waste and reduces format-related errors. It also helps with reducing the scope of the problem that the LLM has to solve.

Test-Driven Prompt Development

BAML includes a playground and unit-test framework so you can write tests against your prompts—just like you test code. Catch edge cases, add error handlers, and ensure prompt stability before deployment.

Token Visualization Tools

BAML's tokenizer view highlights exactly how many tokens each segment consumes, making it easy to identify wasteful language and optimize your prompts.

Agent Architecture

Modular Run Loop

A typical agent cycle checks for a break condition, prompts the LLM for an action, runs that action in deterministic code, updates context, and repeats. Keeping each step isolated makes your orchestrator code both simpler and more debuggable.

Worth noting that this is how most game loops or any kind of decision based loops work. Fairly conventional flow architecture, not specific to LLMs.

Structured Action Enums

Defining a closed set of actions—such as end_run, get_login_details, or insufficient_information—avoids the ambiguity of free-form text commands. Your orchestrator simply switches on the enum, making execution predictable.

Pause, Persist & Resume

Build a persistence layer so your agent can checkpoint its state, wait for human input, and then pick up exactly where it left off. This capability is crucial for long-running or human-in-the-loop processes.

Additional benefit is the ability to replay if you build this right

Models & Side Notes

Qwen3 & the `/nothink` Command

Philip from Baseten praised Qwen3's ability to toggle internal reasoning on or off, plus its smaller siblings for speculative decoding. The /nothink magic command disables chain-of-thought to boost speed, making Qwen3 a compelling choice on modest hardware.

Prompt Tones & Bias

Small tweaks like "you'll do a great job" can subtly steer the model's output style or content. Always validate any tone or bias changes against your evaluation pipeline to catch unintended side effects.

Demo 1: Mercoa - AI Powered Invoicing Agent

AI-Powered Invoicing Agent

Mercoa's agent parses invoice PDFs (even smartphone HEIC photos) with Python libraries, then runs a verification loop that matches line-item math against extracted totals. This automated check catches errors before invoices go out.

Temperature and Determinism

Even with the temperature set to 0, results can still be unpredictable because of how floating-point multiplication works. Groq's custom hardware, however, delivers a much higher degree of determinism.

PDF Extraction

LLMs are coding agents and often use python libraries to extract info from pdfs and the like.

Demo 2: Farhan - browse.dev

AI Browser Prototype

Farhan's browse.dev demo navigates Hacker News, summarizes articles, and writes those summaries directly into Google Docs. It also revealed DOM-parsing quirks—like bullet formatting issues—that highlight the need for AI-native browser functions.

Sensor Fusion in AR -> Browser Automation

Vaibhav voiceovered how in Computer Vision + SLAM, they merge high-frequency accelerometer data with slower visual frames by pausing one stream to correct based on information from the other slower but more accurate source - an approach you can adapt for agentic error correction and accuracy synchronization. In other words, use screenshots to correct your understanding of the dom state so far.

Automated Booking Flow

In a live demo, the browser logged itself into OpenTable, opened the booking dialog, prompted the user for login information, and completed a reservation in the browsser, live — demonstrating a full end-to-end action sequence.

Verbs/Actions in Browser UI

My own thoughts: AI first browser need to expose functions that can be called. Not literal functions but verbs such as 'login', 'reserve_dialog' etc. UI designers will then need to tag UI elements that are actionable. The basic tech for that exists: Accessiblity. However as the tech savvy start using tools such as Operator or other browser AI tools.

Resources

AI-That-Works Repo

The code we worked through in the camp. It has content, links to videos from past camps and talks. A must read.

12-Factor-Agents Repo

Dex's landmark 12 factor agents, which codified best practices for building agents.

Reasoning Models

A video from the camp's repo. Reasoning Models

12-Factor Agents Background

Origins of 12-Factor Principles

Heroku's "12-Factor App" manifesto laid out best practices for building scalable, cloud-native services—everything from strict dependency management to treating logs as event streams. Dex codified similar rules for AI agents to get agents that are easier to test, deploy, and scale.

I had to duck out at 4 PM for another commitment, but the camp ran until 7 PM and I regretted missing a single minute. If you ever get a chance to attend a similar LLM engineering workshop, block off your calendar—these deep dives are absolutely worth your time.

Share this page