Building Agentic Workflows
Llamas have completely changed what programs can do. Previously programs were deterministic and reliable things, but rather narrow in scope: they weren’t able to deal well with ambiguity. This lead to all sorts of standards (yaml, toml…) and a balance between legibility, verbosity, and compiler inference in programming languages (c, cpp, go, rust…).
With llamas, we’re able to make programs that can handle ambiguity in both their inputs and outputs. This is a pretty big shift, since it allows programs to exist in places where they couldn’t have before, notably directly interacting with humans and external environments. To denote this, I’ll be calling them “agentic workflows” as opposed to older programs which are “deterministic workflows”. This is just to emphasize the shift from static machinery to a more general process of getting an output from an (or several) inputs.
To clarify, I’m not talking about vibe coding, that deserves a different blog, I’m talking about a deployed workflow (app, tool, whatever) that’s running without expert interaction. You may vibecode a deterministic or agentic workflow, the difference is the destination not the journey.
Hot Tips
Having made a few of these, here are my hot tips (like hot takes but tips!). These are specific to agentic workflows. Note that many of the tips for deterministic workflows apply to agentic ones too, such as testing, modulation, and clarity. We won’t be covering those here tho.
Centralize your agentic calls
Your workflow should only call agents through a set of functions implemented in
a single module. What I mean by this is that you should avoid having direct
calls to /chat/completions with requests scattered throughout your code. You
should make an llm.py which will export a “response” function or whatever.
This allows you to debug any issues with the agentic side of things much faster.
Try to avoid kwargs in this module as well, for a similar reason.
Case study: temperature parameters. One situation I’ve run into several times is to do with temperature parameters. Newer reasoning models tend to not support passing in the temperature, for instance the gpt-5+ series. Having to hunt down stray temperature parameters was no fun. In general I’ve found an uptick in small bugs like this when you don’t centralize the agent calls.
Separate out your prompts
There’s a tendency to put in-line prompts in your code. You should avoid this.
Put all your prompts in ./prompts as separate .md files.
Despite feeling like code, prompts are prose and meant to be human-legible. It’s really awkward to read prompts scatted between """ in the middle of code. I find it’s so much easier to debug things and review my prompting when they’re all in a separate directory.
For more advanced prompts (think f""") you can build a simple template engine
around them. For instance have {{firstName}} directly in the .md prompt
file. This is pretty simple to code up, especially when you’ve already
centralized your llm calls as above! If you want to go really advanced, template
engines like jinja are there, but I don’t think that’s necessary in most cases.
Case study: llm wiki. I was building a large unstructured dataset being ingested
as a wiki, largely inspired by karpathy’s llm
wiki, but
fully autonomous and as a way to compile unstructured data. Once the workflow
for actually generating the wiki was finished, I spent hours doing prompt tuning
entirely in the ./prompts directory, reviewing the wording and how that
impacted the final wiki. Since all the prompts were centralized, I could quickly
pick out which ones were using overly-strong wording for rapid iteration. I
didn’t need to touch the code even once.
LangGraph
You need to choose an orchestration framework for your agentic workflow. There are quite a few options, including Agent Development SDK pushed by google, and one of the first ones: LangChain. There are also low-code ones like n8n and zapier, but those are quite obtuse, expensive, and silly for anyone who knows how code works.
All these frameworks make it easier to integrate agentic actions with standard
code in your workflow. For very basic things you could just import litellm and
call it a day, but when you start dealing with loops and larger orchestration,
the clarity sort of falls apart and you end up reinventing the wheel.
I’ve found LangGraph to be the best version of this. It doesn’t go bonkers with abstractions like LangChain but still provides a very nice and clean view of what’s going on. You have to get over the “returning functions from nodes” bit, but after that it’s a very enjoyable experience. Decision nodes and dataflow is super clear, and it scales so well for complicated graphs.
Litellm
Litellm is a proxy and python package whose purpose is essentially unifying
provider endpoints. While a lot of providers support /chat/completions,
increasingly there is branching out from this.
There are two options for litellm: the python library and the proxy. Generally speaking, the proxy has several advantages: being cross-application/languages, doubling as centralized monitoring, virtual key control, spending limits, and being somewhat easier to configure for multiple concurrent providers/fallbacks. The big drawback is that you need to proxy all your traffic through this… so it can be slower and risk being a single point of failure. The python library only works for python and doesn’t have these other niceties, but doesn’t add another point of failure to your infra.
I’ve never used the python library myself since my vps providers have been sufficiently stable for the proxy option, but I don’t think the library doesn’t have its place either.
Case study: Openai. One app I was building was mainly using Azure Model
Foundary as the model provider. I needed an unusual combination: kimi k2.6 for
slow reasoning (its tone was preferred) but then gpt-5.4 for no-thinking fast
answers. Azure makes this so complicated, since they have Azure Openai and Azure
Direct, which are both hosted in azure in the same region and account, but
inexplicably have different endpoints. So kimi, which is not an openai model,
required a different endpoint from gpt-5.4! Using litellm as a proxy, I was able
to unify both and just call them using the same /chat/completions endpoint.
All my code had to do was import openai.
Openai themselves has moved away from /chat/completions to the /responses
api, so I would not rely on them any longer. Even worse providers include the
mess that is vertexai or gemini enterprise or whatever google’s renamed it to
this week. Litellm makes it so you don’t need to spend time vendor locking
yourself and can just focus on choosing whatever provider works best at the
moment.
Prompt debugging
One of the saviours of debugging agentic workflows is having a really good prompt logger. This is something that records the inputs and outputs going to your agents.
The most basic version of this is dumping the input and output of each call into
an .md file on disk (and parameters as an adjacent .json). If you’ve
followed the first tip and centralized all your agent calls into one area, this
is as simple as a with open(... f.write() before and after your actual llm
call. For quick debugging for early-stage workflows, this works shockingly well.
Once you have a live app or your workflow is very complex, you may want a more advanced logging system. Langfuse is one of these. I don’t like it but it’s the best one I’ve tried. If you’ve got a better opensource one, please send me an email since they all seem to suck. Remember to disable telemetry if you’re using langfuse.
Whatever you use, it should log full inputs and outputs, as well as the parameters and application doing the call.
Always prefer deterministic code
As it stands, agentic calls are not very reliable and costly. This may continue to change in the future, but you’ll never have llm calls more reliable than hard logic… since hard logic is as reliable as it gets. I also highly doubt calls to llamas will ever be cheaper than normal code.
That said, we’re making agentic workflows for a reason: traditional code cannot deal with “soft” cases well. But you should always start by trying to use hard logic to solve the problem. For instance, if you’re taking in user input to a y/n question, you don’t need to call a llama for that: you need a regex! There’s a tendency to rely too much on llamas when making an agentic workflow, I advise you try to rely on them as little as possible.
The mindset
This is related to the previous point, but this time it’s about you. When writing agentic workflows, I find the best mindset to be in is the same mindset you’re in for traditional workflows. You should not be looking to add calls to llamas or relying on them, you should be looking to get a reliable workflow done.
Inevitably, you’ll find a node in your graph that simply won’t work without an agent involved. That’s okay, but when you’re thinking up the architecture of your workflow, it should still look a lot like a deterministic workflow. Your agents are just special functions within this workflow after all, not something defining it.