Building Agentic Workflows

Llamas have completely changed what programs can do. Previously programs were deterministic and reliable things, but rather narrow in scope: they weren’t able to deal well with ambiguity. This lead to all sorts of standards (yaml, toml…) and a balance between legibility, verbosity, and compiler inference in programming languages (c, cpp, go, rust…).

With llamas, we’re able to make programs that can handle ambiguity in both their inputs and outputs. This is a pretty big shift, since it allows programs to exist in places where they couldn’t have before, notably directly interacting with humans and external environments. To denote this, I’ll be calling them “agentic workflows” as opposed to older programs which are “deterministic workflows”. This is just to emphasize the shift from static machinery to a more general process of getting an output from an (or several) inputs.

To clarify, I’m not talking about vibe coding, that deserves a different blog, I’m talking about a deployed workflow (app, tool, whatever) that’s running without expert interaction. You may vibecode a deterministic or agentic workflow, the difference is the destination not the journey.

Hot Tips

Having made a few of these, here are my hot tips (like hot takes but tips!). These are specific to agentic workflows. Note that many of the tips for deterministic workflows apply to agentic ones too, such as testing, modulation, and clarity. We won’t be covering those here tho.

Centralize your agentic calls

Your workflow should only call agents through a set of functions implemented in a single module. What I mean by this is that you should avoid having direct calls to /chat/completions with requests scattered throughout your code. You should make an llm.py which will export a “response” function or whatever. This allows you to debug any issues with the agentic side of things much faster.

Try to avoid kwargs in this module as well, for a similar reason.

Case study: temperature parameters. One situation I’ve run into several times is to do with temperature parameters. Newer reasoning models tend to not support passing in the temperature, for instance the gpt-5+ series. Having to hunt down stray temperature parameters was no fun. In general I’ve found an uptick in small bugs like this when you don’t centralize the agent calls.

Separate out your prompts

There’s a tendency to put in-line prompts in your code. You should avoid this. Put all your prompts in ./prompts as separate .md files.

Despite feeling like code, prompts are prose and meant to be human-legible. It’s really awkward to read prompts scatted between """ in the middle of code. I find it’s so much easier to debug things and review my prompting when they’re all in a separate directory.

For more advanced prompts (think f""") you can build a simple template engine around them. For instance have {{firstName}} directly in the .md prompt file. This is pretty simple to code up, especially when you’ve already centralized your llm calls as above! If you want to go really advanced, template engines like jinja are there, but I don’t think that’s necessary in most cases.

Case study: llm wiki. I was building a large unstructured dataset being ingested as a wiki, largely inspired by karpathy’s llm wiki, but fully autonomous and as a way to compile unstructured data. Once the workflow for actually generating the wiki was finished, I spent hours doing prompt tuning entirely in the ./prompts directory, reviewing the wording and how that impacted the final wiki. Since all the prompts were centralized, I could quickly pick out which ones were using overly-strong wording for rapid iteration. I didn’t need to touch the code even once.

LangGraph

You need to choose an orchestration framework for your agentic workflow. There are quite a few options, including Agent Development SDK pushed by google, and one of the first ones: LangChain. There are also low-code ones like n8n and zapier, but those are quite obtuse, expensive, and silly for anyone who knows how code works.

All these frameworks make it easier to integrate agentic actions with standard code in your workflow. For very basic things you could just import litellm and call it a day, but when you start dealing with loops and larger orchestration, the clarity sort of falls apart and you end up reinventing the wheel.

I’ve found LangGraph to be the best version of this. It doesn’t go bonkers with abstractions like LangChain but still provides a very nice and clean view of what’s going on. You have to get over the “returning functions from nodes” bit, but after that it’s a very enjoyable experience. Decision nodes and dataflow is super clear, and it scales so well for complicated graphs.

Litellm

Litellm is a proxy and python package whose purpose is essentially unifying provider endpoints. While a lot of providers support /chat/completions, increasingly there is branching out from this.

There are two options for litellm: the python library and the proxy. Generally speaking, the proxy has several advantages: being cross-application/languages, doubling as centralized monitoring, virtual key control, spending limits, and being somewhat easier to configure for multiple concurrent providers/fallbacks. The big drawback is that you need to proxy all your traffic through this… so it can be slower and risk being a single point of failure. The python library only works for python and doesn’t have these other niceties, but doesn’t add another point of failure to your infra.

I’ve never used the python library myself since my vps providers have been sufficiently stable for the proxy option, but I don’t think the library doesn’t have its place either.

Case study: Openai. One app I was building was mainly using Azure Model Foundary as the model provider. I needed an unusual combination: kimi k2.6 for slow reasoning (its tone was preferred) but then gpt-5.4 for no-thinking fast answers. Azure makes this so complicated, since they have Azure Openai and Azure Direct, which are both hosted in azure in the same region and account, but inexplicably have different endpoints. So kimi, which is not an openai model, required a different endpoint from gpt-5.4! Using litellm as a proxy, I was able to unify both and just call them using the same /chat/completions endpoint. All my code had to do was import openai.

Openai themselves has moved away from /chat/completions to the /responses api, so I would not rely on them any longer. Even worse providers include the mess that is vertexai or gemini enterprise or whatever google’s renamed it to this week. Litellm makes it so you don’t need to spend time vendor locking yourself and can just focus on choosing whatever provider works best at the moment.

Prompt debugging

One of the saviours of debugging agentic workflows is having a really good prompt logger. This is something that records the inputs and outputs going to your agents.

The most basic version of this is dumping the input and output of each call into an .md file on disk (and parameters as an adjacent .json). If you’ve followed the first tip and centralized all your agent calls into one area, this is as simple as a with open(... f.write() before and after your actual llm call. For quick debugging for early-stage workflows, this works shockingly well.

Once you have a live app or your workflow is very complex, you may want a more advanced logging system. Langfuse is one of these. I don’t like it but it’s the best one I’ve tried. If you’ve got a better opensource one, please send me an email since they all seem to suck. Remember to disable telemetry if you’re using langfuse.

Whatever you use, it should log full inputs and outputs, as well as the parameters and application doing the call.

Always prefer deterministic code

As it stands, agentic calls are not very reliable and costly. This may continue to change in the future, but you’ll never have llm calls more reliable than hard logic… since hard logic is as reliable as it gets. I also highly doubt calls to llamas will ever be cheaper than normal code.

That said, we’re making agentic workflows for a reason: traditional code cannot deal with “soft” cases well. But you should always start by trying to use hard logic to solve the problem. For instance, if you’re taking in user input to a y/n question, you don’t need to call a llama for that: you need a regex! There’s a tendency to rely too much on llamas when making an agentic workflow, I advise you try to rely on them as little as possible.

The mindset

This is related to the previous point, but this time it’s about you. When writing agentic workflows, I find the best mindset to be in is the same mindset you’re in for traditional workflows. You should not be looking to add calls to llamas or relying on them, you should be looking to get a reliable workflow done.

Inevitably, you’ll find a node in your graph that simply won’t work without an agent involved. That’s okay, but when you’re thinking up the architecture of your workflow, it should still look a lot like a deterministic workflow. Your agents are just special functions within this workflow after all, not something defining it.