🤖What are we serving? Models

Hey {{first name | there}},

Over the weekend, I fell down another rabbit hole, trying to make sense of the portions of artificial intelligence that would make a difference in my day-to-day and, career-wise, give me some bonus points.

And so my topic of choice was inference.

I rolled my eyes a couple more times than I should have; however, I think there are some interesting concepts that can help you thrive should your company decide to "go all in on AI."

Housekeeping:

To make sure you don’t miss future emails, here are two quick GIFs showing how to move this email to your Primary tab and add this address to your contacts.

What is inferencing

Inferencing is not a new concept, but it has been gaining ground as more companies seek to offer their "AI-powered experiences," and that is exactly what inferencing is: the process of taking a trained model and running new data through it to get a useful output.

Every time you ask ChatGPT a question, every time a product recommends something to you, every time a support chatbot responds, that is inference.

Training is where a model learns patterns from massive datasets. Inference is where it puts that learning to work. Training happens once (or periodically). Inference happens every single time a user interacts with the product.

Why it matters

Open source models have gotten very good. DeepSeek, Llama, Mistral, Qwen, and others are now competitive with proprietary offerings for a growing number of use cases.

As the models become better, a key differentiator becomes how well you can serve it. The way I see it, agentic AI is driving inference volume up significantly. Agents do not make a single API call and stop. They walk through multi-step tasks, which means more tokens per interaction. If your company is building AI features, the discussion eventually changes from "which model do we use?" to "how do we serve this without burning through our budget?"

Where this fits in (Infrastructure wise)

A quick API call to OpenAI or Anthropic works fine for prototyping and low-volume use cases. But for real-time applications with strict latency requirements, or high-volume workloads where you are making thousands of inference calls per minute.

While researching over the weekend, I came across a post from Jimmy Song that put it well. He frames AI inference as essentially retracing the path that cloud native microservices already walked, except the underlying compute shifted from CPU to GPU. The core requirements are the same things Kubernetes has spent a decade solving: elasticity for traffic spikes, low latency for response times, cost control for expensive resources, canary releases for frequent model iterations, and multi-tenancy for different teams sharing clusters.

For teams already running cloud native infrastructure, inference fits naturally into what you are already managing

What you can do about it

Well, it's one thing to talk about a topic, but one of the bigger questions in mind is how do you even approach this topic, learning about it, and ultimately building projects that can be used to demonstrate knowledge of it.

Well, if you find this interesting enough, we might just do a series with more actionable steps and projects

Until next time.

Jubril Oyentunji
Chief Technology Officer, EverythingDevOps

🤖What are we serving? Models

What is inferencing

Where this fits in (Infrastructure wise)

What you can do about it

Keep Reading

EverythingDevOps

Home

Account