Building AI things with Cloudflare

I recently found myself staring at a giant legal document. Not because I was in any trouble, but because a client asked me to. They wanted to build a documentation site around it and I thought, “it would be nice to just chat with this thing.” But that would be hard right? LLMs, RAG, Agents, vector databases, and other buzzwords I can’t think of right now. Too much hassle just for an experiment. Right?

Wrong. We could just use Cloudflare. With Workers, KV, Vectorize, and an open-source framework called Hono, we built a globally available API that lets people ask questions in natural language and get meaningful answers.

Here’s how it happened.

Why Cloudflare?

You have options when it comes to semantic search. But I wanted something that was:

Simple to set up: no complicated infrastructure
Fast: no one likes to wait
Global: available wherever users are

Cloudflare just made sense:

Workers are serverless functions that run near your users, with fast startup times and zero server management
KV is a distributed key-value store that lets me persist documents and cache answers across all Cloudflare locations
Vectorize does the heavy lifting of semantic search by comparing vector embeddings

Together, these gave us a solid, scalable backend with minimal hassle.

The stack

Hono

Hono is an open-source framework created by a developer who works at Cloudflare, but it’s not a Cloudflare product. It’s lightweight, easy to pick up, and built for serverless runtimes like Workers. If you’ve used Express before, you’ll feel right at home.

AI SDK

The AI SDK by Vercel abstracts away the complexities of talking to models like Gemini. It makes generating natural language responses with TypeScript a breeze.

Learning the hard way

At first, I tried the naive approach: dump the entire legal document into the prompt every time a question came in. That worked… until it didn’t and I realized how many tokens I was using.

Turns out, feeding a giant document to a model every time isn’t scalable. Shocker I know.

With some TypeScript and regex magic, I broke the document into logical chunks, each tagged with metadata like titles and URLs. Then, I generated vector embeddings for each chunk and stored them in Vectorize, while the original text and metadata were saved in KV.

This meant I could:

Search by semantic similarity to find relevant chunks
Feed only the most relevant chunks to the model for context
Keep costs and token usage in check

How queries work

When you ask a question, the API first checks if it’s seen it before by hashing the query and looking for a cached answer in KV.

If there’s a cache hit, it returns immediately.

If not, it:

Converts your question into a vector embedding
Searches Vectorize for the most relevant chunks
Pulls those chunks from KV
Sends them with your question to Gemini via the AI SDK
Gets back a concise, context-aware answer
Stores the answer in KV for next time

All of this happens pretty fast, so the experience feels smooth.

The secret sauce

Caching isn’t just about speed. It’s about saving money and making your app more reliable.

Here’s how I use KV:

I hash every user query into a cache key
If there’s a cached response, I return it immediately with no extra API calls
If not, I generate a new response and store it with a short expiration
Since KV is global, cached answers are available anywhere

It’s a simple way to make the API fast, cost-effective, and scalable.

A tiny, embeddable frontend

In order to interact with the content there has to be an interface for people to ask questions. I like to keep things simple so I built a tiny and portable frontend with Preact.

Why Preact?

It’s lightweight and fast
It requires no build step — just a single <script type="module"> you can drop anywhere
Perfect for embedding on any webpage without adding bloat

This minimal UI handles user input, shows loading states, and renders answers by talking directly to the API. Bingo, bango, bongo.

What I learned

Platforms like Cloudflare let you build powerful APIs with minimal overhead
Vectorize is a great way to add semantic search without managing your own vector DB
Chunking large documents is crucial to keep LLM costs manageable
Caching with KV is a simple but powerful way to speed up responses and reduce load
Small, zero-build frontends make demos and integrations effortless

What’s next?

There’s still plenty to improve. Better chunking, tweaking system prompts, cleaner ranking logic. All of it aimed at making responses faster, more accurate, and more useful.

I’m also experimenting with adapting the API into an MCP server. But that’s a whole other post.

Need something like this?

We are digital product agency that helps companies turn AI ideas into real, working products.

If you have a dense dataset, a complex document, or a rough concept you want to bring to life, we can build it. Fast, scalable, and meaningful.

Reach out anytime.