Building AI things with Cloudflare

I recently found myself staring at a giant legal document. Not because I was in any trouble, but because a client asked me to. They wanted to build a documentation site around it and I thought, “it would be nice to just chat with this thing.” But that would be hard right? LLMs, RAG, Agents, vector databases, and other buzzwords I can’t think of right now. Too much hassle just for an experiment. Right?
Wrong. We could just use Cloudflare. With Workers, KV, Vectorize, and an open-source framework called Hono, we built a globally available API that lets people ask questions in natural language and get meaningful answers.
Here’s how it happened.
Why Cloudflare?
You have options when it comes to semantic search. But I wanted something that was:
- Simple to set up: no complicated infrastructure
- Fast: no one likes to wait
- Global: available wherever users are
Cloudflare just made sense:
- Workers are serverless functions that run near your users, with fast startup times and zero server management
- KV is a distributed key-value store that lets me persist documents and cache answers across all Cloudflare locations
- Vectorize does the heavy lifting of semantic search by comparing vector embeddings
Together, these gave us a solid, scalable backend with minimal hassle.
The stack
Hono
Hono is an open-source framework created by a developer who works at Cloudflare, but it’s not a Cloudflare product. It’s lightweight, easy to pick up, and built for serverless runtimes like Workers. If you’ve used Express before, you’ll feel right at home.
AI SDK
The AI SDK by Vercel abstracts away the complexities of talking to models like Gemini. It makes generating natural language responses with TypeScript a breeze.
Learning the hard way
At first, I tried the naive approach: dump the entire legal document into the prompt every time a question came in. That worked… until it didn’t and I realized how many tokens I was using.
Turns out, feeding a giant document to a model every time isn’t scalable. Shocker I know.
With some TypeScript and regex magic, I broke the document into logical chunks, each tagged with metadata like titles and URLs. Then, I generated vector embeddings for each chunk and stored them in Vectorize, while the original text and metadata were saved in KV.
This meant I could:
- Search by semantic similarity to find relevant chunks
- Feed only the most relevant chunks to the model for context
- Keep costs and token usage in check
How queries work
When you ask a question, the API first checks if it’s seen it before by hashing the query and looking for a cached answer in KV.
If there’s a cache hit, it returns immediately.
If not, it:
- Converts your question into a vector embedding
- Searches Vectorize for the most relevant chunks
- Pulls those chunks from KV
- Sends them with your question to Gemini via the AI SDK
- Gets back a concise, context-aware answer
- Stores the answer in KV for next time
All of this happens pretty fast, so the experience feels smooth.
The secret sauce
Caching isn’t just about speed. It’s about saving money and making your app more reliable.
Here’s how I use KV:
- I hash every user query into a cache key
- If there’s a cached response, I return it immediately with no extra API calls
- If not, I generate a new response and store it with a short expiration
- Since KV is global, cached answers are available anywhere
It’s a simple way to make the API fast, cost-effective, and scalable.
A tiny, embeddable frontend
In order to interact with the content there has to be an interface for people to ask questions. I like to keep things simple so I built a tiny and portable frontend with Preact.
Why Preact?
- It’s lightweight and fast
- It requires no build step — just a single <script type="module"> you can drop anywhere
- Perfect for embedding on any webpage without adding bloat
This minimal UI handles user input, shows loading states, and renders answers by talking directly to the API. Bingo, bango, bongo.
What I learned
- Platforms like Cloudflare let you build powerful APIs with minimal overhead
- Vectorize is a great way to add semantic search without managing your own vector DB
- Chunking large documents is crucial to keep LLM costs manageable
- Caching with KV is a simple but powerful way to speed up responses and reduce load
- Small, zero-build frontends make demos and integrations effortless
What’s next?
There’s still plenty to improve. Better chunking, tweaking system prompts, cleaner ranking logic. All of it aimed at making responses faster, more accurate, and more useful.
I’m also experimenting with adapting the API into an MCP server. But that’s a whole other post.
Need something like this?
We are digital product agency that helps companies turn AI ideas into real, working products.
If you have a dense dataset, a complex document, or a rough concept you want to bring to life, we can build it. Fast, scalable, and meaningful.