The value in AI isn't in the model. It's in what your system learns over time.

Every agency I speak to is doing something with AI. Most of them are doing the same thing: taking a powerful model, wrapping an interface around it, and calling it a product. It works, up to a point. But the value lives in the model, not in the wrapper. And the model belongs to someone else.

James Wylie

What we’re building at Temper is different. Not because we’re using better models, we’re using many of the same ones, but because of where we’ve chosen to put the intelligence.

The wrapper problem

When you build on top of a foundation model without adding anything that compounds, you have a product that is only as good as the model underneath it. The moment the model improves, your competitors get the same upgrade you do. The moment a new model arrives, anyone can rebuild what you built in a few weeks.

The question worth asking is: what does your system know after a year of use that it didn’t know on day one? If the answer is nothing, you’ve built a wrapper.

Where the value actually accumulates

The systems that become genuinely hard to replicate are the ones that get smarter through use, not because the underlying model improves, but because the system itself is accumulating something proprietary.

For us, that something is editorial judgement. The pattern of decisions made by real knowledge managers at real organisations, over time, about what good knowledge management looks like in innovation and public sector contexts.

Here’s a concrete example. When an editor publishes a new article on a Temper Knowledge site, the system doesn’t just index it. It asks what the article changes. If the site has theme pages (landing pages that represent the organisation’s current thinking on a subject) the system compares what the new article says against what those theme pages currently say, and suggests updates where the new content meaningfully moves the thinking forward.

The editor reviews the suggestion. They accept it, modify it, or reject it. That decision is signal. The delta between what the system suggested and what the human chose encodes something a generic model will never have: an understanding of what this specific organisation’s content should sound like, and what threshold of new evidence actually warrants changing a theme page versus just adding a new article.

A second example, which we’re building into the Gemini Framework for the Digital Twin Hub: the guided questions that help users navigate into a complex knowledge estate. Rather than a static set of entry points curated once by an editor, the questions update as the content changes. New research gets published, a gap gets filled, a theme develops, the system recognises that the questions it was surfacing six months ago are no longer the right ones, and suggests replacements that reflect what the estate can now actually answer.

Both of these are the same thing expressed differently. The system is continuously reconciling what users see first with the actual state of the knowledge underneath. And every time a human makes a decision about one of its suggestions, it learns something.

The stack, and why we’ve chosen it

We use Weaviate as our vector database. It sits at the centre of the knowledge layer — storing not just content but the semantic relationships between content, which is what makes retrieval meaningful rather than just fast.

We use DSPy to manage our prompt logic. Most teams write prompts as strings, which means they’re hard to version, hard to test, and hard to improve systematically. DSPy turns prompt logic into code. That matters for us because the suggestions the system makes — theme page updates, question recommendations, contradiction flags — need to improve over time based on real feedback. DSPy gives us a path to doing that systematically.

We use LangGraph to orchestrate the more complex agent workflows: the decisions about when to draw from the knowledge base, when to reach outside it, when to surface a suggestion versus stay silent. Orchestration is where systems become brittle if the decision logic isn’t designed carefully. LangGraph gives us the control we need over those decision points.

For model inference, we use a combination of AWS Bedrock for hosted inference and open source models — Mistral, Qwen, Gemma — via Ollama for cases where we want the option to fine-tune on domain-specific data. The open source choice isn’t about cost. It’s about ownership. A fine-tuned model trained on accumulated editorial decisions from Temper Knowledge clients is an asset. A hosted API call is an expense.

What this means in practice

A client who has been running on Temper Knowledge for two years has something a new client doesn’t. Their system has learned their editorial judgement. The suggestions it makes are calibrated to how their organisation thinks, what their content should do, and what good looks like in their specific context.

That’s not lock-in by friction. It’s switching cost built on genuine value. The platform has become something the client helped build, whether they thought of it that way or not.

The model is a component. The intelligence is in the system. And the system gets better because real people are using it to make real decisions about real knowledge, every day.

That’s what we’re building. And it’s why the architecture matters.