Artificial intelligence (AI) is high on the to-do list in WebDev's day-to-day work. However, the topic of AI has some limitations. These large language models (LLMs) are a constant source of discussion, for example in terms of data protection or reliability (keyword hallucinations). This is another reason why we have designed and built our own Retrieval Augmented Generation (RAG) in Engineering.
At the denkwerk TechTalk, Tobias Kaulfuß (Lead Software Developer) and Karam Amara (Senior Software Developer) explained their approach to denkwerk's own Retrieval Augmented Generation (RAG).
What exactly is RAG?
Retrieval Augmented Generation stands for method: It enables the feeding of an artificial intelligence (AI), i.e. a Large Language Model (LLM), with data from external sources. The generating part of the AI is therefore supplemented by a search in data sources. Our two developers Tobias and Karam have built their own RAG for denkwerk with great enthusiasm.
The ingredients for the dw-RAG
Karam whetted the appetite for more: he satisfied the hunger for knowledge with detailed knowledge. The software developer presented the ingredients for the RAG: Firstly, there is proprietary data. This is data that was not available to the AI at the time of training and to which the AI has no access. On the other hand, you would of course need this one AI. And last but not least, the connecting ingredient: an application that orchestrates the whole thing. We have specifically selected these components:
VectorStore/ChromaDB (Retrieval)
Ollama (Generation)
SpringBoot/Spring AI application (augmentation)
The advantages of the existing components
First, our two experts looked at the underlying AI technologies. The focus was on Ollama, a runtime environment for various open source LLMs. It enables the use of different models and provides an API with which language models can be flexibly integrated into applications. Ollama is primarily used for large language models. During development, the performance of LLMs can be used directly on local computers. In addition to Ollama, other LLM providers such as OpenAI or Spring AI can also be connected. Spring AI, for example, provides an abstracted interface for various LLM APIs.
A vector memory is used for semantic searches in large volumes of text. With the help of embedding models from Ollama, texts are converted into high-dimensional vectors and stored in ChromaDB. This vectorization makes it possible to efficiently find and provide contextually similar content for a user prompt in the vector memory. This allows relevant information to be precisely extracted even from large amounts of data.
The major advantages of RAG for companies can be derived from these two components alone, especially in comparison to AI as a stand-alone entity:
Resource and data sovereignty
Data protection
Lower latency
Local hosting
Polyvalence/customizability
Data sovereignty and data protection in particular are values that are at risk in the age of AI. But where does the greater reliability mentioned above come from? The answer is: in the proprietary data and the way in which it is processed. After all, the orchestrating application still had to be built. And what sounds nice and simple in theory posed certain challenges during development.
"VectorStore/ChromaDB is available off-the-shelf, Ollama is available off-the-shelf, but the data from our chosen database did not come out normalized, the API did not work."
Karam Amara, Senior Software Developer at denkwerk
Heroes for the infrastructure: Hand in hand with the admins
As is so often the case, our system administrators were there to help in times of need. They provided the infrastructure. This meant that Ollama could be installed on a server with two models: The first model translates text into vectors for the database (Ollama-Model for Embeddings), the second is available for the chatbot (Ollama-Model for Generation). Correctly heard: The knowledge query naturally includes a search using prompts, as offered by conventional chatbots. The application server, which contains the various services for the front end, the VectorStore and, of course, the back end, was also implemented as part of the RAG infrastructure.
A hero for the backend: SpringBoot/Spring AI application
As so often, our system administrators were there to help in times of need. They provided the infrastructure. This meant that Ollama could be installed on a server with two models: The first model translates text into vectors for the database (Ollama-Model for Embeddings), the second is available for the chatbot (Ollama-Model for Generation). Correctly heard: The knowledge query naturally includes a search using prompts, as offered by conventional chatbots. The application server, which contains the various services for the front end, the VectorStore and of course the back end, was also implemented as part of the RAG infrastructure.
A hero for the backend: SpringBoot/Spring AI application
SpringBoot and the framework's still young application Spring AI became the core of the project. Spring AI deals with the fundamental challenge of AI integration: connecting company data and APIs with AI models. Tobias sums up why it became our denkwerker's favorite:
“The Spring AI Framework is a fairly young component in the Spring ecosystem that is currently undergoing rapid development. This project has given us the opportunity to familiarize ourselves intensively with the possibilities of Spring AI. As we rely heavily on Spring in our customer projects, we can use it to seamlessly integrate AI functionality into our business applications.”
Tobias Kaulfuß, Lead Software Developer at denkwerk
Spring AI was used to connect the Ollama and ChromaDB. The Ollama is interchangeable. Other alternatives are Claude or OpenAI. The database is also interchangeable, once again to illustrate the polyvalence/customizability of RAG.
And: no answers without data
We remember: The data from the selected (internal) database was not normalized, the API from Ollama did not work ... Here we could rely on our admins again, who were able to export the data. All that remained was the hard work: the data had to be prepared and saved in the database. To do this, the data was first split into so-called “tokens/chunks”. Relevant data in a database is best found when it corresponds in size to the prompts. The creators were faced with a huge mountain of text data that had to be broken up and processed as strings.
“RAG stands and falls with the data ... Normalizing data and bringing it into the pipeline for embedding is the actual task of RAG.”
Karam Amara, Senior Software Developer at denkwerk
The next step involved a series of instructions; after all, the strings have to be found by the chatbot. This is the retrieval part of RAG. Normally, the framework makes the query independently, but can be fine-tuned using prompt engineering. Tobias and colleagues therefore gave the Ollama context and background information from the dw data source so that it can also find reliable and relevant information. The user is told when it cannot answer a question. This prevents hallucinations altogether, which are unfortunately still very common with chatbots on the market.
"Mainly text, mainly strings, super little code, logic [...] a few parameters for connection that the framework provides us with."
Tobias Kaulfuß, Lead Software Developer at denwerk
A wonderful understatement from Tobias Kaulfuß in TechTalk. Without the expertise, the innovative solutions and, above all, the evaluation of the systems on the market, the dw-RAG would not have been created. And of course, the future business applications of the tool were the subject of a lively exchange following the TechTalk ...
Share this Spark