Building an LLM with retrieval augmented generation stack, locally

9 min readAug 11, 2024

When you want to leverage the power of large language models on your personal data, without sharing a single byte!

As a data scientist and data engineer, I find the generative AI hype to be a bittersweet phenomenon. Yes, it creates awareness for data driven solutions, and there is an obvious overlap with the data science domain, but the amount of misinformation being pushed by self-proclaimed gurus on LinkedIn is absolutely unbearable. Still, the resulting unrealistic expectations is something that can be managed, the privacy and data related risks of rushing these solutions into organizations are far more difficult to address when it’s too late.

Because of this, I wanted to create a proof of concept where I could leverage the power of the latest Llama 3.1 model, add my personal documents as context without going through the hassle of fine-tuning, all that on my own MacBook Pro without sharing a single byte. (It was either that, or manually sorting my messy administration. ;-))

Adding new information to an LLM

Before we can dive into the nitty gritty, let’s get an understanding of the most popular different ways we can actually add new information to an LLM. (Note that I’m consciously avoiding the term “new knowledge”, as I believe one should refrain from using humanizing language in the context of LLM as it would contribute to the incorrect framing of these models.)

Fine-tuning

Fine-tuning involves continuing the training of an LLM on a specific dataset that is representative of the new knowledge you want the model to learn. This method adjusts the pre-existing weights of the model to better reflect the new information.

Advantages:

Enables the model to specialize in specific domains or topics.
Retains the general knowledge learned during pre-training while integrating new, specialized information.

Challenges:

Requires substantial computational resources.
Risk of “catastrophic forgetting,” where the model might lose some of its previously learned knowledge.

Prompt Engineering and Few-Shot Learning

This method involves crafting prompts that guide the LLM to generate responses that incorporate new knowledge without altering the model itself. Few-shot learning uses examples within the prompt to provide context or introduce new concepts.

Advantages:

No need for model retraining.
Can be done with any LLM without the need for additional resources.

Challenges:

The model is not actually “learning” in the traditional sense but is instead being guided.
May not always be reliable for complex or highly specialized knowledge.

Retrieval-Augmented Generation (RAG)

This technique involves integrating an external knowledge base (e.g., a database, document store, or search engine) that the LLM can query during inference to retrieve relevant information and include it in its responses.

Advantages:

Allows the model to access the most up-to-date information without retraining.
Reduces the need for the model to store all knowledge internally, which can be inefficient.

Challenges:

Requires an additional system to manage and query the external knowledge base.
Latency and complexity can increase due to the need to retrieve and integrate information in real-time.
As with prompt engineering, the model isn’t actually learning.

With Fine-tuning being under some scrutiny in terms of its effectiveness, and prompt engineering not being a viable in a situation where you don’t want to search for the extra context, I decided to use RAG.

The RAG process flow

The data scientist in me was slightly disappointed when learning about the actual RAG process, as I found it a rather crude solution, very sensitive to traditional data engineering pitfalls in terms of data quality. It also doesn’t actually improve the model and should be described as a more sophisticated way of prompt engineering. Still, it does provide a clear benefit so let’s get crackin’!

The diagram describes two separate processes:

Offline process

Here we prepare the vector database with our own data. This can be a periodic job that runs as frequently new data should be incorporated in the RAG process.

Main process

Here we go from prompt to response, adding relevant context using RAG along the way.

With our processes clear we can define our ingredient list. As mentioned, the main requirement is that this whole stack should run locally, to prevent any form of data leakage to external entities.

Ingredients:

My personal MacBook Pro M3 Pro running python 3.11.5 to create and run our code.
Our documents accessible from the offline process logic
PyPDF2 for PDF scraping https://pypi.org/project/PyPDF2/
A sentence transformer model to create embeddings (all-MiniLM-L6-v2) https://github.com/UKPLab/sentence-transformers
A Postgres database. I’m using Postgres 2.7.3 running locally for ease of use https://github.com/PostgresApp/PostgresApp
The PGVector extension, adding vector similarity support to Postgres https://github.com/pgvector/pgvector
Ollama for running and interacting with the Llama 3.1 model on Apple Silicon: https://github.com/ollama/ollama
Streamlit because I’m lazy and can’t be bothered to spend time on creating a custom Ui. https://github.com/streamlit/streamlit

Note: I’m aware that with libraries like langchain you can have a one stop shop where a lot of this is abstracted away, but that takes away some of the transparency of truly understanding what is going on, and where is the fun in that? It goes without saying that this code is very rough around the edges, just to illustrate how to get everything running for educational purposes.

Make sure you have Postgres and ollama running and have the model of your presence pulled. I am assuming you know how to do this, as well as running a streamlit environment locally. If not, there are numerous guides explaining how to do this, so let google (or ai) be your little helper. Next, we can setup our database, and get pgvector going.

import psycopg2
from pgvector.psycopg2 import register_vector
from psycopg2 import sql

DB_NAME = "vectorstore"

conn = psycopg2.connect('host=localhost dbname=postgres user=xxx')
cursor = conn.cursor()  

#create the database if it does not exist
conn.autocommit = True  
cursor = conn.cursor()  
cursor.execute("SELECT 1 FROM pg_database WHERE datname=%s", (DB_NAME,))
exists = cursor.fetchone()

if not exists:
    cursor.execute(sql.SQL("CREATE DATABASE {}").format(sql.Identifier(DB_NAME)))

cursor.close()
conn.close()

conn = psycopg2.connect(f'host=localhost dbname={DB_NAME} user=admin')
cursor = conn.cursor()  

#install pgvector
cursor.execute("CREATE EXTENSION IF NOT EXISTS vector");
conn.commit()

#register the vector type with psycopg2
register_vector(conn)

#create table to store embeddings and metadata
table_create_command = """
CREATE TABLE IF NOT EXISTS text_embeddings (
    id SERIAL PRIMARY KEY,
    text TEXT,
    embedding VECTOR(384)
"""

cursor.execute(table_create_command)
cursor.close()
conn.commit()

Most important thing to take into account here is that the vector dimensions should match the number of embedding dimensions we’ll create later.

With that out of the way we can start scraping our documents. I’m running a local smb server on my NAS so I used the smb library in python. You can also just get it from your local machine if needed.

import psycopg2
from transformers import AutoTokenizer, AutoModel
import os
from smb.SMBConnection import SMBConnection
import PyPDF2
import io
import re

#create connection and clean table
conn = psycopg2.connect('host=localhost dbname=vectorstore user=xxx')
cursor = conn.cursor()  

cursor.execute("truncate table text_embeddings");
conn.commit()


#load transformer model for embedding
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def extract_text_from_pdf(pdf_bytes):
    text = ""
    try:
        reader = PyPDF2.PdfReader(io.BytesIO(pdf_bytes))
        num_pages = len(reader.pages)
        for page_num in range(num_pages):
            page = reader.pages[page_num]
            text += page.extract_text()
    except Exception as e:
        print(f"Error reading PDF: {e}")
    return clean_text(text)

def clean_text(text):
    
    text = re.sub(r'\s+', ' ', text)  #normalize whitespace
    text = re.sub(r'[^a-zA-Z0-9\s\.,]', '', text)  #remove non-alphanumeric characters except spaces, dots, and commas
    #of course in a real scenario, more cleaning is needed!
    return text

def chunk_text(text, max_length=1024):
    tokens = tokenizer(text, return_tensors='pt', truncation=True, max_length=max_length, padding=True)
    chunks = []
    for i in range(0, len(tokens['input_ids'][0]), max_length):
        chunk = tokenizer.decode(tokens['input_ids'][0][i:i+max_length], skip_special_tokens=True)
        chunks.append(chunk)
    return chunks

def get_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).detach().numpy()
    return embeddings[0]

def store_embedding(text):
    chunks = chunk_text(text)
    
    for chunk in chunks:
        embedding = get_embedding(chunk)
        
        sql_query = """
            INSERT INTO text_embeddings (text, embedding)
            VALUES (%s, %s)
        """

        cursor.execute(sql_query, (chunk, embedding.tolist()))
    
    conn.commit()

def read_pdfs_from_smb_share(server_name, share_name, path, username, password,server_ip):
    conn = SMBConnection(username, password, my_name="llmpreprocessor", remote_name=server_name, use_ntlm_v2=True)
    assert conn.connect(server_ip, 139)
    
    pdf_texts = []

    def traverse_smb_folder(folder_path):
        shared_files = conn.listPath(share_name, folder_path)
        for shared_file in shared_files:
            if shared_file.filename not in ['.', '..']:
                file_path = os.path.join(folder_path, shared_file.filename)
                if shared_file.isDirectory:
                    traverse_smb_folder(file_path)
                elif shared_file.filename.lower().endswith('.pdf'):
                    print(f"Processing file: {file_path}")
                    file_obj = io.BytesIO()
                    conn.retrieveFile(share_name, file_path, file_obj)
                    file_obj.seek(0)
                    pdf_bytes = file_obj.read()
                    text = extract_text_from_pdf(pdf_bytes)
                    pdf_texts.append(text)

    traverse_smb_folder(path)
    conn.close()
    return pdf_texts


#read pdfs from smb share
server_name = "NAS"
server_ip = "192.168.xx.xx"
share_name = "xxx"
path = "/xxx/"
username = "xxx"
password = "xxx"

pdf_texts = read_pdfs_from_smb_share(server_name, share_name, path, username, password, server_ip)

for i, text in enumerate(pdf_texts):
    store_embedding(text)

cursor.close()
conn.close()

So with our data scraped, cleaned (well, sort of ;-) ), chunked and embedded, we’re ready to implement the actual RAG process. Note that you should play around with k to find the best balance between relevance of context and number of responses. Also, there is a lot of value in perfecting the prompt template itself.

import numpy as np
import psycopg2
from pgvector.psycopg2 import register_vector
from transformers import AutoTokenizer, AutoModel
import ollama
import streamlit as st
from streamlit_chat import message

conn = psycopg2.connect('host=localhost dbname=vectorstore user=xxx')
cursor = conn.cursor()  
register_vector(conn)

model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


def page():
    if len(st.session_state) == 0:
        st.session_state["messages"] = []

    st.session_state["ingestion_spinner"] = st.empty()

    display_messages()
    st.text_input("Message", key="user_input", on_change=process_input)


def process_input():
    if st.session_state["user_input"] and len(st.session_state["user_input"].strip()) > 0:
        user_text = st.session_state["user_input"].strip()
        with st.session_state["thinking_spinner"], st.spinner(f"Thinking"):
            relevant_texts = fetch_relevant_texts(user_text)
            agent_text = generate_response(user_text, relevant_texts)
        st.session_state["messages"].append((user_text, True))
        st.session_state["messages"].append((agent_text['message']['content'], False))

def fetch_relevant_texts(query, top_k=5):
    query_embedding = get_embedding(query)

    query_embedding_str = np.array(query_embedding)
    sql_query = """
        SELECT text, embedding, (embedding <=> %s) AS distance
        FROM text_embeddings
        ORDER BY distance ASC
        LIMIT %s;
    """
    cursor.execute(sql_query, (query_embedding_str, top_k))
    results = cursor.fetchall()
    print(results)
    return results

def get_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).detach().numpy()
    return embeddings[0]

def generate_response(query, relevant_texts):
  
    SYS_PROMPT = """You are an assistant for answering questions.
    You are given the extracted parts of a long document and a question. Provide a conversational answer.
    If you don't know the answer, just say "I do not know." Don't make up an answer."""
  
    PROMPT = f"Question:{query}\nContext:"
    for text, _, _ in relevant_texts:
        PROMPT+= f"{text}\n"

    messages = [{"role":"system","content":SYS_PROMPT},{"role":"user","content":PROMPT}]
    output = ollama.chat(
    model = "llama3.1",
    messages = messages,
    )

    return output


def display_messages():
    st.subheader("Tijs's Personal Assistant")
    for i, (msg, is_user) in enumerate(st.session_state["messages"]):
        message(msg, is_user=is_user, key=str(i))
    st.session_state["thinking_spinner"] = st.empty()


if __name__ == "__main__":
    page()

When running our application, you can see that it’s able to answer my prompt even from Dutch!

Getting answers to questions that can only be answered with extra context

Not a bad result for less than an hour of work, with no risk of my data ending up on the streets. Now I never have to go through several folders to find what my “WOZ Waarde” was in 2022! ;-)

Take aways

As shown, it’s very simple to build a solution similar to this, and it’s a great way to gain a basic understanding of how the infrastructure works. However, we should also understand that RAG is just an automated fancy way of prompt engineering. The model isn’t learning anything new; we just automated the context searching process, and this also exposes the weakness of a solution like this. The search process, although pretty nifty using embeddings and cosine similarity, is prone to all basic data quality problems you encounter in classic data engineering/analytics/science projects. Scraped PDF text tends to be very messy, with unexpected line breaks, white spaces and other encodings resulting of formatting. Besides that, the dataset itself should be reviewed thoroughly as irrelevant and duplicate data will severely reduce the similarity search accuracy. So, as always, it’s not a fix for lazy data governance, and there is no free lunch.

That said, I do believe it’s a viable solution if you want to set up an infrastructure where you do want to incorporate own documents/data to use with large language models, without having to upload/share your data with another company. Especially when fine-tuning is not an option. However, I still feel that the first step should still (and probably always be), get your data quality under control, because RAG is no get out of jail for free card!

Building an LLM with retrieval augmented generation stack, locally

Adding new information to an LLM

Fine-tuning

Prompt Engineering and Few-Shot Learning

Retrieval-Augmented Generation (RAG)

The RAG process flow

Offline process

Main process

Take aways

Written by Tijs van der Velden

No responses yet