Secure AI Bot for Private Data

Devesh Paragiri
Python in Plain English
9 min readSep 19, 2023

Make your personal AI with Langchain + LLaMA 2 + Flask

LLMs (Large Language Models) are currently the center of attention in the AI community. With the advent of GPT-4, LLMs have become so mainstream that developers are closely integrating these models into several applications. While traditional LLMs are amazing for most usecases, they tend to fall short when you want to use them out-of-the-box with private data. While you can integrate your private data with GPT-4 through their APIs, it is not the best idea since you don’t want your sensitive data being sent to third-party servers.

Tackling the security issue

The ONLY way you can be sure that your data is securely processed is to make use of open-source LLMs which give you total control and flexibility.

What are the advantages?

  1. Get away with using even a 7B parameter model by proper prompt-engineering and fine-tuning
  2. Using lower parameter LLMs are computationally faster and quantizing the model makes it even more efficient leading to lower operating costs
  3. You have end-to-end control over the entire process so the data stays on YOUR servers
  4. Easily customizable to meet your growing requirements due to the open-source nature

Contents

  1. Creating a simple front-end chat interface with 🐍 Flask
  2. Downloading the 🦙 LLM
  3. Collating and processing private data with 🦜🔗 Langchain
  4. 🦜🔗 Langchain Retrival QA object for vectorDB similarity search
  5. Creating a custom prompt template

Step 1:Creating a simple front-end chat interface with 🐍 Flask

🐍 Flask interface

For our user interface, we will use Flask to create a simple webapp. There will only be one dynamic page where you will interact with the LLM. To get started, go ahead and clone this github repo for the entire code. I suggest you only use the code for the front-end and custom write everything else based on your needs!

The code is very simple since we only use the front-end for getting the user query and returning the result. If you want to understand more about the formatting of the response, check out the main.js file under the static/ directory.

# Main script
from flask import Flask, render_template, request
from utils import setup_dbqa

app = Flask(__name__)


@app.route("/")
def index():
return render_template("index.html")


@app.route("/get", methods=["GET", "POST"])
def chat():
msg = request.form["msg"]
input = msg
try:
return get_chat_response(input)
except ValueError:
return "You have exceeded the token limit! Sorry for the inconvenience!"


# Gets the response by passing the prompt to QA Object
def get_chat_response(input):
response = dbqa({"query": input})
return response["result"]


if __name__ == "__main__":
dbqa = setup_dbqa() #This is the RetreivalQA Object
app.run(debug=True, port=2000)

Step 2: Downloading the LLM

The LLM of choice for this usecase will be LLaMA-2’s 7B parameter chat model. This is a good choice for inference on a generalized set of private data. Feel free to explore other LLMs which you think might suit your usecase better.

For instance LLaMA fine-tuned models like Vicuna is a good option if you have a lot of instruction based tasks and Koala if you have dialogue based tasks. If you want to know more about which LLM fits your needs best, check out this LLM Index from Sapling.ai.

In this blog, we will be performing a CPU based inference. To do so, we will be using the GGML format of the LLM since it significantly increases the computational efficiency by utilizing various optimization techniques (primarily quantization).

What is GGML? Quantization?

GGML is a tensor library written in C++ that makes use of integer quantization and other optimization algorithms such as ADAM and L-BFGS to enable LLM inference in a CPU-based environment.

The main optimization lies in the quantization where the model weights which are floating point numbers are compressed to 4-bit or 8-bit integer formats. This reduces the precision of weights leading to a hit in performance but drastically improves efficiency since the RAM and Disk usage is significantly lower. If you want to know more about quantization, check out this page.

Downloading the LLM in GGML format

You can download the LLM of your choice from here. For this blog, I will be using the basic LLaMA-7B-Chat version. If the LLM you want to use is not already present in the GGML format on huggingface, you can always convert it to GGML locally by following this video and this github repo.

Using the LLM in python with 🦜🔗 Langchain

For using the LLM with python, we need to make use of python bindings that allow you to pass data or call functions between Python and C++ in this instance. To do this, we will be using the CTransformers library from Langchain.

#======= llm.py ===========
from langchain.llms import CTransformers

# Local CTransformers wrapper for Llama-2-7B-Chat
llm = CTransformers(
model="/Users/deveshparagiri/Downloads/models/sage-v2-q8_0.bin",
model_type="llama", # Model type Llama
config={"max_new_tokens": 256, "temperature": 0.5},
)

Step 3: Collating and processing private data with 🦜🔗 Langchain

Now that the LLM has been downloaded, the next step is to create our vector database with our personal documents. Create a data/ directory in your project folder and upload all your personal documents there. First, the documents need to be loaded, and then split into proper chunks before creating the vector embeddings based on them.

Chunking

Chunking the data is crucial for performing a good semantic similarity search. The most basic way of chunking would just be splitting the text based on a fixed length (Fixed-Size Chunking). However, this is not the best way to do it since we need to chunk while maintaining a semblance of the context. One way to do this is recursive chunking where the chunks are of similar size while maintaining context. To know more about the intricacies of chunking and its impact on inference, click here.

Vector Database and Embeddings

Once chunked, they are converted into vector embeddings with the help of our all-MiniLM-L6-v2 model and stored in Milvus, a vector DB. If you do not want to go for a traditional vector DB for now, you can also make use of the FAISS wrapper in Langchain to index and store the embeddings locally. For setting up milvus, make sure to pip install pymilvus and

The below code is run only once to create the vector database which will be used as a reference by our LLM to answer our queries.

#========= db_build.py ===========
from langchain.vectorstores import Milvus
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, DirectoryLoader,
Docx2txtLoader, CSVLoader

from langchain.embeddings import HuggingFaceEmbeddings
from dbconfig import CONNECTION_HOST, CONNECTION_PORT, COLLECTION_NAME

# Embedding model loading
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={"device": "cpu"}
)

# Recursive text splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,
chunk_overlap=50)


def load_data(directory):
"""This function loads and splits the files

Parameters:
`directory (str): Path where data is located`

Returns:
List: Returns a list of processed Langchain Document objects
"""

# Inititating all loaders
pdf_loader = DirectoryLoader(directory, glob="*.pdf", loader_cls=PyPDFLoader)
docx_loader = DirectoryLoader(directory, glob="*.docx", loader_cls=Docx2txtLoader)
spotify_loader = CSVLoader(file_path=f"{directory}spotify.csv")

insta_following_loader = CSVLoader(file_path=f"{directory}insta_following.csv")
insta_followers_loader = CSVLoader(file_path=f"{directory}insta_followers.csv")

# Loading all documents
pdf_documents = pdf_loader.load()
docx_documents = docx_loader.load()
spotify_documents = spotify_loader.load()
insta_following_documents = insta_following_loader.load()
insta_followers_documents = insta_followers_loader.load()

# Adding all loaded documents to one single list of Documents
corpus = pdf_documents
corpus.extend(docx_documents)
corpus.extend(spotify_documents)
corpus.extend(insta_following_documents)
corpus.extend(insta_followers_documents)
corpus.extend(spotify_documents)

# Resetting metadata for all type of documents to make it compatible for vector DB
for document in corpus:
document.metadata = {"source": document.metadata["source"]}

# Splitting all documents
corpus_processed = text_splitter.split_documents(corpus)
return corpus_processed

def vectordb_store(corpus_processed):
"""This function takes in the split documents,
creates vector embeddings, indexes and stores them in Milvus.

Parameters:
corpus_processed (List): List of Langchain Document objects

Returns: Milvus Vector DB Object
"""

vector_db = Milvus.from_documents(
corpus_processed,
embedding=embeddings,
connection_args={"host": "127.0.0.1", "port": "19530"},
collection_name=COLLECTION_NAME,
)
return vector_db


if __name__ == "__main__":
vectordb_store(load_data("data/"))

Step 4: 🦜🔗 Langchain Retrival QA Object for vectorDB similarity search

Our vector database has been created based on our private documents. Now, we need to be able to search that vector database based on our query to the LLM, find the most relevant data and send it back to our LLM for inference. To better understand the entire workflow, check out the architecture of the whole process below.

Workflow

To enable searching the vectorDB, we will instantiate a RetreivalQA Object in Langchain. We basically pass the LLM object, vectorDB source, and the prompt (user query) to this object and it returns the nearest search result. In my case, I am only returning the most relevant search but you can get the top K relevant searches by modifying the search_kwargs parameter.

#======= utils.py ============
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Milvus
from llm import llm
from pymilvus import connections


def build_retrieval_qa(llm, prompt, vectordb):
"""This function builds the RetreivalQA object`

Parameters:
llm (Object): The llm object
prompt (Object): The prompt template
vectordb (Object): The vector store

Returns:
RetreivalQA Object: Returns the best result
"""

# Only retreiving the first best result
dbqa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectordb.as_retriever(search_kwargs={"k": 1}),
return_source_documents=True,
chain_type_kwargs={"prompt": prompt},
)
return dbqa


def setup_dbqa():
"""This function instantiates the RetreivalQA object

Parameters:
llm (Object): The llm object
prompt (Object): The prompt template
vectordb (Object): The vector store

Returns:
RetreivalQA Object: Returns created object
"""

embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={"device": "cpu"},
)

connections.connect("default", host="localhost", port="19530")

vector_db: Milvus = Milvus(
embedding_function=embeddings,
connection_args={"host": "127.0.0.1", "port": "19530"},
collection_name="mystore",
)
qa_prompt = set_qa_prompt()
dbqa = build_retrieval_qa(llm, qa_prompt, vector_db)
return dbqa

Step 5: Creating a custom prompt template

The custom prompt template is very helpful in giving an idea of what type of prompt you will receive from the user and the way your model has to respond. For example, I have used this prompt template below but you can make up your own based on your requirements.

# ================================================================================
# Creating the template based on which the model will reply
# ================================================================================
from langchain import PromptTemplate

qa_template = """
You are Dev's personal A.I assistant named S.A.G.E.
You are a helpful and honest assistant who has access to my personal information. Please ensure that your responses are socially unbiased and positive in nature.
Censor any explicit content.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
Only answer based on what is presented.
Use the following context to answer the user's question.
Context: {context}
Question: {question}
Only return the answer and nothing else.
Answer:
"""


def set_qa_prompt():
"""This function wraps the prompt template in a PromptTemplate object

Parameters:

Returns:
PromptTemplate Object: Returns the prompt template object
"""

prompt = PromptTemplate(
template=qa_template, input_variables=["context", "question"]
)
return prompt

This is just the most basic example of a prompt template. You can really make use of prompt engineering to design curated responses. For example, you could use a Few Shot Prompt template. This is a way of training the model on specific source data by providing a few examples.

from langchain import FewShotPromptTemplate

# create our examples
examples = [
{
"query": "How are you?",
"answer": "I can't complain but sometimes I still do."
}, {
"query": "What time is it?",
"answer": "It's time to get a watch."
}
]

# create a example template
example_template = """
User: {query}
AI: {answer}
"""


# create a prompt example from above template
example_prompt = PromptTemplate(
input_variables=["query", "answer"],
template=example_template
)

# now break our previous prompt into a prefix and suffix
# the prefix is our instructions
prefix = """The following are exerpts from conversations with an AI
assistant. The assistant is typically sarcastic and witty, producing
creative and funny responses to the users questions. Here are some
examples:
"""

# and the suffix our user input and output indicator
suffix = """
User: {query}
AI: """


# now create the few shot prompt template
few_shot_prompt_template = FewShotPromptTemplate(
examples=examples,
example_prompt=example_prompt,
prefix=prefix,
suffix=suffix,
input_variables=["query"],
example_separator="\n\n"
)

For a more rigorous training on specific data, you must fine-tune the model on your private data. This can be done with a single-instance T4 GPU on Google Colab (It will take sometime depending on the size of the dataset) provided you make use of PEFT (Parameter Efficient Fine Tuning) techniques like QLoRA. For the fine-tuning code, check out my google colab notebook here. Check out this resource if you want to know more about QLoRA.

That’s about it for making your own personalized and secure A.I assistant. Click here to download the entire project code.

In Plain English

Thank you for being a part of our community! Before you go:

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in Python in Plain English

New Python content every day. Follow to join our 3.5M+ monthly readers.

Written by Devesh Paragiri

Currently @ Global Ecology Lab; Student, Tech enthusiast, Aficionado, Skateboarder deveshparagiri.com

No responses yet

Write a response