How to Code a Project like ChatPDF?

Josh Lin
4 min readMar 8, 2023

--

What is ChatPDF?

It is an awsome tool to make a PDF document chat with you like a person, and this means alot.

For example, if it’s a cook book, then you can ask him about cooking, if it’s a technical book, then he is an expert!

screenshot of ChatPDF

If you haven’t played with ChatPDF yet, here’s the site:

Technologies and Concepts

There are such kind of people, they need everything explainable, they need to know things behind the scenes. Like me, if you tell me this amazing tool is just some super power, I’d be crazy.

So the first thing behind scene is OpenAI’s ChatGPT, I guess it shall be the best chat AI for now

But chat AI has it’s limit, it can not take input more than 4001 tokens, that means:

  • you can only ask him some hundred words, so you can not feed the whole pdf content within the question
  • you can try make him remember the whole pdf content, maybe fine-tuning can do that, but it might not remember all the content
https://community.openai.com/t/fine-tuning-vs-embedding/35813/2

Embedding come to help, in simple words, embeddings are texts in vector form, you can compute distance betweeen texts, the distance will be short if they have similiar meaning or more related.

So we get the question, and find the related parts, combine them into ChatGPT prompt, then we get the answer.

process flow of ChatPDF (I guess)

The Coding Part

Get embedding from string


import openai

def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
result = openai.Embedding.create(
model=model,
input=text
)
return result["data"][0]["embedding"]

Break full content into pieces, and get embedding of each piece

  for source in content.split('\n'):
if source.strip() == '':
continue
embeddings.append(get_embedding(source))
sources.append(source)

Calculate distance between two strings

def vector_similarity(x: list[float], y: list[float]) -> float:
"""
Returns the similarity between two vectors.

Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
"""
return np.dot(np.array(x), np.array(y))

Sort by distance from user’s question

def order_document_sections_by_query_similarity(query: str, embeddings) -> list[(float, (str, str))]:
#pprint.pprint("embeddings")
#pprint.pprint(embeddings)
"""
Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
to find the most relevant sections.

Return the list of document sections, sorted by relevance in descending order.
"""
query_embedding = get_embedding(query)

document_similarities = sorted([
(vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in enumerate(embeddings)
], reverse=True, key=lambda x: x[0])

return document_similarities

When question asked, get nearest information as context and ask ChatGPT

def ask(question:str,embeddings,sources):
ordered_candidates = order_document_sections_by_query_similarity(question,embeddings)
ctx = ""
for candi in ordered_candidates:
next = ctx + " " + sources[candi[1]]
if len(next)>CONTEXT_TOKEN_LIMIT:
break
ctx = next
if len(ctx) == 0:
return ""

prompt = "".join([
u"Answer the question based on the following context:\n\n"
u"context:"+ ctx +u"\n\n"
u"Q:"+question+u"\n\n"
u"A:"
])

completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content":prompt}])
return [prompt, completion.choices[0].message.content]

Of course, to make it working still need a lot more codes, I had the codes work and pushed them in github repo:

the link broken because github flagged my account, if someboy can help unflag my account (postor@gmail.com) very much appreciated

Hope you have fun playing with it !

screenshot of postor/chatpdf-minimal-demo

--

--