What is ChatPDF?
It is an awsome tool to make a PDF document chat with you like a person, and this means alot.
For example, if it’s a cook book, then you can ask him about cooking, if it’s a technical book, then he is an expert!
If you haven’t played with ChatPDF yet, here’s the site:
Technologies and Concepts
There are such kind of people, they need everything explainable, they need to know things behind the scenes. Like me, if you tell me this amazing tool is just some super power, I’d be crazy.
So the first thing behind scene is OpenAI’s ChatGPT, I guess it shall be the best chat AI for now
But chat AI has it’s limit, it can not take input more than 4001 tokens, that means:
- you can only ask him some hundred words, so you can not feed the whole pdf content within the question
- you can try make him remember the whole pdf content, maybe fine-tuning can do that, but it might not remember all the content
Embedding come to help, in simple words, embeddings are texts in vector form, you can compute distance betweeen texts, the distance will be short if they have similiar meaning or more related.
So we get the question, and find the related parts, combine them into ChatGPT prompt, then we get the answer.
The Coding Part
Get embedding from string
import openai
def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
result = openai.Embedding.create(
model=model,
input=text
)
return result["data"][0]["embedding"]
Break full content into pieces, and get embedding of each piece
for source in content.split('\n'):
if source.strip() == '':
continue
embeddings.append(get_embedding(source))
sources.append(source)
Calculate distance between two strings
def vector_similarity(x: list[float], y: list[float]) -> float:
"""
Returns the similarity between two vectors.
Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
"""
return np.dot(np.array(x), np.array(y))
Sort by distance from user’s question
def order_document_sections_by_query_similarity(query: str, embeddings) -> list[(float, (str, str))]:
#pprint.pprint("embeddings")
#pprint.pprint(embeddings)
"""
Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
to find the most relevant sections.
Return the list of document sections, sorted by relevance in descending order.
"""
query_embedding = get_embedding(query)
document_similarities = sorted([
(vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in enumerate(embeddings)
], reverse=True, key=lambda x: x[0])
return document_similarities
When question asked, get nearest information as context and ask ChatGPT
def ask(question:str,embeddings,sources):
ordered_candidates = order_document_sections_by_query_similarity(question,embeddings)
ctx = ""
for candi in ordered_candidates:
next = ctx + " " + sources[candi[1]]
if len(next)>CONTEXT_TOKEN_LIMIT:
break
ctx = next
if len(ctx) == 0:
return ""
prompt = "".join([
u"Answer the question based on the following context:\n\n"
u"context:"+ ctx +u"\n\n"
u"Q:"+question+u"\n\n"
u"A:"
])
completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content":prompt}])
return [prompt, completion.choices[0].message.content]
Of course, to make it working still need a lot more codes, I had the codes work and pushed them in github repo:
the link broken because github flagged my account, if someboy can help unflag my account (postor@gmail.com) very much appreciated
Hope you have fun playing with it !