How to Code a Project like ChatPDF?

Josh Lin

4 min readMar 8, 2023

What is ChatPDF?

It is an awsome tool to make a PDF document chat with you like a person, and this means alot.

For example, if it’s a cook book, then you can ask him about cooking, if it’s a technical book, then he is an expert!

If you haven’t played with ChatPDF yet, here’s the site:

Chat with any PDF

It works great to quickly extract information from large PDF files. Try talking to manuals, essays, legal contracts…

www.chatpdf.com

Technologies and Concepts

There are such kind of people, they need everything explainable, they need to know things behind the scenes. Like me, if you tell me this amazing tool is just some super power, I’d be crazy.

So the first thing behind scene is OpenAI’s ChatGPT, I guess it shall be the best chat AI for now

Introducing ChatGPT

Contributors: John Schulman, Barret Zoph, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron…

openai.com

But chat AI has it’s limit, it can not take input more than 4001 tokens, that means:

you can only ask him some hundred words, so you can not feed the whole pdf content within the question
you can try make him remember the whole pdf content, maybe fine-tuning can do that, but it might not remember all the content

https://community.openai.com/t/fine-tuning-vs-embedding/35813/2

OpenAI API — fine-tuning

An API for accessing new AI models developed by OpenAI

platform.openai.com

Embedding come to help, in simple words, embeddings are texts in vector form, you can compute distance betweeen texts, the distance will be short if they have similiar meaning or more related.

OpenAI API — embedding

An API for accessing new AI models developed by OpenAI

platform.openai.com

So we get the question, and find the related parts, combine them into ChatGPT prompt, then we get the answer.

The Coding Part

Get embedding from string


import openai

def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
    result = openai.Embedding.create(
      model=model,
      input=text
    )
    return result["data"][0]["embedding"]

Break full content into pieces, and get embedding of each piece

  for source in content.split('\n'):
    if source.strip() == '':
        continue
    embeddings.append(get_embedding(source))
    sources.append(source)

Calculate distance between two strings

def vector_similarity(x: list[float], y: list[float]) -> float:
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(x), np.array(y))

Sort by distance from user’s question

def order_document_sections_by_query_similarity(query: str, embeddings) -> list[(float, (str, str))]:
    #pprint.pprint("embeddings")
    #pprint.pprint(embeddings)
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in enumerate(embeddings)
    ], reverse=True, key=lambda x: x[0])
    
    return document_similarities

When question asked, get nearest information as context and ask ChatGPT

def ask(question:str,embeddings,sources):
    ordered_candidates = order_document_sections_by_query_similarity(question,embeddings)
    ctx = ""
    for candi in ordered_candidates:
        next = ctx + " " + sources[candi[1]]
        if len(next)>CONTEXT_TOKEN_LIMIT:
            break
        ctx = next
    if len(ctx) == 0:
      return ""    
    
    prompt = "".join([
        u"Answer the question based on the following context:\n\n"
        u"context:"+ ctx +u"\n\n"
        u"Q:"+question+u"\n\n"
        u"A:"
                        ])

    completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content":prompt}])
    return [prompt, completion.choices[0].message.content]

Of course, to make it working still need a lot more codes, I had the codes work and pushed them in github repo:

GitHub - postor/chatpdf-minimal-demo

Contribute to postor/chatpdf-minimal-demo development by creating an account on GitHub.

github.com

the link broken because github flagged my account, if someboy can help unflag my account (postor@gmail.com) very much appreciated

GitHub - joshlinhit/chatpdf-minimal-demo

Contribute to joshlinhit/chatpdf-minimal-demo development by creating an account on GitHub.

github.com

Hope you have fun playing with it !

How to Code a Project like ChatPDF?

What is ChatPDF?

Chat with any PDF

It works great to quickly extract information from large PDF files. Try talking to manuals, essays, legal contracts…

Technologies and Concepts

Introducing ChatGPT

Contributors: John Schulman, Barret Zoph, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron…

OpenAI API — fine-tuning

An API for accessing new AI models developed by OpenAI

OpenAI API — embedding

An API for accessing new AI models developed by OpenAI

The Coding Part

GitHub - postor/chatpdf-minimal-demo

Contribute to postor/chatpdf-minimal-demo development by creating an account on GitHub.

GitHub - joshlinhit/chatpdf-minimal-demo

Contribute to joshlinhit/chatpdf-minimal-demo development by creating an account on GitHub.

Written by Josh Lin