This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage. All of our data referenced in this file is located here on hugging face

Check out our blog post here where we have a detailed explanation of the code and a colab notebook walking you through how we perform our calculations.

Instructions

First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.

We use NLTK to ensure that our summaries are of a certain token length. In order to do so, you'll need to download the punkt package to compute the token metrics. You can do so by running the command nltk.download('punkt')

pip3 install -r requirements.txt

Download the test.csv file and the summarization.jsonl file that you want to use for finetuning. We provide one with 20 examples, 50 examples and 100 examples to be used for testing. Let's now run a simple finetuning job with the following command.

Don't forget to set your OPENAI_API_KEY as an environment variable in your shell before running these commands

instructor jobs create-from-file summarization.jsonl

Once the job is complete, you'll end up with a new GPT 3.5 model that's capable of producing high quality summaries with a high entity density. You can run it by simply changing our finetune.py file's instructions.distil annotator as

@instructions.distil(model=<your finetuned model >,mode="dispatch")
def distil_summarization(text: str) -> GeneratedSummary:
// rest of code goes here