mirror of
https://github.com/kennethreitz/instructor.git
synced 2026-06-05 22:50:18 +00:00
Introduction
This is a simple example which shows how to perform Chain Of Density summarization using GPT-3.5 and utilise the generated output to fine-tune a 3.5 model for production usage. All of our data referenced in this file is located here on hugging face
Check out our blog post here where we have a detailed explanation of the code and a colab notebook walking you through how we perform our calculations.
Instructions
- First, install all of the required dependencies by running the command below. We recommend using a virtual environment to install these so that it does not affect your system installation.
We use NLTK to ensure that our summaries are of a certain token length. In order to do so, you'll need to download the
punktpackage to compute the token metrics. You can do so by running the commandnltk.download('punkt')
pip3 install -r requirements.txt
- Download the
test.csvfile and thesummarization.jsonlfile that you want to use for finetuning. We provide one with20examples,50examples and100examples to be used for testing. Let's now run a simple finetuning job with the following command.
Don't forget to set your
OPENAI_API_KEYas an environment variable in your shell before running these commands
instructor jobs create-from-file summarization.jsonl
- Once the job is complete, you'll end up with a new GPT 3.5 model that's capable of producing high quality summaries with a high entity density. You can run it by simply changing our
finetune.pyfile'sinstructions.distilannotator as
@instructions.distil(model=<your finetuned model >,mode="dispatch")
def distil_summarization(text: str) -> GeneratedSummary:
// rest of code goes here