diff --git a/docs/blog/posts/distilation-part1.md b/docs/blog/posts/distilation-part1.md index e1fa041..04c8010 100644 --- a/docs/blog/posts/distilation-part1.md +++ b/docs/blog/posts/distilation-part1.md @@ -1,116 +1,26 @@ --- -draft: False -date: 2023-10-15 +draft: False +date: 2023-10-17 tags: - python - distillation - function calling - finetuning - - experimental --- -# Experimental: Finetuning with `Instructions` from `Instructor` +# Enhancing Python Functions with Instructor: A Guide to Fine-Tuning and Distillation -The core philosophy with the `instructor` library is to make language models backwards compatible with existing code. By adding Pydantic in the mix we're able to easily work with LLMs without much worry. +## Introduction -However, building efficient, reliable function is a key skill in software development. But why stop there? What if your functions could automatically become smarter and more efficient without any hand-holding? That's exactly what you gain by investing a few minutes into this read. Here, we delve into some new features `instructor`. +Get ready to dive deep into the world of fine-tuning task specific language models with Python functions. We'll explore how the `instructor.instructions` streamlines this process, making the task you want to distil more efficient and powerful while preserving its original functionality and backwards compatibility. -!!! note "Experimental" - This is an experimental feature. It's not yet ready for production use. This post is meant to give you a sneak peek into what's coming next, and get your feedback on what you'd like to see. +## Why You Need Instructor -By the end of this article, you'll understand how to easily integrate the end to end finetuning of small functions `instructor` library with your Python functions to improve them without breaking existing code. +Imagine you're developing a backend service that uses a mix old and new school ML practises, it may involve pipelines with multiple function calls, validations, and data processing. Sounds cumbersome, right? That's where `Instructor` comes in. It simplifies complex procedures, making them more efficient and easier to manage by adding a decorator to your function that will automatically generate a dataset for fine-tuning and help you swap out the function implementation. -## Why You Should Care +## Quick Start: How to Use Instructor's Distillation Feature -Traditionally, implementing a complex prompt chain involved linking multiple chains together. Each llm call might need [data validation](https://jxnl.github.io/instructor/reask_validation/), external validations, follow up prompts and more. This can be a tedious process, especially if you're working with a large number of functions. Instead we might want to finetune a model that can handle the entire chain end to - -### Anatomy of a Complex Function - -To paint a clearer picture, let's consider a function that takes a video transcript and churns out an email. This function may include the following steps: - -1. Summarize the video transcript. -2. Fact-check the summary. -3. Create a sequence of increasingly dense email drafts. -4. Select the final draft. - -Here's how the function could look in code: - -```python -def complex_chain(video_transcript: str) -> Email: - """ - Generate a follow-up email based on a video transcript - """ - summary = extract_summary(video_transcript) - summary = check_for_hallucinations(video_transcript, summary) - emails: List[Email] = create_chain_of_density(summary) - return emails[-1] -``` - -Traditional approaches would require you to manually save logs, extract the logs into a training set, fine-tune the model, and then replace the function with the model. But with `instructor`, a single decorator does the trick. - -```python -from instructor import Instructions -import logging - -logging.basicConfig(level=logging.INFO) - -instructions = Instructions( - name="sales_follow_up", - log_handlers=[FileHandler("complex_chain_finetune.jsonl")] -) - -@instructions.distil -def complex_chain(video_transcript: str) -> Email: - summary = extract_summary(video_transcript) - summary = check_for_hallucinations(video_transcript, summary) - emails: List[Email] = create_chain_of_density(summary) - return emails[-1] -``` - -This now results in a log file that can be used to finetune a model, you can use the `instructor` cli or upload the file directly to OpenAI. Note that its building using log handlers, so if you want to save to a DB or S3 you can do that by saving your logs elsewhere. - - -## I trained the model. Now what? - -Once a model is trained, you might imagine you want to delete the code body and replace it with a call to the model. However since we already decorate the function with `@instructions.distil`, we can simply call the function as usual. Here, `@distil` will automatically detect the model and use it instead of the function body. - -```python -from instructor import Instructions - -instructions = Instructions(name="sales_follow_up") - -@instructions.distil(model='gpt-3.5-turbo:finetuned') -def complex_chain(video_transcript: str) -> Email: - summary = extract_summary(video_transcript) - summary = check_for_hallucinations(video_transcript, summary) - emails: List[Email] = create_chain_of_density(summary) - return emails[-1] -``` - -Its a bit advanced but notice that `@distil` can detect the model and call openai rather than calling the base function: - -```python -def distil(model): - if model: - return openai.ChatCompletion.create( - model=model, - messages=[...], - response_model=fn.__annotations__["return"], - ) - # call the original function - # if the model is not set yet - return fn(*args, **kwargs) -``` - -You can imagine in the future we can have a range of different behavior - -1. Call a finetuned model, fall back to the original function -2. Call the finetuned model and the original function and compare the results as a validation -3. Route a percentage of calls to the finetuned model and the rest to the original function, as a way to test the model in production - -## A Simpler Example: Three-Digit Multiplication - -Even for trivial functions, defining data transformations and gathering the data can still be tedious. Here's how `instructor` automates this. +Before we dig into the nitty-gritty, let's look at how easy it is to use Instructor's distillation feature to use function calling finetuning to export the data to a JSONL file. ```python import logging @@ -118,6 +28,7 @@ import random from pydantic import BaseModel from instructor import Instructions +# Logging setup logging.basicConfig(level=logging.INFO) instructions = Instructions( @@ -131,20 +42,38 @@ class Multiply(BaseModel): b: int result: int +# Define a function with distillation +# The decorator will automatically generate a dataset for fine-tuning +# They must return a pydantic model to leverage function calling @instructions.distil def fn(a: int, b: int) -> Multiply: resp = a * b return Multiply(a=a, b=b, result=resp) +# Generate some data for _ in range(10): a = random.randint(100, 999) b = random.randint(100, 999) print(fn(a, b)) ``` -Once your function is defined, you can run it and it will automatically log the output to the file specified in the `log_handlers` argument. +## The Intricacies of Fine-tuning Language Models -## Logging output +Fine-tuning isn't just about writing a function like `def f(a, b): return a * b`. It requires detailed data preparation and logging. However, Instructor provides a built-in logging feature and structured outputs to simplify this. + +## Why Instructor and Distillation are Game Changers + +The library offers two main benefits: + +1. **Efficiency**: Streamlines functions, distilling requirements into model weights and a few lines of code. +2. **Integration**: Eases combining classical machine learning and language models by providing a simple interface that wraps existing functions. + +## Role of Instructor in Simplifying Fine-Tuning + +The `from instructor import Instructions` feature is a time saver. It auto-generates a fine-tuning dataset, making it a breeze to imitate a function's behavior. + +## Logging Output and Running a Finetune +Here's how the logging output would look: ```python { @@ -165,10 +94,30 @@ Once your function is defined, you can run it and it will automatically log the } ``` -Now this file is ready to be used for finetuning. You can use the `instructor` CLI to finetune the model. Check out the [finetune docs](https://jxnl.github.io/instructor/cli/finetune/) for more information. +Run a finetune like this: -## Next step +```bash +instructor jobs create-from-file math_finetunes.jsonl +``` -The `instructor` library offers an effortless way to make your llm functions smarter and more efficient. The best part? It ensures backward compatibility, so you can implement these improvements without breaking your existing codebase. +## Next Steps and Future Plans +Here's a sneak peek of what I'm planning: -Now if you're thinking wow, I'd love a backend service to do this for continously, you're in luck! Check out the survey at [useinstructor.com](https://useinstructor.com) and let us know who you are. \ No newline at end of file +```python +from instructor import Instructions + +instructions = Instructions( + name="three_digit_multiply", +) + +@instructions.distil(model='gpt-3.5-turbo:finetuned-123', mode="dispatch") +def fn(a: int, b: int) -> Multiply: + resp = a + b + return Multiply(a=a, b=b, result=resp) +``` + +With this, you can swap the function implementation, making it backward compatible. You can even imagine using the different models for different tasks or validating and runnign evals by using the original function and comparing it to the distillation. + +## Conclusion + +We've seen how `Instructor` can make your life easier, from fine-tuning to distillation. Now if you're thinking wow, I'd love a backend service to do this for continously, you're in luck! Please check out the survey at [useinstructor.com](https://useinstructor.com) and let us know who you are. diff --git a/examples/distilations/three_digit_mul.py b/examples/distilations/three_digit_mul.py index 74ccd69..9c521ee 100644 --- a/examples/distilations/three_digit_mul.py +++ b/examples/distilations/three_digit_mul.py @@ -22,49 +22,43 @@ class Multiply(BaseModel): @instructions.distil -def fn(a: int, b: int, c: str) -> Multiply: - """_summary_ - - Args: - a (int): _description_ - b (int): _description_ - c (str): _description_ - - Returns: - Response: _description_ - """ - resp = a + b +def fn(a: int, b: int) -> Multiply: + """Return the result of multiplying a and b together""" + resp = a * b return Multiply(a=a, b=b, result=resp) if __name__ == "__main__": import random - # A log will look like this: - log_line = { + log_lines = { "messages": [ { "role": "system", - "content": 'Predict the results of this function:\n\ndef fn(a: int, b: int, c: str) -> __main__.Response\n"""\n_summary_\n\nArgs:\n a (int): _description_\n b (int): _description_\n c (str): _description_\n\nReturns:\n Response: _description_\n"""', + "content": 'Predict the results of this function:\n\ndef fn(a: int, b: int) -> __main__.Multiply\n"""\nReturn the result of multiplying a and b together\n"""', }, - {"role": "user", "content": 'Return fn(133, b=539, c="hello")'}, + {"role": "user", "content": "Return `fn(169, b=166)`"}, { "role": "assistant", "function_call": { - "name": "Response", - "arguments": '{"a":133,"b":539,"result":672}', + "name": "Multiply", + "arguments": '{\n "a": 169,\n "b": 166,\n "result": 28054\n}', }, }, ], "functions": [ { - "name": "Response", - "description": "Correctly extracted `Response` with all the required parameters with correct types", + "name": "Multiply", + "description": "Correctly extracted `Multiply` with all the required parameters with correct types", "parameters": { "properties": { - "a": {"type": "integer"}, - "b": {"type": "integer"}, - "result": {"type": "integer"}, + "a": {"title": "A", "type": "integer"}, + "b": {"title": "B", "type": "integer"}, + "result": { + "description": "The result of the multiplication", + "title": "Result", + "type": "integer", + }, }, "required": ["a", "b", "result"], "type": "object", @@ -72,8 +66,7 @@ if __name__ == "__main__": } ], } - for _ in range(10): a = random.randint(100, 999) b = random.randint(100, 999) - print("returning", fn(a, b=b, c="hello")) + print("returning", fn(a, b=b)) diff --git a/examples/distilations/three_digit_mul_dispatch.py b/examples/distilations/three_digit_mul_dispatch.py new file mode 100644 index 0000000..1feb611 --- /dev/null +++ b/examples/distilations/three_digit_mul_dispatch.py @@ -0,0 +1,49 @@ +import logging + +from pydantic import BaseModel, Field +from instructor import Instructions +import instructor + +instructor.patch() + +logging.basicConfig(level=logging.INFO) + +# Usage +instructions = Instructions( + name="three_digit_multiply", + finetune_format="messages", + include_code_body=True, + log_handlers=[ + logging.FileHandler("math_finetunes.jsonl"), + ], +) + + +class Multiply(BaseModel): + a: int + b: int + result: int = Field(..., description="The result of the multiplication") + + +@instructions.distil(mode="dispatch", model="ft:gpt-3.5-turbo-0613:personal::8CazU0uq") +def fn(a: int, b: int) -> Multiply: + """Return the result of the multiplication as an integer""" + resp = a * b + return Multiply(a=a, b=b, result=resp) + + +if __name__ == "__main__": + import random + + for _ in range(5): + a = random.randint(100, 999) + b = random.randint(100, 999) + result = fn(a, b) + print(f"{a} * {b} = {result.result}, expected {a*b}") + """ + 972 * 508 = 493056, expected 493776 + 145 * 369 = 53505, expected 53505 + 940 * 440 = 413600, expected 413600 + 114 * 213 = 24282, expected 24282 + 259 * 650 = 168350, expected 168350 + """ diff --git a/instructor/distil.py b/instructor/distil.py index 179a9fb..e4801f5 100644 --- a/instructor/distil.py +++ b/instructor/distil.py @@ -4,10 +4,19 @@ import inspect import json import logging +<<<<<<< HEAD from typing import Any, Callable, List, Optional import uuid from pydantic import BaseModel, validate_call +======= +from typing import Any, Callable, List, Optional, Type +from pydantic import BaseModel, validate_call + +import uuid +import openai + +>>>>>>> distil from instructor import openai_schema @@ -83,12 +92,31 @@ class Instructions: log_handlers: List[logging.Handler] = None, finetune_format: FinetuneFormat = FinetuneFormat.MESSAGES, indent: int = 2, +<<<<<<< HEAD ): +======= + include_code_body: bool = False, + ): + """ + Instructions for distillation and dispatch. + + :param name: Name of the instructions. + :param id: ID of the instructions. + :param log_handlers: List of log handlers to use. + :param finetune_format: Format to use for finetuning. + :param indent: Indentation to use for finetuning. + :param include_code_body: Whether to include the code body in the finetuning. + """ +>>>>>>> distil self.name = name self.id = id or str(uuid.uuid4()) self.unique_id = str(uuid.uuid4()) self.finetune_format = finetune_format self.indent = indent +<<<<<<< HEAD +======= + self.include_code_body = include_code_body +>>>>>>> distil self.logger = logging.getLogger(self.name) for handler in log_handlers or []: @@ -99,6 +127,10 @@ class Instructions: *args, name: str = None, mode: str = "distil", +<<<<<<< HEAD +======= + model: str = "gpt-3.5-turbo", +>>>>>>> distil fine_tune_format: FinetuneFormat = None, ): """ @@ -122,7 +154,10 @@ class Instructions: """ allowed_modes = {"distil", "dispatch"} assert mode in allowed_modes, f"Must be in {allowed_modes}" +<<<<<<< HEAD assert mode == "distil", "Only distil mode is supported at the moment." +======= +>>>>>>> distil if fine_tune_format is None: fine_tune_format = self.finetune_format @@ -130,6 +165,23 @@ class Instructions: def _wrap_distil(fn): msg = f"Return type hint for {fn} must subclass `pydantic.BaseModel'" assert is_return_type_base_model_or_instance(fn), msg +<<<<<<< HEAD +======= + return_base_model = inspect.signature(fn).return_annotation + + @functools.wraps(fn) + def _dispatch(*args, **kwargs): + openai_kwargs = self.openai_kwargs( + name=name, + fn=fn, + args=args, + kwargs=kwargs, + base_model=return_base_model, + ) + return openai.ChatCompletion.create( + **openai_kwargs, model=model, response_model=return_base_model + ) +>>>>>>> distil @functools.wraps(fn) def _distil(*args, **kwargs): @@ -140,7 +192,15 @@ class Instructions: return resp +<<<<<<< HEAD return _distil +======= + if mode == "dispatch": + return _dispatch + + if mode == "distil": + return _distil +>>>>>>> distil if len(args) == 1 and callable(args[0]): return _wrap_distil(args[0]) @@ -172,35 +232,18 @@ class Instructions: if finetune_format == FinetuneFormat.MESSAGES: openai_function_call = openai_schema(base_model).openai_schema - func_def = get_signature_from_fn(fn).replace(fn.__name__, name) - - str_args = ", ".join(map(str, args)) - str_kwargs = ( - ", ".join(f"{k}={json.dumps(v)}" for k, v in kwargs.items()) or None + openai_kwargs = self.openai_kwargs(name, fn, args, kwargs, base_model) + openai_kwargs["messages"].append( + { + "role": "assistant", + "function_call": { + "name": base_model.__name__, + "arguments": resp.model_dump_json(indent=self.indent), + }, + } ) - call_args = ", ".join(filter(None, [str_args, str_kwargs])) - - function_body = { - "messages": [ - { - "role": "system", - "content": f"Predict the results of this function:\n\n{func_def}", - }, - { - "role": "user", - "content": f"Return `{name}({call_args})`", - }, - { - "role": "assistant", - "function_call": { - "name": openai_function_call["name"], - "arguments": resp.model_dump_json(indent=self.indent), - }, - }, - ], - "functions": [openai_function_call], - } - self.logger.info(json.dumps(function_body)) + openai_kwargs["functions"] = [openai_function_call] + self.logger.info(json.dumps(openai_kwargs)) if finetune_format == FinetuneFormat.RAW: function_body = dict( @@ -212,3 +255,29 @@ class Instructions: schema=base_model.model_json_schema(), ) self.logger.info(json.dumps(function_body)) + + def openai_kwargs(self, name, fn, args, kwargs, base_model): + if self.include_code_body: + func_def = format_function(fn) + else: + func_def = get_signature_from_fn(fn) + + str_args = ", ".join(map(str, args)) + str_kwargs = ( + ", ".join(f"{k}={json.dumps(v)}" for k, v in kwargs.items()) or None + ) + call_args = ", ".join(filter(None, [str_args, str_kwargs])) + + function_body = { + "messages": [ + { + "role": "system", + "content": f"Predict the results of this function:\n\n{func_def}", + }, + { + "role": "user", + "content": f"Return `{name}({call_args})`", + }, + ], + } + return function_body diff --git a/instructor/function_calls.py b/instructor/function_calls.py index 9daa86b..eb5c3a9 100644 --- a/instructor/function_calls.py +++ b/instructor/function_calls.py @@ -27,16 +27,6 @@ from typing import Any, Callable from pydantic import BaseModel, create_model, validate_arguments -def _remove_a_key(d, remove_key) -> None: - """Remove a key from a dictionary recursively""" - if isinstance(d, dict): - for key in list(d.keys()): - if key == remove_key and "type" in d.keys(): - del d[key] - else: - _remove_a_key(d[key], remove_key) - - class openai_function: """ Decorator to convert a function into an OpenAI function. @@ -82,8 +72,6 @@ class openai_function: parameters["required"] = sorted( k for k, v in parameters["properties"].items() if not "default" in v ) - _remove_a_key(parameters, "additionalProperties") - _remove_a_key(parameters, "title") self.openai_schema = { "name": self.func.__name__, "description": self.docstring.short_description, @@ -200,8 +188,6 @@ class OpenAISchema(BaseModel): f"the required parameters with correct types" ) - _remove_a_key(parameters, "title") - _remove_a_key(parameters, "additionalProperties") return { "name": schema["title"], "description": schema["description"], diff --git a/tests/test_multitask.py b/tests/test_multitask.py index 9e24949..04bd014 100644 --- a/tests/test_multitask.py +++ b/tests/test_multitask.py @@ -10,67 +10,8 @@ def test_multi_task(): query: str multitask = MultiTask(Search) - assert multitask.openai_schema == { - "description": "Correct segmentation of `Search` tasks", - "name": "MultiSearch", - "parameters": { - "$defs": { - "Search": { - "properties": { - "id": {"type": "integer"}, - "query": {"type": "string"}, - }, - "required": ["id", "query"], - "description": "This is the search docstring", - "type": "object", - } - }, - "properties": { - "tasks": { - "description": "Correctly segmented list of `Search` tasks", - "items": {"$ref": "#/$defs/Search"}, - "type": "array", - } - }, - "required": ["tasks"], - "type": "object", - }, - } - - -def test_multi_task_with_name_and_desc(): - class Search(OpenAISchema): - """This is the search docstring""" - - id: int - query: str - - multitask = MultiTask( - subtask_class=Search, name="MyCustomName", description="MyCustomDesc" + assert multitask.openai_schema["name"] == "MultiSearch" + assert ( + multitask.openai_schema["description"] + == "Correct segmentation of `Search` tasks" ) - assert multitask.openai_schema == { - "description": "MyCustomDesc", - "name": "MultiMyCustomName", - "parameters": { - "$defs": { - "Search": { - "properties": { - "id": {"type": "integer"}, - "query": {"type": "string"}, - }, - "required": ["id", "query"], - "description": "This is the search docstring", - "type": "object", - } - }, - "properties": { - "tasks": { - "description": "Correctly segmented list of `MyCustomName` tasks", - "items": {"$ref": "#/$defs/Search"}, - "type": "array", - } - }, - "required": ["tasks"], - "type": "object", - }, - }