mirror of
https://github.com/kennethreitz/instructor.git
synced 2026-06-05 22:50:18 +00:00
doc: synthetic data
This commit is contained in:
@@ -0,0 +1,193 @@
|
||||
---
|
||||
draft: False
|
||||
date: 2024-03-08
|
||||
authors:
|
||||
- jxnl
|
||||
---
|
||||
|
||||
# Simple Synthetic Data Generation
|
||||
|
||||
What that people have been using instructor for is to generate synthetic data rather than extracting data itself. We can even use the J-Schemo extra fields to give specific examples to control how we generate data.
|
||||
|
||||
Consider the example below. We'll likely generate very simple names.
|
||||
|
||||
```python
|
||||
from typing import Iterable
|
||||
from pydantic import BaseModel
|
||||
import instructor
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
# Define the UserDetail model
|
||||
class UserDetail(BaseModel):
|
||||
name: str
|
||||
age: int
|
||||
|
||||
|
||||
# Patch the OpenAI client to enable the response_model functionality
|
||||
client = instructor.patch(OpenAI())
|
||||
|
||||
|
||||
def generate_fake_users(count: int) -> Iterable[UserDetail]:
|
||||
return client.chat.completions.create(
|
||||
model="gpt-3.5-turbo",
|
||||
response_model=Iterable[UserDetail],
|
||||
messages=[
|
||||
{"role": "user", "content": f"Generate a {count} synthetic users"},
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
for user in generate_fake_users(5):
|
||||
print(user)
|
||||
"""
|
||||
name='Alice' age=25
|
||||
name='Bob' age=30
|
||||
name='Charlie' age=35
|
||||
name='David' age=40
|
||||
name='Eve' age=45
|
||||
"""
|
||||
```
|
||||
|
||||
## Leveraging Simple Examples
|
||||
|
||||
We might want to set examples as part of the prompt by leveraging Pydantics configuration. We can set examples directly in the JSON scheme itself.
|
||||
|
||||
```python
|
||||
from typing import Iterable
|
||||
from pydantic import BaseModel
|
||||
import instructor
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
# Define the UserDetail model
|
||||
class UserDetail(BaseModel):
|
||||
name: str = Field(examples=["Timothee Chalamet", "Zendaya"])
|
||||
age: int
|
||||
|
||||
|
||||
# Patch the OpenAI client to enable the response_model functionality
|
||||
client = instructor.patch(OpenAI())
|
||||
|
||||
|
||||
def generate_fake_users(count: int) -> Iterable[UserDetail]:
|
||||
return client.chat.completions.create(
|
||||
model="gpt-3.5-turbo",
|
||||
response_model=Iterable[UserDetail],
|
||||
messages=[
|
||||
{"role": "user", "content": f"Generate a {count} synthetic users"},
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
for user in generate_fake_users(5):
|
||||
print(user)
|
||||
"""
|
||||
name='Timothee Chalamet' age=25
|
||||
name='Zendaya' age=24
|
||||
name='Keanu Reeves' age=56
|
||||
name='Scarlett Johansson' age=36
|
||||
name='Chris Hemsworth' age=37
|
||||
"""
|
||||
```
|
||||
|
||||
By incorporating names of celebrities as examples, we have shifted towards generating synthetic data featuring well-known personalities, moving away from the simplistic, single-word names previously used.
|
||||
|
||||
## Leveraging Complex Example
|
||||
|
||||
To effectively generate synthetic examples with more nuance, lets upgrade to the "gpt-4-turbo-preview" model, use model level examples rather than attribute level examples:
|
||||
|
||||
```Python
|
||||
import instructor
|
||||
|
||||
from typing import Iterable
|
||||
from pydantic import BaseModel, Field, ConfigDict
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
# Define the UserDetail model
|
||||
class UserDetail(BaseModel):
|
||||
"""Old Wizards"""
|
||||
name: str
|
||||
age: int
|
||||
|
||||
model_config = ConfigDict(
|
||||
json_schema_extra={
|
||||
"examples": [
|
||||
{"name": "Gandalf the Grey", "age": 1000},
|
||||
{"name": "Albus Dumbledore", "age": 150},
|
||||
]
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
# Patch the OpenAI client to enable the response_model functionality
|
||||
client = instructor.patch(OpenAI())
|
||||
|
||||
|
||||
def generate_fake_users(count: int) -> Iterable[UserDetail]:
|
||||
return client.chat.completions.create(
|
||||
model="gpt-4-turbo-preview",
|
||||
response_model=Iterable[UserDetail],
|
||||
messages=[
|
||||
{"role": "user", "content": f"Generate `{count}` synthetic examples"},
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
for user in generate_fake_users(5):
|
||||
print(user)
|
||||
"""
|
||||
name='Merlin' age=196
|
||||
name='Saruman the White' age=543
|
||||
name='Radagast the Brown' age=89
|
||||
name='Morgoth' age=901
|
||||
name='Filius Flitwick' age=105
|
||||
"""
|
||||
```
|
||||
|
||||
## Leveraging Descriptions
|
||||
|
||||
By adjusting the descriptions within our Pydantic models, we can subtly influence the nature of the synthetic data generated. This method allows for a more nuanced control over the output, ensuring that the generated data aligns more closely with our expectations or requirements.
|
||||
|
||||
For instance, specifying "Fancy French sounding names" as a description for the `name` field in our `UserDetail` model directs the generation process to produce names that fit this particular criterion, resulting in a dataset that is both diverse and tailored to specific linguistic characteristics.
|
||||
|
||||
|
||||
```python
|
||||
import instructor
|
||||
|
||||
from typing import Iterable
|
||||
from pydantic import BaseModel, Field
|
||||
from openai import OpenAI
|
||||
|
||||
|
||||
# Define the UserDetail model
|
||||
class UserDetail(BaseModel):
|
||||
name: str = Field(description="Fancy French sounding names")
|
||||
age: int
|
||||
|
||||
|
||||
# Patch the OpenAI client to enable the response_model functionality
|
||||
client = instructor.patch(OpenAI())
|
||||
|
||||
|
||||
def generate_fake_users(count: int) -> Iterable[UserDetail]:
|
||||
return client.chat.completions.create(
|
||||
model="gpt-3.5-turbo",
|
||||
response_model=Iterable[UserDetail],
|
||||
messages=[
|
||||
{"role": "user", "content": f"Generate `{count}` synthetic users"},
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
for user in generate_fake_users(5):
|
||||
print(user)
|
||||
"""
|
||||
name='Jean' age=25
|
||||
name='Claire' age=30
|
||||
name='Pierre' age=22
|
||||
name='Marie' age=27
|
||||
name='Luc' age=35
|
||||
"""
|
||||
```
|
||||
Reference in New Issue
Block a user