remove bad examples

This commit is contained in:
Jason Liu
2024-01-05 15:13:49 -05:00
parent c16d622e88
commit 8893bdd83b
7 changed files with 169 additions and 326 deletions
-138
View File
@@ -1,138 +0,0 @@
# Example: Converting Text into Dataframes
In this example, we'll demonstrate how to convert a text into dataframes using OpenAI Function Call. We will define the necessary data structures using Pydantic and show how to convert the text into dataframes.
!!! note "Motivation"
Often times when we parse data we have an opportunity to extract structured data, what if we could extract an arbitrary number of tables with arbitrary schemas? By pulling out dataframes we could write tables or .csv files and attach them to our retrieved data.
## Defining the Data Structures
Let's start by defining the data structures required for this task: `RowData`, `Dataframe`, and `Database`.
```python
from pydantic import Field, BaseModel
from typing import List, Any
class RowData(BaseModel):
row: List[Any] = Field(..., description="The values for each row")
citation: str = Field(
..., description="The citation for this row from the original source data"
)
class Dataframe(BaseModel):
"""
Class representing a dataframe. This class is used to convert
data into a frame that can be used by pandas.
"""
name: str = Field(..., description="The name of the dataframe")
data: List[RowData] = Field(
...,
description="Correct rows of data aligned to column names, Nones are allowed",
)
columns: List[str] = Field(
...,
description="Column names relevant from source data, should be in snake_case",
)
def to_pandas(self):
import pandas as pd
columns = self.columns + ["citation"]
data = [row.row + [row.citation] for row in self.data]
return pd.DataFrame(data=data, columns=columns)
class Database(BaseModel):
"""
A set of correct named and defined tables as dataframes
"""
tables: List[Dataframe] = Field(
...,
description="List of tables in the database",
)
```
The `RowData` class represents a single row of data in the dataframe. It contains a `row` attribute for the values in each row and a `citation` attribute for the citation from the original source data.
The `Dataframe` class represents a dataframe and consists of a `name` attribute, a list of `RowData` objects in the `data` attribute, and a list of column names in the `columns` attribute. It also provides a `to_pandas` method to convert the dataframe into a Pandas DataFrame.
The `Database` class represents a set of tables in a database. It contains a list of `Dataframe` objects in the `tables` attribute.
## Using the Prompt Pipeline
To convert a text into dataframes, we'll use the Prompt Pipeline in OpenAI Function Call. We can define a function `dataframe` that takes a text as input and returns a `Database` object.
```python
import instructor
from openai import OpenAI
# Apply the patch to the OpenAI client
# enables response_model keyword
client = instructor.patch(OpenAI())
def dataframe(data: str) -> Database:
return client.chat.completions.create(
model="gpt-4-0613",
temperature=0.1,
response_model=Database,
messages=[
{
"role": "system",
"content": """Map this data into a dataframe a
nd correctly define the correct columns and rows""",
},
{
"role": "user",
"content": f"{data}",
},
],
max_tokens=1000,
)
```
The `dataframe` function takes a string `data` as input and creates a completion using the Prompt Pipeline. It prompts the model to map the data into a dataframe and define the correct columns and rows. The resulting completion is then converted into a `Database` object.
## Evaluating an Example
Let's evaluate the example by converting a text into dataframes using the `dataframe` function and print the resulting dataframes.
```python
dfs = dataframe("""My name is John and I am 25 years old. I live in
New York and I like to play basketball. His name is
Mike and he is 30 years old. He lives in San Francisco
and he likes to play baseball. Sarah is 20 years old
and she lives in Los Angeles. She likes to play tennis.
Her name is Mary and she is 35 years old.
She lives in Chicago.
On one team 'Tigers' the captain is John and there are 12 players.
On the other team 'Lions' the captain is Mike and there are 10 players.
""")
for df in dfs.tables:
print(df.name)
print(df.to_pandas())
```
The output will be:
```sh
People
Name Age City Favorite Sport
0 John 25 New York Basketball
1 Mike 30 San Francisco Baseball
2 Sarah 20 Los Angeles Tennis
3 Mary 35 Chicago None
Teams
Team Name Captain Number of Players
0 Tigers John 12
1 Lions Mike 10
```
+6 -7
View File
@@ -10,12 +10,11 @@
6. [How are complex queries decomposed into subqueries in a single request?](planning-tasks.md)
7. [How are entities extracted and resolved from documents?](entity_resolution.md)
8. [How are recursive schemas implemented and understood?](recursive.md)
9. [How are tables extracted automatically from textual data?](autodataframe.md)
10. [How is multi-file code generation accomplished?](gpt-engineer.md)
11. [How is Personally Identifiable Information sanitized from documents?](pii.md)
12. [How are action items and dependencies generated from transcripts?](action_items.md)
13. [How to enable OpenAI's moderation](moderation.md)
14. [How to extract tables from images](extracting_tables.md)
15. [How to generate advertising copy from image inputs](image_to_ad_copy.md)
9. [How is multi-file code generation accomplished?](gpt-engineer.md)
10. [How is Personally Identifiable Information sanitized from documents?](pii.md)
11. [How are action items and dependencies generated from transcripts?](action_items.md)
12. [How to enable OpenAI's moderation](moderation.md)
13. [How to extract tables using GPT-Vision?](extracting_tables.md)
14. [How to generate advertising copy from image inputs](image_to_ad_copy.md)
Explore more!
@@ -1,77 +0,0 @@
from instructor import OpenAISchema
from pydantic import Field
from typing import List, Any
from openai import OpenAI
client = OpenAI()
class RowData(OpenAISchema):
row: List[Any] = Field(..., description="The values for each row")
class Dataframe(OpenAISchema):
"""
Class representing a dataframe. This class is used to convert
data into a frame that can be used by pandas.
"""
data: List[RowData] = Field(
...,
description="Correct rows of data aligned to column names, Nones are allowed",
)
columns: List[str] = Field(
...,
description="Column names relevant from source data, should be in snake_case",
)
def to_pandas(self):
import pandas as pd
columns = self.columns
data = [row.row for row in self.data]
return pd.DataFrame(data=data, columns=columns)
def dataframe(data: str) -> Dataframe:
completion = client.chat.completions.create(
model="gpt-3.5-turbo-0613",
temperature=0.1,
functions=[Dataframe.openai_schema],
function_call={"name": Dataframe.openai_schema["name"]},
messages=[
{
"role": "system",
"content": """Map this data into a dataframe a
nd correctly define the correct columns and rows""",
},
{
"role": "user",
"content": f"{data}",
},
],
max_tokens=1000,
)
return Dataframe.from_response(completion)
if __name__ == "__main__":
df = dataframe(
"""My name is John and I am 25 years old. I live in
New York and I like to play basketball. His name is
Mike and he is 30 years old. He lives in San Francisco
and he likes to play baseball. Sarah is 20 years old
and she lives in Los Angeles. She likes to play tennis.
Her name is Mary and she is 35 years old.
She lives in Chicago."""
)
print(df.to_pandas())
"""
name age location hobby
0 John 25 New York basketball
1 Mike 30 San Francisco baseball
2 Sarah 20 Los Angeles tennis
3 Mary 35 Chicago None
"""
@@ -1,97 +0,0 @@
from instructor import OpenAISchema
from pydantic import Field
from typing import List, Any
from openai import OpenAI
client = OpenAI()
class RowData(OpenAISchema):
row: List[Any] = Field(..., description="Correct values for each row")
class Dataframe(OpenAISchema):
name: str = Field(..., description="The name of the dataframe")
data: List[RowData] = Field(
...,
description="Correct rows of data aligned to column names, Nones are allowed",
)
columns: List[str] = Field(
...,
description="Column names relevant from source data, should be in snake_case",
)
def to_pandas(self):
import pandas as pd
columns = self.columns
data = [row.row for row in self.data]
return pd.DataFrame(data=data, columns=columns)
class Database(OpenAISchema):
"""
A set of correct named and defined tables as dataframes
Each one should have the right number of columns and correct
values for each.
"""
tables: List[Dataframe] = Field(
...,
description="List of tables in the database",
)
def dataframe(data: str) -> Database:
completion = client.chat.completions.create(
model="gpt-4-0613",
temperature=0.0,
functions=[Database.openai_schema],
function_call={"name": Database.openai_schema["name"]},
messages=[
{
"role": "system",
"content": """Map this data into a dataframe a
nd correctly define the correct columns and rows""",
},
{
"role": "user",
"content": f"{data}",
},
],
max_tokens=1000,
)
return Database.from_response(completion)
if __name__ == "__main__":
dfs = dataframe(
"""My name is John and I am 25 years old. I live in
New York and I like to play basketball. His name is
Mike and he is 30 years old. He lives in San Francisco
and he likes to play baseball. Sarah is 20 years old
and she lives in Los Angeles. She likes to play tennis.
Her name is Mary and she is 35 years old.
She lives in Chicago.
On one team 'Tigers' the captan is John and there are 12 players.
On the other team 'Lions' the captan is Mike and there are 10 players.
"""
)
for df in dfs.tables:
print(df.name)
print(df.to_pandas())
"""
People
ID Name Age City Favorite Sport
0 1 John 25 New York Basketball
1 2 Mike 30 San Francisco Baseball
2 3 Sarah 20 Los Angeles Tennis
3 4 Mary 35 Chicago None
Teams
ID Team Name Captain Number of Players
0 1 Tigers John 12
1 2 Lions Mike 10
"""
+94
View File
@@ -0,0 +1,94 @@
from openai import OpenAI
from io import StringIO
from typing import Annotated, Any, Iterable
from openai import OpenAI
from pydantic import (
BaseModel,
BeforeValidator,
PlainSerializer,
InstanceOf,
WithJsonSchema,
)
import pandas as pd
from tomlkit import table
import instructor
client = instructor.patch(OpenAI(), mode=instructor.function_calls.Mode.MD_JSON)
def md_to_df(data: Any) -> Any:
if isinstance(data, str):
return (
pd.read_csv(
StringIO(data), # Get rid of whitespaces
sep="|",
index_col=1,
)
.dropna(axis=1, how="all")
.iloc[1:]
.map(lambda x: x.strip())
)
return data
MarkdownDataFrame = Annotated[
InstanceOf[pd.DataFrame],
BeforeValidator(md_to_df),
PlainSerializer(lambda x: x.to_markdown()),
WithJsonSchema(
{
"type": "string",
"description": """
The markdown representation of the table,
each one should be tidy, do not try to join tables
that should be seperate""",
}
),
]
class Table(BaseModel):
caption: str
dataframe: MarkdownDataFrame
client = instructor.patch(OpenAI())
tables = client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=Iterable[Table],
messages=[
{
"role": "system",
"content": "Please extract the tables from the following text, merge as much as possible:",
},
{
"role": "user",
"content": """
My name is John and I am 25 years old. I live in
New York and I like to play basketball. His name is
Mike and he is 30 years old. He lives in San Francisco
and he likes to play baseball. Sarah is 20 years old
and she lives in Los Angeles. She likes to play tennis.
Her name is Mary and she is 35 years old.
She lives in Chicago.
""",
},
],
)
for table in tables:
print(table.caption)
print(table.dataframe)
print()
"""
People
Age City Hobby
Name
John 25 New York Basketball
Mike 30 San Francisco Baseball
Sarah 20 Los Angeles Tennis
Mary 35 Chicago N/A
"""
@@ -1,11 +1,61 @@
from openai import OpenAI
client = OpenAI()
from io import StringIO
from typing import Annotated, Any, Iterable
from openai import OpenAI
from pydantic import (
BaseModel,
BeforeValidator,
PlainSerializer,
InstanceOf,
WithJsonSchema,
)
import pandas as pd
import instructor
response = client.chat.completions.create(
client = instructor.patch(OpenAI(), mode=instructor.function_calls.Mode.MD_JSON)
def md_to_df(data: Any) -> Any:
if isinstance(data, str):
return (
pd.read_csv(
StringIO(data), # Get rid of whitespaces
sep="|",
index_col=1,
)
.dropna(axis=1, how="all")
.iloc[1:]
.map(lambda x: x.strip())
)
return data
MarkdownDataFrame = Annotated[
InstanceOf[pd.DataFrame],
BeforeValidator(md_to_df),
PlainSerializer(lambda x: x.to_markdown()),
WithJsonSchema(
{
"type": "string",
"description": """
The markdown representation of the table,
each one should be tidy, do not try to join tables
that should be seperate""",
}
),
]
class Table(BaseModel):
caption: str
dataframe: MarkdownDataFrame
tables = client.chat.completions.create(
model="gpt-4-vision-preview",
max_tokens=1000,
response_model=Iterable[Table],
messages=[
{
"role": "user",
@@ -38,4 +88,17 @@ response = client.chat.completions.create(
],
)
print(response.choices[0].message.content)
for table in tables:
print(table.caption)
print(table.dataframe)
print()
"""
D1 App Retention Rates July 2023 (Ireland & U.K.)
Ireland UK
Category
Education 14% 12%
Entertainment 13% 11%
Games 26% 25%
Social 27% 18%
Utilities 11% 9%
"""
+2 -3
View File
@@ -149,8 +149,7 @@ nav:
- Overview: 'examples/index.md'
- Text Classification: 'examples/classification.md'
- Self Critique: 'examples/self_critique.md'
- Image Extracting Tables: 'examples/extracting_tables.md'
- Image to Ad Copy: 'examples/image_to_ad_copy.md'
- Extracting Tables: 'examples/extracting_tables.md'
- Moderation: 'examples/moderation.md'
- Citations: 'examples/exact_citations.md'
- Knowledge Graph: 'examples/knowledge_graph.md'
@@ -158,11 +157,11 @@ nav:
- Search Queries: 'examples/search.md'
- Query Decomposition: 'examples/planning-tasks.md'
- Recursive Schemas: 'examples/recursive.md'
- Table Extraction: 'examples/autodataframe.md'
- Action Item and Dependency Mapping: 'examples/action_items.md'
- Multi-File Code Generation: 'examples/gpt-engineer.md'
- PII Data Sanitization: 'examples/pii.md'
- Open Source: 'examples/open_source.md'
- Image to Ad Copy: 'examples/image_to_ad_copy.md'
- CLI Reference:
- "Introduction": "cli/index.md"
- "Finetuning GPT-3.5": "cli/finetune.md"