diff --git a/tutorials/3.0.applications-rag.ipynb b/tutorials/3.0.applications-rag.ipynb
index 62ce23b..6d1c024 100644
--- a/tutorials/3.0.applications-rag.ipynb
+++ b/tutorials/3.0.applications-rag.ipynb
@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Applying Structured Output to RAG applications "
+ "# Applying Structured Output to RAG applications\n"
]
},
{
@@ -23,7 +23,7 @@
"\n",
"**Why is there a need for them?**\n",
"\n",
- "Pre-trained large language models do not learn over time. If you ask them a question they have not been trained on, they will often hallucinate. Therefore, we need to embed our own data to achieve a better output."
+ "Pre-trained large language models do not learn over time. If you ask them a question they have not been trained on, they will often hallucinate. Therefore, we need to embed our own data to achieve a better output.\n"
]
},
{
@@ -38,7 +38,7 @@
"\n",
"- **Query-Document Mismatch:** It assumes that the query and document embeddings will align in the vector space, which is often not the case.\n",
"- **Text Search Limitations:** The model is restricted to simple text queries without the nuances of advanced search features.\n",
- "- **Limited Planning Ability:** It fails to consider additional contextual information that could refine the search results."
+ "- **Limited Planning Ability:** It fails to consider additional contextual information that could refine the search results.\n"
]
},
{
@@ -62,7 +62,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Practical Examples"
+ "## Practical Examples\n"
]
},
{
@@ -78,7 +78,7 @@
"metadata": {},
"outputs": [],
"source": [
- "import instructor \n",
+ "import instructor\n",
"\n",
"from openai import OpenAI\n",
"from typing import List\n",
@@ -93,8 +93,8 @@
"source": [
"### Example 1) Improving Extractions\n",
"\n",
- "One of the big limitations is that often times the query we embed and the text \n",
- "A common method of using structured output is to extract information from a document and use it to answer a question. Directly, we can be creative in how we extract, summarize and generate potential questions in order for our embeddings to do better. \n",
+ "One of the big limitations is that often times the query we embed and the text\n",
+ "A common method of using structured output is to extract information from a document and use it to answer a question. Directly, we can be creative in how we extract, summarize and generate potential questions in order for our embeddings to do better.\n",
"\n",
"For example, instead of using just a text chunk we could try to:\n",
"\n",
@@ -102,7 +102,7 @@
"2. extract hypothetical questions\n",
"3. generate a summary of the text\n",
"\n",
- "In the example below, we use the `instructor` library to extract the key words and themes from a text chunk and use them to answer a question."
+ "In the example below, we use the `instructor` library to extract the key words and themes from a text chunk and use them to answer a question.\n"
]
},
{
@@ -113,9 +113,14 @@
"source": [
"class Extraction(BaseModel):\n",
" topic: str\n",
- " summary: str \n",
- " hypothetical_questions: List[str] = Field(default_factory=list, description=\"Hypothetical questions that this document could answer\")\n",
- " keywords: List[str] = Field(default_factory=list, description=\"Keywords that this document is about\")"
+ " summary: str\n",
+ " hypothetical_questions: List[str] = Field(\n",
+ " default_factory=list,\n",
+ " description=\"Hypothetical questions that this document could answer\",\n",
+ " )\n",
+ " keywords: List[str] = Field(\n",
+ " default_factory=list, description=\"Keywords that this document is about\"\n",
+ " )"
]
},
{
@@ -224,13 +229,14 @@
" model=\"gpt-4-1106-preview\",\n",
" stream=True,\n",
" response_model=Iterable[Extraction],\n",
- " messages=[{\n",
- " \"role\": \"system\", \n",
- " \"content\": \"Your role is to extract chunks from the following and create a set of topics.\"\n",
- " }, {\n",
- " \"role\": \"user\", \n",
- " \"content\": text_chunk\n",
- " }])\n",
+ " messages=[\n",
+ " {\n",
+ " \"role\": \"system\",\n",
+ " \"content\": \"Your role is to extract chunks from the following and create a set of topics.\",\n",
+ " },\n",
+ " {\"role\": \"user\", \"content\": text_chunk},\n",
+ " ],\n",
+ ")\n",
"\n",
"\n",
"for extraction in extractions:\n",
@@ -241,7 +247,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Now you can imagine if you were to embed the summaries, hypothetical questions, and keywords in a vector database, you can then use a vector search to find the best matching document for a given query. What you'll find is that the results are much better than if you were to just embed the text chunk! "
+ "Now you can imagine if you were to embed the summaries, hypothetical questions, and keywords in a vector database, you can then use a vector search to find the best matching document for a given query. What you'll find is that the results are much better than if you were to just embed the text chunk!\n"
]
},
{
@@ -261,10 +267,12 @@
"source": [
"from datetime import date\n",
"\n",
+ "\n",
"class DateRange(BaseModel):\n",
" start: date\n",
" end: date\n",
"\n",
+ "\n",
"class Query(BaseModel):\n",
" rewritten_query: str\n",
" published_daterange: DateRange"
@@ -276,14 +284,14 @@
"source": [
"In this example, `DateRange` and `Query` are Pydantic models that structure the user's query with a date range and a list of domains to search within.\n",
"\n",
- "These models **restructure** the user's query by including a rewritten query, a range of published dates, and a list of domains to search in."
+ "These models **restructure** the user's query by including a rewritten query, a range of published dates, and a list of domains to search in.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Using the new restructured query, we can apply this pattern to our function calls to obtain results that are optimized for our backend."
+ "Using the new restructured query, we can apply this pattern to our function calls to obtain results that are optimized for our backend.\n"
]
},
{
@@ -309,16 +317,14 @@
" response_model=Query,\n",
" messages=[\n",
" {\n",
- " \"role\": \"system\", \n",
- " \"content\": f\"You're a query understanding system for the Metafor Systems search engine. Today is {date.today()}. Here are some tips: ...\"\n",
+ " \"role\": \"system\",\n",
+ " \"content\": f\"You're a query understanding system for the Metafor Systems search engine. Today is {date.today()}. Here are some tips: ...\",\n",
" },\n",
- " {\n",
- " \"role\": \"user\", \n",
- " \"content\": f\"query: {q}\"\n",
- " }\n",
+ " {\"role\": \"user\", \"content\": f\"query: {q}\"},\n",
" ],\n",
" )\n",
"\n",
+ "\n",
"expand_query(\"What are some recent developments in AI?\")"
]
},
@@ -326,7 +332,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "This isn't just about adding some date ranges. We can even use some chain of thought prompting to generate tailored searches that are deeply integrated with our backend. "
+ "This isn't just about adding some date ranges. We can even use some chain of thought prompting to generate tailored searches that are deeply integrated with our backend.\n"
]
},
{
@@ -353,6 +359,7 @@
" start: date\n",
" end: date\n",
"\n",
+ "\n",
"class Query(BaseModel):\n",
" rewritten_query: str = Field(\n",
" description=\"Rewrite the query to make it more specific\"\n",
@@ -368,22 +375,75 @@
" response_model=Query,\n",
" messages=[\n",
" {\n",
- " \"role\": \"system\", \n",
- " \"content\": f\"You're a query understanding system for the Metafor Systems search engine. Today is {date.today()}. Here are some tips: ...\"\n",
+ " \"role\": \"system\",\n",
+ " \"content\": f\"You're a query understanding system for the Metafor Systems search engine. Today is {date.today()}. Here are some tips: ...\",\n",
" },\n",
- " {\n",
- " \"role\": \"user\", \n",
- " \"content\": f\"query: {q}\"\n",
- " }\n",
+ " {\"role\": \"user\", \"content\": f\"query: {q}\"},\n",
" ],\n",
" )\n",
"\n",
+ "\n",
"expand_query(\"What are some recent developments in AI?\")"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Using Weights and Biases to track experiments\n",
+ "\n",
+ "While running a function like this production is quite simple, a lot of time will be spend on iterating and improving the model. To do this, we can use Weights and Biases to track our experiments.\n",
+ "\n",
+ "In order to do so we wand manage a few things\n",
+ "\n",
+ "1. Save input and output pairs for later\n",
+ "2. Save the JSON schema for the response_model\n",
+ "3. Having snapshots of the model and data allow us to compare results over time, and as we make changes to the model we can see how the results change.\n"
+ ]
+ },
{
"cell_type": "code",
- "execution_count": 18,
+ "execution_count": 43,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "\n",
+ "def flatten_dict(d, parent_key=\"\", sep=\"_\"):\n",
+ " \"\"\"\n",
+ " Flatten a nested dictionary.\n",
+ "\n",
+ " :param d: The nested dictionary to flatten.\n",
+ " :param parent_key: The base key to use for the flattened keys.\n",
+ " :param sep: Separator to use between keys.\n",
+ " :return: A flattened dictionary.\n",
+ " \"\"\"\n",
+ " items = []\n",
+ " for k, v in d.items():\n",
+ " new_key = f\"{parent_key}{sep}{k}\" if parent_key else k\n",
+ " if isinstance(v, dict):\n",
+ " items.extend(flatten_dict(v, new_key, sep=sep).items())\n",
+ " else:\n",
+ " items.append((new_key, v))\n",
+ " return dict(items)\n",
+ "\n",
+ "\n",
+ "def dicts_to_df(list_of_dicts):\n",
+ " \"\"\"\n",
+ " Convert a list of dictionaries to a pandas DataFrame.\n",
+ "\n",
+ " :param list_of_dicts: List of dictionaries, potentially nested.\n",
+ " :return: A pandas DataFrame representing the flattened data.\n",
+ " \"\"\"\n",
+ " # Flatten each dictionary and create a DataFrame\n",
+ " flattened_data = [flatten_dict(d) for d in list_of_dicts]\n",
+ " return pd.DataFrame(flattened_data)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
"metadata": {
"scrolled": true
},
@@ -416,7 +476,7 @@
{
"data": {
"text/html": [
- "Run data is saved locally in /Users/jasonliu/dev/instructor/tutorials/wandb/run-20231221_145615-zv4el5or"
+ "Run data is saved locally in /Users/jasonliu/dev/instructor/tutorials/wandb/run-20231221_153734-idscpy5k"
],
"text/plain": [
""
@@ -428,7 +488,7 @@
{
"data": {
"text/html": [
- "Syncing run swept-sound-11 to Weights & Biases (docs)
"
+ "Syncing run easy-feather-16 to Weights & Biases (docs)
"
],
"text/plain": [
""
@@ -452,7 +512,7 @@
{
"data": {
"text/html": [
- " View run at https://wandb.ai/instructor/query-understanding/runs/zv4el5or"
+ " View run at https://wandb.ai/instructor/query-understanding/runs/idscpy5k"
],
"text/plain": [
""
@@ -471,12 +531,12 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
- "model_id": "eacf93c56db445ffafcfbf673ee44a91",
+ "model_id": "6440cf236ba24c3b839d1256cfada604",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
- "VBox(children=(Label(value='0.003 MB of 0.003 MB uploaded\\r'), FloatProgress(value=1.0, max=1.0)))"
+ "VBox(children=(Label(value='0.007 MB of 0.007 MB uploaded (0.001 MB deduped)\\r'), FloatProgress(value=1.0, max…"
]
},
"metadata": {},
@@ -485,7 +545,7 @@
{
"data": {
"text/html": [
- " View run swept-sound-11 at: https://wandb.ai/instructor/query-understanding/runs/zv4el5or
Synced 4 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)"
+ "W&B sync reduced upload amount by 9.0% "
],
"text/plain": [
""
@@ -497,7 +557,19 @@
{
"data": {
"text/html": [
- "Find logs at: ./wandb/run-20231221_145615-zv4el5or/logs"
+ " View run easy-feather-16 at: https://wandb.ai/instructor/query-understanding/runs/idscpy5k
Synced 4 W&B file(s), 1 media file(s), 4 artifact file(s) and 0 other file(s)"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "Find logs at: ./wandb/run-20231221_153734-idscpy5k/logs"
],
"text/plain": [
""
@@ -511,30 +583,70 @@
"import json\n",
"import wandb\n",
"\n",
+ "\n",
+ "class DateRange(BaseModel):\n",
+ " chain_of_thought: str = Field(\n",
+ " description=\"Think step by step to plan what is the best time range to search in\"\n",
+ " )\n",
+ " start: date\n",
+ " end: date\n",
+ "\n",
+ "\n",
+ "class Query(BaseModel):\n",
+ " rewritten_query: str = Field(\n",
+ " description=\"Rewrite the query to make it more specific\"\n",
+ " )\n",
+ " published_daterange: DateRange = Field(\n",
+ " description=\"Effective date range to search in\"\n",
+ " )\n",
+ "\n",
+ "\n",
+ "def expand_query(q) -> Query:\n",
+ " return client.chat.completions.create(\n",
+ " model=\"gpt-4-1106-preview\",\n",
+ " response_model=Query,\n",
+ " messages=[\n",
+ " {\n",
+ " \"role\": \"system\",\n",
+ " \"content\": f\"You're a query understanding system for the Metafor Systems search engine. Today is {date.today()}. Here are some tips: ...\",\n",
+ " },\n",
+ " {\"role\": \"user\", \"content\": f\"query: {q}\"},\n",
+ " ],\n",
+ " )\n",
+ "\n",
+ "\n",
"run = wandb.init(\n",
" project=\"query-understanding\",\n",
- " config=Query.model_json_schema()\n",
")\n",
"\n",
- "files = wandb.Artifact('data', type='dataset'\n",
- ")\n",
+ "test_queries = [\n",
+ " \"latest developments in artificial intelligence last 3 weeks\",\n",
+ " \"renewable energy trends past month\",\n",
+ " \"quantum computing advancements last 2 months\",\n",
+ " \"biotechnology updates last 10 days\",\n",
+ "]\n",
+ "\n",
+ "queries = [expand_query(q) for q in test_queries]\n",
"\n",
"with open(\"schema.json\", \"w+\") as f:\n",
" schema = Query.model_json_schema()\n",
" json.dump(schema, f, indent=2)\n",
"\n",
- "with open(\"results.jsonl\", \"w+\") as f:\n",
- " for questions in [\n",
- " \"latest developments in artificial intelligence last 3 weeks\",\n",
- " \"renewable energy trends past month\",\n",
- " \"quantum computing advancements last 2 months\",\n",
- " \"biotechnology updates last 10 days\"\n",
- " ]:\n",
- " query = expand_query(questions)\n",
- " f.write(json.dumps(query.model_dump_json()) + \"\\n\")\n",
- " \n",
- "files.add_file('schema.json')\n",
- "files.add_file('results.jsonl')\n",
+ "with open(\"results.jsonlines\", \"w+\") as f:\n",
+ " for query in queries:\n",
+ " f.write(query.model_dump_json() + \"\\n\")\n",
+ "\n",
+ "df = dicts_to_df([q.model_dump() for q in queries])\n",
+ "df[\"input\"] = test_queries\n",
+ "df.to_csv(\"results.csv\")\n",
+ "\n",
+ "run.log({\"results\": wandb.Table(dataframe=df)})\n",
+ "\n",
+ "files = wandb.Artifact(\"data\", type=\"dataset\")\n",
+ "\n",
+ "files.add_file(\"schema.json\")\n",
+ "files.add_file(\"results.jsonlines\")\n",
+ "files.add_file(\"results.csv\")\n",
"\n",
"run.log_artifact(files)\n",
"run.finish()"
@@ -542,7 +654,7 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 42,
"metadata": {},
"outputs": [],
"source": []
@@ -568,13 +680,15 @@
"source": [
"from typing import Literal\n",
"\n",
+ "\n",
"class SearchClient(BaseModel):\n",
" query: str\n",
" keywords: List[str]\n",
" email: str\n",
- " source: Literal[\"gmail\", \"calendar\"] \n",
+ " source: Literal[\"gmail\", \"calendar\"]\n",
" date_range: DateRange\n",
"\n",
+ "\n",
"class Retrival(BaseModel):\n",
" queries: List[SearchClient]"
]
@@ -626,8 +740,11 @@
" model=\"gpt-4-1106-preview\",\n",
" response_model=Retrival,\n",
" messages=[\n",
- " {\"role\": \"system\", \"content\":f\"You are Jason's personal assistant. Today is {date.today()}\"},\n",
- " {\"role\": \"user\", \"content\": \"What do I have today?\"}\n",
+ " {\n",
+ " \"role\": \"system\",\n",
+ " \"content\": f\"You are Jason's personal assistant. Today is {date.today()}\",\n",
+ " },\n",
+ " {\"role\": \"user\", \"content\": \"What do I have today?\"},\n",
" ],\n",
")\n",
"print(retrival.model_dump_json(indent=4))"
@@ -637,7 +754,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "To make it more challenging, we will assign it multiple tasks, followed by a list of queries that are routed to various search backends, such as email and calendar. Not only do we dispatch to different backends, over which we have no control, but we are also likely to render them to the user in different ways."
+ "To make it more challenging, we will assign it multiple tasks, followed by a list of queries that are routed to various search backends, such as email and calendar. Not only do we dispatch to different backends, over which we have no control, but we are also likely to render them to the user in different ways.\n"
]
},
{
@@ -691,8 +808,14 @@
" model=\"gpt-4-1106-preview\",\n",
" response_model=Retrival,\n",
" messages=[\n",
- " {\"role\": \"system\", \"content\": f\"You are Jason's personal assistant. Today is {date.today()}\"},\n",
- " {\"role\": \"user\", \"content\": \"What meetings do I have today and are there any important emails I should be aware of?\"}\n",
+ " {\n",
+ " \"role\": \"system\",\n",
+ " \"content\": f\"You are Jason's personal assistant. Today is {date.today()}\",\n",
+ " },\n",
+ " {\n",
+ " \"role\": \"user\",\n",
+ " \"content\": \"What meetings do I have today and are there any important emails I should be aware of?\",\n",
+ " },\n",
" ],\n",
")\n",
"print(retrival.model_dump_json(indent=4))"
@@ -702,9 +825,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Example 4) Decomposing questions \n",
+ "### Example 4) Decomposing questions\n",
"\n",
- "Lastly, a lightly more complex example of a problem that can be solved with structured output is decomposing questions. Where you ultimately want to decompose a question into a series of sub-questions that can be answered by a search backend. For example \n",
+ "Lastly, a lightly more complex example of a problem that can be solved with structured output is decomposing questions. Where you ultimately want to decompose a question into a series of sub-questions that can be answered by a search backend. For example\n",
"\n",
"\"Whats the difference in populations of jason's home country and canada?\"\n",
"\n",
@@ -715,7 +838,7 @@
"3. The population of Canada\n",
"4. The difference between the two\n",
"\n",
- "This would not be done correctly as a single query, nor would it be done in parallel, however there are some opportunities try to be parallel since not all of the sub-questions are dependent on each other."
+ "This would not be done correctly as a single query, nor would it be done in parallel, however there are some opportunities try to be parallel since not all of the sub-questions are dependent on each other.\n"
]
},
{
@@ -764,20 +887,31 @@
"class Question(BaseModel):\n",
" id: int = Field(..., description=\"A unique identifier for the question\")\n",
" query: str = Field(..., description=\"The question decomposited as much as possible\")\n",
- " subquestions: List[int] = Field(default_factory=list, description=\"The subquestions that this question is composed of\")\n",
+ " subquestions: List[int] = Field(\n",
+ " default_factory=list,\n",
+ " description=\"The subquestions that this question is composed of\",\n",
+ " )\n",
"\n",
"\n",
"class QueryPlan(BaseModel):\n",
" root_question: str = Field(..., description=\"The root question that the user asked\")\n",
- " plan: List[Question] = Field(..., description=\"The plan to answer the root question and its subquestions\")\n",
+ " plan: List[Question] = Field(\n",
+ " ..., description=\"The plan to answer the root question and its subquestions\"\n",
+ " )\n",
"\n",
"\n",
"retrival = client.chat.completions.create(\n",
" model=\"gpt-4-1106-preview\",\n",
" response_model=QueryPlan,\n",
" messages=[\n",
- " {\"role\": \"system\", \"content\":\"You are a query understanding system capable of decomposing a question into subquestions.\"},\n",
- " {\"role\": \"user\", \"content\": \"What is the difference between the population of jason's home country and canada?\"}\n",
+ " {\n",
+ " \"role\": \"system\",\n",
+ " \"content\": \"You are a query understanding system capable of decomposing a question into subquestions.\",\n",
+ " },\n",
+ " {\n",
+ " \"role\": \"user\",\n",
+ " \"content\": \"What is the difference between the population of jason's home country and canada?\",\n",
+ " },\n",
" ],\n",
")\n",
"\n",
@@ -788,7 +922,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "I hope in this section I've exposed you to some ways we can be creative in modeling structured outputs to leverage LLMS in building some lightweight components for our systems."
+ "I hope in this section I've exposed you to some ways we can be creative in modeling structured outputs to leverage LLMS in building some lightweight components for our systems.\n"
]
}
],