langchain/docs/extras/modules/chains/additional/extraction.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6605e7f7",
   "metadata": {},
   "source": [
    "# Extraction\n",
    "\n",
    "The extraction chain uses the OpenAI `functions` parameter to specify a schema to extract entities from a document. This helps us make sure that the model outputs exactly the schema of entities and properties that we want, with their appropriate types.\n",
    "\n",
    "The extraction chain is to be used when we want to extract several entities with their properties from the same passage (i.e. what people were mentioned in this passage?)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "34f04daf",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/harrisonchase/.pyenv/versions/3.9.1/envs/langchain/lib/python3.9/site-packages/deeplake/util/check_latest_version.py:32: UserWarning: A newer version of deeplake (3.6.4) is available. It's recommended that you update to the latest version using `pip install -U deeplake`.\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.chains import create_extraction_chain, create_extraction_chain_pydantic\n",
    "from langchain.prompts import ChatPromptTemplate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a2648974",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo-0613\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ef034ce",
   "metadata": {},
   "source": [
    "## Extracting entities"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78ff9df9",
   "metadata": {},
   "source": [
    "To extract entities, we need to create a schema like the following, were we specify all the properties we want to find and the type we expect them to have. We can also specify which of these properties are required and which are optional."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4ac43eba",
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = {\n",
    "    \"properties\": {\n",
    "        \"person_name\": {\"type\": \"string\"},\n",
    "        \"person_height\": {\"type\": \"integer\"},\n",
    "        \"person_hair_color\": {\"type\": \"string\"},\n",
    "        \"dog_name\": {\"type\": \"string\"},\n",
    "        \"dog_breed\": {\"type\": \"string\"},\n",
    "    },\n",
    "    \"required\": [\"person_name\", \"person_height\"],\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "640bd005",
   "metadata": {},
   "outputs": [],
   "source": [
    "inp = \"\"\"\n",
    "Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\n",
    "Alex's dog Frosty is a labrador and likes to play hide and seek.\n",
    "        \"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "64313214",
   "metadata": {},
   "outputs": [],
   "source": [
    "chain = create_extraction_chain(schema, llm)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17c48adb",
   "metadata": {},
   "source": [
    "As we can see, we extracted the required entities and their properties in the required format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cc5436ed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'person_name': 'Alex',\n",
       "  'person_height': 5,\n",
       "  'person_hair_color': 'blonde',\n",
       "  'dog_name': 'Frosty',\n",
       "  'dog_breed': 'labrador'},\n",
       " {'person_name': 'Claudia',\n",
       "  'person_height': 6,\n",
       "  'person_hair_color': 'brunette'}]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "698b4c4d",
   "metadata": {},
   "source": [
    "## Pydantic example"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6504a6d9",
   "metadata": {},
   "source": [
    "We can also use a Pydantic schema to choose the required properties and types and we will set as 'Optional' those that are not strictly required.\n",
    "\n",
    "By using the `create_extraction_chain_pydantic` function, we can send a Pydantic schema as input and the output will be an instantiated object that respects our desired schema. \n",
    "\n",
    "In this way, we can specify our schema in the same manner that we would a new class or function in Python - with purely Pythonic types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6792866b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Optional, List\n",
    "from pydantic import BaseModel, Field"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "36a63761",
   "metadata": {},
   "outputs": [],
   "source": [
    "class Properties(BaseModel):\n",
    "    person_name: str\n",
    "    person_height: int\n",
    "    person_hair_color: str\n",
    "    dog_breed: Optional[str]\n",
    "    dog_name: Optional[str]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "8ffd1e57",
   "metadata": {},
   "outputs": [],
   "source": [
    "chain = create_extraction_chain_pydantic(pydantic_schema=Properties, llm=llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "24baa954",
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "inp = \"\"\"\n",
    "Alex is 5 feet tall. Claudia is 1 feet taller Alex and jumps higher than him. Claudia is a brunette and Alex is blonde.\n",
    "Alex's dog Frosty is a labrador and likes to play hide and seek.\n",
    "        \"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84e0a241",
   "metadata": {},
   "source": [
    "As we can see, we extracted the required entities and their properties in the required format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f771df58",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Properties(person_name='Alex', person_height=5, person_hair_color='blonde', dog_breed='labrador', dog_name='Frosty'),\n",
       " Properties(person_name='Claudia', person_height=6, person_hair_color='brunette', dog_breed=None, dog_name=None)]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chain.run(inp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0df61283",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}