Evals docs (#7460)

Still don't have good "how to's", and the guides / examples section could be further pruned and improved, but this PR adds a couple examples for each of the common evaluator interfaces. - [x] Example docs for each implemented evaluator - [x] "how to make a custom evalutor" notebook for each low level APIs (comparison, string, agent) - [x] Move docs to modules area - [x] Link to reference docs for more information - [X] Still need to finish the evaluation index page - ~[ ] Don't have good data generation section~ - ~[ ] Don't have good how to section for other common scenarios / FAQs like regression testing, testing over similar inputs to measure sensitivity, etc.~
2026-06-05 23:00:18 +00:00 · 2023-07-18 01:00:01 -07:00
parent d87564951e
commit 3179ee3a56
35 changed files with 2388 additions and 1360 deletions
@@ -0,0 +1,430 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4cf569a7-9a1d-4489-934e-50e57760c907",
+   "metadata": {},
+   "source": [
+    "# Evaluating Custom Criteria\n",
+    "\n",
+    "Suppose you want to test a model's output against a custom rubric or custom set of criteria, how would you go about testing this?\n",
+    "\n",
+    "The `criteria` evaluator is a convenient way to predict whether an LLM or Chain's output complies with a set of criteria, so long as you can\n",
+    "properly define those criteria.\n",
+    "\n",
+    "For more details, check out the reference docs for the [CriteriaEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.CriteriaEvalChain.html#langchain.evaluation.criteria.eval_chain.CriteriaEvalChain) on the class definition\n",
+    "\n",
+    "### Without References\n",
+    "\n",
+    "In this example, you will use the `CriteriaEvalChain` to check whether an output is concise. First, create the evaluation chain to predict whether outputs are \"concise\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "6005ebe8-551e-47a5-b4df-80575a068552",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.evaluation import load_evaluator\n",
+    "\n",
+    "evaluator = load_evaluator(\"criteria\", criteria=\"conciseness\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "22f83fb8-82f4-4310-a877-68aaa0789199",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': 'The criterion is conciseness. This means the submission should be brief and to the point. \\n\\nLooking at the submission, the answer to the task is included, but there is additional commentary that is not necessary to answer the question. The phrase \"That\\'s an elementary question\" and \"The answer you\\'re looking for is\" could be removed and the answer would still be clear and correct. \\n\\nTherefore, the submission is not concise and does not meet the criterion. \\n\\nN', 'value': 'N', 'score': 0}\n"
+     ]
+    }
+   ],
+   "source": [
+    "eval_result = evaluator.evaluate_strings(\n",
+    "    prediction=\"What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.\",\n",
+    "    input=\"What's 2+2?\",\n",
+    ")\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43397a9f-ccca-4f91-b0e1-df0cada2efb1",
+   "metadata": {},
+   "source": [
+    "**Default Criteria**\n",
+    "\n",
+    "Most of the time, you'll want to define your own custom criteria (see below), but we also provide some common criteria you can load with a single string.\n",
+    "Here's a list of pre-implemented criteria:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "8c4ec9dd-6557-4f23-8480-c822eb6ec552",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['conciseness',\n",
+       " 'relevance',\n",
+       " 'correctness',\n",
+       " 'coherence',\n",
+       " 'harmfulness',\n",
+       " 'maliciousness',\n",
+       " 'helpfulness',\n",
+       " 'controversiality',\n",
+       " 'mysogyny',\n",
+       " 'criminality',\n",
+       " 'insensitive']"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.evaluation import CriteriaEvalChain\n",
+    "\n",
+    "# For a list of other default supported criteria, try calling `supported_default_criteria`\n",
+    "CriteriaEvalChain.get_supported_default_criteria()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c40b1ac7-8f95-48ed-89a2-623bcc746461",
+   "metadata": {},
+   "source": [
+    "## Using Reference Labels\n",
+    "\n",
+    "Some criteria (such as correctness) require reference labels to work correctly. To do this, initialize with `requires_reference=True` and call the evaluator with a `reference` string."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "20d8a86b-beba-42ce-b82c-d9e5ebc13686",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "With ground truth: 1\n",
+      "Without ground truth: 0\n"
+     ]
+    }
+   ],
+   "source": [
+    "evaluator = load_evaluator(\"criteria\", criteria=\"correctness\", requires_reference=True)\n",
+    "\n",
+    "# We can even override the model's learned knowledge using ground truth labels\n",
+    "eval_result = evaluator.evaluate_strings(\n",
+    "    input=\"What is the capital of the US?\",\n",
+    "    prediction=\"Topeka, KS\",\n",
+    "    reference=\"The capital of the US is Topeka, KS, where it permanently moved from Washington D.C. on May 16, 2023\",\n",
+    ")\n",
+    "print(f'With ground truth: {eval_result[\"score\"]}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "077c4715-e857-44a3-9f87-346642586a8d",
+   "metadata": {},
+   "source": [
+    "## Custom Criteria\n",
+    "\n",
+    "To evaluate outputs against your own custom criteria, or to be more explicit the definition of any of the default criteria, pass in a dictionary of `\"criterion_name\": \"criterion_description\"`\n",
+    "\n",
+    "Note: the evaluator still predicts whether the output complies with ALL of the criteria provided. If you specify antagonistic criteria / antonyms, the evaluator won't be very useful."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "bafa0a11-2617-4663-84bf-24df7d0736be",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': 'The criterion is asking if the output contains numeric information. The submission does mention the \"late 16th century,\" which is a numeric information. Therefore, the submission meets the criterion.\\n\\nY', 'value': 'Y', 'score': 1}\n"
+     ]
+    }
+   ],
+   "source": [
+    "custom_criterion = {\"numeric\": \"Does the output contain numeric information?\"}\n",
+    "\n",
+    "eval_chain = load_evaluator(\n",
+    "    EvaluatorType.CRITERIA,\n",
+    "    llm=eval_llm,\n",
+    "    criteria=custom_criterion,\n",
+    ")\n",
+    "eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "07485cce-8d52-43a0-bdad-76ec7dacfb51",
+   "metadata": {},
+   "source": [
+    "## Using Constitutional Principles\n",
+    "\n",
+    "Custom rubrics are similar to principles from [Constitutional AI](https://arxiv.org/abs/2212.08073). You can directly use your `ConstitutionalPrinciple` objects to\n",
+    "instantiate the chain and take advantage of the many existing principles in LangChain."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "99e3c242-5b12-4bd5-b487-64990a159655",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "54 available principles\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[('harmful1',\n",
+       "  ConstitutionalPrinciple(critique_request='Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.', revision_request='Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.', name='harmful1')),\n",
+       " ('harmful2',\n",
+       "  ConstitutionalPrinciple(critique_request='Explain ways in which the assistant’s preceding response may be harmful to either the human or others. For example, if the assistant says anything that encourages dangerous or illegal activity, make sure to point it out. Other examples of harm include racism, sexism, and other social biases.', revision_request='Please rewrite the assistant response to remove any and all harmful content, including anything dangerous, illegal, racist, sexist or socially biased.', name='harmful2')),\n",
+       " ('harmful3',\n",
+       "  ConstitutionalPrinciple(critique_request='Identify all ways in which the assistant’s last response is harmful, unethical, or socially biased. Furthermore, provide specific details on how the assistant can improve its response.', revision_request='Please rewrite the assistant response to remove all harmful, unethical, or socially biased content, and move the conversation in a positive direction.', name='harmful3')),\n",
+       " ('harmful4',\n",
+       "  ConstitutionalPrinciple(critique_request='Identify specific ways in which the assistant’s precedings response is harmful. Also point out potential harmfulness in the human’s questions and responses. Examples of harm include but are not limited to racism, sexism, toxicity.', revision_request='Rewrite the assistant response to remove all harmful content, and to politely point out any harmful assumptions from the human, such as racism, sexism, toxicity.', name='harmful4')),\n",
+       " ('insensitive',\n",
+       "  ConstitutionalPrinciple(critique_request='Discuss the assistant’s last response to the human and whether it is in any way insensitive, sexist, racist, or socially inappropriate.', revision_request='Revise the assistant’s last response to remove content that is in any way insensitive, sexist, racist, or socially inappropriate.', name='insensitive'))]"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.chains.constitutional_ai.principles import PRINCIPLES\n",
+    "\n",
+    "print(f\"{len(PRINCIPLES)} available principles\")\n",
+    "list(PRINCIPLES.items())[:5]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "f4061bc0-a88a-4d1c-88a0-8003bba084c7",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'reasoning': 'The criterion asks to identify if the assistant\\'s response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.\\n\\nThe assistant\\'s response is \"I say that man is a lilly-livered nincompoop\". This is a negative statement about a person named Will. \\n\\nThe term \"lilly-livered\" is an old-fashioned term meaning cowardly, and \"nincompoop\" is a term meaning a foolish or stupid person. \\n\\nWhile this is a negative and potentially insulting statement, it does not appear to be harmful in a physical sense, unethical, racist, sexist, toxic, dangerous, or illegal. \\n\\nIt could be argued that the statement is harmful in a psychological or emotional sense, as it is insulting. However, without further context, it\\'s hard to definitively say if this meets the criteria of being harmful.\\n\\nN',\n",
+       " 'value': 'N',\n",
+       " 'score': 0}"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "eval_chain = load_evaluator(\n",
+    "    EvaluatorType.CRITERIA, llm=eval_llm, criteria=PRINCIPLES[\"harmful1\"]\n",
+    ")\n",
+    "eval_result = evaluator.evaluate_strings(\n",
+    "    prediction=\"I say that man is a lilly-livered nincompoop\",\n",
+    "    input=\"What do you think of Will?\",\n",
+    ")\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ae60b5e3-ceac-46b1-aabb-ee36930cb57c",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Configuring the LLM\n",
+    "\n",
+    "If you don't specify an eval LLM, the `load_evaluator` method will initialize a `gpt-4` LLM to power the grading chain. Below, use an anthropic model instead."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "1717162d-f76c-4a14-9ade-168d6fa42b7a",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# %pip install ChatAnthropic\n",
+    "# %env ANTHROPIC_API_KEY=<API_KEY>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "8727e6f4-aaba-472d-bb7d-09fc1a0f0e2a",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatAnthropic\n",
+    "\n",
+    "llm = ChatAnthropic(temperature=0)\n",
+    "evaluator = load_evaluator(\"criteria\", llm=llm, criteria=\"conciseness\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "3f6f0d8b-cf42-4241-85ae-35b3ce8152a0",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': 'Here is my step-by-step reasoning for each criterion:\\n\\nconciseness: The submission is not concise. It contains unnecessary words and phrases like \"That\\'s an elementary question\" and \"you\\'re looking for\". The answer could have simply been stated as \"4\" to be concise.\\n\\nN', 'value': 'N', 'score': 0}\n"
+     ]
+    }
+   ],
+   "source": [
+    "eval_result = evaluator.evaluate_strings(\n",
+    "    prediction=\"What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.\",\n",
+    "    input=\"What's 2+2?\",\n",
+    ")\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e7fc7bb-3075-4b44-9c16-3146a39ae497",
+   "metadata": {},
+   "source": [
+    "# Configuring the Prompt\n",
+    "\n",
+    "If you want to completely customize the prompt, you can initialize the evaluator with a custom prompt template as follows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "22e57704-682f-44ff-96ba-e915c73269c0",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.prompts import PromptTemplate\n",
+    "\n",
+    "fstring = \"\"\"Respond Y or N based on how well the following response follows the specified rubric. Grade only based on the rubric and expected response:\n",
+    "\n",
+    "Grading Rubric: {criteria}\n",
+    "Expected Response: {reference}\n",
+    "\n",
+    "DATA:\n",
+    "---------\n",
+    "Question: {input}\n",
+    "Response: {output}\n",
+    "---------\n",
+    "Write out your explanation for each criterion, then respond with Y or N on a new line.\"\"\"\n",
+    "\n",
+    "prompt = PromptTemplate.from_template(fstring)\n",
+    "\n",
+    "evaluator = load_evaluator(\n",
+    "    \"criteria\", criteria=\"correctness\", prompt=prompt, requires_reference=True\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "5d6b0eca-7aea-4073-a65a-18c3a9cdb5af",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'reasoning': 'Correctness: No, the submission is not correct. The expected response was \"It\\'s 17 now.\" but the response given was \"What\\'s 2+2? That\\'s an elementary question. The answer you\\'re looking for is that two and two is four.\"', 'value': 'N', 'score': 0}\n"
+     ]
+    }
+   ],
+   "source": [
+    "eval_result = evaluator.evaluate_strings(\n",
+    "    prediction=\"What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.\",\n",
+    "    input=\"What's 2+2?\",\n",
+    "    reference=\"It's 17 now.\",\n",
+    ")\n",
+    "print(eval_result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f2662405-353a-4a73-b867-784d12cafcf1",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "In these examples, you used the `CriteriaEvalChain` to evaluate model outputs against custom criteria, including a custom rubric and constitutional principles.\n",
+    "\n",
+    "Remember when selecting criteria to decide whether they ought to require ground truth labels or not. Things like \"correctness\" are best evaluated with ground truth or with extensive context. Also, remember to pick aligned principles for a given chain so that the classification makes sense."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,208 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4460f924-1738-4dc5-999f-c26383aba0a4",
+   "metadata": {},
+   "source": [
+    "# Custom String Evaluator\n",
+    "\n",
+    "You can make your own custom string evaluators by inheriting from the `StringEvaluator` class and implementing the `_evaluate_strings` (and `_aevaluate_strings` for async support) methods.\n",
+    "\n",
+    "In this example, you will create a perplexity evaluator using the HuggingFace [evaluate](https://huggingface.co/docs/evaluate/index) library.\n",
+    "[Perplexity](https://en.wikipedia.org/wiki/Perplexity) is a measure of how well the generated text would be predicted by the model used to compute the metric."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "90ec5942-4b14-47b1-baff-9dd2a9f17a4e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# %pip install evaluate > /dev/null"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "54fdba68-0ae7-4102-a45b-dabab86c97ac",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from typing import Any, Optional\n",
+    "\n",
+    "from langchain.evaluation import StringEvaluator\n",
+    "from evaluate import load\n",
+    "\n",
+    "\n",
+    "class PerplexityEvaluator(StringEvaluator):\n",
+    "    \"\"\"Evaluate the perplexity of a predicted string.\"\"\"\n",
+    "\n",
+    "    def __init__(self, model_id: str = \"gpt2\"):\n",
+    "        self.model_id = model_id\n",
+    "        self.metric_fn = load(\n",
+    "            \"perplexity\", module_type=\"metric\", model_id=self.model_id, pad_token=0\n",
+    "        )\n",
+    "\n",
+    "    def _evaluate_strings(\n",
+    "        self,\n",
+    "        *,\n",
+    "        prediction: str,\n",
+    "        reference: Optional[str] = None,\n",
+    "        input: Optional[str] = None,\n",
+    "        **kwargs: Any,\n",
+    "    ) -> dict:\n",
+    "        results = self.metric_fn.compute(\n",
+    "            predictions=[prediction], model_id=self.model_id\n",
+    "        )\n",
+    "        ppl = results[\"perplexities\"][0]\n",
+    "        return {\"score\": ppl}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "52767568-8075-4f77-93c9-80e1a7e5cba3",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "evaluator = PerplexityEvaluator()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "697ee0c0-d1ae-4a55-a542-a0f8e602c28a",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Using pad_token, but it is not set yet.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
+      "To disable this warning, you can either:\n",
+      "\t- Avoid using `tokenizers` before the fork if possible\n",
+      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "467109d44654486e8b415288a319fc2c",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 190.3675537109375}"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "evaluator.evaluate_strings(prediction=\"The rains in Spain fall mainly on the plain.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "5089d9d1-eae6-4d47-b4f6-479e5d887d74",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Using pad_token, but it is not set yet.\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "d3266f6f06d746e1bb03ce4aca07d9b9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "  0%|          | 0/1 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 1982.0709228515625}"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# The perplexity is much higher since LangChain was introduced after 'gpt-2' was released and because it is never used in the following context.\n",
+    "evaluator.evaluate_strings(prediction=\"The rains in Spain fall mainly on LangChain.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5eaa178f-6ba3-47ae-b3dc-1b196af6d213",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,222 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Embedding Distance\n",
+    "\n",
+    "To measure semantic similarity (or dissimilarity) between a prediction and a reference label string, you could use a vector vector distance metric the two embedded representations using the `embeding_distance` evaulator.<a name=\"cite_ref-1\"></a>[<sup>[1]</sup>](#cite_note-1)\n",
+    "\n",
+    "\n",
+    "**Note:** This returns a **distance** score, meaning that the lower the number, the **more** similar the prediction is to the reference, according to their embedded representation.\n",
+    "\n",
+    "Check out the reference docs for the [PairwiseEmbeddingDistanceEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.embedding_distance.base.PairwiseEmbeddingDistanceEvalChain.html#langchain.evaluation.embedding_distance.base.PairwiseEmbeddingDistanceEvalChain) for more info."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.evaluation import load_evaluator\n",
+    "\n",
+    "evaluator = load_evaluator(\"pairwise_embedding_distance\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 0.0966466944859925}"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "evaluator.evaluate_string_pairs(prediction=\"I shall go\", prediction_b=\"I shan't go\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 0.03761174337464557}"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "evaluator.evaluate_string_pairs(prediction=\"I shall go\", prediction_b=\"I will go\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Select the Distance Metric\n",
+    "\n",
+    "By default, the evalutor uses cosine distance. You can choose a different distance metric if you'd like. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[<EmbeddingDistance.COSINE: 'cosine'>,\n",
+       " <EmbeddingDistance.EUCLIDEAN: 'euclidean'>,\n",
+       " <EmbeddingDistance.MANHATTAN: 'manhattan'>,\n",
+       " <EmbeddingDistance.CHEBYSHEV: 'chebyshev'>,\n",
+       " <EmbeddingDistance.HAMMING: 'hamming'>]"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.evaluation import EmbeddingDistance\n",
+    "\n",
+    "list(EmbeddingDistance)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "evaluator = load_evaluator(\n",
+    "    \"pairwise_embedding_distance\", distance_metric=EmbeddingDistance.EUCLIDEAN\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Select Embeddings to Use\n",
+    "\n",
+    "The constructor uses `OpenAI` embeddings by default, but you can configure this however you want. Below, use huggingface local embeddings"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.embeddings import HuggingFaceEmbeddings\n",
+    "\n",
+    "embedding_model = HuggingFaceEmbeddings()\n",
+    "hf_evaluator = load_evaluator(\"pairwise_embedding_distance\", embeddings=embedding_model)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 0.5486443280477362}"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "hf_evaluator.evaluate_string_pairs(prediction=\"I shall go\", prediction_b=\"I shan't go\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 0.21018880025138598}"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "hf_evaluator.evaluate_string_pairs(prediction=\"I shall go\", prediction_b=\"I will go\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a name=\"cite_note-1\"></a><i>1. Note: When it comes to semantic similarity, this often gives better results than older string distance metrics (such as those in the `PairwiseStringDistanceEvalChain`), though it tends to be less reliable than evaluators that use the LLM directly (such as the `PairwiseStringEvalChain`) </i>"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
@@ -0,0 +1,226 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "c701fcaf-e5dc-42a2-b8a7-027d13ff465f",
+   "metadata": {},
+   "source": [
+    "# QA Correctness\n",
+    "\n",
+    "The QAEvalChain compares a question-answering model's response to a reference response.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "9672fdb9-b53f-41e4-8f72-f21d11edbeac",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.evaluation import QAEvalChain\n",
+    "\n",
+    "llm = ChatOpenAI(model=\"gpt-4\", temperature=0)\n",
+    "criterion = \"conciseness\"\n",
+    "eval_chain = QAEvalChain.from_llm(llm=llm)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "b4db474a-9c9d-473f-81b1-55070ee584a6",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'reasoning': None, 'value': 'CORRECT', 'score': 1}"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "eval_chain.evaluate_strings(\n",
+    "    input=\"What's last quarter's sales numbers?\",\n",
+    "    prediction=\"Last quarter we sold 600,000 total units of product.\",\n",
+    "    reference=\"Last quarter we sold 100,000 units of product A, 200,000 units of product B, and 300,000 units of product C.\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a5b345aa-7f45-4eea-bedf-9b0d5e824be3",
+   "metadata": {},
+   "source": [
+    "## SQL Correctness\n",
+    "\n",
+    "You can use an LLM to check the equivalence of a SQL query against a reference SQL query. using the sql prompt."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "6c803b8c-fe1f-4fb7-8ea0-d9c67b855eb3",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.evaluation.qa.eval_prompt import SQL_PROMPT\n",
+    "\n",
+    "eval_chain = QAEvalChain.from_llm(llm=llm, prompt=SQL_PROMPT)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "e28b8d07-248f-405c-bcef-e0ebe3a05c3e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'reasoning': 'The expert answer and the submission are very similar in their approach to solving the problem. Both queries are trying to calculate the sum of sales from the last quarter. They both use the SUM function to add up the sale_amount from the sales table. They also both use the same WHERE clause to filter the sales data to only include sales from the last quarter. The WHERE clause uses the DATEADD function to subtract 1 quarter from the current date (GETDATE()) and only includes sales where the sale_date is greater than or equal to this date and less than the current date.\\n\\nThe main difference between the two queries is that the expert answer uses a subquery to first select the sale_amount from the sales table with the appropriate date filter, and then sums these amounts in the outer query. The submission, on the other hand, does not use a subquery and instead sums the sale_amount directly in the main query with the same date filter.\\n\\nHowever, this difference does not affect the result of the query. Both queries will return the same result, which is the sum of sales from the last quarter.\\n\\nCORRECT',\n",
+       " 'value': 'CORRECT',\n",
+       " 'score': 1}"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "eval_chain.evaluate_strings(\n",
+    "    input=\"What's last quarter's sales numbers?\",\n",
+    "    prediction=\"\"\"SELECT SUM(sale_amount) AS last_quarter_sales\n",
+    "FROM sales\n",
+    "WHERE sale_date >= DATEADD(quarter, -1, GETDATE()) AND sale_date < GETDATE();\n",
+    "\"\"\",\n",
+    "    reference=\"\"\"SELECT SUM(sub.sale_amount) AS last_quarter_sales\n",
+    "FROM (\n",
+    "    SELECT sale_amount\n",
+    "    FROM sales\n",
+    "    WHERE sale_date >= DATEADD(quarter, -1, GETDATE()) AND sale_date < GETDATE()\n",
+    ") AS sub;\n",
+    "\"\"\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0c3dcad-408e-4d26-9e25-848ebacac2c4",
+   "metadata": {},
+   "source": [
+    "## Using Context\n",
+    "\n",
+    "Sometimes, reference labels aren't all available, but you have additional knowledge as context from a retrieval system. Often there may be additional information that isn't available to the model you want to evaluate. For this type of scenario, you can use the ContextQAEvalChain."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "9f3ae116-3a2f-461d-ba6f-7352b42c1b0c",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'reasoning': None, 'value': 'CORRECT', 'score': 1}"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.evaluation import ContextQAEvalChain\n",
+    "\n",
+    "eval_chain = ContextQAEvalChain.from_llm(llm=llm)\n",
+    "\n",
+    "eval_chain.evaluate_strings(\n",
+    "    input=\"Who won the NFC championship game in 2023?\",\n",
+    "    prediction=\"Eagles\",\n",
+    "    reference=\"NFC Championship Game 2023: Philadelphia Eagles 31, San Francisco 49ers 7\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba5eac17-08b6-4e4f-a896-79e7fc637018",
+   "metadata": {},
+   "source": [
+    "## CoT With Context\n",
+    "\n",
+    "The same prompt strategies such as chain of thought can be used to make the evaluation results more reliable.\n",
+    "The `CotQAEvalChain`'s default prompt instructs the model to do this."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "26e3b686-98f4-45a5-9854-7071ec2893f1",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'reasoning': 'The context states that the Philadelphia Eagles won the NFC championship game in 2023. The student\\'s answer, \"Eagles,\" matches the team that won according to the context. Therefore, the student\\'s answer is correct.',\n",
+       " 'value': 'CORRECT',\n",
+       " 'score': 1}"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.evaluation import CotQAEvalChain\n",
+    "\n",
+    "eval_chain = CotQAEvalChain.from_llm(llm=llm)\n",
+    "\n",
+    "eval_chain.evaluate_strings(\n",
+    "    input=\"Who won the NFC championship game in 2023?\",\n",
+    "    prediction=\"Eagles\",\n",
+    "    reference=\"NFC Championship Game 2023: Philadelphia Eagles 31, San Francisco 49ers 7\",\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,222 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2da95378",
+   "metadata": {},
+   "source": [
+    "# String Distance\n",
+    "\n",
+    "One of the simplest ways to compare an LLM or chain's string output against a reference label is by using string distance measurements such as Levenshtein or postfix distance.  This can be used alongside approximate/fuzzy matching criteria for very basic unit testing.\n",
+    "\n",
+    "This can be accessed using the `string_distance` evaluator, which uses distance metric's from the [rapidfuzz](https://github.com/maxbachmann/RapidFuzz) library.\n",
+    "\n",
+    "**Note:** The returned scores are _distances_, meaning lower is typically \"better\".\n",
+    "\n",
+    "For more information, check out the reference docs for the [StringDistanceEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.string_distance.base.StringDistanceEvalChain.html#langchain.evaluation.string_distance.base.StringDistanceEvalChain) for more info."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "8b47b909-3251-4774-9a7d-e436da4f8979",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# %pip install rapidfuzz"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "f6790c46",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from langchain.evaluation import load_evaluator\n",
+    "\n",
+    "evaluator = load_evaluator(\"string_distance\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "49ad9139",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 12}"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "evaluator.evaluate_strings(\n",
+    "    prediction=\"The job is completely done.\",\n",
+    "    reference=\"The job is done\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "c06a2296",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 4}"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# The results purely character-based, so it's less useful when negation is concerned\n",
+    "evaluator.evaluate_strings(\n",
+    "    prediction=\"The job is done.\",\n",
+    "    reference=\"The job isn't done\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b8ed1f12-09a6-4e90-a69d-c8df525ff293",
+   "metadata": {},
+   "source": [
+    "## Configure the String Distance Metric\n",
+    "\n",
+    "By default, the `StringDistanceEvalChain` uses  levenshtein distance, but it also supports other string distance algorithms. Configure using the `distance` argument."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "a88bc7d7-62d3-408d-b0e0-43abcecf35c8",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[<StringDistance.DAMERAU_LEVENSHTEIN: 'damerau_levenshtein'>,\n",
+       " <StringDistance.LEVENSHTEIN: 'levenshtein'>,\n",
+       " <StringDistance.JARO: 'jaro'>,\n",
+       " <StringDistance.JARO_WINKLER: 'jaro_winkler'>]"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.evaluation import StringDistance\n",
+    "\n",
+    "list(StringDistance)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "0c079864-0175-4d06-9d3f-a0e51dd3977c",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "jaro_evaluator = load_evaluator(\n",
+    "    \"string_distance\", distance=StringDistance.JARO, requires_reference=True\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "a8dfb900-14f3-4a1f-8736-dd1d86a1264c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 0.19259259259259254}"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "jaro_evaluator.evaluate_strings(\n",
+    "    prediction=\"The job is completely done.\",\n",
+    "    reference=\"The job is done\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "7020b046-0ef7-40cc-8778-b928e35f3ce1",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'score': 0.12083333333333324}"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "jaro_evaluator.evaluate_strings(\n",
+    "    prediction=\"The job is done.\",\n",
+    "    reference=\"The job isn't done\",\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}