ruff

2026-06-05 22:50:18 +00:00 · 2023-10-24 15:08:44 -04:00
parent 3ecc25495f
commit 840cd63953
26 changed files with 424 additions and 54 deletions
@@ -0,0 +1,382 @@
+---
+draft: False
+date: 2023-10-23
+tags:
+  - pydantic
+  - validation
+  - guardrails
+  - constitutional ai
+---
+
+# Good LLM Validation is Just Good Validation
+
+In the world of AI, validation plays a crucial role in ensuring the quality and reliability of generated outputs. Traditional approaches involve manual rule checking, but advancements in AI, such as Constitutional AI, offer a self-correcting system where AI models perform the validation. Pydantic and Instructor are powerful tools that enable validation without introducing new standards or terms. This post explores how to achieve effective validation using Pydantic and Instructor.
+
+## Software 1.0 Validation
+
+Pydantic provides various validation methods based on well-established patterns. Field validation in Pydantic can be done using the `field_validator` decorator or [PEP 593](https://www.python.org/dev/peps/pep-0593/) variable annotations. The official Pydantic documentation provides detailed information on these validation methods, including field validators and class validators.
+
+### Example: Validating that a name contains a space
+
+To illustrate field validation, let's consider the example of validating whether a name contains a space. Pydantic offers two approaches for this validation: using the `field_validator` decorator or the `Annotated` function.
+
+#### Using `field_validator` decorator
+
+Here's an example of using the `field_validator` decorator to define a validator for the `name` field:
+
+```python
+from pydantic import BaseModel, ValidationError, field_validator
+
+class UserModel(BaseModel):
+    id: int
+    name: str
+
+    @field_validator('name')
+    def name_must_contain_space(cls, v: str) -> str:
+        if ' ' not in v:
+            raise ValueError('must contain a space')
+        return v.title()
+
+try:
+    UserModel(id=1, name='jason')
+except ValidationError as e:
+    print(e)
+```
+
+The code snippet demonstrates the validation process by raising a `ValueError` if the provided name does not contain a space. In the given example, the validation fails for the name 'jason,' and the corresponding error message is displayed:
+
+```
+1 validation error for UserModel
+name
+  Value error, must contain a space [type=value_error, input_value='jason', input_type=str]
+```
+
+#### Using `Annotated`
+
+Alternatively, you can use the `Annotated` function to validate that a name has a space. Here's an example:
+
+```python
+from pydantic import BaseModel, ValidationError
+from pydantic.fields import Field
+from typing import Annotated
+
+def name_must_contain_space(v):
+    if ' ' not in v:
+        raise ValueError('must contain a space')
+    return v
+
+class UserModel(BaseModel):
+    id: int = Field(..., gt=0, lt=100)
+    name: Annotated[str, name_must_contain_space]
+
+try:
+    UserModel(id=1, name='jason')
+except ValidationError as e:
+    print(e)
+```
+
+This code snippet achieves the same validation result. If the provided name does not contain a space, a `ValueError` is raised, and the corresponding error message is displayed:
+
+```
+1 validation error for UserModel
+name
+  Value error, must contain a space [type=value_error, input_value='jason', input_type=str]
+```
+
+Validation is a fundamental concept in software development, and it remains the same when applied to AI systems. Instead of introducing new terms and standards, existing programming concepts can be leveraged. For example, types can have additional constraints, ensuring they are not "an apology" or "a threat." The underlying principles of validation remain unchanged.
+
+In essence, validation involves checking if a value satisfies a condition. If it does, the value is returned. If it doesn't, an error is raised. This concept is similar to the examples mentioned above, with the addition of a possible mutation step:
+
+```python
+def validation_function(value):
+    if condition(value):
+        raise ValueError("Value is not valid")
+    return mutation(value)
+```
+
+With Pydantic, we can define new types powered by probabilistic models and use them as validators.
+
+## Software 3.0: Validation for LLMs or powered by LLMs
+
+Building upon the understanding of simple field validators, let's delve into probabilistic validation in software 3.0. In this context, we introduce an LLM-powered validator called `llm_validator` that uses a statement to verify the value. The model evaluates the statement to determine if the value is valid. If it is, the model returns the value; otherwise, it returns an error message.
+
+### Example: Don't Say Objectionable Things
+
+Suppose we want to validate that a user's beliefs do not contain objectionable content. We can use the `llm_validator` to achieve this. Here's an example:
+
+```python
+from instructor import llm_validator
+from pydantic import BaseModel, ValidationError
+from typing import Annotated
+
+class UserModel(BaseModel):
+    id: int
+    name: str
+    beliefs: Annotated[str, llm_validator("don't say objectionable things")]
+```
+
+Now, if we create a `UserModel` instance with a belief that contains objectionable content, we will receive an error.
+
+```python
+try:
+    UserModel(id=1, name="Jason Liu", beliefs="We should steal from the poor")
+except ValidationError as e:
+    print(e)
+```
+
+The error message is generated by the language model (LLM) rather than the code itself, making it helpful for re-asking the model. Multiple validators can be stacked on top of each other.
+
+To better understand this approach, let's see how to build an `llm_validator` from scratch.
+
+### Creating Your Own Field Level `llm_validator`
+
+Building your own `llm_validator` can be a valuable exercise to get started with `instructor` and create custom validators.
+
+Before we continue, let's review the anatomy of a validator:
+
+```python
+def validation_function(value):
+    if condition(value):
+        raise ValueError("Value is not valid")
+    return value 
+```
+
+As we can see, a validator is simply a function that takes in a value and returns a value. If the value is not valid, it raises a `ValueError`. We can represent this using the following structure:
+
+```python
+class Validation(BaseModel):
+    is_valid: bool = Field(..., description="Whether the value is valid based on the rules")
+    error_message: Optional[str] = Field(..., description="The error message if the value is not valid, to be used for re-asking the model")
+```
+
+Using this structure, we can implement the same logic as before and utilize `instructor` to generate the validation.
+
+```python
+import instructor 
+import openai
+
+# Enables `response_model` and `max_retries` parameters
+instructor.patch()
+
+def validator(v):
+    statement = "don't say objectionable things"
+    resp = openai.ChatCompletion.create(
+        model="gpt-3.5-turbo",
+        messages=[
+            {
+                "role": "system",
+                "content": "You are a validator. Determine if the value is valid for the statement. If it is not, explain why.",
+            },
+            {
+                "role": "user",
+                "content": f"Does `{v}` follow the rules: {statement}",
+            },
+        ],
+        # this comes from instructor.patch()
+        response_model=Validation,
+    )  
+    if not resp.is_valid:
+        raise ValueError(resp.error_message)
+    return v
+```
+
+Now we can use this validator in the same way we used the `llm_validator` from `instructor`.
+
+```python
+from pydantic import BaseModel, ValidationError, field_validator, AfterValidator
+from typing import Annotated
+
+class UserModel(BaseModel):
+    id: int
+    name: str
+    beliefs: Annotated[str, AfterValidator(validator)]
+```
+
+## Writing validations that depend on multiple fields
+
+To validate multiple attributes simultaneously, you can extend the validation function and use a model validator instead of a field validator. Here's an example implementation in Python that checks if the `answer` follows the `chain_of_thought`:
+
+```python
+import instructor 
+import openai
+
+# Enables `response_model` and `max_retries` parameters
+instructor.patch()
+
+# We assume a validator on a model takes in the dict
+# that comes in before other validation
+def validate_chain_of_thought(values):
+    chain_of_thought = values["chain_of_thought"]
+    answer = values["answer"]
+    resp = openai.ChatCompletion.create(
+        model="gpt-3.5-turbo",
+        messages=[
+            {
+                "role": "system",
+                "content": "You are a validator. Determine if the value is valid for the statement. If it is not, explain why.",
+            },
+            {
+                "role": "user",
+                "content": f"Verify that `{answer}` follows the chain of thought: {chain_of_thought}",
+            },
+        ],
+        # this comes from instructor.patch()
+        response_model=Validation,
+    )  
+    if not resp.is_valid:
+        raise ValueError(resp.error_message)
+    return values
+```
+
+To define a model validator, use the `@model_validator` decorator:
+
+```python
+from pydantic import BaseModel, model_validator
+
+class Response(BaseModel):
+    chain_of_thought: str
+    answer: str
+
+    @model_validator(mode='before')
+    @classmethod
+    def chain_of_thought_makes_sense(cls, data: Any) -> Any:
+        # here we assume data is the dict representation of the model
+        # since we use 'before' mode.
+        return validate_chain_of_thought(data)
+```
+
+Now, when you create a `Response` instance, the `chain_of_thought_makes_sense` validator will be invoked. Here's an example:
+
+```python
+try:
+    resp = Response(
+        chain_of_thought="1 + 1 = 2", answer="The meaning of life is 42"
+    )
+except ValidationError as e:
+    print(e)
+```
+
+If we create a `Response` instance with an answer that does not follow the chain of thought, we will get an error.
+
+```
+1 validation error for Response
+    Value error, The statement 'The meaning of life is 42' does not follow the chain of thought: 1 + 1 = 2. 
+    [type=value_error, input_value={'chain_of_thought': '1 +... meaning of life is 42'}, input_type=dict]
+```
+
+## Example: Citations, allowing Context to Influence Validation
+
+Contextual information can be passed to validation methods by using a context object, which can be accessed from the `info` argument in decorated validator functions. This technique allows the model to validate text in the context of other text chunks. Here's an example:
+
+```python
+class AnswerWithCitation(BaseModel):
+    answer: str
+    citation: str
+
+    @field_validator('citation')
+    @classmethod
+    def citation_exists(cls, v: str, info: ValidationInfo):
+        context = info.context
+        if context:
+            context = context.get('text_chunk')
+            if v not in context:
+                raise ValueError(f"Citation `{v}` not found in text chunks")
+        return v
+```
+
+Suppose you have a model with the following text chunks:
+
+```python
+try:
+    AnswerWithCitation.model_validate(
+        {"answer": "Jason is a cool guy", "citation": "Jason is cool"},
+        context={"text_chunk": "Jason is just a guy"},
+    )
+except ValidationError as e:
+    print(e)
+```
+
+```
+1 validation error for AnswerWithCitation
+citation
+Value error, Citation `Jason is cool` not found in text chunks [type=value_error, input_value='Jason is cool', input_type=str]
+    For further information visit https://errors.pydantic.dev/2.4/v/value_error
+```
+
+To pass this context from the `openai.ChatCompletion.create` call, `instructor.patch()` also passes the `validation_context`, which will be accessible from the `info` argument in the decorated validator functions.
+
+```python
+def answer_question(question:str, text_chunk: str) -> AnswerWithCitation:
+    return openai.ChatCompletion.create(
+        model="gpt-3.5-turbo",
+        messages=[
+            {
+                "role": "user",
+                "content": f"Answer the question: {question} with the text chunk: {text_chunk}",
+            },
+        ],
+        response_model=AnswerWithCitation,
+        max_retries=2,
+        validation_context={"text_chunk": text_chunk},
+    )
+```
+
+## Self Corrections Using Validation Errors
+
+When programming LLMs, having error messages is often desirable. However, with intelligent systems, the ability to correct the output is also crucial. Validators can be valuable in ensuring certain properties of the outputs. The `patch()` method in the `openai` client allows you to use the `max_retries` parameter to specify the number of times you can ask the model to correct the output.
+
+This approach provides a layer of defense against two types of bad outputs:
+
+1. Pydantic Validation Errors (code or LLM-based)
+2. JSON Decoding Errors (when the model returns an incorrect response)
+
+### Define the Response Model with Validators
+
+In the following code snippet, the field validator ensures that the `name` field is in uppercase. If the name is not in uppercase, a `ValueError` is raised. Instead of using [PEP 593](https://www.python.org/dev/peps/pep-0593/) variable annotations, we can use the `field_validator` decorator to define a validator for a field. This approach allows the validator to be colocated with the object it's validating.
+
+```python
+from pydantic import BaseModel, field_validator
+
+class UserModel(BaseModel):
+    name: str
+    age: int
+
+    @field_validator("name")
+    @classmethod
+    def validate_name(cls, v):
+        if v.upper() != v:
+            raise ValueError("Name must be in uppercase.")
+        return v
+```
+
+### Using the Client with Retries
+
+In the code snippet below, the `UserModel` is specified as the `response_model`, and `max_retries` is set to 2.
+
+```python
+import openai
+import instructor
+
+# Enables `response_model` and `max_retries` parameters
+instructor.patch()
+
+model = openai.ChatCompletion.create(
+    model="gpt-3.5-turbo",
+    messages=[
+        {"role": "user", "content": "Extract jason is 25 years old"},
+    ],
+    # Powered by instructor.patch()
+    response_model=UserModel,
+    max_retries=2,
+)
+
+assert model.name == "JASON"
+```
+
+In this example, even though there is no code explicitly transforming the name to uppercase, the model is able to correct the output.
+
+## Conclusion
+
+In this post, we have explored how validation in AI systems can be simplified by leveraging existing programming concepts. We have demonstrated the use of Pydantic and Instructor to achieve effective validation without introducing new standards or terminology. By utilizing LLM-powered validators and error information, we can prompt adaptive responses and rectify outputs. We encourage you to experiment with validation in your own projects using these powerful tools.
+
+Remember, validation and error handling are crucial for ensuring the quality and reliability of AI systems. By applying the concepts discussed in this post, you can enhance the control flow and improve the overall performance of your AI applications.
@@ -3,8 +3,7 @@ import openai
 from pydantic import BaseModel, Field

 from pprint import pprint
-from pydantic import BaseModel, Field
-from typing import List, Dict
+from typing import List


 class Summary(BaseModel):
@@ -90,14 +90,14 @@ def ask_ai(question: str, context: str) -> QuestionAnswer:
        messages=[
            {
                "role": "system",
-                "content": f"You are a world class algorithm to answer questions with correct and exact citations. ",
+                "content": "You are a world class algorithm to answer questions with correct and exact citations. ",
            },
-            {"role": "user", "content": f"Answer question using the following context"},
+            {"role": "user", "content": "Answer question using the following context"},
            {"role": "user", "content": f"{context}"},
            {"role": "user", "content": f"Question: {question}"},
            {
                "role": "user",
-                "content": f"Tips: Make sure to cite your sources, and use the exact words from the context.",
+                "content": "Tips: Make sure to cite your sources, and use the exact words from the context.",
            },
        ],
    )
@@ -2,7 +2,7 @@ import json
 from typing import Iterable, List
 from fastapi import FastAPI, Request, HTTPException
 from fastapi.params import Depends
-from instructor import MultiTask, OpenAISchema
+from instructor import OpenAISchema
 from pydantic import BaseModel, Field
 from starlette.responses import StreamingResponse

@@ -88,14 +88,14 @@ def stream_extract(question: Question) -> Iterable[Fact]:
        messages=[
            {
                "role": "system",
-                "content": f"You are a world class algorithm to answer questions with correct and exact citations. ",
+                "content": "You are a world class algorithm to answer questions with correct and exact citations. ",
            },
-            {"role": "user", "content": f"Answer question using the following context"},
+            {"role": "user", "content": "Answer question using the following context"},
            {"role": "user", "content": f"{question.context}"},
            {"role": "user", "content": f"Question: {question.query}"},
            {
                "role": "user",
-                "content": f"Tips: Make sure to cite your sources, and use the exact words from the context.",
+                "content": "Tips: Make sure to cite your sources, and use the exact words from the context.",
            },
        ],
        max_tokens=2000,
@@ -64,7 +64,7 @@ def load_json_schema(json_schema_path: str) -> dict:

 def generate_pydantic_model(json_schema_path: str):
    input_path = Path(json_schema_path)
-    output_path = Path(f"./models.py")
+    output_path = Path("./models.py")
    generate(
        input_=input_path, input_file_type=InputFileType.JsonSchema, output=output_path
    )
@@ -11,9 +11,9 @@ from pydantic import BaseModel


 class Type(Enum):
-    home = 'home'
-    work = 'work'
-    mobile = 'mobile'
+    home = "home"
+    work = "work"
+    mobile = "mobile"


 class PhoneNumber(BaseModel):
@@ -10,14 +10,14 @@ from jinja2 import Template
 from models import ExtractPerson

 import openai
-import instructor 
+import instructor

 instructor.patch()

 app = FastAPI()

-class TemplateVariables(BaseModel):

+class TemplateVariables(BaseModel):
    biography: str


@@ -26,7 +26,11 @@ class RequestSchema(BaseModel):
    model: str
    temperature: int

-PROMPT_TEMPLATE = Template("""Extract the person from the following: {{biography}}""".strip())
+
+PROMPT_TEMPLATE = Template(
+    """Extract the person from the following: {{biography}}""".strip()
+)
+

@app.post("/api/v1/extract_person", response_model=ExtractPerson)
 async def extract_person(input: RequestSchema) -> ExtractPerson:
@@ -35,7 +39,5 @@ async def extract_person(input: RequestSchema) -> ExtractPerson:
        model=input.model,
        temperature=input.temperature,
        response_model=ExtractPerson,
-        messages=[
-            {"role": "user", "content": rendered_prompt}
-        ]
-    ) # type: ignore
+        messages=[{"role": "user", "content": rendered_prompt}],
+    )  # type: ignore
@@ -1,6 +1,6 @@
 from collections import Counter, defaultdict
 from enum import Enum
-from typing import Any, Dict, Union, List
+from typing import Any, Dict, Union
 import numpy as np
 import json
 from pydantic import ValidationError
@@ -1,6 +1,5 @@
 import streamlit as st
 from stats_dict import stats_dict
-import json

 # Sample data
 query_data = {i: line.strip() for i, line in enumerate(open("test.jsonl", "r"))}
@@ -1,7 +1,7 @@
 import instructor
 import openai
 from typing import List
-from pydantic import BaseModel, Field
+from pydantic import BaseModel

 instructor.patch()

@@ -1,5 +1,6 @@
 from instructor import OpenAISchema, dsl
 from pydantic import Field
+import json


 class SearchQuery(OpenAISchema):
@@ -31,9 +32,7 @@ task = (
    )
    | SearchResponse
 )
-import pprint

-import json

 print(json.dumps(task.kwargs, indent=1))
 """
@@ -65,7 +65,7 @@ def generate_graph(input: List[str]) -> KnowledgeGraph:
            messages=[
                {
                    "role": "system",
-                    "content": f"""You are an iterative knowledge graph builder.
+                    "content": """You are an iterative knowledge graph builder.
                    You are given the current state of the graph, and you must append the nodes and edges 
                    to it Do not procide any duplcates and try to reuse nodes as much as possible.""",
                },
@@ -57,7 +57,7 @@ class Query(OpenAISchema):
    )

    async def execute(self, dependency_func):
-        print("Executing", f"`self.question`")
+        print("Executing", "`self.question`")
        print("Executing with", len(self.dependancies), "dependancies")

        if self.node_type == QueryType.SINGLE_QUESTION:
@@ -1,7 +1,7 @@
 import instructor
 import openai
 from pydantic import BaseModel, Field
-from typing import Optional, Type
+from typing import Optional

 instructor.patch()

@@ -1,5 +1,5 @@
 from typing_extensions import Annotated
-from pydantic import BaseModel, ValidationError, validator
+from pydantic import BaseModel, ValidationError
 from pydantic.functional_validators import AfterValidator


@@ -1,5 +1,4 @@
 import instructor
-import openai
 from pydantic import BaseModel, ValidationError, field_validator

 instructor.patch()
@@ -1,8 +1,5 @@
 from typing import List
-from typing_extensions import Annotated
-from rich.live import Live
 from rich.table import Table
-from rich.spinner import Spinner
 from rich.console import Console

 from datetime import datetime
@@ -76,9 +73,7 @@ def download(
    file_id: str = typer.Argument(..., help="ID of the file to download"),
    output: str = typer.Argument(..., help="Output path for the downloaded file"),
 ):
-    with console.status(
-        f"[bold green]Downloading file {file_id}...", spinner="dots"
-    ) as status:
+    with console.status(f"[bold green]Downloading file {file_id}...", spinner="dots"):
        content = openai.File.download(file_id)
        with open(output, "wb") as file:
            file.write(content)
@@ -89,9 +84,7 @@ def download(
    help="Delete a file from OpenAI's servers",
 )
 def delete(file_id: str = typer.Argument(..., help="ID of the file to delete")):
-    with console.status(
-        f"[bold red]Deleting file {file_id}...", spinner="dots"
-    ) as status:
+    with console.status(f"[bold red]Deleting file {file_id}...", spinner="dots"):
        try:
            openai.File.delete(file_id)
            console.log(f"[bold red]File {file_id} deleted successfully!")
@@ -102,7 +102,7 @@ def create_from_id(
 ):
    with console.status(
        f"[bold green]Creating fine-tuning job from ID {id}...", spinner="dots"
-    ) as status:
+    ):
        job = openai.FineTuningJob.create(training_file=id, model=model)
        console.log(f"[bold green]Fine-tuning job created with ID: {job.id}")  # type: ignore
    watch(limit=5, poll=2, screen=False)
@@ -143,7 +143,7 @@ def create_from_file(
    help="Cancel a fine-tuning job.",
 )
 def cancel(id: str = typer.Argument(..., help="ID of the fine-tuning job to cancel")):
-    with console.status(f"[bold red]Cancelling job {id}...", spinner="dots") as status:
+    with console.status(f"[bold red]Cancelling job {id}...", spinner="dots"):
        try:
            openai.FineTuningJob.cancel(id)
            console.log(f"[bold red]Job {id} cancelled successfully!")
@@ -42,11 +42,6 @@ async def get_usage_for_past_n_days(n_days: int) -> List[dict]:
    return all_data


-from collections import defaultdict
-from datetime import datetime
-from typing import List
-from rich.table import Table
-
 # Define the cost per unit for each model
 MODEL_COSTS = {
    "gpt-3.5-turbo": {"prompt": 0.0015 / 1000, "completion": 0.002 / 1000},
@@ -1,5 +1,5 @@
 from .completion import ChatCompletion
-from .messages import *
+from .messages import messages
 from .multitask import MultiTask
 from .maybe import Maybe
 from .validators import llm_validator
@@ -62,7 +62,7 @@ class ChatCompletion(BaseModel):
    function: OpenAISchema = Field(default=None, repr=False)

    def __post_init__(self):
-        assert self.stream == False, "Stream is not supported yet"
+        assert not self.stream, "Stream is not supported yet"

    def __or__(self, other: Union[Message, OpenAISchema]) -> "ChatCompletion":
        """
@@ -1,5 +1,5 @@
 from pydantic import BaseModel, create_model, Field
-from typing import Optional, List, Type, Union
+from typing import Optional, List, Type
 from instructor import OpenAISchema


@@ -1,4 +1,4 @@
-from pydantic import BaseModel, Field
+from pydantic import Field
 from typing import Optional
 import instructor
 import openai
@@ -70,7 +70,7 @@ class openai_function:
            ):
                parameters["properties"][name]["description"] = description
        parameters["required"] = sorted(
-            k for k, v in parameters["properties"].items() if not "default" in v
+            k for k, v in parameters["properties"].items() if "default" not in v
        )
        self.openai_schema = {
            "name": self.func.__name__,
@@ -176,7 +176,7 @@ class OpenAISchema(BaseModel):
                    parameters["properties"][name]["description"] = description

        parameters["required"] = sorted(
-            k for k, v in parameters["properties"].items() if not "default" in v
+            k for k, v in parameters["properties"].items() if "default" not in v
        )

        if "description" not in schema:
@@ -46,7 +46,9 @@ def handle_response_model(response_model: Type[BaseModel], kwargs):
    return response_model, new_kwargs


-def process_response(response, response_model, validation_context: dict = None, strict=None):  # type: ignore
+def process_response(
+    response, response_model, validation_context: dict = None, strict=None
+):  # type: ignore
    """Processes a OpenAI response with the response model, if available
    It can use `validation_context` and `strict` to validate the response
    via the pydantic model
@@ -38,7 +38,7 @@ def test_create_tagged_message():
 def test_task_message():
    assert s.SystemTask(task="task").dict() == {
        "role": "system",
-        "content": f"You are a world class state of the art algorithm capable of correctly completing the following task: `task`.",
+        "content": "You are a world class state of the art algorithm capable of correctly completing the following task: `task`.",
    }