Validation with Pydantic#

Here, we’ll see how to use pydantic to specify the schema and validate the results.

ATTENTION Validation does NOT imply that extraction was correct. Validation only implies that the data was returned in the correct shape and meets all validation criteria. This doesn’t mean that the LLM didn’t make some up information!

import enum
from langchain.chat_models import ChatOpenAI
from kor import create_extraction_chain, Object, Text, Number
import pydantic
from typing import List
from kor import from_pydantic
from pydantic import BaseModel, Field
from typing import Optional
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
)

Let’s returning to our hypothetical music player API:

class Action(enum.Enum):
    play = "play"
    stop = "stop"
    previous = "previous"
    next_ = "next"


class MusicRequest(BaseModel):
    song: Optional[List[str]] = Field(
        default=None, description="The song(s) that the user would like to be played."
    )
    album: Optional[List[str]] = Field(
        default=None, description="The album(s) that the user would like to be played."
    )
    artist: Optional[List[str]] = Field(
        default=None,
        description="The artist(s) whose music the user would like to hear.",
        examples=[("Songs by paul simon", "paul simon")],
    )
    action: Optional[Action] = Field(
        default=None,
        description="The action that should be taken; one of `play`, `stop`, `next`, `previous`",
        examples=[
            ("Please stop the music", "stop"),
            ("play something", "play"),
            ("play a song", "play"),
            ("next song", "next"),
        ],
    )
schema, validator = from_pydantic(MusicRequest)

ATTENTION Use the JSON encoder here rather than the default CSV encoder as it supports nested lists

chain = create_extraction_chain(
    llm, schema, encoder_or_encoder_class="json", validator=validator
)

Let’s test it out

print(chain.prompt.format_prompt(text="[user input]").to_string())
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

musicrequest: { // 
 song: Array<string> // The song(s) that the user would like to be played.
 album: Array<string> // The album(s) that the user would like to be played.
 artist: Array<string> // The artist(s) whose music the user would like to hear.
 action: "play" | "stop" | "previous" | "next" // The action that should be taken; one of `play`, `stop`, `next`, `previous`
}
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.



Input: Songs by paul simon
Output: <json>{"musicrequest": {"artist": ["paul simon"]}}</json>
Input: Please stop the music
Output: <json>{"musicrequest": {"action": "stop"}}</json>
Input: play something
Output: <json>{"musicrequest": {"action": "play"}}</json>
Input: play a song
Output: <json>{"musicrequest": {"action": "play"}}</json>
Input: next song
Output: <json>{"musicrequest": {"action": "next"}}</json>
Input: [user input]
Output:
chain.run("stop the music now")["validated_data"]
MusicRequest(song=None, album=None, artist=None, action=<Action.stop: 'stop'>)
chain.run("i want to hear yellow submarine by the beatles")["validated_data"]
MusicRequest(song=['yellow submarine'], album=None, artist=['the beatles'], action=None)
chain.run("play goliath by smith&thell")["validated_data"]
MusicRequest(song=['goliath'], album=None, artist=['smith&thell'], action=None)
chain.run("can you play the lion king soundtrack")["validated_data"]
MusicRequest(song=None, album=['the lion king soundtrack'], artist=None, action=None)
chain.run("play songs by paul simon and led zeppelin and the doors")["validated_data"]
MusicRequest(song=None, album=None, artist=['paul simon', 'led zeppelin', 'the doors'], action=None)
chain.run("could you play the previous song again?")["validated_data"]
MusicRequest(song=None, album=None, artist=None, action=<Action.previous: 'previous'>)
chain.run("previous")["validated_data"]
MusicRequest(song=None, album=None, artist=None, action=<Action.previous: 'previous'>)
chain.run("play the song before")["validated_data"]
MusicRequest(song=None, album=None, artist=None, action=<Action.previous: 'previous'>)

Validation in Action#

class Player(BaseModel):
    song: List[str] = Field(
        description="The song(s) that the user would like to be played."
    )  # <-- Note this is NOT Optional
    album: Optional[List[str]] = Field(
        default=None, description="The album(s) that the user would like to be played."
    )
    artist: Optional[List[str]] = Field(
        default=None,
        description="The artist(s) whose music the user would like to hear.",
        examples=[("Songs by paul simon", "paul simon")],
    )
    action: Optional[Action] = Field(
        default=None,
        description="The action that should be taken; one of `play`, `stop`, `next`, `previous`",
        examples=[
            ("Please stop the music", "stop"),
            ("play something", "play"),
            ("play a song", "play"),
            ("next song", "next"),
        ],
    )
schema, validator = from_pydantic(Player)
chain = create_extraction_chain(
    llm, schema, encoder_or_encoder_class="json", validator=validator
)

Now the schema expects that a list of songs parsed out in the query.

No valid data!#

We made SONG a required attribute in the pydantic schema above! Let’s see what happens now!

chain.run("stop the music now")
Error in LangChainTracer.on_chain_end callback: No constructor defined
{'data': {'player': {'action': 'stop'}},
 'raw': '<json>{"player": {"action": "stop"}}</json>',
 'errors': [1 validation error for Player
  song
    Field required [type=missing, input_value={'action': 'stop'}, input_type=dict]
      For further information visit https://errors.pydantic.dev/2.3/v/missing],
 'validated_data': None}
chain.run("i want to hear yellow submarine by the beatles")["validated_data"]
Player(song=['yellow submarine'], album=None, artist=['the beatles'], action=None)

Validating Collections#

Currently, there are a few gotchyas when modeling collections that depend on the encoder.

CSV Encoder#

A CSV encoder is expected to work best when encoding a list of records.

At the moment, the CSV encoder doesn’t handle embedded lists or objects.

(This works with either JSON or CSV.)

class Person(BaseModel):
    name: str = Field(description="The person's name")
    age: int = Field(description="The age of the person")
schema, validator = from_pydantic(
    Person,
    description="Personal information",
    many=True,
    examples=[("Joe is 10 years old", {"name": "Joe", "age": "10"})],
)
chain = create_extraction_chain(
    llm, schema, encoder_or_encoder_class="csv", validator=validator
)
print(chain.prompt.format_prompt(text="[user input]").to_string())
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

person: Array<{ // Personal information
 name: string // The person's name
 age: number // The age of the person
}>
```


Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: Joe is 10 years old
Output: name|age
Joe|10

Input: [user input]
Output:
chain.run("john is 13 years old. maria is 24 years old")["validated_data"]
[Person(name='john', age=13), Person(name='maria', age=24)]

Complex Structure#

To serialize more complex structures, use the JSON encoder.

So for the example, above the following alternative works:

class Person(BaseModel):
    name: str = Field(description="The person's name")
    age: int = Field(description="The age of the person")


class Root(BaseModel):
    people: List[Person] = Field(
        description="Personal information",
        examples=[("John was 23 years old", {"name": "John", "age": 23})],
    )

** NOTE ** Using a Root container and many = False

schema, validator = from_pydantic(Root, description="Personal information", many=False)
chain = create_extraction_chain(
    llm, schema, encoder_or_encoder_class="json", validator=validator
)
chain.run(
    "My name is tom and i am 23 years old. Her name is Jessica and she is 75 years old."
)
{'data': {'root': {'people': [{'name': 'tom', 'age': 23},
    {'name': 'Jessica', 'age': 75}]}},
 'raw': '<json>{"root": {"people": [{"name": "tom", "age": 23}, {"name": "Jessica", "age": 75}]}}</json>',
 'errors': [],
 'validated_data': Root(people=[Person(name='tom', age=23), Person(name='Jessica', age=75)])}
print(chain.prompt.format_prompt(text="[user input]").to_string())
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

root: { // Personal information
 people: Array<{ // Personal information
  name: string // The person's name
  age: number // The age of the person
 }>
}
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.



Input: John was 23 years old
Output: <json>{"root": {"people": [{"name": "John", "age": 23}]}}</json>
Input: [user input]
Output:
class Pet(BaseModel):
    name: str = Field(description="the name of the pet")
    species: Optional[str] = Field(
        default=None, description="The species of the pet; e.g., dog or cat"
    )
    age: Optional[int] = Field(default=None, description="The number of the age; e.g.,")
    age_unit: Optional[str] = Field(
        default=None, description="The unit of the age; e.g., days or weeks"
    )


class Person(BaseModel):
    name: str = Field(description="The person's name")
    age: Optional[int] = Field(default=None, description="The age of the person")
    pets: List[Pet] = Field(
        description="The pets owned by the person",
        examples=[
            (
                "he had a dog by the name of charles that was 5 days old",
                {"name": "dog", "species": "dog", "age": "5", "age_unit": "days"},
            )
        ],
    )


class Root(BaseModel):
    people: List[Person] = Field(
        description="Personal information",
        examples=[("John was 23 years old", {"name": "John", "age": 23})],
    )
schema, validator = from_pydantic(
    Root, description="Personal information for multiple people", many=False
)
chain = create_extraction_chain(
    llm, schema, encoder_or_encoder_class="json", validator=validator
)
print(chain.prompt.format_prompt(text="[user input]").to_string())
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

root: { // Personal information for multiple people
 people: Array<{ // Personal information
  name: string // The person's name
  age: number // The age of the person
  pets: Array<{ // The pets owned by the person
   name: string // the name of the pet
   species: string // The species of the pet; e.g., dog or cat
   age: number // The number of the age; e.g.,
   age_unit: string // The unit of the age; e.g., days or weeks
  }>
 }>
}
```


Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.



Input: John was 23 years old
Output: <json>{"root": {"people": [{"name": "John", "age": 23}]}}</json>
Input: he had a dog by the name of charles that was 5 days old
Output: <json>{"root": {"people": [{"pets": [{"name": "dog", "species": "dog", "age": "5", "age_unit": "days"}]}]}}</json>
Input: [user input]
Output:
chain.run(
    text="Neo had a dog by the name of Tom and a cat by the name of Weeby. Weeby was 23 days old. Julia owned a horse. The horses name was Wind"
)["validated_data"]
Root(people=[Person(name='Neo', age=None, pets=[Pet(name='Tom', species='dog', age=None, age_unit=None), Pet(name='Weeby', species='cat', age=23, age_unit='days')]), Person(name='Julia', age=None, pets=[Pet(name='Wind', species='horse', age=None, age_unit=None)])])