Validation with Pydantic
Contents
Validation with Pydantic#
Here, we’ll see how to use pydantic to specify the schema and validate the results.
ATTENTION Validation does NOT imply that extraction was correct. Validation only implies that the data was returned in the correct shape and meets all validation criteria. This doesn’t mean that the LLM didn’t make some up information!
import enum
from kor import create_extraction_chain, Object, Text, Number
import pydantic
from typing import List
from kor import from_pydantic
from pydantic import BaseModel, Field
from typing import Optional
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model_name="gpt-4o",
temperature=0,
)
Let’s returning to our hypothetical music player API:
class Action(enum.Enum):
play = "play"
stop = "stop"
previous = "previous"
next_ = "next"
class MusicRequest(BaseModel):
song: Optional[List[str]] = Field(
default=None, description="The song(s) that the user would like to be played."
)
album: Optional[List[str]] = Field(
default=None, description="The album(s) that the user would like to be played."
)
artist: Optional[List[str]] = Field(
default=None,
description="The artist(s) whose music the user would like to hear.",
examples=[("Songs by paul simon", "paul simon")],
)
action: Optional[Action] = Field(
default=None,
description="The action that should be taken; one of `play`, `stop`, `next`, `previous`",
examples=[
("Please stop the music", "stop"),
("play something", "play"),
("play a song", "play"),
("next song", "next"),
],
)
schema, validator = from_pydantic(MusicRequest)
ATTENTION Use the JSON encoder here rather than the default CSV encoder as it supports nested lists
chain = create_extraction_chain(
llm, schema, encoder_or_encoder_class="json", validator=validator
)
Let’s test it out
print(chain.get_prompts()[0].format_prompt(text="[user input]").to_string())
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.
```TypeScript
musicrequest: { //
song: Array<string> // The song(s) that the user would like to be played.
album: Array<string> // The album(s) that the user would like to be played.
artist: Array<string> // The artist(s) whose music the user would like to hear.
action: "play" | "stop" | "previous" | "next" // The action that should be taken; one of `play`, `stop`, `next`, `previous`
}
```
Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.
Input: Songs by paul simon
Output: <json>{"musicrequest": {"artist": ["paul simon"]}}</json>
Input: Please stop the music
Output: <json>{"musicrequest": {"action": "stop"}}</json>
Input: play something
Output: <json>{"musicrequest": {"action": "play"}}</json>
Input: play a song
Output: <json>{"musicrequest": {"action": "play"}}</json>
Input: next song
Output: <json>{"musicrequest": {"action": "next"}}</json>
Input: [user input]
Output:
chain.invoke("stop the music now")["validated_data"]
MusicRequest(song=None, album=None, artist=None, action=<Action.stop: 'stop'>)
chain.invoke("i want to hear yellow submarine by the beatles")["validated_data"]
MusicRequest(song=['yellow submarine'], album=None, artist=['the beatles'], action=<Action.play: 'play'>)
chain.invoke("play goliath by smith&thell")["validated_data"]
MusicRequest(song=['goliath'], album=None, artist=['smith&thell'], action=<Action.play: 'play'>)
chain.invoke("can you play the lion king soundtrack")["validated_data"]
MusicRequest(song=None, album=['the lion king soundtrack'], artist=None, action=<Action.play: 'play'>)
chain.invoke("play songs by paul simon and led zeppelin and the doors")["validated_data"]
MusicRequest(song=None, album=None, artist=['paul simon', 'led zeppelin', 'the doors'], action=<Action.play: 'play'>)
chain.invoke("could you play the previous song again?")["validated_data"]
MusicRequest(song=None, album=None, artist=None, action=<Action.previous: 'previous'>)
chain.invoke("previous")["validated_data"]
MusicRequest(song=None, album=None, artist=None, action=<Action.previous: 'previous'>)
chain.invoke("play the song before")["validated_data"]
MusicRequest(song=None, album=None, artist=None, action=<Action.previous: 'previous'>)
Validation in Action#
class Player(BaseModel):
song: List[str] = Field(
description="The song(s) that the user would like to be played."
) # <-- Note this is NOT Optional
album: Optional[List[str]] = Field(
default=None, description="The album(s) that the user would like to be played."
)
artist: Optional[List[str]] = Field(
default=None,
description="The artist(s) whose music the user would like to hear.",
examples=[("Songs by paul simon", "paul simon")],
)
action: Optional[Action] = Field(
default=None,
description="The action that should be taken; one of `play`, `stop`, `next`, `previous`",
examples=[
("Please stop the music", "stop"),
("play something", "play"),
("play a song", "play"),
("next song", "next"),
],
)
schema, validator = from_pydantic(Player)
chain = create_extraction_chain(
llm, schema, encoder_or_encoder_class="json", validator=validator
)
Now the schema expects that a list of songs parsed out in the query.
No valid data!#
We made SONG a required attribute in the pydantic schema above! Let’s see what happens now!
chain.invoke("stop the music now")
{'data': {'player': {'action': 'stop'}},
'raw': '<json>{"player": {"action": "stop"}}</json>',
'errors': [1 validation error for Player
song
Field required [type=missing, input_value={'action': 'stop'}, input_type=dict]
For further information visit https://errors.pydantic.dev/2.8/v/missing],
'validated_data': None}
chain.invoke("i want to hear yellow submarine by the beatles")["validated_data"]
Player(song=['yellow submarine'], album=None, artist=['the beatles'], action=None)
Validating Collections#
Currently, there are a few gotchyas when modeling collections that depend on the encoder.
CSV Encoder#
A CSV encoder is expected to work best when encoding a list of records.
At the moment, the CSV encoder doesn’t handle embedded lists or objects.
(This works with either JSON or CSV.)
class Person(BaseModel):
name: str = Field(description="The person's name")
age: int = Field(description="The age of the person")
schema, validator = from_pydantic(
Person,
description="Personal information",
many=True,
examples=[("Joe is 10 years old", {"name": "Joe", "age": "10"})],
)
chain = create_extraction_chain(
llm, schema, encoder_or_encoder_class="csv", validator=validator
)
print(chain.get_prompts()[0].format_prompt(text="[user input]").to_string())
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.
```TypeScript
person: Array<{ // Personal information
name: string // The person's name
age: number // The age of the person
}>
```
Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter.
Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.
Input: Joe is 10 years old
Output: name|age
Joe|10
Input: [user input]
Output:
chain.invoke("john is 13 years old. maria is 24 years old")["validated_data"]
[Person(name='john', age=13), Person(name='maria', age=24)]
Complex Structure#
To serialize more complex structures, use the JSON encoder.
So for the example, above the following alternative works:
class Person(BaseModel):
name: str = Field(description="The person's name")
age: int = Field(description="The age of the person")
class Root(BaseModel):
people: List[Person] = Field(
description="Personal information",
examples=[("John was 23 years old", {"name": "John", "age": 23})],
)
** NOTE ** Using a Root container and many
= False
schema, validator = from_pydantic(Root, description="Personal information", many=False)
chain = create_extraction_chain(
llm, schema, encoder_or_encoder_class="json", validator=validator
)
chain.invoke(
"My name is tom and i am 23 years old. Her name is Jessica and she is 75 years old."
)
{'data': {'root': {'people': [{'name': 'tom', 'age': 23},
{'name': 'Jessica', 'age': 75}]}},
'raw': '<json>{"root": {"people": [{"name": "tom", "age": 23}, {"name": "Jessica", "age": 75}]}}</json>',
'errors': [],
'validated_data': Root(people=[Person(name='tom', age=23), Person(name='Jessica', age=75)])}
print(chain.get_prompts()[0].format_prompt(text="[user input]").to_string())
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.
```TypeScript
root: { // Personal information
people: Array<{ // Personal information
name: string // The person's name
age: number // The age of the person
}>
}
```
Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.
Input: John was 23 years old
Output: <json>{"root": {"people": [{"name": "John", "age": 23}]}}</json>
Input: [user input]
Output:
class Pet(BaseModel):
name: str = Field(description="the name of the pet")
species: Optional[str] = Field(
default=None, description="The species of the pet; e.g., dog or cat"
)
age: Optional[int] = Field(default=None, description="The number of the age; e.g.,")
age_unit: Optional[str] = Field(
default=None, description="The unit of the age; e.g., days or weeks"
)
class Person(BaseModel):
name: str = Field(description="The person's name")
age: Optional[int] = Field(default=None, description="The age of the person")
pets: List[Pet] = Field(
description="The pets owned by the person",
examples=[
(
"he had a dog by the name of charles that was 5 days old",
{"name": "dog", "species": "dog", "age": "5", "age_unit": "days"},
)
],
)
class Root(BaseModel):
people: List[Person] = Field(
description="Personal information",
examples=[("John was 23 years old", {"name": "John", "age": 23})],
)
schema, validator = from_pydantic(
Root, description="Personal information for multiple people", many=False
)
chain = create_extraction_chain(
llm, schema, encoder_or_encoder_class="json", validator=validator
)
print(chain.get_prompts()[0].format_prompt(text="[user input]").to_string())
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.
```TypeScript
root: { // Personal information for multiple people
people: Array<{ // Personal information
name: string // The person's name
age: number // The age of the person
pets: Array<{ // The pets owned by the person
name: string // the name of the pet
species: string // The species of the pet; e.g., dog or cat
age: number // The number of the age; e.g.,
age_unit: string // The unit of the age; e.g., days or weeks
}>
}>
}
```
Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.
Input: John was 23 years old
Output: <json>{"root": {"people": [{"name": "John", "age": 23}]}}</json>
Input: he had a dog by the name of charles that was 5 days old
Output: <json>{"root": {"people": [{"pets": [{"name": "dog", "species": "dog", "age": "5", "age_unit": "days"}]}]}}</json>
Input: [user input]
Output:
chain.invoke(
"Neo had a dog by the name of Tom and a cat by the name of Weeby. Weeby was 23 days old. Julia owned a horse. The horses name was Wind"
)["validated_data"]
Root(people=[Person(name='Neo', age=None, pets=[Pet(name='Tom', species='dog', age=0, age_unit='days'), Pet(name='Weeby', species='cat', age=23, age_unit='days')]), Person(name='Julia', age=None, pets=[Pet(name='Wind', species='horse', age=0, age_unit='days')])])