Working With Objects#

Kor attempts to make it easy to extract objects from text.

from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
    max_tokens=2000,
)

Object Schema#

Kor the familiar idea of an Object type to specify how to extract an object from text.

schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            examples=[("John Smith went to the store", "John")],
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            examples=[("John Smith went to the store", "Smith")],
        ),
        Number(
            id="age",
            description="The age of the person in years.",
            examples=[("23 years old", "23"), ("I turned three on sunday", "3")],
        ),
    ],
    examples=[
        (
            "John Smith was 23 years old. He was very tall. He knew Jane Doe. She was 5 years old.",
            [
                {"first_name": "John", "last_name": "Smith", "age": 23},
                {"first_name": "Jane", "last_name": "Doe", "age": 5},
            ],
        )
    ],
    many=True,
)


chain = create_extraction_chain(llm, schema)
print(chain.prompt.format_prompt(text="[user input]").to_string())
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

personal_info: Array<{ // Personal information about a given person.
 first_name: string // The first name of the person
 last_name: string // The last name of the person
 age: number // The age of the person in years.
}>
```


Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: John Smith was 23 years old. He was very tall. He knew Jane Doe. She was 5 years old.
Output: first_name|last_name|age
John|Smith|23
Jane|Doe|5

Input: John Smith went to the store
Output: first_name|last_name|age
John||

Input: John Smith went to the store
Output: first_name|last_name|age
|Smith|

Input: 23 years old
Output: first_name|last_name|age
||23

Input: I turned three on sunday
Output: first_name|last_name|age
||3

Input: [user input]
Output:

Please note above that examples were specified on a per attribute level.

When this works it allows one to more easily compose attributes; however, to improve performance generally examples will need to be provided at the object level (as we’ll do below), as it helps the model determine how to associate attributes together.

chain.run("Eugene was 18 years old a long time ago.")["data"]
{'personal_info': [{'first_name': 'Eugene', 'last_name': '', 'age': '18'}]}
chain = create_extraction_chain(llm, schema)
print(
    chain.run(
        "My name is Bob Alice and my phone number is (123)-444-9999. I found my true love one"
        " on a blue sunday. Her number was (333)1232832. Her name was Moana Sunrise and she was 10 years old."
    )["data"]
)
{'personal_info': [{'first_name': 'Bob', 'last_name': 'Alice', 'age': ''}]}

And nothing should be extracted from the text below.

chain.run(
    "My phone number is (123)-444-9999. I found my true love one on a blue sunday."
    " Her number was (333)1232832"
)["data"]
{'personal_info': [{'first_name': '', 'last_name': '', 'age': ''}]}

Handling Hallucinations#

LLMs that don’t understand instructions well will need more examples to perform well on extraction tasks.

Let’s comment some of the examples from the previous schema to see the outputs.

schema = Object(
    id="personal_info",
    description="Personal information about a given person.",
    attributes=[
        Text(
            id="first_name",
            description="The first name of the person",
            # examples=[("John Smith went to the store", "John")]
        ),
        Text(
            id="last_name",
            description="The last name of the person",
            # examples=[("John Smith went to the store", "Smith")],
        ),
        Number(
            id="age",
            description="The age of the person in years.",
            # examples=[("23 years old", "23"), ("I turned three on sunday", "3")]
        ),
    ],
    examples=[
        (
            "John Smith was 23 years old. He was very tall. He knew Jane Doe. She was 5 years old.",
            [
                {"first_name": "John", "last_name": "Smith", "age": 23},
                {"first_name": "Jane", "last_name": "Doe", "age": 5},
            ],
        )
    ],
    many=True,
)
chain = create_extraction_chain(llm, schema)
chain.run(
    "My name is Bob Alice and my phone number is (123)-444-9999. I found my true love one"
    " on a blue sunday. Her number was (333)1232832. Her name was Moana Sunrise and she was 10 years old."
)["data"]
{'personal_info': [{'first_name': 'Bob', 'last_name': 'Alice', 'age': ''},
  {'first_name': 'Moana', 'last_name': 'Sunrise', 'age': '10'}]}

What’s the actual prompt?#

print(chain.prompt.format_prompt(text="[user input]").to_string())
Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.

```TypeScript

personal_info: Array<{ // Personal information about a given person.
 first_name: string // The first name of the person
 last_name: string // The last name of the person
 age: number // The age of the person in years.
}>
```


Please output the extracted information in CSV format in Excel dialect. Please use a | as the delimiter. 
 Do NOT add any clarifying information. Output MUST follow the schema above. Do NOT add any additional columns that do not appear in the schema.



Input: John Smith was 23 years old. He was very tall. He knew Jane Doe. She was 5 years old.
Output: first_name|last_name|age
John|Smith|23
Jane|Doe|5

Input: [user input]
Output: