kor package#

Subpackages#

Submodules#

kor.adapters module#

Adapters to convert from validation frameworks to Kor internal representation.

kor.adapters.from_pydantic(model_class: Type[pydantic.main.BaseModel], *, description: str = '', examples: Sequence[Tuple[str, Dict[str, Any]]] = (), many: bool = False) Tuple[kor.nodes.Object, kor.validators.Validator][source]#

Convert a pydantic model to Kor internal representation.

Parameters
  • model_class – The pydantic model class to convert.

  • description – The description of the model.

  • examples – A sequence of examples to be used for the model.

  • many – Whether to expect the model to be a list of models.

Returns

A tuple of the Kor internal representation of the model and a validator.

kor.examples module#

Module for code that generates examples for a given input.

At the moment, this code only has a simple implementation that concatenates all the examples, but one may want to select or generate examples in a smarter way, or take into account the finite size of the context window and limit the number of examples.

The code uses a default encoding of XML. This encoding should match the parser.

class kor.examples.SimpleExampleAggregator[source]#

Bases: kor.nodes.AbstractVisitor[List[Tuple[str, str]]]

Use to visit node and all of its descendants and aggregates all examples.

visit(node: kor.nodes.AbstractSchemaNode) List[Tuple[str, str]][source]#

Entry-point.

visit_default(node: kor.nodes.AbstractSchemaNode, **kwargs: Any) List[Tuple[str, str]][source]#

Default visitor implementation.

visit_object(node: kor.nodes.Object, **kwargs: Any) List[Tuple[str, str]][source]#

Implementation of an object visitor.

visit_option(node: kor.nodes.Option, **kwargs: Any) List[Tuple[str, str]][source]#

Should not visit Options directly.

visit_selection(node: kor.nodes.Selection, **kwargs: Any) List[Tuple[str, str]][source]#

Selection visitor.

kor.examples.generate_examples(node: kor.nodes.AbstractSchemaNode) List[Tuple[str, str]][source]#

Generate examples for a given element.

A rudimentary implementation that simply concatenates all available examples from the components across the entire element tree.

Does not provide a way to impose constraints (e.g., select a subset of examples to meet a constraint on the overall number of tokens.)

Parameters

node – AbstractInput

Returns

list of 2-tuples containing input, output pairs

kor.exceptions module#

exception kor.exceptions.KorException[source]#

Bases: Exception

Base class for all Kor exceptions.

exception kor.exceptions.ParseError[source]#

Bases: kor.exceptions.KorException

Exception for parsing errors.

exception kor.exceptions.ValidationError[source]#

Bases: kor.exceptions.KorException

Exception for validators.

kor.nodes module#

Definitions of input elements.

class kor.nodes.AbstractSchemaNode(*, id: str, description: str = '', many: bool = False)[source]#

Bases: pydantic.main.BaseModel

Abstract schema node.

Each node is expected to have a unique ID, and should only use alphanumeric characters.

The ID should be unique across all inputs that belong to a given form.

The description should describe what the node represents. It is used during prompt generation.

abstract accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

description: str#
id: str#
many: bool#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

replace(id: Optional[str] = None, description: Optional[str] = None) kor.nodes.AbstractSchemaNode[source]#

Wrapper around data-classes replace.

class kor.nodes.AbstractVisitor[source]#

Bases: Generic[kor.nodes.T], abc.ABC

An abstract visitor.

visit_bool(node: kor.nodes.Bool, **kwargs: Any) kor.nodes.T[source]#

Visit bool node.

visit_default(node: kor.nodes.AbstractSchemaNode, **kwargs: Any) kor.nodes.T[source]#

Default node implementation.

visit_number(node: kor.nodes.Number, **kwargs: Any) kor.nodes.T[source]#

Visit text node.

visit_object(node: kor.nodes.Object, **kwargs: Any) kor.nodes.T[source]#

Visit object node.

visit_option(node: kor.nodes.Option, **kwargs: Any) kor.nodes.T[source]#

Visit option node.

visit_selection(node: kor.nodes.Selection, **kwargs: Any) kor.nodes.T[source]#

Visit selection node.

visit_text(node: kor.nodes.Text, **kwargs: Any) kor.nodes.T[source]#

Visit text node.

class kor.nodes.Bool(*, id: str, description: str = '', many: bool = False, examples: Sequence[Tuple[str, Union[Sequence[bool], bool]]] = ())[source]#

Bases: kor.nodes.ExtractionSchemaNode

Built-in bool input.

accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

examples: Sequence[Tuple[str, Union[Sequence[bool], bool]]]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[Tuple[str, Union[Sequence[bool], bool]]], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class kor.nodes.ExtractionSchemaNode(*, id: str, description: str = '', many: bool = False, examples: Sequence[Tuple[str, Union[bool, int, float, str, Sequence[Union[str, int, float, bool]]]]] = ())[source]#

Bases: kor.nodes.AbstractSchemaNode, abc.ABC

An abstract definition for inputs that involve extraction.

An extraction input can be associated with extraction examples.

An extraction example is a 2-tuple composed of a text segment and the expected extraction.

For example:

[
    ("I bought this cookie for $10", "$10"),
    ("Eggs cost twelve dollars", "twelve dollars"),
]
examples: Sequence[Tuple[str, Union[bool, int, float, str, Sequence[Union[str, int, float, bool]]]]]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[Tuple[str, Union[bool, int, float, str, Sequence[Union[str, int, float, bool]]]]], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

classmethod parse_obj(data: dict) kor.nodes.ExtractionSchemaNode[source]#

Parse an object.

classmethod validate(v: Any) kor.nodes.ExtractionSchemaNode[source]#
class kor.nodes.Number(*, id: str, description: str = '', many: bool = False, examples: Sequence[Tuple[str, Union[int, float, Sequence[Union[float, int]]]]] = ())[source]#

Bases: kor.nodes.ExtractionSchemaNode

Built-in number input.

accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

examples: Sequence[Tuple[str, Union[int, float, Sequence[Union[float, int]]]]]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[Tuple[str, Union[int, float, Sequence[Union[float, int]]]]], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class kor.nodes.Object(*, id: str, description: str = '', many: bool = False, attributes: Sequence[Union[kor.nodes.ExtractionSchemaNode, kor.nodes.Selection, kor.nodes.Object]], examples: Sequence[Tuple[str, Union[Sequence[Mapping[str, Any]], Mapping[str, Any]]]] = ())[source]#

Bases: kor.nodes.AbstractSchemaNode

Built-in representation for an object.

Use an object node to represent an entire object that should be extracted.

An extraction input can be associated with 2 different types of examples:

Example:

object = Object(
    id="cookie",
    description="Information about a cookie including price and name.",
    attributes=[
        Text(id="name", description="The name of the cookie"),
        Number(id="price", description="The price of the cookie"),
    ],
    examples=[
        ("I bought this Big Cookie for $10",
            {"name": "Big Cookie", "price": "$10"}),
        ("Eggs cost twelve dollars", {}), # Not a cookie
    ],
)
accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

attributes: Sequence[Union[ExtractionSchemaNode, Selection, Object]]#
examples: Sequence[Tuple[str, Union[Sequence[Mapping[str, Any]], Mapping[str, Any]]]]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'attributes': FieldInfo(annotation=Sequence[Union[ExtractionSchemaNode, Selection, Object]], required=True), 'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[Tuple[str, Union[Sequence[Mapping[str, Any]], Mapping[str, Any]]]], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

classmethod parse_obj(*args: Any, **kwargs: Any) kor.nodes.Object[source]#

Parse an object.

classmethod parse_raw(*args: Any, **kwargs: Any) kor.nodes.Object[source]#

Parse raw data.

class kor.nodes.Option(*, id: str, description: str = '', many: bool = False, examples: Sequence[str] = ())[source]#

Bases: kor.nodes.AbstractSchemaNode

Built-in option input must be part of a selection input.

accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

examples: Sequence[str]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[str], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class kor.nodes.Selection(*, id: str, description: str = '', many: bool = False, options: Sequence[kor.nodes.Option], examples: Sequence[Tuple[str, Union[Sequence[str], str]]] = (), null_examples: Sequence[str] = ())[source]#

Bases: kor.nodes.AbstractSchemaNode

Built-in selection node (aka Enum).

A selection input is composed of one or more options.

A selectio node supports both examples and null_examples.

Null examples are segments of text for which nothing should be extracted.

Examples:

selection = Selection(
    id="species",
    description="What is your favorite animal species?",
    options=[
        Option(id="dog", description="Dog"),
        Option(id="cat", description="Cat"),
        Option(id="bird", description="Bird"),
    ],
    examples=[
        ("I like dogs", "dog"),
        ("I like cats", "cat"),
        ("I like birds", "bird"),
    ],
    null_examples=[
        "I like flowers",
    ],
    many=False
)
accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

examples: Sequence[Tuple[str, Union[str, Sequence[str]]]]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[Tuple[str, Union[Sequence[str], str]]], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False), 'null_examples': FieldInfo(annotation=Sequence[str], required=False, default=()), 'options': FieldInfo(annotation=Sequence[Option], required=True)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

null_examples: Sequence[str]#
options: Sequence[Option]#
class kor.nodes.Text(*, id: str, description: str = '', many: bool = False, examples: Sequence[Tuple[str, Union[Sequence[str], str]]] = ())[source]#

Bases: kor.nodes.ExtractionSchemaNode

Built-in text input.

accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

examples: Sequence[Tuple[str, Union[Sequence[str], str]]]#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[Tuple[str, Union[Sequence[str], str]]], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

kor.prompts module#

Code to dynamically generate appropriate LLM prompts.

class kor.prompts.ExtractionPromptTemplate(*, name: Optional[str] = None, input_variables: List[str], input_types: Dict[str, Any] = None, output_parser: Optional[langchain_core.output_parsers.base.BaseOutputParser] = None, partial_variables: Mapping[str, Union[str, Callable[[], str]]] = None, encoder: kor.encoders.typedefs.Encoder, node: kor.nodes.Object, type_descriptor: kor.type_descriptors.TypeDescriptor, input_formatter: Union[Literal['text_prefix'], Literal['triple_quotes'], None, Callable[[str], str]] = None, instruction_template: langchain_core.prompts.prompt.PromptTemplate)[source]#

Bases: langchain_core.prompts.base.BasePromptTemplate

Extraction prompt template.

class Config[source]#

Bases: object

Configuration for this pydantic object.

arbitrary_types_allowed = True#
extra = 'forbid'#
encoder: Encoder#
format(**kwargs: Any) str[source]#

Implementation of deprecated format method.

format_instruction_segment(node: kor.nodes.Object) str[source]#

Generate the instruction segment of the extraction.

format_prompt(text: str) langchain_core.prompt_values.PromptValue[source]#

Format the prompt.

generate_encoded_examples(node: kor.nodes.Object) List[Tuple[str, str]][source]#

Generate encoded examples.

input_formatter: InputFormatter#
instruction_template: PromptTemplate#
node: Object#
to_messages(text: str) List[langchain_core.messages.base.BaseMessage][source]#

Format the template to chat messages.

to_string(text: str) str[source]#

Format the template to a string.

type_descriptor: TypeDescriptor#
class kor.prompts.ExtractionPromptValue(*, string: str, messages: List[langchain_core.messages.base.BaseMessage])[source]#

Bases: langchain_core.prompt_values.PromptValue

Integration with langchain prompt format.

class Config[source]#

Bases: object

Configuration for this pydantic object.

arbitrary_types_allowed = True#
extra = 'forbid'#
messages: List[BaseMessage]#
string: str#
to_messages() List[langchain_core.messages.base.BaseMessage][source]#

Get materialized messages.

to_string() str[source]#

Format the prompt to a string.

kor.prompts.create_langchain_prompt(schema: kor.nodes.Object, encoder: kor.encoders.typedefs.Encoder, type_descriptor: kor.type_descriptors.TypeDescriptor, *, validator: Optional[kor.validators.Validator] = None, input_formatter: Union[Literal['text_prefix'], Literal['triple_quotes'], None, Callable[[str], str]] = None, instruction_template: Optional[langchain_core.prompts.prompt.PromptTemplate] = None) kor.prompts.ExtractionPromptTemplate[source]#

Create a langchain style prompt with specified encoder.

kor.type_descriptors module#

Code that takes an Object schema and outputs a string that describes its schema.

Without fine-tuning the LLM, the quality of the response may end up depending on details such as the schema description in the prompt.

Users can implement their own type descriptors or customize an existing one using inheritance and over-loading and provide the type-descriptors to the create_extraction_chain function.

class kor.type_descriptors.BulletPointDescriptor[source]#

Bases: kor.type_descriptors.TypeDescriptor[Iterable[str]]

Generate a bullet point style schema description.

describe(node: kor.nodes.Object) str[source]#

Describe the type of the given node.

visit_default(node: kor.nodes.AbstractSchemaNode, **kwargs: Any) List[str][source]#

Default action for a node.

visit_object(node: kor.nodes.Object, **kwargs: Any) List[str][source]#

Visit an object node.

class kor.type_descriptors.TypeDescriptor[source]#

Bases: kor.nodes.AbstractVisitor[kor.type_descriptors.T], abc.ABC

Abstract interface for a type-descriptor.

A type-descriptor is responsible for taking in a schema and outputting its type as a string. The description is used to help the LLM generate structured output.

A type-descriptor is a visitor that can be used to traverse the schema recursively.

abstract describe(node: kor.nodes.Object) str[source]#

Take in node and describe its type as a string.

class kor.type_descriptors.TypeScriptDescriptor[source]#

Bases: kor.type_descriptors.TypeDescriptor[Iterable[str]]

Generate a typescript style schema description.

describe(node: kor.nodes.Object) str[source]#

Describe the node type in TypeScript notation.

visit_default(node: kor.nodes.AbstractSchemaNode, **kwargs: Any) List[str][source]#

Default action for a node.

visit_object(node: kor.nodes.Object, **kwargs: Any) List[str][source]#

Visit an object node.

kor.type_descriptors.initialize_type_descriptors(type_descriptor: Union[kor.type_descriptors.TypeDescriptor, str]) kor.type_descriptors.TypeDescriptor[source]#

Initialize the type descriptors.

kor.validators module#

Define validator interface and provide built-in validators for common-use cases.

class kor.validators.PydanticValidator(model_class: Type[pydantic.main.BaseModel], many: bool)[source]#

Bases: kor.validators.Validator

Use a pydantic model for validation.

clean_data(data: Any) Tuple[Union[pydantic.main.BaseModel, None, List[pydantic.main.BaseModel]], List[Exception]][source]#

Clean the data using the pydantic model.

Parameters

data – the parsed data

Returns

cleaned data instantiated as the corresponding pydantic model

class kor.validators.Validator[source]#

Bases: abc.ABC

abstract clean_data(data: Union[List[Mapping[str, Any]], Mapping[str, Any]]) Tuple[Any, List[Exception]][source]#

Validate the data and return a cleaned version of it.

Parameters

data – the parsed data

Returns

a cleaned version of the data, the type depends on the validator

kor.version module#

Get the version of the package.

Module contents#

class kor.Bool(*, id: str, description: str = '', many: bool = False, examples: Sequence[Tuple[str, Union[Sequence[bool], bool]]] = ())[source]#

Bases: kor.nodes.ExtractionSchemaNode

Built-in bool input.

accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

description: str#
examples: Sequence[Tuple[str, Union[Sequence[bool], bool]]]#
id: str#
many: bool#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[Tuple[str, Union[Sequence[bool], bool]]], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class kor.BulletPointDescriptor[source]#

Bases: kor.type_descriptors.TypeDescriptor[Iterable[str]]

Generate a bullet point style schema description.

describe(node: kor.nodes.Object) str[source]#

Describe the type of the given node.

visit_default(node: kor.nodes.AbstractSchemaNode, **kwargs: Any) List[str][source]#

Default action for a node.

visit_object(node: kor.nodes.Object, **kwargs: Any) List[str][source]#

Visit an object node.

class kor.CSVEncoder(node: kor.nodes.AbstractSchemaNode, use_tags: bool = False)[source]#

Bases: kor.encoders.typedefs.SchemaBasedEncoder

CSV encoder.

decode(text: str) Dict[str, List[Dict[str, Any]]][source]#

Decode the text.

encode(data: Any) str[source]#

Encode the data.

get_instruction_segment() str[source]#

Format instructions.

class kor.DocumentExtraction[source]#

Bases: dict

Type-definition for a document extraction result.

The original extraction typedefs together with the unique identifiers for the result itself as well as the source document.

Identifiers are included to make it easier to link the extraction result to the source content.

data: Dict[str, Any]#
errors: List[Exception]#
raw: str#
source_uid: str#

The source uid of the document from which data was extracted.

uid: str#

The uid of the extraction result.

validated_data: Dict[str, Any]#
class kor.Extraction[source]#

Bases: TypedDict

Type-definition for an extraction result.

data: Dict[str, Any]#

The decoding of the raw output from the LLM without any further processing.

errors: List[Exception]#

Any errors encountered during decoding or validation.

raw: str#

The raw output from the LLM.

validated_data: Dict[str, Any]#

The validated data if a validator was provided.

class kor.JSONEncoder(use_tags: bool = True, ensure_ascii: bool = False)[source]#

Bases: kor.encoders.typedefs.Encoder

JSON encoder and decoder.

The encoder by default adds additional <json> tags around the JSON output,

Additional tags are added to the output to help identify the JSON content within the LLM response and extract it.

The usage of <json> tags is similar to the usage of `JSON and ` marks.

Examples

from kor import JSONEncoder

json_encoder = JSONEncoder(use_tags=True)
data = {"name": "Café"}
json_encoder.encode(data)
# '<json>{"name": "Café"}</json>'

json_encoder = JSONEncoder(use_tags=True, ensure_ascii=True)
data = {"name": "Café"}
json_encoder.encode(data)
# '<json>{"name": "Caf\u00e9"}</json>'
decode(text: str) Any[source]#

Decode the text as JSON.

If the encoder is using tags, the <json> content is identified within the text and then is decoded.

Parameters

text – the text to be decoded

Returns

The decoded JSON data.

encode(data: Any) str[source]#

Encode the data as JSON.

Parameters

data – JSON serializable data.

Returns

The JSON encoded data as a string optionally wrapped in <json> tags.

get_instruction_segment() str[source]#

Get the format instructions for the given decoder.

This is a specification to the LLM that tells it how to shape its response so that the response can be structured properly using the given decoder.

class kor.Number(*, id: str, description: str = '', many: bool = False, examples: Sequence[Tuple[str, Union[int, float, Sequence[Union[float, int]]]]] = ())[source]#

Bases: kor.nodes.ExtractionSchemaNode

Built-in number input.

accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

description: str#
examples: Sequence[Tuple[str, Union[int, float, Sequence[Union[float, int]]]]]#
id: str#
many: bool#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[Tuple[str, Union[int, float, Sequence[Union[float, int]]]]], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class kor.Object(*, id: str, description: str = '', many: bool = False, attributes: Sequence[Union[kor.nodes.ExtractionSchemaNode, kor.nodes.Selection, kor.nodes.Object]], examples: Sequence[Tuple[str, Union[Sequence[Mapping[str, Any]], Mapping[str, Any]]]] = ())[source]#

Bases: kor.nodes.AbstractSchemaNode

Built-in representation for an object.

Use an object node to represent an entire object that should be extracted.

An extraction input can be associated with 2 different types of examples:

Example:

object = Object(
    id="cookie",
    description="Information about a cookie including price and name.",
    attributes=[
        Text(id="name", description="The name of the cookie"),
        Number(id="price", description="The price of the cookie"),
    ],
    examples=[
        ("I bought this Big Cookie for $10",
            {"name": "Big Cookie", "price": "$10"}),
        ("Eggs cost twelve dollars", {}), # Not a cookie
    ],
)
accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

attributes: Sequence[Union[ExtractionSchemaNode, Selection, Object]]#
description: str#
examples: Sequence[Tuple[str, Union[Sequence[Mapping[str, Any]], Mapping[str, Any]]]]#
id: str#
many: bool#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'attributes': FieldInfo(annotation=Sequence[Union[ExtractionSchemaNode, Selection, Object]], required=True), 'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[Tuple[str, Union[Sequence[Mapping[str, Any]], Mapping[str, Any]]]], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

classmethod parse_obj(*args: Any, **kwargs: Any) kor.nodes.Object[source]#

Parse an object.

classmethod parse_raw(*args: Any, **kwargs: Any) kor.nodes.Object[source]#

Parse raw data.

class kor.Option(*, id: str, description: str = '', many: bool = False, examples: Sequence[str] = ())[source]#

Bases: kor.nodes.AbstractSchemaNode

Built-in option input must be part of a selection input.

accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

description: str#
examples: Sequence[str]#
id: str#
many: bool#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[str], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class kor.Selection(*, id: str, description: str = '', many: bool = False, options: Sequence[kor.nodes.Option], examples: Sequence[Tuple[str, Union[Sequence[str], str]]] = (), null_examples: Sequence[str] = ())[source]#

Bases: kor.nodes.AbstractSchemaNode

Built-in selection node (aka Enum).

A selection input is composed of one or more options.

A selectio node supports both examples and null_examples.

Null examples are segments of text for which nothing should be extracted.

Examples:

selection = Selection(
    id="species",
    description="What is your favorite animal species?",
    options=[
        Option(id="dog", description="Dog"),
        Option(id="cat", description="Cat"),
        Option(id="bird", description="Bird"),
    ],
    examples=[
        ("I like dogs", "dog"),
        ("I like cats", "cat"),
        ("I like birds", "bird"),
    ],
    null_examples=[
        "I like flowers",
    ],
    many=False
)
accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

description: str#
examples: Sequence[Tuple[str, Union[str, Sequence[str]]]]#
id: str#
many: bool#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[Tuple[str, Union[Sequence[str], str]]], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False), 'null_examples': FieldInfo(annotation=Sequence[str], required=False, default=()), 'options': FieldInfo(annotation=Sequence[Option], required=True)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

null_examples: Sequence[str]#
options: Sequence[Option]#
class kor.Text(*, id: str, description: str = '', many: bool = False, examples: Sequence[Tuple[str, Union[Sequence[str], str]]] = ())[source]#

Bases: kor.nodes.ExtractionSchemaNode

Built-in text input.

accept(visitor: kor.nodes.AbstractVisitor[kor.nodes.T], **kwargs: Any) kor.nodes.T[source]#

Accept a visitor.

description: str#
examples: Sequence[Tuple[str, Union[Sequence[str], str]]]#
id: str#
many: bool#
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'description': FieldInfo(annotation=str, required=False, default=''), 'examples': FieldInfo(annotation=Sequence[Tuple[str, Union[Sequence[str], str]]], required=False, default=()), 'id': FieldInfo(annotation=str, required=True), 'many': FieldInfo(annotation=bool, required=False, default=False)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class kor.TypeDescriptor[source]#

Bases: kor.nodes.AbstractVisitor[kor.type_descriptors.T], abc.ABC

Abstract interface for a type-descriptor.

A type-descriptor is responsible for taking in a schema and outputting its type as a string. The description is used to help the LLM generate structured output.

A type-descriptor is a visitor that can be used to traverse the schema recursively.

abstract describe(node: kor.nodes.Object) str[source]#

Take in node and describe its type as a string.

class kor.TypeScriptDescriptor[source]#

Bases: kor.type_descriptors.TypeDescriptor[Iterable[str]]

Generate a typescript style schema description.

describe(node: kor.nodes.Object) str[source]#

Describe the node type in TypeScript notation.

visit_default(node: kor.nodes.AbstractSchemaNode, **kwargs: Any) List[str][source]#

Default action for a node.

visit_object(node: kor.nodes.Object, **kwargs: Any) List[str][source]#

Visit an object node.

class kor.XMLEncoder[source]#

Bases: kor.encoders.typedefs.Encoder

Experimental XML encoder to encode and decode data.

Warning

This encoder is not recommended for usage, at least not without further benchmarking for your use-case.

The decoder re-interprets all data types as lists, which makes validating and using parser results more involved. It’s unclear whether the encoder offers more advantages over other encoders (e.g., JSON or CSV).

The encoder would encode the following dictionary

{
    "color": ["red", "blue"],
    "height": ["6.1"],
    "width": ["3"],
}

As:

<color>red</color><height>6.1</height><width>3</width><color>blue</color>

A tag be repeated multiple times to represent multiple list elements.

decode(text: str) Dict[str, List[str]][source]#

Decode the XML as an object.

encode(obj: Mapping[str, Any]) str[source]#

Encode the object as XML.

get_instruction_segment() str[source]#

Format the instructions segment.

kor.create_extraction_chain(llm: langchain_core.language_models.base.BaseLanguageModel, node: kor.nodes.Object, *, encoder_or_encoder_class: Union[Type[kor.encoders.typedefs.Encoder], kor.encoders.typedefs.Encoder, str] = 'csv', type_descriptor: Union[kor.type_descriptors.TypeDescriptor, str] = 'typescript', validator: Optional[kor.validators.Validator] = None, input_formatter: Union[Literal['text_prefix'], Literal['triple_quotes'], None, Callable[[str], str]] = None, instruction_template: Optional[langchain_core.prompts.prompt.PromptTemplate] = None, verbose: Optional[bool] = None, **encoder_kwargs: Any) langchain.chains.llm.LLMChain[source]#

Create an extraction chain.

Parameters
  • llm – the language model used for extraction

  • node – the schematic description of what to extract from text

  • encoder_or_encoder_class – Either an encoder instance, an encoder class or a string representing the encoder class

  • type_descriptor – either a TypeDescriptor or a string representing the type descriptor name

  • validator – optional validator to use for validation

  • input_formatter – the formatter to use for encoding the input. Used for both input examples and the text to be analyzed. * None: use for single sentences or single paragraph, no formatting * triple_quotes: for long text, surround input with “”” * text_prefix: for long text, triple_quote with TEXT: ` prefix * `Callable: user provided function

  • instruction_template

    optional prompt template to use, use to over-ride prompt used for generating the instruction section of the prompt. It accepts 2 optional input variables: * “type_description”: type description of the node (from TypeDescriptor) * “format_instructions”: information on how to format the output

    (from Encoder)

  • verbose – if provided, sets the verbosity on the chain, otherwise default verbosity of the chain will be used

  • encoder_kwargs – Keyword arguments to pass to the encoder class

Returns

A langchain chain

Examples:

# For CSV encoding
chain = create_extraction_chain(llm, node, encoder_or_encoder_class="csv")

# For JSON encoding
chain = create_extraction_chain(llm, node, encoder_or_encoder_class="JSON",
                                input_formatter="triple_quotes")
async kor.extract_from_documents(chain: langchain.chains.llm.LLMChain, documents: Sequence[langchain_core.documents.base.Document], *, max_concurrency: int = 1, use_uid: bool = False, extraction_uid_function: Optional[Callable[[langchain_core.documents.base.Document], str]] = None, return_exceptions: bool = False) List[Union[kor.extraction.typedefs.DocumentExtraction, Exception]][source]#

Run extraction through all the given documents.

Attention: When using this function with a large number of documents, mind the bill

since this can use a lot of tokens!

Concurrency is currently limited using a semaphore. This is a temporary and can be changed to a queue implementation to support a non-materialized stream of documents.

Parameters
  • chain – the extraction chain to use for extraction

  • documents – the documents to run extraction on

  • max_concurrency – the maximum number of concurrent requests to make, uses a semaphore to limit concurrency

  • use_uid

    If True, will use a uid attribute in metadata if it exists

    will raise error if attribute does not exist.

    If False, will use the index of the document in the list as the uid

  • extraction_uid_function – Optional function to use to generate the uid for a given DocumentExtraction. If not provided, will use the uid of the document.

  • return_exceptions – named argument passed to asyncio.gather

Returns

A list of extraction results if return_exceptions = True, the exceptions may be returned as well.

kor.from_pydantic(model_class: Type[pydantic.main.BaseModel], *, description: str = '', examples: Sequence[Tuple[str, Dict[str, Any]]] = (), many: bool = False) Tuple[kor.nodes.Object, kor.validators.Validator][source]#

Convert a pydantic model to Kor internal representation.

Parameters
  • model_class – The pydantic model class to convert.

  • description – The description of the model.

  • examples – A sequence of examples to be used for the model.

  • many – Whether to expect the model to be a list of models.

Returns

A tuple of the Kor internal representation of the model and a validator.