kor.encoders package#

Submodules#

kor.encoders.csv_data module#

Module that contains Kor flavored encoders/decoders for CSV data.

The code will need to eventually support handling some form of nested objects, via either JSON encoded column values or by breaking down nested attributes into additional columns (likely both methods).

class kor.encoders.csv_data.CSVEncoder(node: kor.nodes.AbstractSchemaNode, use_tags: bool = False)[source]#

Bases: kor.encoders.typedefs.SchemaBasedEncoder

CSV encoder.

decode(text: str) Dict[str, List[Dict[str, Any]]][source]#

Decode the text.

encode(data: Any) str[source]#

Encode the data.

get_instruction_segment() str[source]#

Format instructions.

kor.encoders.encode module#

kor.encoders.encode.encode_examples(examples: Sequence[Tuple[str, str]], encoder: kor.encoders.typedefs.Encoder, input_formatter: Union[Literal['text_prefix'], Literal['triple_quotes'], None, Callable[[str], str]] = None) List[Tuple[str, str]][source]#

Encode the output using the given encoder.

kor.encoders.encode.format_text(text: str, input_formatter: Union[Literal['text_prefix'], Literal['triple_quotes'], None, Callable[[str], str]] = None) str[source]#

An encoder for the input text.

Parameters
  • text – the text to encode

  • input_formatter – the formatter to use for the input * None: use for single sentences or single paragraphs, no formatting * triple_quotes: surround input with “””, use for long text * text_prefix: same as triple_quote but with `TEXT: ` prefix * Callable: user provided function

Returns

The encoded text if it was encoded

kor.encoders.encode.initialize_encoder(encoder_or_encoder_class: Union[Type[kor.encoders.typedefs.Encoder], kor.encoders.typedefs.Encoder, str], schema: kor.nodes.AbstractSchemaNode, **kwargs: Any) kor.encoders.typedefs.Encoder[source]#

Flexible way to initialize an encoder, used only for top level API.

Parameters
  • encoder_or_encoder_class – Either an encoder instance, an encoder class or a string representing the encoder class.

  • schema – The schema to use for the encoder.

  • **kwargs – Keyword arguments to pass to the encoder class.

Returns

An encoder instance

kor.encoders.json_data module#

JSON encoder and decoder.

class kor.encoders.json_data.JSONEncoder(use_tags: bool = True, ensure_ascii: bool = False)[source]#

Bases: kor.encoders.typedefs.Encoder

JSON encoder and decoder.

The encoder by default adds additional <json> tags around the JSON output,

Additional tags are added to the output to help identify the JSON content within the LLM response and extract it.

The usage of <json> tags is similar to the usage of `JSON and ` marks.

Examples

from kor import JSONEncoder

json_encoder = JSONEncoder(use_tags=True)
data = {"name": "Café"}
json_encoder.encode(data)
# '<json>{"name": "Café"}</json>'

json_encoder = JSONEncoder(use_tags=True, ensure_ascii=True)
data = {"name": "Café"}
json_encoder.encode(data)
# '<json>{"name": "Caf\u00e9"}</json>'
decode(text: str) Any[source]#

Decode the text as JSON.

If the encoder is using tags, the <json> content is identified within the text and then is decoded.

Parameters

text – the text to be decoded

Returns

The decoded JSON data.

encode(data: Any) str[source]#

Encode the data as JSON.

Parameters

data – JSON serializable data.

Returns

The JSON encoded data as a string optionally wrapped in <json> tags.

get_instruction_segment() str[source]#

Get the format instructions for the given decoder.

This is a specification to the LLM that tells it how to shape its response so that the response can be structured properly using the given decoder.

kor.encoders.typedefs module#

Type-definitions for encoders.

This file only contains the interface for encoders.

  • Added a pre-built format instruction segment.

  • May remove it at some point later or modify it if we discover that there are many ways of phrasing the format instructions.

class kor.encoders.typedefs.Encoder[source]#

Bases: abc.ABC

Abstract interface for an encoder.

The encoder is responsible for encoding and decoding the Output portion of examples provided to the LLM.

It must implement a method called get_instruction_segment that contains instructions for the LLM on how to format its output.

abstract decode(text: str) Any[source]#

Decode the text.

abstract encode(data: Any) str[source]#

Encode the data.

abstract get_instruction_segment() str[source]#

Get the format instructions for the given decoder.

Used to guide the LLM on how to format its output.

class kor.encoders.typedefs.SchemaBasedEncoder(node: kor.nodes.AbstractSchemaNode, **kwargs: Any)[source]#

Bases: kor.encoders.typedefs.Encoder, abc.ABC

Abstract interface for an encoder that has the data schema.

Inherit from this encoder if the encoder needs to know the schema of the data that’s being encoded.

kor.encoders.utils module#

kor.encoders.utils.unwrap_tag(tag_name: str, text: str) Optional[str][source]#

Extract content located inside a tag.

kor.encoders.utils.wrap_in_tag(tag_name: str, content: str) str[source]#

Wrap the content in an HTML style tag.

kor.encoders.xml module#

class kor.encoders.xml.TagParser[source]#

Bases: html.parser.HTMLParser

handle_data(data: str) None[source]#

Hook when handling data.

handle_endtag(tag: str) None[source]#

Hook when a tag is closed.

handle_starttag(tag: str, attrs: Any) None[source]#

Hook when a new tag is encountered.

class kor.encoders.xml.XMLEncoder[source]#

Bases: kor.encoders.typedefs.Encoder

Experimental XML encoder to encode and decode data.

Warning

This encoder is not recommended for usage, at least not without further benchmarking for your use-case.

The decoder re-interprets all data types as lists, which makes validating and using parser results more involved. It’s unclear whether the encoder offers more advantages over other encoders (e.g., JSON or CSV).

The encoder would encode the following dictionary

{
    "color": ["red", "blue"],
    "height": ["6.1"],
    "width": ["3"],
}

As:

<color>red</color><height>6.1</height><width>3</width><color>blue</color>

A tag be repeated multiple times to represent multiple list elements.

decode(text: str) Dict[str, List[str]][source]#

Decode the XML as an object.

encode(obj: Mapping[str, Any]) str[source]#

Encode the object as XML.

get_instruction_segment() str[source]#

Format the instructions segment.

Module contents#

Declare public interface for encoders.

An encoder follows the Encoder interface.

It can encode, decode and contains instructions about the encoding format for an LLM.

class kor.encoders.CSVEncoder(node: kor.nodes.AbstractSchemaNode, use_tags: bool = False)[source]#

Bases: kor.encoders.typedefs.SchemaBasedEncoder

CSV encoder.

decode(text: str) Dict[str, List[Dict[str, Any]]][source]#

Decode the text.

encode(data: Any) str[source]#

Encode the data.

get_instruction_segment() str[source]#

Format instructions.

class kor.encoders.Encoder[source]#

Bases: abc.ABC

Abstract interface for an encoder.

The encoder is responsible for encoding and decoding the Output portion of examples provided to the LLM.

It must implement a method called get_instruction_segment that contains instructions for the LLM on how to format its output.

abstract decode(text: str) Any[source]#

Decode the text.

abstract encode(data: Any) str[source]#

Encode the data.

abstract get_instruction_segment() str[source]#

Get the format instructions for the given decoder.

Used to guide the LLM on how to format its output.

class kor.encoders.JSONEncoder(use_tags: bool = True, ensure_ascii: bool = False)[source]#

Bases: kor.encoders.typedefs.Encoder

JSON encoder and decoder.

The encoder by default adds additional <json> tags around the JSON output,

Additional tags are added to the output to help identify the JSON content within the LLM response and extract it.

The usage of <json> tags is similar to the usage of `JSON and ` marks.

Examples

from kor import JSONEncoder

json_encoder = JSONEncoder(use_tags=True)
data = {"name": "Café"}
json_encoder.encode(data)
# '<json>{"name": "Café"}</json>'

json_encoder = JSONEncoder(use_tags=True, ensure_ascii=True)
data = {"name": "Café"}
json_encoder.encode(data)
# '<json>{"name": "Caf\u00e9"}</json>'
decode(text: str) Any[source]#

Decode the text as JSON.

If the encoder is using tags, the <json> content is identified within the text and then is decoded.

Parameters

text – the text to be decoded

Returns

The decoded JSON data.

encode(data: Any) str[source]#

Encode the data as JSON.

Parameters

data – JSON serializable data.

Returns

The JSON encoded data as a string optionally wrapped in <json> tags.

get_instruction_segment() str[source]#

Get the format instructions for the given decoder.

This is a specification to the LLM that tells it how to shape its response so that the response can be structured properly using the given decoder.

class kor.encoders.SchemaBasedEncoder(node: kor.nodes.AbstractSchemaNode, **kwargs: Any)[source]#

Bases: kor.encoders.typedefs.Encoder, abc.ABC

Abstract interface for an encoder that has the data schema.

Inherit from this encoder if the encoder needs to know the schema of the data that’s being encoded.

class kor.encoders.XMLEncoder[source]#

Bases: kor.encoders.typedefs.Encoder

Experimental XML encoder to encode and decode data.

Warning

This encoder is not recommended for usage, at least not without further benchmarking for your use-case.

The decoder re-interprets all data types as lists, which makes validating and using parser results more involved. It’s unclear whether the encoder offers more advantages over other encoders (e.g., JSON or CSV).

The encoder would encode the following dictionary

{
    "color": ["red", "blue"],
    "height": ["6.1"],
    "width": ["3"],
}

As:

<color>red</color><height>6.1</height><width>3</width><color>blue</color>

A tag be repeated multiple times to represent multiple list elements.

decode(text: str) Dict[str, List[str]][source]#

Decode the XML as an object.

encode(obj: Mapping[str, Any]) str[source]#

Encode the object as XML.

get_instruction_segment() str[source]#

Format the instructions segment.

kor.encoders.encode_examples(examples: Sequence[Tuple[str, str]], encoder: kor.encoders.typedefs.Encoder, input_formatter: Union[Literal['text_prefix'], Literal['triple_quotes'], None, Callable[[str], str]] = None) List[Tuple[str, str]][source]#

Encode the output using the given encoder.

kor.encoders.initialize_encoder(encoder_or_encoder_class: Union[Type[kor.encoders.typedefs.Encoder], kor.encoders.typedefs.Encoder, str], schema: kor.nodes.AbstractSchemaNode, **kwargs: Any) kor.encoders.typedefs.Encoder[source]#

Flexible way to initialize an encoder, used only for top level API.

Parameters
  • encoder_or_encoder_class – Either an encoder instance, an encoder class or a string representing the encoder class.

  • schema – The schema to use for the encoder.

  • **kwargs – Keyword arguments to pass to the encoder class.

Returns

An encoder instance