kor.extraction package
Contents
kor.extraction package#
Submodules#
kor.extraction.api module#
Kor API for extraction related functionality.
- kor.extraction.api.create_extraction_chain(llm: langchain_core.language_models.base.BaseLanguageModel, node: kor.nodes.Object, *, encoder_or_encoder_class: Union[Type[kor.encoders.typedefs.Encoder], kor.encoders.typedefs.Encoder, str] = 'csv', type_descriptor: Union[kor.type_descriptors.TypeDescriptor, str] = 'typescript', validator: Optional[kor.validators.Validator] = None, input_formatter: Union[Literal['text_prefix'], Literal['triple_quotes'], None, Callable[[str], str]] = None, instruction_template: Optional[langchain_core.prompts.prompt.PromptTemplate] = None, verbose: Optional[bool] = None, **encoder_kwargs: Any) langchain_core.runnables.base.Runnable [source]#
Create an extraction chain.
- Parameters
llm – the language model used for extraction
node – the schematic description of what to extract from text
encoder_or_encoder_class – Either an encoder instance, an encoder class or a string representing the encoder class
type_descriptor – either a TypeDescriptor or a string representing the type descriptor name
validator – optional validator to use for validation
input_formatter – the formatter to use for encoding the input. Used for both input examples and the text to be analyzed. * None: use for single sentences or single paragraph, no formatting * triple_quotes: for long text, surround input with “”” * text_prefix: for long text, triple_quote with TEXT: ` prefix * `Callable: user provided function
instruction_template –
optional prompt template to use, use to over-ride prompt used for generating the instruction section of the prompt. It accepts 2 optional input variables: * “type_description”: type description of the node (from TypeDescriptor) * “format_instructions”: information on how to format the output
(from Encoder)
verbose – Deprecated, use langchain_core.globals.set_verbose and langchain_core.globals.set_debug instead. Please reference this guide for more information: https://python.langchain.com/v0.2/docs/how_to/debugging
encoder_kwargs – Keyword arguments to pass to the encoder class
- Returns
A langchain chain
Examples:
# For CSV encoding chain = create_extraction_chain(llm, node, encoder_or_encoder_class="csv") # For JSON encoding chain = create_extraction_chain(llm, node, encoder_or_encoder_class="JSON", input_formatter="triple_quotes")
- async kor.extraction.api.extract_from_documents(chain: langchain_core.runnables.base.Runnable, documents: Sequence[langchain_core.documents.base.Document], *, max_concurrency: int = 1, use_uid: bool = False, extraction_uid_function: Optional[Callable[[langchain_core.documents.base.Document], str]] = None, return_exceptions: bool = False) List[Union[kor.extraction.typedefs.DocumentExtraction, Exception]] [source]#
Run extraction through all the given documents.
- Attention: When using this function with a large number of documents, mind the bill
since this can use a lot of tokens!
Concurrency is currently limited using a semaphore. This is a temporary and can be changed to a queue implementation to support a non-materialized stream of documents.
- Parameters
chain – the extraction chain to use for extraction
documents – the documents to run extraction on
max_concurrency – the maximum number of concurrent requests to make, uses a semaphore to limit concurrency
use_uid –
- If True, will use a uid attribute in metadata if it exists
will raise error if attribute does not exist.
If False, will use the index of the document in the list as the uid
extraction_uid_function – Optional function to use to generate the uid for a given DocumentExtraction. If not provided, will use the uid of the document.
return_exceptions – named argument passed to asyncio.gather
- Returns
A list of extraction results if return_exceptions = True, the exceptions may be returned as well.
kor.extraction.parser module#
- class kor.extraction.parser.KorParser(*, name: Optional[str] = None, encoder: kor.encoders.typedefs.Encoder, schema_: kor.nodes.Object, validator: Optional[kor.validators.Validator] = None)[source]#
Bases:
langchain_core.output_parsers.base.BaseOutputParser
[kor.extraction.typedefs.Extraction
]A Kor langchain parser integration.
This parser can use any of Kor’s encoders to support encoding/decoding different data formats.
- class Config[source]#
Bases:
object
Configuration for this pydantic object.
- arbitrary_types_allowed = True#
- extra = 'forbid'#
- parse(text: str) kor.extraction.typedefs.Extraction [source]#
Parse the text.
kor.extraction.typedefs module#
Type definitions for the extraction package.
- class kor.extraction.typedefs.DocumentExtraction[source]#
Bases:
kor.extraction.typedefs.Extraction
Type-definition for a document extraction result.
The original extraction typedefs together with the unique identifiers for the result itself as well as the source document.
Identifiers are included to make it easier to link the extraction result to the source content.
- data: Dict[str, Any]#
- errors: List[Exception]#
- raw: str#
- source_uid: str#
The source uid of the document from which data was extracted.
- uid: str#
The uid of the extraction result.
- validated_data: Dict[str, Any]#
- class kor.extraction.typedefs.Extraction[source]#
Bases:
typing_extensions.TypedDict
Type-definition for an extraction result.
- data: Dict[str, Any]#
The decoding of the raw output from the LLM without any further processing.
- errors: List[Exception]#
Any errors encountered during decoding or validation.
- raw: str#
The raw output from the LLM.
- validated_data: Dict[str, Any]#
The validated data if a validator was provided.
Module contents#
- class kor.extraction.DocumentExtraction[source]#
Bases:
kor.extraction.typedefs.Extraction
Type-definition for a document extraction result.
The original extraction typedefs together with the unique identifiers for the result itself as well as the source document.
Identifiers are included to make it easier to link the extraction result to the source content.
- data: Dict[str, Any]#
- errors: List[Exception]#
- raw: str#
- source_uid: str#
The source uid of the document from which data was extracted.
- uid: str#
The uid of the extraction result.
- validated_data: Dict[str, Any]#
- class kor.extraction.Extraction[source]#
Bases:
typing_extensions.TypedDict
Type-definition for an extraction result.
- data: Dict[str, Any]#
The decoding of the raw output from the LLM without any further processing.
- errors: List[Exception]#
Any errors encountered during decoding or validation.
- raw: str#
The raw output from the LLM.
- validated_data: Dict[str, Any]#
The validated data if a validator was provided.
- class kor.extraction.KorParser(*, name: Optional[str] = None, encoder: kor.encoders.typedefs.Encoder, schema_: kor.nodes.Object, validator: Optional[kor.validators.Validator] = None)[source]#
Bases:
langchain_core.output_parsers.base.BaseOutputParser
[kor.extraction.typedefs.Extraction
]A Kor langchain parser integration.
This parser can use any of Kor’s encoders to support encoding/decoding different data formats.
- class Config[source]#
Bases:
object
Configuration for this pydantic object.
- arbitrary_types_allowed = True#
- extra = 'forbid'#
- parse(text: str) kor.extraction.typedefs.Extraction [source]#
Parse the text.
- kor.extraction.create_extraction_chain(llm: langchain_core.language_models.base.BaseLanguageModel, node: kor.nodes.Object, *, encoder_or_encoder_class: Union[Type[kor.encoders.typedefs.Encoder], kor.encoders.typedefs.Encoder, str] = 'csv', type_descriptor: Union[kor.type_descriptors.TypeDescriptor, str] = 'typescript', validator: Optional[kor.validators.Validator] = None, input_formatter: Union[Literal['text_prefix'], Literal['triple_quotes'], None, Callable[[str], str]] = None, instruction_template: Optional[langchain_core.prompts.prompt.PromptTemplate] = None, verbose: Optional[bool] = None, **encoder_kwargs: Any) langchain_core.runnables.base.Runnable [source]#
Create an extraction chain.
- Parameters
llm – the language model used for extraction
node – the schematic description of what to extract from text
encoder_or_encoder_class – Either an encoder instance, an encoder class or a string representing the encoder class
type_descriptor – either a TypeDescriptor or a string representing the type descriptor name
validator – optional validator to use for validation
input_formatter – the formatter to use for encoding the input. Used for both input examples and the text to be analyzed. * None: use for single sentences or single paragraph, no formatting * triple_quotes: for long text, surround input with “”” * text_prefix: for long text, triple_quote with TEXT: ` prefix * `Callable: user provided function
instruction_template –
optional prompt template to use, use to over-ride prompt used for generating the instruction section of the prompt. It accepts 2 optional input variables: * “type_description”: type description of the node (from TypeDescriptor) * “format_instructions”: information on how to format the output
(from Encoder)
verbose – Deprecated, use langchain_core.globals.set_verbose and langchain_core.globals.set_debug instead. Please reference this guide for more information: https://python.langchain.com/v0.2/docs/how_to/debugging
encoder_kwargs – Keyword arguments to pass to the encoder class
- Returns
A langchain chain
Examples:
# For CSV encoding chain = create_extraction_chain(llm, node, encoder_or_encoder_class="csv") # For JSON encoding chain = create_extraction_chain(llm, node, encoder_or_encoder_class="JSON", input_formatter="triple_quotes")
- async kor.extraction.extract_from_documents(chain: langchain_core.runnables.base.Runnable, documents: Sequence[langchain_core.documents.base.Document], *, max_concurrency: int = 1, use_uid: bool = False, extraction_uid_function: Optional[Callable[[langchain_core.documents.base.Document], str]] = None, return_exceptions: bool = False) List[Union[kor.extraction.typedefs.DocumentExtraction, Exception]] [source]#
Run extraction through all the given documents.
- Attention: When using this function with a large number of documents, mind the bill
since this can use a lot of tokens!
Concurrency is currently limited using a semaphore. This is a temporary and can be changed to a queue implementation to support a non-materialized stream of documents.
- Parameters
chain – the extraction chain to use for extraction
documents – the documents to run extraction on
max_concurrency – the maximum number of concurrent requests to make, uses a semaphore to limit concurrency
use_uid –
- If True, will use a uid attribute in metadata if it exists
will raise error if attribute does not exist.
If False, will use the index of the document in the list as the uid
extraction_uid_function – Optional function to use to generate the uid for a given DocumentExtraction. If not provided, will use the uid of the document.
return_exceptions – named argument passed to asyncio.gather
- Returns
A list of extraction results if return_exceptions = True, the exceptions may be returned as well.