kor.documents package#

Submodules#

kor.documents.html module#

Load and chunk HTMLs with potential pre-processing to clean the html.

class kor.documents.html.MarkdownifyHTMLProcessor(tags_to_remove: Tuple[str, ...] = ('svg', 'img', 'script', 'style'))[source]#

Bases: kor.documents.typedefs.AbstractDocumentProcessor

A preprocessor to clean HTML and convert to markdown using markdownify.

process(document: langchain_core.documents.base.Document) langchain_core.documents.base.Document[source]#

Clean up HTML and convert to markdown using markdownify.

Parameters

document – a document with HTML content

Returns

The cleaned HTML

kor.documents.typedefs module#

class kor.documents.typedefs.AbstractDocumentProcessor[source]#

Bases: abc.ABC

An interface for document transformers.

abstract process(document: langchain_core.documents.base.Document) langchain_core.documents.base.Document[source]#

Process document.

Module contents#