{ "cells": [ { "cell_type": "markdown", "id": "c8f6fd5d-980b-4a1f-97cf-e5eff784f8f2", "metadata": {}, "source": [ "# Guidelines\n", "\n", "`Kor` is a wrapper around LLMs to help with information extraction.\n", "\n", "The quality of the results depends on a lot of factors. \n", "\n", "Here are a few things to experiment with to improve quality:\n", "\n", "* Add more examples. Diverse examples can help, including examples where nothing should be extracted.\n", "* Improve the descriptions of the attributes.\n", "* If working with multi-paragraph text, specify an `input_formatter` of `\"triple_quotes\"` when creating the chain.\n", "* Try a better model (e.g., text-davinci-003, gpt-4).\n", "* Break the schema into a few smaller schemas, run separate extractions, and merge the results.\n", "* If possible to flatten the schema, and use a CSV encoding instead of a JSON encoding.\n", "* Add verification/correction steps (ask an LLM to correct or verify the results of the extraction).\n", "\n", "## Keep in mind! πŸ˜Άβ€πŸŒ«οΈ\n", "\n", "* If you're extracting information from a **single** **structured** source (e.g., linkedin), using an LLM is not a good idea -- traditional web-scraping will be much cheaper and reliable.\n", "* If perfect quality is needed, then even with all the hacks above, you'll need to plan on having a human in the loop as even the best LLMs will make mistakes with complex extraction tasks." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.1" } }, "nbformat": 4, "nbformat_minor": 5 }