{ "cells": [ { "cell_type": "markdown", "id": "4b3a0584-b52c-4873-abb8-8382e13ff5c0", "metadata": {}, "source": [ "# Document Extraction\n", "\n", "Here, we'll be extracting content from a longer document.\n", "\n", "\n", "The basic workflow is the following:\n", "\n", "1. Load the document\n", "2. Clean up the document (optional)\n", "3. Split the document into chunks\n", "4. Extract from *every* chunk of text\n", "\n", "-------------\n", "\n", "**ATTENTION** This is a *brute force* workflow -- there will be an LLM call for every piece of text that is being analyzed. \n", "This can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!\n", "\n", "---------------\n", "\n", "Let's apply this workflow to an HTML file.\n", "\n", "We'll reduce HTML to markdown. This is a lossy step, which can sometimes improve extraction results, and sometimes make extraction worse.\n", "\n", "When scraping HTML, executing javascript may be necessary to get all HTML fully rendered. \n", "\n", "Here's a piece of code that can execute javascript using playwright: \n", "\n", "\n", "```python\n", "async def a_download_html(url: str, extra_sleep: int) -> str:\n", " \"\"\"Download an HTML from a URL.\n", " \n", " In some pathological cases, an extra sleep period may be needed.\n", " \"\"\"\n", "\n", " async with async_playwright() as p:\n", " browser = await p.chromium.launch()\n", " page = await browser.new_page()\n", " await page.goto(url, wait_until=\"load\")\n", " if extra_sleep:\n", " await asyncio.sleep(extra_sleep)\n", " html_content = await page.content()\n", " await browser.close()\n", " return html_content\n", "```\n", "\n", "Another possibility is to use: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/url.html#selenium-url-loader\n", "\n", "---------\n", " \n", "Again this can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!" ] }, { "cell_type": "code", "execution_count": 1, "id": "f8536314-f0f3-4bb9-acd6-f2cec4046380", "metadata": { "nbsphinx": "hidden", "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import sys\n", "\n", "sys.path.insert(0, \"../../\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "d93b3de7-9b81-456e-acff-4b0df9755c14", "metadata": { "nbsphinx": "hidden", "tags": [] }, "outputs": [], "source": [ "from typing import List, Optional\n", "import itertools\n", "import requests\n", "\n", "import pandas as pd\n", "from pydantic import BaseModel, Field, validator\n", "from kor import extract_from_documents, from_pydantic, create_extraction_chain\n", "from kor.documents.html import MarkdownifyHTMLProcessor\n", "from langchain.chat_models import ChatOpenAI\n", "from langchain.schema import Document\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter" ] }, { "cell_type": "markdown", "id": "725a851e-ab91-4aa4-94b5-747bbf096460", "metadata": {}, "source": [ "## LLM\n", "\n", "Instantiate an LLM. \n", "\n", "Try experimenting with the cheaper davinci models or with gpt-3.5-turbo before trying the more expensive davinci-003 or gpt 4.\n", "\n", "In some cases, providing a better prompt (with more examples) can help make up for using a smaller model." ] }, { "cell_type": "markdown", "id": "b2aeb931-4725-4ab8-956e-18420d1c9a36", "metadata": { "tags": [] }, "source": [ "\n", "-------------------\n", "\n", "Quality can vary a **lot** depending on which LLM is used and how many examples are provided.\n", "\n", "-------------------" ] }, { "cell_type": "code", "execution_count": 3, "id": "fab20ada-6443-4799-b6e8-faf16a2fb585", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Using gpt-3.5-turbo which is pretty cheap, but has worse quality\n", "llm = ChatOpenAI(temperature=0)" ] }, { "cell_type": "markdown", "id": "cd9eaffc-875d-4b2c-a189-6c7c181fe628", "metadata": {}, "source": [ "## Schema" ] }, { "cell_type": "code", "execution_count": 4, "id": "981938c5-f438-49a0-b511-329d31073a56", "metadata": { "tags": [] }, "outputs": [], "source": [ "class ShowOrMovie(BaseModel):\n", " name: str = Field(\n", " description=\"The name of the movie or tv show\",\n", " )\n", " season: Optional[str] = Field(\n", " description=\"Season of TV show. Extract as a digit stripping Season prefix.\",\n", " )\n", " year: Optional[str] = Field(\n", " description=\"Year when the movie / tv show was released\",\n", " )\n", " latest_episode: Optional[str] = Field(\n", " description=\"Date when the latest episode was released\",\n", " )\n", " link: Optional[str] = Field(description=\"Link to the movie / tv show.\")\n", "\n", " # rating -- not included because rating on rottentomatoes is in the html elements\n", " # you could try extracting it by using the raw HTML (rather than markdown)\n", " # or you could try doing something similar on imdb\n", "\n", " @validator(\"name\")\n", " def name_must_not_be_empty(cls, v):\n", " if not v:\n", " raise ValueError(\"Name must not be empty\")\n", " return v\n", "\n", "\n", "schema, extraction_validator = from_pydantic(\n", " ShowOrMovie,\n", " description=\"Extract information about popular movies/tv shows including their name, year, link and rating.\",\n", " examples=[\n", " (\n", " \"[Rain Dogs Latest Episode: Apr 03](/tv/rain_dogs)\",\n", " {\"name\": \"Rain Dogs\", \"latest_episode\": \"Apr 03\", \"link\": \"/tv/rain_dogs\"},\n", " )\n", " ],\n", " many=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "id": "ff57507d-d789-4ae3-8763-0465e8b27686", "metadata": { "tags": [] }, "outputs": [], "source": [ "chain = create_extraction_chain(\n", " llm,\n", " schema,\n", " encoder_or_encoder_class=\"csv\",\n", " validator=extraction_validator,\n", " input_formatter=\"triple_quotes\",\n", ")" ] }, { "cell_type": "markdown", "id": "65a2ff9e-951a-46f8-8cb5-d3eab0c28f59", "metadata": {}, "source": [ "## Download\n", "\n", "Let's download a page containing movies from my favorite movie review site." ] }, { "cell_type": "code", "execution_count": 6, "id": "b5bd49b1-0b51-40ff-b34b-3ad7f90423b6", "metadata": { "tags": [] }, "outputs": [], "source": [ "url = \"https://www.rottentomatoes.com/browse/tv_series_browse/sort:popular\"\n", "response = requests.get(url) # Please see comment at top about using Selenium or" ] }, { "cell_type": "markdown", "id": "1226d70c-68ad-49b5-8327-3132a556bbf3", "metadata": {}, "source": [ "Remember that in some cases you will need to execute javascript! Here's a snippet\n", "\n", "```python\n", "from langchain.document_loaders import SeleniumURLLoader\n", "document = SeleniumURLLoader(url).load()\n", "```" ] }, { "cell_type": "markdown", "id": "1963c139-169a-4d07-a2bd-cb2e7dfa171b", "metadata": {}, "source": [ "## Extract\n", "\n", "Use langchain building blocks to assemble whatever pipeline you need for your own purposes." ] }, { "cell_type": "markdown", "id": "8d4d40c9-6c9b-431d-9bd4-66a00f3de36f", "metadata": {}, "source": [ "Create a langchain document with the HTML content." ] }, { "cell_type": "code", "execution_count": 7, "id": "396a16b7-bc45-4035-9848-f995cddbba2f", "metadata": { "tags": [] }, "outputs": [], "source": [ "doc = Document(page_content=response.text)" ] }, { "cell_type": "markdown", "id": "524e1105-f053-4954-9e11-e693820b4823", "metadata": {}, "source": [ "Convert to markdown\n", "\n", "**ATTENTION** This step is lossy and may end up removing information that's relevant for extraction. You can always try pushing the raw HTML through if you're not worried about cost." ] }, { "cell_type": "code", "execution_count": 8, "id": "22d3229a-aa6f-4586-a527-6f7835800740", "metadata": { "tags": [] }, "outputs": [], "source": [ "md = MarkdownifyHTMLProcessor().process(doc)" ] }, { "cell_type": "markdown", "id": "d98a7554-a3c7-42ea-99a4-d1e44aa9e22c", "metadata": {}, "source": [ "Break the document to chunks so it fits in context window" ] }, { "cell_type": "code", "execution_count": 9, "id": "814e4151-f7d7-412d-beab-58ed190b7dd3", "metadata": { "tags": [] }, "outputs": [], "source": [ "split_docs = RecursiveCharacterTextSplitter().split_documents([md])" ] }, { "cell_type": "code", "execution_count": 10, "id": "e663f031-7a61-4a27-b925-1c9a6f365bcc", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Watch the trailer for You\n", "\n", "[You\n", "\n", " Latest Episode: Mar 09](/tv/you)\n", "\n", "Watch the trailer for She-Hulk: Attorney at Law\n", "\n", "[She-Hulk: Attorney at Law](/tv/she_hulk_attorney_at_law)\n", "\n", "[Breaking Bad](/tv/breaking_bad)\n", "\n", "Watch the trailer for The Lord of the Rings: The Rings of Power\n", "\n", "[The Lord of the Rings: The Rings of Power](/tv/the_lord_of_the_rings_the_rings_of_power)\n", "\n", "No results\n", "\n", " Reset Filters\n", "\n", " Load more\n", "\n", "Close video\n", "\n", "See Details\n", "\n", "See Details\n", "\n", "* [Help](/help_desk)\n", "* [About Rotten Tomatoes](/about)\n", "* [What's the Tomatometer®?](/about#whatisthetomatometer)\n", "* \n", "\n", "* [Critic Submission](/critics/criteria)\n", "* [Licensing](/help_desk/licensing)\n", "* [Advertise With Us](https://together.nbcuni.com/advertise/?utm_source=rotten_tomatoes&utm_medium=referral&utm_campaign=property_ad_pages&utm_content=footer)\n", "* [Careers](//www.fandango.com/careers)\n", "\n", "Join The Newsletter\n", "\n", "Get the freshest reviews, news, and more delivered right to your inbox!\n", "\n", "Join The Newsletter\n", "[Join The Newsletter](https://optout.services.fandango.com/rottentomatoes)\n", "\n", "Follow Us\n", "\n", "* \n", "* \n", "* \n", "* \n", "* \n", "\n", "Copyright © Fandango. All rights reserved.\n", "\n", "Join Newsletter\n", "[Join Newsletter](https://optout.services.fandango.com/rottentomatoes)\n", "* [Privacy Policy](//www.fandango.com/policies/privacy-policy)\n", "* [Terms and Policies](//www.fandango.com/policies/terms-and-policies)\n", "* [Cookie Settings](javascript:void(0))\n", "* [California Notice](//www.fandango.com/californianotice)\n", "* [Ad Choices](//www.fandango.com/policies/cookies-and-tracking#cookie_management)\n", "* \n", "* [Accessibility](/faq#accessibility)\n", "\n", "* V3.1\n", "* [Privacy Policy](//www.fandango.com/policies/privacy-policy)\n", "* [Terms and Policies](//www.fandango.com/policies/terms-and-policies)\n", "* [Cookie Settings](javascript:void(0))\n", "* [California Notice](//www.fandango.com/californianotice)\n", "* [Ad Choices](//www.fandango.com/policies/cookies-and-tracking#cookie_management)\n", "* [Accessibility](/faq#accessibility)\n", "\n", "Copyright © Fandango. All rights reserved.\n" ] } ], "source": [ "print(split_docs[-1].page_content)" ] }, { "cell_type": "code", "execution_count": 11, "id": "ec72cb5b-1fcb-4381-85e8-f060ea1f0077", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(split_docs)" ] }, { "cell_type": "markdown", "id": "de953c9a-cb03-4ac8-8ccd-a32bde1f5f61", "metadata": {}, "source": [ "Run extraction" ] }, { "cell_type": "code", "execution_count": 12, "id": "1d17ebaa-8222-48df-82d9-42173a60f5f7", "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain.callbacks import get_openai_callback" ] }, { "cell_type": "code", "execution_count": 13, "id": "bdb85b12-25b9-477c-bf7e-5407812f4807", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total Tokens: 5854\n", "Prompt Tokens: 5128\n", "Completion Tokens: 726\n", "Successful Requests: 4\n", "Total Cost (USD): $0.011708000000000001\n" ] } ], "source": [ "with get_openai_callback() as cb:\n", " document_extraction_results = await extract_from_documents(\n", " chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True\n", " )\n", " print(f\"Total Tokens: {cb.total_tokens}\")\n", " print(f\"Prompt Tokens: {cb.prompt_tokens}\")\n", " print(f\"Completion Tokens: {cb.completion_tokens}\")\n", " print(f\"Successful Requests: {cb.successful_requests}\")\n", " print(f\"Total Cost (USD): ${cb.total_cost}\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "d00f86ed-48c3-46c3-8d37-49de50817d93", "metadata": { "tags": [] }, "outputs": [], "source": [ "validated_data = list(\n", " itertools.chain.from_iterable(\n", " extraction[\"validated_data\"] for extraction in document_extraction_results\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 15, "id": "89e35b77-aac4-4a32-9515-9ebe9bfef046", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "40" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(validated_data)" ] }, { "cell_type": "markdown", "id": "4b5f6bb8-656f-4d92-b9d7-3dfce538705b", "metadata": {}, "source": [ "Extraction is not perfect, but you can use a better LLM or provide more examples!" ] }, { "cell_type": "code", "execution_count": 16, "id": "75f3e35e-d6a9-4c0d-a5c3-a509eaf7d6ec", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameseasonyearlatest_episodelink
0Beef1/tv/beef/s01
1Dave3/tv/dave/s03
2Schmigadoon!2/tv/schmigadoon/s02
3Pretty Baby: Brooke Shields1/tv/pretty_baby_brooke_shields/s01
4Tiny Beautiful Things1/tv/tiny_beautiful_things/s01
5Grease: Rise of the Pink Ladies1/tv/grease_rise_of_the_pink_ladies/s01
6Jury Duty1/tv/jury_duty/s01
7The Crossover1/tv/the_crossover/s01
8Transatlantic1/tv/transatlantic/s01
9Race to Survive: Alaska1/tv/race_to_survive_alaska/s01
10BeefApr 06/tv/beef
11The Night AgentMar 23/tv/the_night_agent
12UnstableMar 30/tv/unstable
13The MandalorianApr 05/tv/the_mandalorian
14The Big Door PrizeApr 05/tv/the_big_door_prize
15Class of '07Mar 17/tv/class_of_07
16Rabbit HoleApr 02/tv/rabbit_hole
17The PowerApr 07/tv/the_power
18The Last of UsMar 12/tv/the_last_of_us
19YellowjacketsMar 31/tv/yellowjackets
20SuccessionApr 02/tv/succession
21Lucky HankApr 02/tv/lucky_hank
22Sex/LifeMar 02/tv/sex_life
23Ted LassoApr 05/tv/ted_lasso
24WellmaniaMar 29/tv/wellmania
25Daisy Jones & the SixMar 24/tv/daisy_jones_and_the_six
26Shadow and BoneMar 16/tv/shadow_and_bone
27The Order/tv/the_order
28ShrinkingMar 24/tv/shrinking
29SwarmMar 17/tv/swarm
30The Last Kingdom/tv/the_last_kingdom
31Rain DogsApr 03/tv/rain_dogs
32ExtrapolationsApr 07/tv/extrapolations
33War SailorApr 02/tv/war_sailor
34YouMar 09/tv/you
35She-Hulk: Attorney at Law/tv/she_hulk_attorney_at_law
36YouMar 09/tv/you
37She-Hulk: Attorney at LawNone/tv/she_hulk_attorney_at_law
38Breaking BadNone/tv/breaking_bad
39The Lord of the Rings: The Rings of PowerNone/tv/the_lord_of_the_rings_the_rings_of_power
\n", "
" ], "text/plain": [ " name season year latest_episode \\\n", "0 Beef 1 \n", "1 Dave 3 \n", "2 Schmigadoon! 2 \n", "3 Pretty Baby: Brooke Shields 1 \n", "4 Tiny Beautiful Things 1 \n", "5 Grease: Rise of the Pink Ladies 1 \n", "6 Jury Duty 1 \n", "7 The Crossover 1 \n", "8 Transatlantic 1 \n", "9 Race to Survive: Alaska 1 \n", "10 Beef Apr 06 \n", "11 The Night Agent Mar 23 \n", "12 Unstable Mar 30 \n", "13 The Mandalorian Apr 05 \n", "14 The Big Door Prize Apr 05 \n", "15 Class of '07 Mar 17 \n", "16 Rabbit Hole Apr 02 \n", "17 The Power Apr 07 \n", "18 The Last of Us Mar 12 \n", "19 Yellowjackets Mar 31 \n", "20 Succession Apr 02 \n", "21 Lucky Hank Apr 02 \n", "22 Sex/Life Mar 02 \n", "23 Ted Lasso Apr 05 \n", "24 Wellmania Mar 29 \n", "25 Daisy Jones & the Six Mar 24 \n", "26 Shadow and Bone Mar 16 \n", "27 The Order \n", "28 Shrinking Mar 24 \n", "29 Swarm Mar 17 \n", "30 The Last Kingdom \n", "31 Rain Dogs Apr 03 \n", "32 Extrapolations Apr 07 \n", "33 War Sailor Apr 02 \n", "34 You Mar 09 \n", "35 She-Hulk: Attorney at Law \n", "36 You Mar 09 \n", "37 She-Hulk: Attorney at Law None \n", "38 Breaking Bad None \n", "39 The Lord of the Rings: The Rings of Power None \n", "\n", " link \n", "0 /tv/beef/s01 \n", "1 /tv/dave/s03 \n", "2 /tv/schmigadoon/s02 \n", "3 /tv/pretty_baby_brooke_shields/s01 \n", "4 /tv/tiny_beautiful_things/s01 \n", "5 /tv/grease_rise_of_the_pink_ladies/s01 \n", "6 /tv/jury_duty/s01 \n", "7 /tv/the_crossover/s01 \n", "8 /tv/transatlantic/s01 \n", "9 /tv/race_to_survive_alaska/s01 \n", "10 /tv/beef \n", "11 /tv/the_night_agent \n", "12 /tv/unstable \n", "13 /tv/the_mandalorian \n", "14 /tv/the_big_door_prize \n", "15 /tv/class_of_07 \n", "16 /tv/rabbit_hole \n", "17 /tv/the_power \n", "18 /tv/the_last_of_us \n", "19 /tv/yellowjackets \n", "20 /tv/succession \n", "21 /tv/lucky_hank \n", "22 /tv/sex_life \n", "23 /tv/ted_lasso \n", "24 /tv/wellmania \n", "25 /tv/daisy_jones_and_the_six \n", "26 /tv/shadow_and_bone \n", "27 /tv/the_order \n", "28 /tv/shrinking \n", "29 /tv/swarm \n", "30 /tv/the_last_kingdom \n", "31 /tv/rain_dogs \n", "32 /tv/extrapolations \n", "33 /tv/war_sailor \n", "34 /tv/you \n", "35 /tv/she_hulk_attorney_at_law \n", "36 /tv/you \n", "37 /tv/she_hulk_attorney_at_law \n", "38 /tv/breaking_bad \n", "39 /tv/the_lord_of_the_rings_the_rings_of_power " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(record.dict() for record in validated_data)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" } }, "nbformat": 4, "nbformat_minor": 5 }