{ "cells": [ { "cell_type": "markdown", "id": "4b3a0584-b52c-4873-abb8-8382e13ff5c0", "metadata": {}, "source": [ "# Document Extraction\n", "\n", "Here, we'll be extracting content from a longer document.\n", "\n", "\n", "The basic workflow is the following:\n", "\n", "1. Load the document\n", "2. Clean up the document (optional)\n", "3. Split the document into chunks\n", "4. Extract from *every* chunk of text\n", "\n", "-------------\n", "\n", "**ATTENTION** This is a *brute force* workflow -- there will be an LLM call for every piece of text that is being analyzed. \n", "This can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!\n", "\n", "---------------\n", "\n", "Let's apply this workflow to an HTML file.\n", "\n", "We'll reduce HTML to markdown. This is a lossy step, which can sometimes improve extraction results, and sometimes make extraction worse.\n", "\n", "When scraping HTML, executing javascript may be necessary to get all HTML fully rendered. \n", "\n", "Here's a piece of code that can execute javascript using playwright: \n", "\n", "\n", "```python\n", "async def a_download_html(url: str, extra_sleep: int) -> str:\n", " \"\"\"Download an HTML from a URL.\n", " \n", " In some pathological cases, an extra sleep period may be needed.\n", " \"\"\"\n", "\n", " async with async_playwright() as p:\n", " browser = await p.chromium.launch()\n", " page = await browser.new_page()\n", " await page.goto(url, wait_until=\"load\")\n", " if extra_sleep:\n", " await asyncio.sleep(extra_sleep)\n", " html_content = await page.content()\n", " await browser.close()\n", " return html_content\n", "```\n", "\n", "Another possibility is to use: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/url.html#selenium-url-loader\n", "\n", "---------\n", " \n", "Again this can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!" ] }, { "cell_type": "code", "execution_count": 1, "id": "f8536314-f0f3-4bb9-acd6-f2cec4046380", "metadata": { "nbsphinx": "hidden", "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import sys\n", "\n", "sys.path.insert(0, \"../../\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "d93b3de7-9b81-456e-acff-4b0df9755c14", "metadata": { "nbsphinx": "hidden", "tags": [] }, "outputs": [], "source": [ "from typing import List, Optional\n", "import itertools\n", "import requests\n", "\n", "import pandas as pd\n", "from pydantic import BaseModel, Field, field_validator\n", "from kor import extract_from_documents, from_pydantic, create_extraction_chain\n", "from kor.documents.html import MarkdownifyHTMLProcessor\n", "from langchain_core.documents import Document\n", "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", "from langchain_openai import ChatOpenAI" ] }, { "cell_type": "markdown", "id": "725a851e-ab91-4aa4-94b5-747bbf096460", "metadata": {}, "source": [ "## LLM\n", "\n", "Instantiate an LLM. \n", "\n", "Try experimenting with the cheaper davinci models or with gpt-4o before trying the more expensive davinci-003 or gpt 4.\n", "\n", "In some cases, providing a better prompt (with more examples) can help make up for using a smaller model." ] }, { "cell_type": "markdown", "id": "b2aeb931-4725-4ab8-956e-18420d1c9a36", "metadata": { "tags": [] }, "source": [ "\n", "-------------------\n", "\n", "Quality can vary a **lot** depending on which LLM is used and how many examples are provided.\n", "\n", "-------------------" ] }, { "cell_type": "code", "execution_count": 8, "id": "fab20ada-6443-4799-b6e8-faf16a2fb585", "metadata": { "tags": [] }, "outputs": [], "source": [ "llm = ChatOpenAI(temperature=0, model='gpt-4o')" ] }, { "cell_type": "markdown", "id": "cd9eaffc-875d-4b2c-a189-6c7c181fe628", "metadata": {}, "source": [ "## Schema" ] }, { "cell_type": "code", "execution_count": 9, "id": "981938c5-f438-49a0-b511-329d31073a56", "metadata": { "tags": [] }, "outputs": [], "source": [ "\n", "\n", "class ShowOrMovie(BaseModel):\n", " name: str = Field(\n", " description=\"The name of the movie or tv show\",\n", " )\n", " season: Optional[str] = Field(\n", " description=\"Season of TV show. Extract as a digit stripping Season prefix.\",\n", " )\n", " year: Optional[str] = Field(\n", " description=\"Year when the movie / tv show was released\",\n", " )\n", " latest_episode: Optional[str] = Field(\n", " description=\"Date when the latest episode was released\",\n", " )\n", " link: Optional[str] = Field(description=\"Link to the movie / tv show.\")\n", "\n", " # rating -- not included because rating on rottentomatoes is in the html elements\n", " # you could try extracting it by using the raw HTML (rather than markdown)\n", " # or you could try doing something similar on imdb\n", "\n", " @field_validator(\"name\")\n", " def name_must_not_be_empty(cls, v):\n", " if not v:\n", " raise ValueError(\"Name must not be empty\")\n", " return v\n", "\n", "\n", "schema, extraction_validator = from_pydantic(\n", " ShowOrMovie,\n", " description=\"Extract information about popular movies/tv shows including their name, year, link and rating.\",\n", " examples=[\n", " (\n", " \"[Rain Dogs Latest Episode: Apr 03](/tv/rain_dogs)\",\n", " {\"name\": \"Rain Dogs\", \"latest_episode\": \"Apr 03\", \"link\": \"/tv/rain_dogs\"},\n", " )\n", " ],\n", " many=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "id": "ff57507d-d789-4ae3-8763-0465e8b27686", "metadata": { "tags": [] }, "outputs": [], "source": [ "chain = create_extraction_chain(\n", " llm,\n", " schema,\n", " encoder_or_encoder_class=\"csv\",\n", " validator=extraction_validator,\n", " input_formatter=\"triple_quotes\",\n", ")" ] }, { "cell_type": "markdown", "id": "65a2ff9e-951a-46f8-8cb5-d3eab0c28f59", "metadata": {}, "source": [ "## Download\n", "\n", "Let's download a page containing movies from my favorite movie review site." ] }, { "cell_type": "code", "execution_count": 11, "id": "b5bd49b1-0b51-40ff-b34b-3ad7f90423b6", "metadata": { "tags": [] }, "outputs": [], "source": [ "url = \"https://www.rottentomatoes.com/browse/tv_series_browse/sort:popular\"\n", "response = requests.get(url) # Please see comment at top about using Selenium or" ] }, { "cell_type": "markdown", "id": "1226d70c-68ad-49b5-8327-3132a556bbf3", "metadata": {}, "source": [ "Remember that in some cases you will need to execute javascript! Here's a snippet\n", "\n", "```python\n", "from langchain.document_loaders import SeleniumURLLoader\n", "document = SeleniumURLLoader(url).load()\n", "```" ] }, { "cell_type": "markdown", "id": "1963c139-169a-4d07-a2bd-cb2e7dfa171b", "metadata": {}, "source": [ "## Extract\n", "\n", "Use langchain building blocks to assemble whatever pipeline you need for your own purposes." ] }, { "cell_type": "markdown", "id": "8d4d40c9-6c9b-431d-9bd4-66a00f3de36f", "metadata": {}, "source": [ "Create a langchain document with the HTML content." ] }, { "cell_type": "code", "execution_count": 12, "id": "396a16b7-bc45-4035-9848-f995cddbba2f", "metadata": { "tags": [] }, "outputs": [], "source": [ "doc = Document(page_content=response.text)" ] }, { "cell_type": "markdown", "id": "524e1105-f053-4954-9e11-e693820b4823", "metadata": {}, "source": [ "Convert to markdown\n", "\n", "**ATTENTION** This step is lossy and may end up removing information that's relevant for extraction. You can always try pushing the raw HTML through if you're not worried about cost." ] }, { "cell_type": "code", "execution_count": 13, "id": "22d3229a-aa6f-4586-a527-6f7835800740", "metadata": { "tags": [] }, "outputs": [], "source": [ "md = MarkdownifyHTMLProcessor().process(doc)" ] }, { "cell_type": "markdown", "id": "d98a7554-a3c7-42ea-99a4-d1e44aa9e22c", "metadata": {}, "source": [ "Break the document to chunks so it fits in context window" ] }, { "cell_type": "code", "execution_count": 14, "id": "814e4151-f7d7-412d-beab-58ed190b7dd3", "metadata": { "tags": [] }, "outputs": [], "source": [ "split_docs = RecursiveCharacterTextSplitter().split_documents([md])" ] }, { "cell_type": "code", "execution_count": 15, "id": "e663f031-7a61-4a27-b925-1c9a6f365bcc", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Latest Episode: Jul 17](/tv/presumed_innocent)\n", "\n", "Watch the trailer for Sausage Party: Foodtopia\n", "\n", "[52%\n", "\n", " 52%\n", "\n", " Sausage Party: Foodtopia\n", "\n", " Latest Episode: Jul 11](/tv/sausage_party_foodtopia)\n", "\n", "Watch the trailer for Exploding Kittens\n", "\n", "[69%\n", "\n", " 80%\n", "\n", " Exploding Kittens\n", "\n", " Latest Episode: Jul 12](/tv/exploding_kittens)\n", "\n", "Watch the trailer for Kite Man: Hell Yeah!\n", "\n", "[86%\n", "\n", " 100%\n", "\n", " Kite Man: Hell Yeah!\n", "\n", " Latest Episode: Jul 18](/tv/kite_man_hell_yeah)\n", "\n", "Watch the trailer for Vikings: Valhalla\n", "\n", "[96%\n", "\n", " 60%\n", "\n", " Vikings: Valhalla\n", "\n", " Latest Episode: Jul 11](/tv/vikings_valhalla)\n", "\n", "Watch the trailer for Marvel's Hit-Monkey\n", "\n", "[82%\n", "\n", " 93%\n", "\n", " Marvel's Hit-Monkey\n", "\n", " Latest Episode: Jul 15](/tv/marvels_hit_monkey)\n", "\n", "Watch the trailer for Snowpiercer\n", "\n", "[75%\n", "\n", " 69%\n", "\n", " Snowpiercer](/tv/snowpiercer)\n", "\n", "Watch the trailer for Shōgun\n", "\n", "[99%\n", "\n", " 92%\n", "\n", " Shōgun\n", "\n", " Latest Episode: Apr 23](/tv/shogun_2024)\n", "\n", "Watch the trailer for True Detective\n", "\n", "[79%\n", "\n", " 57%\n", "\n", " True Detective](/tv/true_detective)\n", "\n", "Watch the trailer for Land of Women\n", "\n", "[89%\n", "\n", " 36%\n", "\n", " Land of Women\n", "\n", " Latest Episode: Jul 17](/tv/land_of_women)\n", "\n", "Watch the trailer for Emperor of Ocean Park\n", "\n", "[50%\n", "\n", " 75%\n", "\n", " Emperor of Ocean Park\n", "\n", " Latest Episode: Jul 14](/tv/emperor_of_ocean_park)\n", "\n", "[86%\n", "\n", " 71%\n", "\n", " A Good Girl's Guide to Murder\n", "\n", " Latest Episode: Jul 01](/tv/a_good_girls_guide_to_murder)\n", "\n", "[75%\n", "\n", " Desperate Lies\n", "\n", " Latest Episode: Jul 05](/tv/desperate_lies)\n", "\n", "Watch the trailer for Simone Biles: Rising\n", "\n", "[100%\n", "\n", " Simone Biles: Rising\n", "\n", " Latest Episode: Jul 17](/tv/simone_biles_rising)\n", "\n", "Watch the trailer for Fool Me Once\n", "\n", "[69%\n", "\n", " 45%\n", "\n", " Fool Me Once](/tv/fool_me_once)\n", "\n", "Watch the trailer for Dear Child\n", "\n", "[100%\n", "\n", " 84%\n", "\n", " Dear Child](/tv/dear_child)\n", "\n", "Watch the trailer for Dark Matter\n", "\n", "[82%\n", "\n", " 82%\n", "\n", " Dark Matter\n", "\n", " Latest Episode: Jun 26](/tv/dark_matter_2024)\n", "\n", "Watch the trailer for The Serpent Queen\n", "\n", "[100%\n", "\n", " 92%\n", "\n", " The Serpent Queen\n", "\n", " Latest Episode: Jul 19](/tv/the_serpent_queen)\n", "\n", " Load more\n", "\n", "Close video\n", "\n", "See Details\n", "\n", "See Details\n", "\n", "* [Help](/help_desk)\n", "* [About Rotten Tomatoes](/about)\n", "* [What's the Tomatometer®?](/about#whatisthetomatometer)\n", "* \n", "\n", "* [Critic Submission](/critics/criteria)\n", "* [Licensing](/help_desk/licensing)\n", "* [Advertise With Us](https://together.nbcuni.com/advertise/?utm_source=rotten_tomatoes&utm_medium=referral&utm_campaign=property_ad_pages&utm_content=footer)\n", "* [Careers](//www.fandango.com/careers)\n", "\n", " Join the Newsletter\n", "\n", "Get the freshest reviews, news, and more delivered right to your inbox!\n", "\n", " Join The Newsletter\n", "\n", "Join The Newsletter\n", "\n", "Follow Us\n", "\n", "Copyright © Fandango. All rights reserved.\n", "\n", "Join The Newsletter\n", "Join The Newsletter\n", "* [Privacy Policy](https://www.nbcuniversal.com/fandango-privacy-policy)\n", "* [Terms and Policies](/policies/terms-and-policies)\n", "* \n", "* [California Notice](https://www.nbcuniversal.com/privacy/california-consumer-privacy-act)\n", "* [Ad Choices](https://www.nbcuniversal.com/privacy/cookies#accordionheader2)\n", "* \n", "* [Accessibility](/faq#accessibility)\n", "\n", "* V3.1\n", "* [Privacy Policy](https://www.nbcuniversal.com/fandango-privacy-policy)\n", "* [Terms and Policies](/policies/terms-and-policies)\n", "* \n", "* [California Notice](https://www.nbcuniversal.com/privacy/california-consumer-privacy-act)\n", "* [Ad Choices](https://www.nbcuniversal.com/privacy/cookies#accordionheader2)\n", "* [Accessibility](/faq#accessibility)\n", "\n", "Copyright © Fandango. All rights reserved.\n" ] } ], "source": [ "print(split_docs[-1].page_content)" ] }, { "cell_type": "code", "execution_count": 16, "id": "ec72cb5b-1fcb-4381-85e8-f060ea1f0077", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(split_docs)" ] }, { "cell_type": "markdown", "id": "de953c9a-cb03-4ac8-8ccd-a32bde1f5f61", "metadata": {}, "source": [ "Run extraction" ] }, { "cell_type": "code", "execution_count": 17, "id": "1d17ebaa-8222-48df-82d9-42173a60f5f7", "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain_community.callbacks import get_openai_callback" ] }, { "cell_type": "code", "execution_count": 18, "id": "bdb85b12-25b9-477c-bf7e-5407812f4807", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total Tokens: 6344\n", "Prompt Tokens: 5448\n", "Completion Tokens: 896\n", "Successful Requests: 4\n", "Total Cost (USD): $0.0\n" ] } ], "source": [ "with get_openai_callback() as cb:\n", " document_extraction_results = await extract_from_documents(\n", " chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True\n", " )\n", " print(f\"Total Tokens: {cb.total_tokens}\")\n", " print(f\"Prompt Tokens: {cb.prompt_tokens}\")\n", " print(f\"Completion Tokens: {cb.completion_tokens}\")\n", " print(f\"Successful Requests: {cb.successful_requests}\")\n", " print(f\"Total Cost (USD): ${cb.total_cost}\")" ] }, { "cell_type": "code", "execution_count": 19, "id": "d00f86ed-48c3-46c3-8d37-49de50817d93", "metadata": { "tags": [] }, "outputs": [], "source": [ "validated_data = list(\n", " itertools.chain.from_iterable(\n", " extraction[\"validated_data\"] for extraction in document_extraction_results\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 20, "id": "89e35b77-aac4-4a32-9515-9ebe9bfef046", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "47" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(validated_data)" ] }, { "cell_type": "markdown", "id": "4b5f6bb8-656f-4d92-b9d7-3dfce538705b", "metadata": {}, "source": [ "Extraction is not perfect, but you can use a better LLM or provide more examples!" ] }, { "cell_type": "code", "execution_count": 21, "id": "75f3e35e-d6a9-4c0d-a5c3-a509eaf7d6ec", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | name | \n", "season | \n", "year | \n", "latest_episode | \n", "link | \n", "
---|---|---|---|---|---|
0 | \n", "Twisters | \n", "\n", " | \n", " | July 2024 | \n", "/m/twisters | \n", "
1 | \n", "Longlegs | \n", "\n", " | \n", " | July 2024 | \n", "/m/longlegs | \n", "
2 | \n", "National Anthem | \n", "\n", " | \n", " | July 2024 | \n", "/m/national_anthem | \n", "
3 | \n", "Cobra Kai | \n", "6 | \n", "\n", " | July 2024 | \n", "/tv/cobra_kai/s06 | \n", "
4 | \n", "Cobra Kai | \n", "6 | \n", "\n", " | \n", " | /tv/cobra_kai/s06 | \n", "
5 | \n", "Kite Man: Hell Yeah! | \n", "1 | \n", "\n", " | \n", " | /tv/kite_man_hell_yeah/s01 | \n", "
6 | \n", "Simone Biles: Rising | \n", "1 | \n", "\n", " | \n", " | /tv/simone_biles_rising/s01 | \n", "
7 | \n", "Lady in the Lake | \n", "1 | \n", "\n", " | \n", " | /tv/lady_in_the_lake/s01 | \n", "
8 | \n", "Marvel's Hit-Monkey | \n", "2 | \n", "\n", " | \n", " | /tv/marvels_hit_monkey/s02 | \n", "
9 | \n", "Those About to Die | \n", "1 | \n", "\n", " | \n", " | /tv/those_about_to_die/s01 | \n", "
10 | \n", "Emperor of Ocean Park | \n", "1 | \n", "\n", " | \n", " | /tv/emperor_of_ocean_park/s01 | \n", "
11 | \n", "Mafia Spies | \n", "1 | \n", "\n", " | \n", " | /tv/mafia_spies/s01 | \n", "
12 | \n", "The Ark | \n", "2 | \n", "\n", " | \n", " | /tv/the_ark/s02 | \n", "
13 | \n", "Unprisoned | \n", "2 | \n", "\n", " | \n", " | /tv/unprisoned/s02 | \n", "
14 | \n", "Star Wars: The Acolyte | \n", "1 | \n", "\n", " | \n", " | /tv/star_wars_the_acolyte/s01 | \n", "
15 | \n", "The Boys | \n", "4 | \n", "\n", " | \n", " | /tv/the_boys_2019/s04 | \n", "
16 | \n", "Supacell | \n", "1 | \n", "\n", " | \n", " | /tv/supacell/s01 | \n", "
17 | \n", "The Bear | \n", "3 | \n", "\n", " | \n", " | /tv/the_bear/s03 | \n", "
18 | \n", "Presumed Innocent | \n", "1 | \n", "\n", " | \n", " | /tv/presumed_innocent/s01 | \n", "
19 | \n", "Sunny | \n", "1 | \n", "\n", " | \n", " | /tv/sunny/s01 | \n", "
20 | \n", "Cobra Kai | \n", "6 | \n", "\n", " | Jul 18 | \n", "https://editorial.rottentomatoes.com/article/c... | \n", "
21 | \n", "Star Wars: The Acolyte | \n", "\n", " | \n", " | Jul 16 | \n", "/tv/star_wars_the_acolyte | \n", "
22 | \n", "The Boys | \n", "\n", " | \n", " | Jul 18 | \n", "/tv/the_boys_2019 | \n", "
23 | \n", "Supacell | \n", "\n", " | \n", " | Jun 27 | \n", "/tv/supacell | \n", "
24 | \n", "Sunny | \n", "\n", " | \n", " | Jul 17 | \n", "/tv/sunny | \n", "
25 | \n", "Those About to Die | \n", "\n", " | \n", " | Jul 19 | \n", "/tv/those_about_to_die | \n", "
26 | \n", "Cobra Kai | \n", "\n", " | \n", " | Jul 18 | \n", "/tv/cobra_kai | \n", "
27 | \n", "The Bear | \n", "\n", " | \n", " | Jun 26 | \n", "/tv/the_bear | \n", "
28 | \n", "House of the Dragon | \n", "\n", " | \n", " | Jul 14 | \n", "/tv/house_of_the_dragon | \n", "
29 | \n", "Lady in the Lake | \n", "\n", " | \n", " | Jul 19 | \n", "/tv/lady_in_the_lake | \n", "
30 | \n", "My Lady Jane | \n", "\n", " | \n", " | Jun 27 | \n", "/tv/my_lady_jane | \n", "
31 | \n", "Presumed Innocent | \n", "\n", " | \n", " | Jul 17 | \n", "/tv/presumed_innocent | \n", "
32 | \n", "Sausage Party: Foodtopia | \n", "\n", " | \n", " | Jul 11 | \n", "/tv/sausage_party_foodtopia | \n", "
33 | \n", "Presumed Innocent | \n", "\n", " | \n", " | Jul 17 | \n", "/tv/presumed_innocent | \n", "
34 | \n", "Sausage Party: Foodtopia | \n", "\n", " | \n", " | Jul 11 | \n", "/tv/sausage_party_foodtopia | \n", "
35 | \n", "Exploding Kittens | \n", "\n", " | \n", " | Jul 12 | \n", "/tv/exploding_kittens | \n", "
36 | \n", "Kite Man: Hell Yeah! | \n", "\n", " | \n", " | Jul 18 | \n", "/tv/kite_man_hell_yeah | \n", "
37 | \n", "Vikings: Valhalla | \n", "\n", " | \n", " | Jul 11 | \n", "/tv/vikings_valhalla | \n", "
38 | \n", "Marvel's Hit-Monkey | \n", "\n", " | \n", " | Jul 15 | \n", "/tv/marvels_hit_monkey | \n", "
39 | \n", "Shōgun | \n", "\n", " | \n", " | Apr 23 | \n", "/tv/shogun_2024 | \n", "
40 | \n", "Land of Women | \n", "\n", " | \n", " | Jul 17 | \n", "/tv/land_of_women | \n", "
41 | \n", "Emperor of Ocean Park | \n", "\n", " | \n", " | Jul 14 | \n", "/tv/emperor_of_ocean_park | \n", "
42 | \n", "A Good Girl's Guide to Murder | \n", "\n", " | \n", " | Jul 01 | \n", "/tv/a_good_girls_guide_to_murder | \n", "
43 | \n", "Desperate Lies | \n", "\n", " | \n", " | Jul 05 | \n", "/tv/desperate_lies | \n", "
44 | \n", "Simone Biles: Rising | \n", "\n", " | \n", " | Jul 17 | \n", "/tv/simone_biles_rising | \n", "
45 | \n", "Dark Matter | \n", "\n", " | \n", " | Jun 26 | \n", "/tv/dark_matter_2024 | \n", "
46 | \n", "The Serpent Queen | \n", "\n", " | \n", " | Jul 19 | \n", "/tv/the_serpent_queen | \n", "