{ "cells": [ { "cell_type": "markdown", "id": "4b3a0584-b52c-4873-abb8-8382e13ff5c0", "metadata": {}, "source": [ "# Document Extraction\n", "\n", "Here, we'll be extracting content from a longer document.\n", "\n", "\n", "The basic workflow is the following:\n", "\n", "1. Load the document\n", "2. Clean up the document (optional)\n", "3. Split the document into chunks\n", "4. Extract from *every* chunk of text\n", "\n", "-------------\n", "\n", "**ATTENTION** This is a *brute force* workflow -- there will be an LLM call for every piece of text that is being analyzed. \n", "This can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!\n", "\n", "---------------\n", "\n", "Let's apply this workflow to an HTML file.\n", "\n", "We'll reduce HTML to markdown. This is a lossy step, which can sometimes improve extraction results, and sometimes make extraction worse.\n", "\n", "When scraping HTML, executing javascript may be necessary to get all HTML fully rendered. \n", "\n", "Here's a piece of code that can execute javascript using playwright: \n", "\n", "\n", "```python\n", "async def a_download_html(url: str, extra_sleep: int) -> str:\n", " \"\"\"Download an HTML from a URL.\n", " \n", " In some pathological cases, an extra sleep period may be needed.\n", " \"\"\"\n", "\n", " async with async_playwright() as p:\n", " browser = await p.chromium.launch()\n", " page = await browser.new_page()\n", " await page.goto(url, wait_until=\"load\")\n", " if extra_sleep:\n", " await asyncio.sleep(extra_sleep)\n", " html_content = await page.content()\n", " await browser.close()\n", " return html_content\n", "```\n", "\n", "Another possibility is to use: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/url.html#selenium-url-loader\n", "\n", "---------\n", " \n", "Again this can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!" ] }, { "cell_type": "code", "execution_count": 1, "id": "f8536314-f0f3-4bb9-acd6-f2cec4046380", "metadata": { "nbsphinx": "hidden", "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import sys\n", "\n", "sys.path.insert(0, \"../../\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "d93b3de7-9b81-456e-acff-4b0df9755c14", "metadata": { "nbsphinx": "hidden", "tags": [] }, "outputs": [], "source": [ "from typing import List, Optional\n", "import itertools\n", "import requests\n", "\n", "import pandas as pd\n", "from pydantic import BaseModel, Field, validator\n", "from kor import extract_from_documents, from_pydantic, create_extraction_chain\n", "from kor.documents.html import MarkdownifyHTMLProcessor\n", "from langchain.chat_models import ChatOpenAI\n", "from langchain.schema import Document\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter" ] }, { "cell_type": "markdown", "id": "725a851e-ab91-4aa4-94b5-747bbf096460", "metadata": {}, "source": [ "## LLM\n", "\n", "Instantiate an LLM. \n", "\n", "Try experimenting with the cheaper davinci models or with gpt-3.5-turbo before trying the more expensive davinci-003 or gpt 4.\n", "\n", "In some cases, providing a better prompt (with more examples) can help make up for using a smaller model." ] }, { "cell_type": "markdown", "id": "b2aeb931-4725-4ab8-956e-18420d1c9a36", "metadata": { "tags": [] }, "source": [ "\n", "-------------------\n", "\n", "Quality can vary a **lot** depending on which LLM is used and how many examples are provided.\n", "\n", "-------------------" ] }, { "cell_type": "code", "execution_count": 3, "id": "fab20ada-6443-4799-b6e8-faf16a2fb585", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Using gpt-3.5-turbo which is pretty cheap, but has worse quality\n", "llm = ChatOpenAI(temperature=0)" ] }, { "cell_type": "markdown", "id": "cd9eaffc-875d-4b2c-a189-6c7c181fe628", "metadata": {}, "source": [ "## Schema" ] }, { "cell_type": "code", "execution_count": 4, "id": "981938c5-f438-49a0-b511-329d31073a56", "metadata": { "tags": [] }, "outputs": [], "source": [ "class ShowOrMovie(BaseModel):\n", " name: str = Field(\n", " description=\"The name of the movie or tv show\",\n", " )\n", " season: Optional[str] = Field(\n", " description=\"Season of TV show. Extract as a digit stripping Season prefix.\",\n", " )\n", " year: Optional[str] = Field(\n", " description=\"Year when the movie / tv show was released\",\n", " )\n", " latest_episode: Optional[str] = Field(\n", " description=\"Date when the latest episode was released\",\n", " )\n", " link: Optional[str] = Field(description=\"Link to the movie / tv show.\")\n", "\n", " # rating -- not included because rating on rottentomatoes is in the html elements\n", " # you could try extracting it by using the raw HTML (rather than markdown)\n", " # or you could try doing something similar on imdb\n", "\n", " @validator(\"name\")\n", " def name_must_not_be_empty(cls, v):\n", " if not v:\n", " raise ValueError(\"Name must not be empty\")\n", " return v\n", "\n", "\n", "schema, extraction_validator = from_pydantic(\n", " ShowOrMovie,\n", " description=\"Extract information about popular movies/tv shows including their name, year, link and rating.\",\n", " examples=[\n", " (\n", " \"[Rain Dogs Latest Episode: Apr 03](/tv/rain_dogs)\",\n", " {\"name\": \"Rain Dogs\", \"latest_episode\": \"Apr 03\", \"link\": \"/tv/rain_dogs\"},\n", " )\n", " ],\n", " many=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "id": "ff57507d-d789-4ae3-8763-0465e8b27686", "metadata": { "tags": [] }, "outputs": [], "source": [ "chain = create_extraction_chain(\n", " llm,\n", " schema,\n", " encoder_or_encoder_class=\"csv\",\n", " validator=extraction_validator,\n", " input_formatter=\"triple_quotes\",\n", ")" ] }, { "cell_type": "markdown", "id": "65a2ff9e-951a-46f8-8cb5-d3eab0c28f59", "metadata": {}, "source": [ "## Download\n", "\n", "Let's download a page containing movies from my favorite movie review site." ] }, { "cell_type": "code", "execution_count": 6, "id": "b5bd49b1-0b51-40ff-b34b-3ad7f90423b6", "metadata": { "tags": [] }, "outputs": [], "source": [ "url = \"https://www.rottentomatoes.com/browse/tv_series_browse/sort:popular\"\n", "response = requests.get(url) # Please see comment at top about using Selenium or" ] }, { "cell_type": "markdown", "id": "1226d70c-68ad-49b5-8327-3132a556bbf3", "metadata": {}, "source": [ "Remember that in some cases you will need to execute javascript! Here's a snippet\n", "\n", "```python\n", "from langchain.document_loaders import SeleniumURLLoader\n", "document = SeleniumURLLoader(url).load()\n", "```" ] }, { "cell_type": "markdown", "id": "1963c139-169a-4d07-a2bd-cb2e7dfa171b", "metadata": {}, "source": [ "## Extract\n", "\n", "Use langchain building blocks to assemble whatever pipeline you need for your own purposes." ] }, { "cell_type": "markdown", "id": "8d4d40c9-6c9b-431d-9bd4-66a00f3de36f", "metadata": {}, "source": [ "Create a langchain document with the HTML content." ] }, { "cell_type": "code", "execution_count": 7, "id": "396a16b7-bc45-4035-9848-f995cddbba2f", "metadata": { "tags": [] }, "outputs": [], "source": [ "doc = Document(page_content=response.text)" ] }, { "cell_type": "markdown", "id": "524e1105-f053-4954-9e11-e693820b4823", "metadata": {}, "source": [ "Convert to markdown\n", "\n", "**ATTENTION** This step is lossy and may end up removing information that's relevant for extraction. You can always try pushing the raw HTML through if you're not worried about cost." ] }, { "cell_type": "code", "execution_count": 8, "id": "22d3229a-aa6f-4586-a527-6f7835800740", "metadata": { "tags": [] }, "outputs": [], "source": [ "md = MarkdownifyHTMLProcessor().process(doc)" ] }, { "cell_type": "markdown", "id": "d98a7554-a3c7-42ea-99a4-d1e44aa9e22c", "metadata": {}, "source": [ "Break the document to chunks so it fits in context window" ] }, { "cell_type": "code", "execution_count": 9, "id": "814e4151-f7d7-412d-beab-58ed190b7dd3", "metadata": { "tags": [] }, "outputs": [], "source": [ "split_docs = RecursiveCharacterTextSplitter().split_documents([md])" ] }, { "cell_type": "code", "execution_count": 10, "id": "e663f031-7a61-4a27-b925-1c9a6f365bcc", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Watch the trailer for You\n", "\n", "[You\n", "\n", " Latest Episode: Mar 09](/tv/you)\n", "\n", "Watch the trailer for She-Hulk: Attorney at Law\n", "\n", "[She-Hulk: Attorney at Law](/tv/she_hulk_attorney_at_law)\n", "\n", "[Breaking Bad](/tv/breaking_bad)\n", "\n", "Watch the trailer for The Lord of the Rings: The Rings of Power\n", "\n", "[The Lord of the Rings: The Rings of Power](/tv/the_lord_of_the_rings_the_rings_of_power)\n", "\n", "No results\n", "\n", " Reset Filters\n", "\n", " Load more\n", "\n", "Close video\n", "\n", "See Details\n", "\n", "See Details\n", "\n", "* [Help](/help_desk)\n", "* [About Rotten Tomatoes](/about)\n", "* [What's the Tomatometer®?](/about#whatisthetomatometer)\n", "* \n", "\n", "* [Critic Submission](/critics/criteria)\n", "* [Licensing](/help_desk/licensing)\n", "* [Advertise With Us](https://together.nbcuni.com/advertise/?utm_source=rotten_tomatoes&utm_medium=referral&utm_campaign=property_ad_pages&utm_content=footer)\n", "* [Careers](//www.fandango.com/careers)\n", "\n", "Join The Newsletter\n", "\n", "Get the freshest reviews, news, and more delivered right to your inbox!\n", "\n", "Join The Newsletter\n", "[Join The Newsletter](https://optout.services.fandango.com/rottentomatoes)\n", "\n", "Follow Us\n", "\n", "* \n", "* \n", "* \n", "* \n", "* \n", "\n", "Copyright © Fandango. All rights reserved.\n", "\n", "Join Newsletter\n", "[Join Newsletter](https://optout.services.fandango.com/rottentomatoes)\n", "* [Privacy Policy](//www.fandango.com/policies/privacy-policy)\n", "* [Terms and Policies](//www.fandango.com/policies/terms-and-policies)\n", "* [Cookie Settings](javascript:void(0))\n", "* [California Notice](//www.fandango.com/californianotice)\n", "* [Ad Choices](//www.fandango.com/policies/cookies-and-tracking#cookie_management)\n", "* \n", "* [Accessibility](/faq#accessibility)\n", "\n", "* V3.1\n", "* [Privacy Policy](//www.fandango.com/policies/privacy-policy)\n", "* [Terms and Policies](//www.fandango.com/policies/terms-and-policies)\n", "* [Cookie Settings](javascript:void(0))\n", "* [California Notice](//www.fandango.com/californianotice)\n", "* [Ad Choices](//www.fandango.com/policies/cookies-and-tracking#cookie_management)\n", "* [Accessibility](/faq#accessibility)\n", "\n", "Copyright © Fandango. All rights reserved.\n" ] } ], "source": [ "print(split_docs[-1].page_content)" ] }, { "cell_type": "code", "execution_count": 11, "id": "ec72cb5b-1fcb-4381-85e8-f060ea1f0077", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(split_docs)" ] }, { "cell_type": "markdown", "id": "de953c9a-cb03-4ac8-8ccd-a32bde1f5f61", "metadata": {}, "source": [ "Run extraction" ] }, { "cell_type": "code", "execution_count": 12, "id": "1d17ebaa-8222-48df-82d9-42173a60f5f7", "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain.callbacks import get_openai_callback" ] }, { "cell_type": "code", "execution_count": 13, "id": "bdb85b12-25b9-477c-bf7e-5407812f4807", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total Tokens: 5854\n", "Prompt Tokens: 5128\n", "Completion Tokens: 726\n", "Successful Requests: 4\n", "Total Cost (USD): $0.011708000000000001\n" ] } ], "source": [ "with get_openai_callback() as cb:\n", " document_extraction_results = await extract_from_documents(\n", " chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True\n", " )\n", " print(f\"Total Tokens: {cb.total_tokens}\")\n", " print(f\"Prompt Tokens: {cb.prompt_tokens}\")\n", " print(f\"Completion Tokens: {cb.completion_tokens}\")\n", " print(f\"Successful Requests: {cb.successful_requests}\")\n", " print(f\"Total Cost (USD): ${cb.total_cost}\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "d00f86ed-48c3-46c3-8d37-49de50817d93", "metadata": { "tags": [] }, "outputs": [], "source": [ "validated_data = list(\n", " itertools.chain.from_iterable(\n", " extraction[\"validated_data\"] for extraction in document_extraction_results\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 15, "id": "89e35b77-aac4-4a32-9515-9ebe9bfef046", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "40" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(validated_data)" ] }, { "cell_type": "markdown", "id": "4b5f6bb8-656f-4d92-b9d7-3dfce538705b", "metadata": {}, "source": [ "Extraction is not perfect, but you can use a better LLM or provide more examples!" ] }, { "cell_type": "code", "execution_count": 16, "id": "75f3e35e-d6a9-4c0d-a5c3-a509eaf7d6ec", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | name | \n", "season | \n", "year | \n", "latest_episode | \n", "link | \n", "
---|---|---|---|---|---|
0 | \n", "Beef | \n", "1 | \n", "\n", " | \n", " | /tv/beef/s01 | \n", "
1 | \n", "Dave | \n", "3 | \n", "\n", " | \n", " | /tv/dave/s03 | \n", "
2 | \n", "Schmigadoon! | \n", "2 | \n", "\n", " | \n", " | /tv/schmigadoon/s02 | \n", "
3 | \n", "Pretty Baby: Brooke Shields | \n", "1 | \n", "\n", " | \n", " | /tv/pretty_baby_brooke_shields/s01 | \n", "
4 | \n", "Tiny Beautiful Things | \n", "1 | \n", "\n", " | \n", " | /tv/tiny_beautiful_things/s01 | \n", "
5 | \n", "Grease: Rise of the Pink Ladies | \n", "1 | \n", "\n", " | \n", " | /tv/grease_rise_of_the_pink_ladies/s01 | \n", "
6 | \n", "Jury Duty | \n", "1 | \n", "\n", " | \n", " | /tv/jury_duty/s01 | \n", "
7 | \n", "The Crossover | \n", "1 | \n", "\n", " | \n", " | /tv/the_crossover/s01 | \n", "
8 | \n", "Transatlantic | \n", "1 | \n", "\n", " | \n", " | /tv/transatlantic/s01 | \n", "
9 | \n", "Race to Survive: Alaska | \n", "1 | \n", "\n", " | \n", " | /tv/race_to_survive_alaska/s01 | \n", "
10 | \n", "Beef | \n", "\n", " | \n", " | Apr 06 | \n", "/tv/beef | \n", "
11 | \n", "The Night Agent | \n", "\n", " | \n", " | Mar 23 | \n", "/tv/the_night_agent | \n", "
12 | \n", "Unstable | \n", "\n", " | \n", " | Mar 30 | \n", "/tv/unstable | \n", "
13 | \n", "The Mandalorian | \n", "\n", " | \n", " | Apr 05 | \n", "/tv/the_mandalorian | \n", "
14 | \n", "The Big Door Prize | \n", "\n", " | \n", " | Apr 05 | \n", "/tv/the_big_door_prize | \n", "
15 | \n", "Class of '07 | \n", "\n", " | \n", " | Mar 17 | \n", "/tv/class_of_07 | \n", "
16 | \n", "Rabbit Hole | \n", "\n", " | \n", " | Apr 02 | \n", "/tv/rabbit_hole | \n", "
17 | \n", "The Power | \n", "\n", " | \n", " | Apr 07 | \n", "/tv/the_power | \n", "
18 | \n", "The Last of Us | \n", "\n", " | \n", " | Mar 12 | \n", "/tv/the_last_of_us | \n", "
19 | \n", "Yellowjackets | \n", "\n", " | \n", " | Mar 31 | \n", "/tv/yellowjackets | \n", "
20 | \n", "Succession | \n", "\n", " | \n", " | Apr 02 | \n", "/tv/succession | \n", "
21 | \n", "Lucky Hank | \n", "\n", " | \n", " | Apr 02 | \n", "/tv/lucky_hank | \n", "
22 | \n", "Sex/Life | \n", "\n", " | \n", " | Mar 02 | \n", "/tv/sex_life | \n", "
23 | \n", "Ted Lasso | \n", "\n", " | \n", " | Apr 05 | \n", "/tv/ted_lasso | \n", "
24 | \n", "Wellmania | \n", "\n", " | \n", " | Mar 29 | \n", "/tv/wellmania | \n", "
25 | \n", "Daisy Jones & the Six | \n", "\n", " | \n", " | Mar 24 | \n", "/tv/daisy_jones_and_the_six | \n", "
26 | \n", "Shadow and Bone | \n", "\n", " | \n", " | Mar 16 | \n", "/tv/shadow_and_bone | \n", "
27 | \n", "The Order | \n", "\n", " | \n", " | \n", " | /tv/the_order | \n", "
28 | \n", "Shrinking | \n", "\n", " | \n", " | Mar 24 | \n", "/tv/shrinking | \n", "
29 | \n", "Swarm | \n", "\n", " | \n", " | Mar 17 | \n", "/tv/swarm | \n", "
30 | \n", "The Last Kingdom | \n", "\n", " | \n", " | \n", " | /tv/the_last_kingdom | \n", "
31 | \n", "Rain Dogs | \n", "\n", " | \n", " | Apr 03 | \n", "/tv/rain_dogs | \n", "
32 | \n", "Extrapolations | \n", "\n", " | \n", " | Apr 07 | \n", "/tv/extrapolations | \n", "
33 | \n", "War Sailor | \n", "\n", " | \n", " | Apr 02 | \n", "/tv/war_sailor | \n", "
34 | \n", "You | \n", "\n", " | \n", " | Mar 09 | \n", "/tv/you | \n", "
35 | \n", "She-Hulk: Attorney at Law | \n", "\n", " | \n", " | \n", " | /tv/she_hulk_attorney_at_law | \n", "
36 | \n", "You | \n", "\n", " | \n", " | Mar 09 | \n", "/tv/you | \n", "
37 | \n", "She-Hulk: Attorney at Law | \n", "\n", " | \n", " | None | \n", "/tv/she_hulk_attorney_at_law | \n", "
38 | \n", "Breaking Bad | \n", "\n", " | \n", " | None | \n", "/tv/breaking_bad | \n", "
39 | \n", "The Lord of the Rings: The Rings of Power | \n", "\n", " | \n", " | None | \n", "/tv/the_lord_of_the_rings_the_rings_of_power | \n", "