Document Extraction#

Here, we’ll be extracting content from a longer document.

The basic workflow is the following:

Load the document
Clean up the document (optional)
Split the document into chunks
Extract from every chunk of text

ATTENTION This is a brute force workflow – there will be an LLM call for every piece of text that is being analyzed. This can be expensive 💰💰💰, so use at your own risk and monitor your costs!

Let’s apply this workflow to an HTML file.

We’ll reduce HTML to markdown. This is a lossy step, which can sometimes improve extraction results, and sometimes make extraction worse.

When scraping HTML, executing javascript may be necessary to get all HTML fully rendered.

Here’s a piece of code that can execute javascript using playwright:

async def a_download_html(url: str, extra_sleep: int) -> str:
    """Download an HTML from a URL.
    
    In some pathological cases, an extra sleep period may be needed.
    """

    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url, wait_until="load")
        if extra_sleep:
            await asyncio.sleep(extra_sleep)
        html_content = await page.content()
        await browser.close()
    return html_content

Another possibility is to use: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/url.html#selenium-url-loader

Again this can be expensive 💰💰💰, so use at your own risk and monitor your costs!

from typing import List, Optional
import itertools
import requests

import pandas as pd
from pydantic import BaseModel, Field, field_validator
from kor import extract_from_documents, from_pydantic, create_extraction_chain
from kor.documents.html import MarkdownifyHTMLProcessor
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI

LLM#

Instantiate an LLM.

Try experimenting with the cheaper davinci models or with gpt-4o before trying the more expensive davinci-003 or gpt 4.

In some cases, providing a better prompt (with more examples) can help make up for using a smaller model.

Quality can vary a lot depending on which LLM is used and how many examples are provided.

llm = ChatOpenAI(temperature=0, model='gpt-4o')

Schema#

class ShowOrMovie(BaseModel):
    name: str = Field(
        description="The name of the movie or tv show",
    )
    season: Optional[str] = Field(
        description="Season of TV show. Extract as a digit stripping Season prefix.",
    )
    year: Optional[str] = Field(
        description="Year when the movie / tv show was released",
    )
    latest_episode: Optional[str] = Field(
        description="Date when the latest episode was released",
    )
    link: Optional[str] = Field(description="Link to the movie / tv show.")

    # rating -- not included because rating on rottentomatoes is in the html elements
    # you could try extracting it by using the raw HTML (rather than markdown)
    # or you could try doing something similar on imdb

    @field_validator("name")
    def name_must_not_be_empty(cls, v):
        if not v:
            raise ValueError("Name must not be empty")
        return v


schema, extraction_validator = from_pydantic(
    ShowOrMovie,
    description="Extract information about popular movies/tv shows including their name, year, link and rating.",
    examples=[
        (
            "[Rain Dogs Latest Episode: Apr 03](/tv/rain_dogs)",
            {"name": "Rain Dogs", "latest_episode": "Apr 03", "link": "/tv/rain_dogs"},
        )
    ],
    many=True,
)

chain = create_extraction_chain(
    llm,
    schema,
    encoder_or_encoder_class="csv",
    validator=extraction_validator,
    input_formatter="triple_quotes",
)

Download#

Let’s download a page containing movies from my favorite movie review site.

url = "https://www.rottentomatoes.com/browse/tv_series_browse/sort:popular"
response = requests.get(url)  # Please see comment at top about using Selenium or

Remember that in some cases you will need to execute javascript! Here’s a snippet

from langchain.document_loaders import SeleniumURLLoader
document = SeleniumURLLoader(url).load()

Extract#

Use langchain building blocks to assemble whatever pipeline you need for your own purposes.

Create a langchain document with the HTML content.

doc = Document(page_content=response.text)

Convert to markdown

ATTENTION This step is lossy and may end up removing information that’s relevant for extraction. You can always try pushing the raw HTML through if you’re not worried about cost.

md = MarkdownifyHTMLProcessor().process(doc)

Break the document to chunks so it fits in context window

split_docs = RecursiveCharacterTextSplitter().split_documents([md])

print(split_docs[-1].page_content)

Latest Episode: Jul 17](/tv/presumed_innocent)

Watch the trailer for Sausage Party: Foodtopia

[52%

 52%

 Sausage Party: Foodtopia

 Latest Episode: Jul 11](/tv/sausage_party_foodtopia)

Watch the trailer for Exploding Kittens

[69%

 80%

 Exploding Kittens

 Latest Episode: Jul 12](/tv/exploding_kittens)

Watch the trailer for Kite Man: Hell Yeah!

[86%

 100%

 Kite Man: Hell Yeah!

 Latest Episode: Jul 18](/tv/kite_man_hell_yeah)

Watch the trailer for Vikings: Valhalla

[96%

 60%

 Vikings: Valhalla

 Latest Episode: Jul 11](/tv/vikings_valhalla)

Watch the trailer for Marvel's Hit-Monkey

[82%

 93%

 Marvel's Hit-Monkey

 Latest Episode: Jul 15](/tv/marvels_hit_monkey)

Watch the trailer for Snowpiercer

[75%

 69%

 Snowpiercer](/tv/snowpiercer)

Watch the trailer for Shōgun

[99%

 92%

 Shōgun

 Latest Episode: Apr 23](/tv/shogun_2024)

Watch the trailer for True Detective

[79%

 57%

 True Detective](/tv/true_detective)

Watch the trailer for Land of Women

[89%

 36%

 Land of Women

 Latest Episode: Jul 17](/tv/land_of_women)

Watch the trailer for Emperor of Ocean Park

[50%

 75%

 Emperor of Ocean Park

 Latest Episode: Jul 14](/tv/emperor_of_ocean_park)

[86%

 71%

 A Good Girl's Guide to Murder

 Latest Episode: Jul 01](/tv/a_good_girls_guide_to_murder)

[75%

 Desperate Lies

 Latest Episode: Jul 05](/tv/desperate_lies)

Watch the trailer for Simone Biles: Rising

[100%

 Simone Biles: Rising

 Latest Episode: Jul 17](/tv/simone_biles_rising)

Watch the trailer for Fool Me Once

[69%

 45%

 Fool Me Once](/tv/fool_me_once)

Watch the trailer for Dear Child

[100%

 84%

 Dear Child](/tv/dear_child)

Watch the trailer for Dark Matter

[82%

 82%

 Dark Matter

 Latest Episode: Jun 26](/tv/dark_matter_2024)

Watch the trailer for The Serpent Queen

[100%

 92%

 The Serpent Queen

 Latest Episode: Jul 19](/tv/the_serpent_queen)

 Load more

Close video

See Details

See Details

* [Help](/help_desk)
* [About Rotten Tomatoes](/about)
* [What's the Tomatometer®?](/about#whatisthetomatometer)
* 

* [Critic Submission](/critics/criteria)
* [Licensing](/help_desk/licensing)
* [Advertise With Us](https://together.nbcuni.com/advertise/?utm_source=rotten_tomatoes&utm_medium=referral&utm_campaign=property_ad_pages&utm_content=footer)
* [Careers](//www.fandango.com/careers)

 Join the Newsletter

Get the freshest reviews, news, and more delivered right to your inbox!

 Join The Newsletter

Join The Newsletter

Follow Us

Copyright © Fandango. All rights reserved.

Join The Newsletter
Join The Newsletter
* [Privacy Policy](https://www.nbcuniversal.com/fandango-privacy-policy)
* [Terms and Policies](/policies/terms-and-policies)
* 
* [California Notice](https://www.nbcuniversal.com/privacy/california-consumer-privacy-act)
* [Ad Choices](https://www.nbcuniversal.com/privacy/cookies#accordionheader2)
* 
* [Accessibility](/faq#accessibility)

* V3.1
* [Privacy Policy](https://www.nbcuniversal.com/fandango-privacy-policy)
* [Terms and Policies](/policies/terms-and-policies)
* 
* [California Notice](https://www.nbcuniversal.com/privacy/california-consumer-privacy-act)
* [Ad Choices](https://www.nbcuniversal.com/privacy/cookies#accordionheader2)
* [Accessibility](/faq#accessibility)

Copyright © Fandango. All rights reserved.

len(split_docs)

Run extraction

from langchain_community.callbacks import get_openai_callback

with get_openai_callback() as cb:
    document_extraction_results = await extract_from_documents(
        chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
    )
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")

Total Tokens: 6344
Prompt Tokens: 5448
Completion Tokens: 896
Successful Requests: 4
Total Cost (USD): $0.0

validated_data = list(
    itertools.chain.from_iterable(
        extraction["validated_data"] for extraction in document_extraction_results
    )
)

len(validated_data)

Extraction is not perfect, but you can use a better LLM or provide more examples!

pd.DataFrame(record.dict() for record in validated_data)

	name	season	latest_episode	link
0	Twisters		July 2024	/m/twisters
1	Longlegs		July 2024	/m/longlegs
2	National Anthem		July 2024	/m/national_anthem
3	Cobra Kai	6	July 2024	/tv/cobra_kai/s06
4	Cobra Kai	6		/tv/cobra_kai/s06
5	Kite Man: Hell Yeah!	1		/tv/kite_man_hell_yeah/s01
6	Simone Biles: Rising	1		/tv/simone_biles_rising/s01
7	Lady in the Lake	1		/tv/lady_in_the_lake/s01
8	Marvel's Hit-Monkey	2		/tv/marvels_hit_monkey/s02
9	Those About to Die	1		/tv/those_about_to_die/s01
10	Emperor of Ocean Park	1		/tv/emperor_of_ocean_park/s01
11	Mafia Spies	1		/tv/mafia_spies/s01
12	The Ark	2		/tv/the_ark/s02
13	Unprisoned	2		/tv/unprisoned/s02
14	Star Wars: The Acolyte	1		/tv/star_wars_the_acolyte/s01
15	The Boys	4		/tv/the_boys_2019/s04
16	Supacell	1		/tv/supacell/s01
17	The Bear	3		/tv/the_bear/s03
18	Presumed Innocent	1		/tv/presumed_innocent/s01
19	Sunny	1		/tv/sunny/s01
20	Cobra Kai	6	Jul 18	https://editorial.rottentomatoes.com/article/c...
21	Star Wars: The Acolyte		Jul 16	/tv/star_wars_the_acolyte
22	The Boys		Jul 18	/tv/the_boys_2019
23	Supacell		Jun 27	/tv/supacell
24	Sunny		Jul 17	/tv/sunny
25	Those About to Die		Jul 19	/tv/those_about_to_die
26	Cobra Kai		Jul 18	/tv/cobra_kai
27	The Bear		Jun 26	/tv/the_bear
28	House of the Dragon		Jul 14	/tv/house_of_the_dragon
29	Lady in the Lake		Jul 19	/tv/lady_in_the_lake
30	My Lady Jane		Jun 27	/tv/my_lady_jane
31	Presumed Innocent		Jul 17	/tv/presumed_innocent
32	Sausage Party: Foodtopia		Jul 11	/tv/sausage_party_foodtopia
33	Presumed Innocent		Jul 17	/tv/presumed_innocent
34	Sausage Party: Foodtopia		Jul 11	/tv/sausage_party_foodtopia
35	Exploding Kittens		Jul 12	/tv/exploding_kittens
36	Kite Man: Hell Yeah!		Jul 18	/tv/kite_man_hell_yeah
37	Vikings: Valhalla		Jul 11	/tv/vikings_valhalla
38	Marvel's Hit-Monkey		Jul 15	/tv/marvels_hit_monkey
39	Shōgun		Apr 23	/tv/shogun_2024
40	Land of Women		Jul 17	/tv/land_of_women
41	Emperor of Ocean Park		Jul 14	/tv/emperor_of_ocean_park
42	A Good Girl's Guide to Murder		Jul 01	/tv/a_good_girls_guide_to_murder
43	Desperate Lies		Jul 05	/tv/desperate_lies
44	Simone Biles: Rising		Jul 17	/tv/simone_biles_rising
45	Dark Matter		Jun 26	/tv/dark_matter_2024
46	The Serpent Queen		Jul 19	/tv/the_serpent_queen

😼 Kor 2.0.0

Document Extraction

Contents

Document Extraction#

LLM#

Schema#

Download#

Extract#