Document Extraction#

Here, we’ll be extracting content from a longer document.

The basic workflow is the following:

  1. Load the document

  2. Clean up the document (optional)

  3. Split the document into chunks

  4. Extract from every chunk of text


ATTENTION This is a brute force workflow – there will be an LLM call for every piece of text that is being analyzed. This can be expensive 💰💰💰, so use at your own risk and monitor your costs!


Let’s apply this workflow to an HTML file.

We’ll reduce HTML to markdown. This is a lossy step, which can sometimes improve extraction results, and sometimes make extraction worse.

When scraping HTML, executing javascript may be necessary to get all HTML fully rendered.

Here’s a piece of code that can execute javascript using playwright:

async def a_download_html(url: str, extra_sleep: int) -> str:
    """Download an HTML from a URL.
    
    In some pathological cases, an extra sleep period may be needed.
    """

    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url, wait_until="load")
        if extra_sleep:
            await asyncio.sleep(extra_sleep)
        html_content = await page.content()
        await browser.close()
    return html_content

Another possibility is to use: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/url.html#selenium-url-loader


Again this can be expensive 💰💰💰, so use at your own risk and monitor your costs!

from typing import List, Optional
import itertools
import requests

import pandas as pd
from pydantic import BaseModel, Field, field_validator
from kor import extract_from_documents, from_pydantic, create_extraction_chain
from kor.documents.html import MarkdownifyHTMLProcessor
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI

LLM#

Instantiate an LLM.

Try experimenting with the cheaper davinci models or with gpt-4o before trying the more expensive davinci-003 or gpt 4.

In some cases, providing a better prompt (with more examples) can help make up for using a smaller model.


Quality can vary a lot depending on which LLM is used and how many examples are provided.


llm = ChatOpenAI(temperature=0, model='gpt-4o')

Schema#

class ShowOrMovie(BaseModel):
    name: str = Field(
        description="The name of the movie or tv show",
    )
    season: Optional[str] = Field(
        description="Season of TV show. Extract as a digit stripping Season prefix.",
    )
    year: Optional[str] = Field(
        description="Year when the movie / tv show was released",
    )
    latest_episode: Optional[str] = Field(
        description="Date when the latest episode was released",
    )
    link: Optional[str] = Field(description="Link to the movie / tv show.")

    # rating -- not included because rating on rottentomatoes is in the html elements
    # you could try extracting it by using the raw HTML (rather than markdown)
    # or you could try doing something similar on imdb

    @field_validator("name")
    def name_must_not_be_empty(cls, v):
        if not v:
            raise ValueError("Name must not be empty")
        return v


schema, extraction_validator = from_pydantic(
    ShowOrMovie,
    description="Extract information about popular movies/tv shows including their name, year, link and rating.",
    examples=[
        (
            "[Rain Dogs Latest Episode: Apr 03](/tv/rain_dogs)",
            {"name": "Rain Dogs", "latest_episode": "Apr 03", "link": "/tv/rain_dogs"},
        )
    ],
    many=True,
)
chain = create_extraction_chain(
    llm,
    schema,
    encoder_or_encoder_class="csv",
    validator=extraction_validator,
    input_formatter="triple_quotes",
)

Download#

Let’s download a page containing movies from my favorite movie review site.

url = "https://www.rottentomatoes.com/browse/tv_series_browse/sort:popular"
response = requests.get(url)  # Please see comment at top about using Selenium or

Remember that in some cases you will need to execute javascript! Here’s a snippet

from langchain.document_loaders import SeleniumURLLoader
document = SeleniumURLLoader(url).load()

Extract#

Use langchain building blocks to assemble whatever pipeline you need for your own purposes.

Create a langchain document with the HTML content.

doc = Document(page_content=response.text)

Convert to markdown

ATTENTION This step is lossy and may end up removing information that’s relevant for extraction. You can always try pushing the raw HTML through if you’re not worried about cost.

md = MarkdownifyHTMLProcessor().process(doc)

Break the document to chunks so it fits in context window

split_docs = RecursiveCharacterTextSplitter().split_documents([md])
print(split_docs[-1].page_content)
Latest Episode: Jul 17](/tv/presumed_innocent)

Watch the trailer for Sausage Party: Foodtopia

[52%

 52%

 Sausage Party: Foodtopia

 Latest Episode: Jul 11](/tv/sausage_party_foodtopia)

Watch the trailer for Exploding Kittens

[69%

 80%

 Exploding Kittens

 Latest Episode: Jul 12](/tv/exploding_kittens)

Watch the trailer for Kite Man: Hell Yeah!

[86%

 100%

 Kite Man: Hell Yeah!

 Latest Episode: Jul 18](/tv/kite_man_hell_yeah)

Watch the trailer for Vikings: Valhalla

[96%

 60%

 Vikings: Valhalla

 Latest Episode: Jul 11](/tv/vikings_valhalla)

Watch the trailer for Marvel's Hit-Monkey

[82%

 93%

 Marvel's Hit-Monkey

 Latest Episode: Jul 15](/tv/marvels_hit_monkey)

Watch the trailer for Snowpiercer

[75%

 69%

 Snowpiercer](/tv/snowpiercer)

Watch the trailer for Shōgun

[99%

 92%

 Shōgun

 Latest Episode: Apr 23](/tv/shogun_2024)

Watch the trailer for True Detective

[79%

 57%

 True Detective](/tv/true_detective)

Watch the trailer for Land of Women

[89%

 36%

 Land of Women

 Latest Episode: Jul 17](/tv/land_of_women)

Watch the trailer for Emperor of Ocean Park

[50%

 75%

 Emperor of Ocean Park

 Latest Episode: Jul 14](/tv/emperor_of_ocean_park)

[86%

 71%

 A Good Girl's Guide to Murder

 Latest Episode: Jul 01](/tv/a_good_girls_guide_to_murder)

[75%

 Desperate Lies

 Latest Episode: Jul 05](/tv/desperate_lies)

Watch the trailer for Simone Biles: Rising

[100%

 Simone Biles: Rising

 Latest Episode: Jul 17](/tv/simone_biles_rising)

Watch the trailer for Fool Me Once

[69%

 45%

 Fool Me Once](/tv/fool_me_once)

Watch the trailer for Dear Child

[100%

 84%

 Dear Child](/tv/dear_child)

Watch the trailer for Dark Matter

[82%

 82%

 Dark Matter

 Latest Episode: Jun 26](/tv/dark_matter_2024)

Watch the trailer for The Serpent Queen

[100%

 92%

 The Serpent Queen

 Latest Episode: Jul 19](/tv/the_serpent_queen)

 Load more

Close video

See Details

See Details

* [Help](/help_desk)
* [About Rotten Tomatoes](/about)
* [What's the Tomatometer®?](/about#whatisthetomatometer)
* 

* [Critic Submission](/critics/criteria)
* [Licensing](/help_desk/licensing)
* [Advertise With Us](https://together.nbcuni.com/advertise/?utm_source=rotten_tomatoes&utm_medium=referral&utm_campaign=property_ad_pages&utm_content=footer)
* [Careers](//www.fandango.com/careers)

 Join the Newsletter

Get the freshest reviews, news, and more delivered right to your inbox!

 Join The Newsletter

Join The Newsletter

Follow Us

Copyright © Fandango. All rights reserved.

Join The Newsletter
Join The Newsletter
* [Privacy Policy](https://www.nbcuniversal.com/fandango-privacy-policy)
* [Terms and Policies](/policies/terms-and-policies)
* 
* [California Notice](https://www.nbcuniversal.com/privacy/california-consumer-privacy-act)
* [Ad Choices](https://www.nbcuniversal.com/privacy/cookies#accordionheader2)
* 
* [Accessibility](/faq#accessibility)

* V3.1
* [Privacy Policy](https://www.nbcuniversal.com/fandango-privacy-policy)
* [Terms and Policies](/policies/terms-and-policies)
* 
* [California Notice](https://www.nbcuniversal.com/privacy/california-consumer-privacy-act)
* [Ad Choices](https://www.nbcuniversal.com/privacy/cookies#accordionheader2)
* [Accessibility](/faq#accessibility)

Copyright © Fandango. All rights reserved.
len(split_docs)
4

Run extraction

from langchain_community.callbacks import get_openai_callback
with get_openai_callback() as cb:
    document_extraction_results = await extract_from_documents(
        chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
    )
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")
Total Tokens: 6344
Prompt Tokens: 5448
Completion Tokens: 896
Successful Requests: 4
Total Cost (USD): $0.0
validated_data = list(
    itertools.chain.from_iterable(
        extraction["validated_data"] for extraction in document_extraction_results
    )
)
len(validated_data)
47

Extraction is not perfect, but you can use a better LLM or provide more examples!

pd.DataFrame(record.dict() for record in validated_data)
name season year latest_episode link
0 Twisters July 2024 /m/twisters
1 Longlegs July 2024 /m/longlegs
2 National Anthem July 2024 /m/national_anthem
3 Cobra Kai 6 July 2024 /tv/cobra_kai/s06
4 Cobra Kai 6 /tv/cobra_kai/s06
5 Kite Man: Hell Yeah! 1 /tv/kite_man_hell_yeah/s01
6 Simone Biles: Rising 1 /tv/simone_biles_rising/s01
7 Lady in the Lake 1 /tv/lady_in_the_lake/s01
8 Marvel's Hit-Monkey 2 /tv/marvels_hit_monkey/s02
9 Those About to Die 1 /tv/those_about_to_die/s01
10 Emperor of Ocean Park 1 /tv/emperor_of_ocean_park/s01
11 Mafia Spies 1 /tv/mafia_spies/s01
12 The Ark 2 /tv/the_ark/s02
13 Unprisoned 2 /tv/unprisoned/s02
14 Star Wars: The Acolyte 1 /tv/star_wars_the_acolyte/s01
15 The Boys 4 /tv/the_boys_2019/s04
16 Supacell 1 /tv/supacell/s01
17 The Bear 3 /tv/the_bear/s03
18 Presumed Innocent 1 /tv/presumed_innocent/s01
19 Sunny 1 /tv/sunny/s01
20 Cobra Kai 6 Jul 18 https://editorial.rottentomatoes.com/article/c...
21 Star Wars: The Acolyte Jul 16 /tv/star_wars_the_acolyte
22 The Boys Jul 18 /tv/the_boys_2019
23 Supacell Jun 27 /tv/supacell
24 Sunny Jul 17 /tv/sunny
25 Those About to Die Jul 19 /tv/those_about_to_die
26 Cobra Kai Jul 18 /tv/cobra_kai
27 The Bear Jun 26 /tv/the_bear
28 House of the Dragon Jul 14 /tv/house_of_the_dragon
29 Lady in the Lake Jul 19 /tv/lady_in_the_lake
30 My Lady Jane Jun 27 /tv/my_lady_jane
31 Presumed Innocent Jul 17 /tv/presumed_innocent
32 Sausage Party: Foodtopia Jul 11 /tv/sausage_party_foodtopia
33 Presumed Innocent Jul 17 /tv/presumed_innocent
34 Sausage Party: Foodtopia Jul 11 /tv/sausage_party_foodtopia
35 Exploding Kittens Jul 12 /tv/exploding_kittens
36 Kite Man: Hell Yeah! Jul 18 /tv/kite_man_hell_yeah
37 Vikings: Valhalla Jul 11 /tv/vikings_valhalla
38 Marvel's Hit-Monkey Jul 15 /tv/marvels_hit_monkey
39 Shōgun Apr 23 /tv/shogun_2024
40 Land of Women Jul 17 /tv/land_of_women
41 Emperor of Ocean Park Jul 14 /tv/emperor_of_ocean_park
42 A Good Girl's Guide to Murder Jul 01 /tv/a_good_girls_guide_to_murder
43 Desperate Lies Jul 05 /tv/desperate_lies
44 Simone Biles: Rising Jul 17 /tv/simone_biles_rising
45 Dark Matter Jun 26 /tv/dark_matter_2024
46 The Serpent Queen Jul 19 /tv/the_serpent_queen