Document Extraction#

Here, we’ll be extracting content from a longer document.

The basic workflow is the following:

  1. Load the document

  2. Clean up the document (optional)

  3. Split the document into chunks

  4. Extract from every chunk of text


ATTENTION This is a brute force workflow – there will be an LLM call for every piece of text that is being analyzed. This can be expensive 💰💰💰, so use at your own risk and monitor your costs!


Let’s apply this workflow to an HTML file.

We’ll reduce HTML to markdown. This is a lossy step, which can sometimes improve extraction results, and sometimes make extraction worse.

When scraping HTML, executing javascript may be necessary to get all HTML fully rendered.

Here’s a piece of code that can execute javascript using playwright:

async def a_download_html(url: str, extra_sleep: int) -> str:
    """Download an HTML from a URL.
    
    In some pathological cases, an extra sleep period may be needed.
    """

    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url, wait_until="load")
        if extra_sleep:
            await asyncio.sleep(extra_sleep)
        html_content = await page.content()
        await browser.close()
    return html_content

Another possibility is to use: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/url.html#selenium-url-loader


Again this can be expensive 💰💰💰, so use at your own risk and monitor your costs!

from typing import List, Optional
import itertools
import requests

import pandas as pd
from pydantic import BaseModel, Field, validator
from kor import extract_from_documents, from_pydantic, create_extraction_chain
from kor.documents.html import MarkdownifyHTMLProcessor
from langchain.chat_models import ChatOpenAI
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

LLM#

Instantiate an LLM.

Try experimenting with the cheaper davinci models or with gpt-3.5-turbo before trying the more expensive davinci-003 or gpt 4.

In some cases, providing a better prompt (with more examples) can help make up for using a smaller model.


Quality can vary a lot depending on which LLM is used and how many examples are provided.


# Using gpt-3.5-turbo which is pretty cheap, but has worse quality
llm = ChatOpenAI(temperature=0)

Schema#

class ShowOrMovie(BaseModel):
    name: str = Field(
        description="The name of the movie or tv show",
    )
    season: Optional[str] = Field(
        description="Season of TV show. Extract as a digit stripping Season prefix.",
    )
    year: Optional[str] = Field(
        description="Year when the movie / tv show was released",
    )
    latest_episode: Optional[str] = Field(
        description="Date when the latest episode was released",
    )
    link: Optional[str] = Field(description="Link to the movie / tv show.")

    # rating -- not included because rating on rottentomatoes is in the html elements
    # you could try extracting it by using the raw HTML (rather than markdown)
    # or you could try doing something similar on imdb

    @validator("name")
    def name_must_not_be_empty(cls, v):
        if not v:
            raise ValueError("Name must not be empty")
        return v


schema, extraction_validator = from_pydantic(
    ShowOrMovie,
    description="Extract information about popular movies/tv shows including their name, year, link and rating.",
    examples=[
        (
            "[Rain Dogs Latest Episode: Apr 03](/tv/rain_dogs)",
            {"name": "Rain Dogs", "latest_episode": "Apr 03", "link": "/tv/rain_dogs"},
        )
    ],
    many=True,
)
chain = create_extraction_chain(
    llm,
    schema,
    encoder_or_encoder_class="csv",
    validator=extraction_validator,
    input_formatter="triple_quotes",
)

Download#

Let’s download a page containing movies from my favorite movie review site.

url = "https://www.rottentomatoes.com/browse/tv_series_browse/sort:popular"
response = requests.get(url)  # Please see comment at top about using Selenium or

Remember that in some cases you will need to execute javascript! Here’s a snippet

from langchain.document_loaders import SeleniumURLLoader
document = SeleniumURLLoader(url).load()

Extract#

Use langchain building blocks to assemble whatever pipeline you need for your own purposes.

Create a langchain document with the HTML content.

doc = Document(page_content=response.text)

Convert to markdown

ATTENTION This step is lossy and may end up removing information that’s relevant for extraction. You can always try pushing the raw HTML through if you’re not worried about cost.

md = MarkdownifyHTMLProcessor().process(doc)

Break the document to chunks so it fits in context window

split_docs = RecursiveCharacterTextSplitter().split_documents([md])
print(split_docs[-1].page_content)
Watch the trailer for You

[You

 Latest Episode: Mar 09](/tv/you)

Watch the trailer for She-Hulk: Attorney at Law

[She-Hulk: Attorney at Law](/tv/she_hulk_attorney_at_law)

[Breaking Bad](/tv/breaking_bad)

Watch the trailer for The Lord of the Rings: The Rings of Power

[The Lord of the Rings: The Rings of Power](/tv/the_lord_of_the_rings_the_rings_of_power)

No results

 Reset Filters

 Load more

Close video

See Details

See Details

* [Help](/help_desk)
* [About Rotten Tomatoes](/about)
* [What's the Tomatometer®?](/about#whatisthetomatometer)
* 

* [Critic Submission](/critics/criteria)
* [Licensing](/help_desk/licensing)
* [Advertise With Us](https://together.nbcuni.com/advertise/?utm_source=rotten_tomatoes&utm_medium=referral&utm_campaign=property_ad_pages&utm_content=footer)
* [Careers](//www.fandango.com/careers)

Join The Newsletter

Get the freshest reviews, news, and more delivered right to your inbox!

Join The Newsletter
[Join The Newsletter](https://optout.services.fandango.com/rottentomatoes)

Follow Us

* 
* 
* 
* 
* 

Copyright © Fandango. All rights reserved.

Join Newsletter
[Join Newsletter](https://optout.services.fandango.com/rottentomatoes)
* [Privacy Policy](//www.fandango.com/policies/privacy-policy)
* [Terms and Policies](//www.fandango.com/policies/terms-and-policies)
* [Cookie Settings](javascript:void(0))
* [California Notice](//www.fandango.com/californianotice)
* [Ad Choices](//www.fandango.com/policies/cookies-and-tracking#cookie_management)
* 
* [Accessibility](/faq#accessibility)

* V3.1
* [Privacy Policy](//www.fandango.com/policies/privacy-policy)
* [Terms and Policies](//www.fandango.com/policies/terms-and-policies)
* [Cookie Settings](javascript:void(0))
* [California Notice](//www.fandango.com/californianotice)
* [Ad Choices](//www.fandango.com/policies/cookies-and-tracking#cookie_management)
* [Accessibility](/faq#accessibility)

Copyright © Fandango. All rights reserved.
len(split_docs)
4

Run extraction

from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
    document_extraction_results = await extract_from_documents(
        chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
    )
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")
Total Tokens: 5854
Prompt Tokens: 5128
Completion Tokens: 726
Successful Requests: 4
Total Cost (USD): $0.011708000000000001
validated_data = list(
    itertools.chain.from_iterable(
        extraction["validated_data"] for extraction in document_extraction_results
    )
)
len(validated_data)
40

Extraction is not perfect, but you can use a better LLM or provide more examples!

pd.DataFrame(record.dict() for record in validated_data)
name season year latest_episode link
0 Beef 1 /tv/beef/s01
1 Dave 3 /tv/dave/s03
2 Schmigadoon! 2 /tv/schmigadoon/s02
3 Pretty Baby: Brooke Shields 1 /tv/pretty_baby_brooke_shields/s01
4 Tiny Beautiful Things 1 /tv/tiny_beautiful_things/s01
5 Grease: Rise of the Pink Ladies 1 /tv/grease_rise_of_the_pink_ladies/s01
6 Jury Duty 1 /tv/jury_duty/s01
7 The Crossover 1 /tv/the_crossover/s01
8 Transatlantic 1 /tv/transatlantic/s01
9 Race to Survive: Alaska 1 /tv/race_to_survive_alaska/s01
10 Beef Apr 06 /tv/beef
11 The Night Agent Mar 23 /tv/the_night_agent
12 Unstable Mar 30 /tv/unstable
13 The Mandalorian Apr 05 /tv/the_mandalorian
14 The Big Door Prize Apr 05 /tv/the_big_door_prize
15 Class of '07 Mar 17 /tv/class_of_07
16 Rabbit Hole Apr 02 /tv/rabbit_hole
17 The Power Apr 07 /tv/the_power
18 The Last of Us Mar 12 /tv/the_last_of_us
19 Yellowjackets Mar 31 /tv/yellowjackets
20 Succession Apr 02 /tv/succession
21 Lucky Hank Apr 02 /tv/lucky_hank
22 Sex/Life Mar 02 /tv/sex_life
23 Ted Lasso Apr 05 /tv/ted_lasso
24 Wellmania Mar 29 /tv/wellmania
25 Daisy Jones & the Six Mar 24 /tv/daisy_jones_and_the_six
26 Shadow and Bone Mar 16 /tv/shadow_and_bone
27 The Order /tv/the_order
28 Shrinking Mar 24 /tv/shrinking
29 Swarm Mar 17 /tv/swarm
30 The Last Kingdom /tv/the_last_kingdom
31 Rain Dogs Apr 03 /tv/rain_dogs
32 Extrapolations Apr 07 /tv/extrapolations
33 War Sailor Apr 02 /tv/war_sailor
34 You Mar 09 /tv/you
35 She-Hulk: Attorney at Law /tv/she_hulk_attorney_at_law
36 You Mar 09 /tv/you
37 She-Hulk: Attorney at Law None /tv/she_hulk_attorney_at_law
38 Breaking Bad None /tv/breaking_bad
39 The Lord of the Rings: The Rings of Power None /tv/the_lord_of_the_rings_the_rings_of_power