Document Extraction
Contents
Document Extraction#
Here, we’ll be extracting content from a longer document.
The basic workflow is the following:
Load the document
Clean up the document (optional)
Split the document into chunks
Extract from every chunk of text
ATTENTION This is a brute force workflow – there will be an LLM call for every piece of text that is being analyzed. This can be expensive 💰💰💰, so use at your own risk and monitor your costs!
Let’s apply this workflow to an HTML file.
We’ll reduce HTML to markdown. This is a lossy step, which can sometimes improve extraction results, and sometimes make extraction worse.
When scraping HTML, executing javascript may be necessary to get all HTML fully rendered.
Here’s a piece of code that can execute javascript using playwright:
async def a_download_html(url: str, extra_sleep: int) -> str:
"""Download an HTML from a URL.
In some pathological cases, an extra sleep period may be needed.
"""
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto(url, wait_until="load")
if extra_sleep:
await asyncio.sleep(extra_sleep)
html_content = await page.content()
await browser.close()
return html_content
Another possibility is to use: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/url.html#selenium-url-loader
Again this can be expensive 💰💰💰, so use at your own risk and monitor your costs!
from typing import List, Optional
import itertools
import requests
import pandas as pd
from pydantic import BaseModel, Field, field_validator
from kor import extract_from_documents, from_pydantic, create_extraction_chain
from kor.documents.html import MarkdownifyHTMLProcessor
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
LLM#
Instantiate an LLM.
Try experimenting with the cheaper davinci models or with gpt-4o before trying the more expensive davinci-003 or gpt 4.
In some cases, providing a better prompt (with more examples) can help make up for using a smaller model.
Quality can vary a lot depending on which LLM is used and how many examples are provided.
llm = ChatOpenAI(temperature=0, model='gpt-4o')
Schema#
class ShowOrMovie(BaseModel):
name: str = Field(
description="The name of the movie or tv show",
)
season: Optional[str] = Field(
description="Season of TV show. Extract as a digit stripping Season prefix.",
)
year: Optional[str] = Field(
description="Year when the movie / tv show was released",
)
latest_episode: Optional[str] = Field(
description="Date when the latest episode was released",
)
link: Optional[str] = Field(description="Link to the movie / tv show.")
# rating -- not included because rating on rottentomatoes is in the html elements
# you could try extracting it by using the raw HTML (rather than markdown)
# or you could try doing something similar on imdb
@field_validator("name")
def name_must_not_be_empty(cls, v):
if not v:
raise ValueError("Name must not be empty")
return v
schema, extraction_validator = from_pydantic(
ShowOrMovie,
description="Extract information about popular movies/tv shows including their name, year, link and rating.",
examples=[
(
"[Rain Dogs Latest Episode: Apr 03](/tv/rain_dogs)",
{"name": "Rain Dogs", "latest_episode": "Apr 03", "link": "/tv/rain_dogs"},
)
],
many=True,
)
chain = create_extraction_chain(
llm,
schema,
encoder_or_encoder_class="csv",
validator=extraction_validator,
input_formatter="triple_quotes",
)
Download#
Let’s download a page containing movies from my favorite movie review site.
url = "https://www.rottentomatoes.com/browse/tv_series_browse/sort:popular"
response = requests.get(url) # Please see comment at top about using Selenium or
Remember that in some cases you will need to execute javascript! Here’s a snippet
from langchain.document_loaders import SeleniumURLLoader
document = SeleniumURLLoader(url).load()
Extract#
Use langchain building blocks to assemble whatever pipeline you need for your own purposes.
Create a langchain document with the HTML content.
doc = Document(page_content=response.text)
Convert to markdown
ATTENTION This step is lossy and may end up removing information that’s relevant for extraction. You can always try pushing the raw HTML through if you’re not worried about cost.
md = MarkdownifyHTMLProcessor().process(doc)
Break the document to chunks so it fits in context window
split_docs = RecursiveCharacterTextSplitter().split_documents([md])
print(split_docs[-1].page_content)
Latest Episode: Jul 17](/tv/presumed_innocent)
Watch the trailer for Sausage Party: Foodtopia
[52%
52%
Sausage Party: Foodtopia
Latest Episode: Jul 11](/tv/sausage_party_foodtopia)
Watch the trailer for Exploding Kittens
[69%
80%
Exploding Kittens
Latest Episode: Jul 12](/tv/exploding_kittens)
Watch the trailer for Kite Man: Hell Yeah!
[86%
100%
Kite Man: Hell Yeah!
Latest Episode: Jul 18](/tv/kite_man_hell_yeah)
Watch the trailer for Vikings: Valhalla
[96%
60%
Vikings: Valhalla
Latest Episode: Jul 11](/tv/vikings_valhalla)
Watch the trailer for Marvel's Hit-Monkey
[82%
93%
Marvel's Hit-Monkey
Latest Episode: Jul 15](/tv/marvels_hit_monkey)
Watch the trailer for Snowpiercer
[75%
69%
Snowpiercer](/tv/snowpiercer)
Watch the trailer for Shōgun
[99%
92%
Shōgun
Latest Episode: Apr 23](/tv/shogun_2024)
Watch the trailer for True Detective
[79%
57%
True Detective](/tv/true_detective)
Watch the trailer for Land of Women
[89%
36%
Land of Women
Latest Episode: Jul 17](/tv/land_of_women)
Watch the trailer for Emperor of Ocean Park
[50%
75%
Emperor of Ocean Park
Latest Episode: Jul 14](/tv/emperor_of_ocean_park)
[86%
71%
A Good Girl's Guide to Murder
Latest Episode: Jul 01](/tv/a_good_girls_guide_to_murder)
[75%
Desperate Lies
Latest Episode: Jul 05](/tv/desperate_lies)
Watch the trailer for Simone Biles: Rising
[100%
Simone Biles: Rising
Latest Episode: Jul 17](/tv/simone_biles_rising)
Watch the trailer for Fool Me Once
[69%
45%
Fool Me Once](/tv/fool_me_once)
Watch the trailer for Dear Child
[100%
84%
Dear Child](/tv/dear_child)
Watch the trailer for Dark Matter
[82%
82%
Dark Matter
Latest Episode: Jun 26](/tv/dark_matter_2024)
Watch the trailer for The Serpent Queen
[100%
92%
The Serpent Queen
Latest Episode: Jul 19](/tv/the_serpent_queen)
Load more
Close video
See Details
See Details
* [Help](/help_desk)
* [About Rotten Tomatoes](/about)
* [What's the Tomatometer®?](/about#whatisthetomatometer)
*
* [Critic Submission](/critics/criteria)
* [Licensing](/help_desk/licensing)
* [Advertise With Us](https://together.nbcuni.com/advertise/?utm_source=rotten_tomatoes&utm_medium=referral&utm_campaign=property_ad_pages&utm_content=footer)
* [Careers](//www.fandango.com/careers)
Join the Newsletter
Get the freshest reviews, news, and more delivered right to your inbox!
Join The Newsletter
Join The Newsletter
Follow Us
Copyright © Fandango. All rights reserved.
Join The Newsletter
Join The Newsletter
* [Privacy Policy](https://www.nbcuniversal.com/fandango-privacy-policy)
* [Terms and Policies](/policies/terms-and-policies)
*
* [California Notice](https://www.nbcuniversal.com/privacy/california-consumer-privacy-act)
* [Ad Choices](https://www.nbcuniversal.com/privacy/cookies#accordionheader2)
*
* [Accessibility](/faq#accessibility)
* V3.1
* [Privacy Policy](https://www.nbcuniversal.com/fandango-privacy-policy)
* [Terms and Policies](/policies/terms-and-policies)
*
* [California Notice](https://www.nbcuniversal.com/privacy/california-consumer-privacy-act)
* [Ad Choices](https://www.nbcuniversal.com/privacy/cookies#accordionheader2)
* [Accessibility](/faq#accessibility)
Copyright © Fandango. All rights reserved.
len(split_docs)
4
Run extraction
from langchain_community.callbacks import get_openai_callback
with get_openai_callback() as cb:
document_extraction_results = await extract_from_documents(
chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
)
print(f"Total Tokens: {cb.total_tokens}")
print(f"Prompt Tokens: {cb.prompt_tokens}")
print(f"Completion Tokens: {cb.completion_tokens}")
print(f"Successful Requests: {cb.successful_requests}")
print(f"Total Cost (USD): ${cb.total_cost}")
Total Tokens: 6344
Prompt Tokens: 5448
Completion Tokens: 896
Successful Requests: 4
Total Cost (USD): $0.0
validated_data = list(
itertools.chain.from_iterable(
extraction["validated_data"] for extraction in document_extraction_results
)
)
len(validated_data)
47
Extraction is not perfect, but you can use a better LLM or provide more examples!
pd.DataFrame(record.dict() for record in validated_data)
name | season | year | latest_episode | link | |
---|---|---|---|---|---|
0 | Twisters | July 2024 | /m/twisters | ||
1 | Longlegs | July 2024 | /m/longlegs | ||
2 | National Anthem | July 2024 | /m/national_anthem | ||
3 | Cobra Kai | 6 | July 2024 | /tv/cobra_kai/s06 | |
4 | Cobra Kai | 6 | /tv/cobra_kai/s06 | ||
5 | Kite Man: Hell Yeah! | 1 | /tv/kite_man_hell_yeah/s01 | ||
6 | Simone Biles: Rising | 1 | /tv/simone_biles_rising/s01 | ||
7 | Lady in the Lake | 1 | /tv/lady_in_the_lake/s01 | ||
8 | Marvel's Hit-Monkey | 2 | /tv/marvels_hit_monkey/s02 | ||
9 | Those About to Die | 1 | /tv/those_about_to_die/s01 | ||
10 | Emperor of Ocean Park | 1 | /tv/emperor_of_ocean_park/s01 | ||
11 | Mafia Spies | 1 | /tv/mafia_spies/s01 | ||
12 | The Ark | 2 | /tv/the_ark/s02 | ||
13 | Unprisoned | 2 | /tv/unprisoned/s02 | ||
14 | Star Wars: The Acolyte | 1 | /tv/star_wars_the_acolyte/s01 | ||
15 | The Boys | 4 | /tv/the_boys_2019/s04 | ||
16 | Supacell | 1 | /tv/supacell/s01 | ||
17 | The Bear | 3 | /tv/the_bear/s03 | ||
18 | Presumed Innocent | 1 | /tv/presumed_innocent/s01 | ||
19 | Sunny | 1 | /tv/sunny/s01 | ||
20 | Cobra Kai | 6 | Jul 18 | https://editorial.rottentomatoes.com/article/c... | |
21 | Star Wars: The Acolyte | Jul 16 | /tv/star_wars_the_acolyte | ||
22 | The Boys | Jul 18 | /tv/the_boys_2019 | ||
23 | Supacell | Jun 27 | /tv/supacell | ||
24 | Sunny | Jul 17 | /tv/sunny | ||
25 | Those About to Die | Jul 19 | /tv/those_about_to_die | ||
26 | Cobra Kai | Jul 18 | /tv/cobra_kai | ||
27 | The Bear | Jun 26 | /tv/the_bear | ||
28 | House of the Dragon | Jul 14 | /tv/house_of_the_dragon | ||
29 | Lady in the Lake | Jul 19 | /tv/lady_in_the_lake | ||
30 | My Lady Jane | Jun 27 | /tv/my_lady_jane | ||
31 | Presumed Innocent | Jul 17 | /tv/presumed_innocent | ||
32 | Sausage Party: Foodtopia | Jul 11 | /tv/sausage_party_foodtopia | ||
33 | Presumed Innocent | Jul 17 | /tv/presumed_innocent | ||
34 | Sausage Party: Foodtopia | Jul 11 | /tv/sausage_party_foodtopia | ||
35 | Exploding Kittens | Jul 12 | /tv/exploding_kittens | ||
36 | Kite Man: Hell Yeah! | Jul 18 | /tv/kite_man_hell_yeah | ||
37 | Vikings: Valhalla | Jul 11 | /tv/vikings_valhalla | ||
38 | Marvel's Hit-Monkey | Jul 15 | /tv/marvels_hit_monkey | ||
39 | Shōgun | Apr 23 | /tv/shogun_2024 | ||
40 | Land of Women | Jul 17 | /tv/land_of_women | ||
41 | Emperor of Ocean Park | Jul 14 | /tv/emperor_of_ocean_park | ||
42 | A Good Girl's Guide to Murder | Jul 01 | /tv/a_good_girls_guide_to_murder | ||
43 | Desperate Lies | Jul 05 | /tv/desperate_lies | ||
44 | Simone Biles: Rising | Jul 17 | /tv/simone_biles_rising | ||
45 | Dark Matter | Jun 26 | /tv/dark_matter_2024 | ||
46 | The Serpent Queen | Jul 19 | /tv/the_serpent_queen |