Document Extraction
Contents
Document Extraction#
Here, we’ll be extracting content from a longer document.
The basic workflow is the following:
Load the document
Clean up the document (optional)
Split the document into chunks
Extract from every chunk of text
ATTENTION This is a brute force workflow – there will be an LLM call for every piece of text that is being analyzed. This can be expensive 💰💰💰, so use at your own risk and monitor your costs!
Let’s apply this workflow to an HTML file.
We’ll reduce HTML to markdown. This is a lossy step, which can sometimes improve extraction results, and sometimes make extraction worse.
When scraping HTML, executing javascript may be necessary to get all HTML fully rendered.
Here’s a piece of code that can execute javascript using playwright:
async def a_download_html(url: str, extra_sleep: int) -> str:
"""Download an HTML from a URL.
In some pathological cases, an extra sleep period may be needed.
"""
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto(url, wait_until="load")
if extra_sleep:
await asyncio.sleep(extra_sleep)
html_content = await page.content()
await browser.close()
return html_content
Another possibility is to use: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/url.html#selenium-url-loader
Again this can be expensive 💰💰💰, so use at your own risk and monitor your costs!
from typing import List, Optional
import itertools
import requests
import pandas as pd
from pydantic import BaseModel, Field, validator
from kor import extract_from_documents, from_pydantic, create_extraction_chain
from kor.documents.html import MarkdownifyHTMLProcessor
from langchain.chat_models import ChatOpenAI
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
LLM#
Instantiate an LLM.
Try experimenting with the cheaper davinci models or with gpt-3.5-turbo before trying the more expensive davinci-003 or gpt 4.
In some cases, providing a better prompt (with more examples) can help make up for using a smaller model.
Quality can vary a lot depending on which LLM is used and how many examples are provided.
# Using gpt-3.5-turbo which is pretty cheap, but has worse quality
llm = ChatOpenAI(temperature=0)
Schema#
class ShowOrMovie(BaseModel):
name: str = Field(
description="The name of the movie or tv show",
)
season: Optional[str] = Field(
description="Season of TV show. Extract as a digit stripping Season prefix.",
)
year: Optional[str] = Field(
description="Year when the movie / tv show was released",
)
latest_episode: Optional[str] = Field(
description="Date when the latest episode was released",
)
link: Optional[str] = Field(description="Link to the movie / tv show.")
# rating -- not included because rating on rottentomatoes is in the html elements
# you could try extracting it by using the raw HTML (rather than markdown)
# or you could try doing something similar on imdb
@validator("name")
def name_must_not_be_empty(cls, v):
if not v:
raise ValueError("Name must not be empty")
return v
schema, extraction_validator = from_pydantic(
ShowOrMovie,
description="Extract information about popular movies/tv shows including their name, year, link and rating.",
examples=[
(
"[Rain Dogs Latest Episode: Apr 03](/tv/rain_dogs)",
{"name": "Rain Dogs", "latest_episode": "Apr 03", "link": "/tv/rain_dogs"},
)
],
many=True,
)
chain = create_extraction_chain(
llm,
schema,
encoder_or_encoder_class="csv",
validator=extraction_validator,
input_formatter="triple_quotes",
)
Download#
Let’s download a page containing movies from my favorite movie review site.
url = "https://www.rottentomatoes.com/browse/tv_series_browse/sort:popular"
response = requests.get(url) # Please see comment at top about using Selenium or
Remember that in some cases you will need to execute javascript! Here’s a snippet
from langchain.document_loaders import SeleniumURLLoader
document = SeleniumURLLoader(url).load()
Extract#
Use langchain building blocks to assemble whatever pipeline you need for your own purposes.
Create a langchain document with the HTML content.
doc = Document(page_content=response.text)
Convert to markdown
ATTENTION This step is lossy and may end up removing information that’s relevant for extraction. You can always try pushing the raw HTML through if you’re not worried about cost.
md = MarkdownifyHTMLProcessor().process(doc)
Break the document to chunks so it fits in context window
split_docs = RecursiveCharacterTextSplitter().split_documents([md])
print(split_docs[-1].page_content)
Watch the trailer for You
[You
Latest Episode: Mar 09](/tv/you)
Watch the trailer for She-Hulk: Attorney at Law
[She-Hulk: Attorney at Law](/tv/she_hulk_attorney_at_law)
[Breaking Bad](/tv/breaking_bad)
Watch the trailer for The Lord of the Rings: The Rings of Power
[The Lord of the Rings: The Rings of Power](/tv/the_lord_of_the_rings_the_rings_of_power)
No results
Reset Filters
Load more
Close video
See Details
See Details
* [Help](/help_desk)
* [About Rotten Tomatoes](/about)
* [What's the Tomatometer®?](/about#whatisthetomatometer)
*
* [Critic Submission](/critics/criteria)
* [Licensing](/help_desk/licensing)
* [Advertise With Us](https://together.nbcuni.com/advertise/?utm_source=rotten_tomatoes&utm_medium=referral&utm_campaign=property_ad_pages&utm_content=footer)
* [Careers](//www.fandango.com/careers)
Join The Newsletter
Get the freshest reviews, news, and more delivered right to your inbox!
Join The Newsletter
[Join The Newsletter](https://optout.services.fandango.com/rottentomatoes)
Follow Us
*
*
*
*
*
Copyright © Fandango. All rights reserved.
Join Newsletter
[Join Newsletter](https://optout.services.fandango.com/rottentomatoes)
* [Privacy Policy](//www.fandango.com/policies/privacy-policy)
* [Terms and Policies](//www.fandango.com/policies/terms-and-policies)
* [Cookie Settings](javascript:void(0))
* [California Notice](//www.fandango.com/californianotice)
* [Ad Choices](//www.fandango.com/policies/cookies-and-tracking#cookie_management)
*
* [Accessibility](/faq#accessibility)
* V3.1
* [Privacy Policy](//www.fandango.com/policies/privacy-policy)
* [Terms and Policies](//www.fandango.com/policies/terms-and-policies)
* [Cookie Settings](javascript:void(0))
* [California Notice](//www.fandango.com/californianotice)
* [Ad Choices](//www.fandango.com/policies/cookies-and-tracking#cookie_management)
* [Accessibility](/faq#accessibility)
Copyright © Fandango. All rights reserved.
len(split_docs)
4
Run extraction
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
document_extraction_results = await extract_from_documents(
chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
)
print(f"Total Tokens: {cb.total_tokens}")
print(f"Prompt Tokens: {cb.prompt_tokens}")
print(f"Completion Tokens: {cb.completion_tokens}")
print(f"Successful Requests: {cb.successful_requests}")
print(f"Total Cost (USD): ${cb.total_cost}")
Total Tokens: 5854
Prompt Tokens: 5128
Completion Tokens: 726
Successful Requests: 4
Total Cost (USD): $0.011708000000000001
validated_data = list(
itertools.chain.from_iterable(
extraction["validated_data"] for extraction in document_extraction_results
)
)
len(validated_data)
40
Extraction is not perfect, but you can use a better LLM or provide more examples!
pd.DataFrame(record.dict() for record in validated_data)
name | season | year | latest_episode | link | |
---|---|---|---|---|---|
0 | Beef | 1 | /tv/beef/s01 | ||
1 | Dave | 3 | /tv/dave/s03 | ||
2 | Schmigadoon! | 2 | /tv/schmigadoon/s02 | ||
3 | Pretty Baby: Brooke Shields | 1 | /tv/pretty_baby_brooke_shields/s01 | ||
4 | Tiny Beautiful Things | 1 | /tv/tiny_beautiful_things/s01 | ||
5 | Grease: Rise of the Pink Ladies | 1 | /tv/grease_rise_of_the_pink_ladies/s01 | ||
6 | Jury Duty | 1 | /tv/jury_duty/s01 | ||
7 | The Crossover | 1 | /tv/the_crossover/s01 | ||
8 | Transatlantic | 1 | /tv/transatlantic/s01 | ||
9 | Race to Survive: Alaska | 1 | /tv/race_to_survive_alaska/s01 | ||
10 | Beef | Apr 06 | /tv/beef | ||
11 | The Night Agent | Mar 23 | /tv/the_night_agent | ||
12 | Unstable | Mar 30 | /tv/unstable | ||
13 | The Mandalorian | Apr 05 | /tv/the_mandalorian | ||
14 | The Big Door Prize | Apr 05 | /tv/the_big_door_prize | ||
15 | Class of '07 | Mar 17 | /tv/class_of_07 | ||
16 | Rabbit Hole | Apr 02 | /tv/rabbit_hole | ||
17 | The Power | Apr 07 | /tv/the_power | ||
18 | The Last of Us | Mar 12 | /tv/the_last_of_us | ||
19 | Yellowjackets | Mar 31 | /tv/yellowjackets | ||
20 | Succession | Apr 02 | /tv/succession | ||
21 | Lucky Hank | Apr 02 | /tv/lucky_hank | ||
22 | Sex/Life | Mar 02 | /tv/sex_life | ||
23 | Ted Lasso | Apr 05 | /tv/ted_lasso | ||
24 | Wellmania | Mar 29 | /tv/wellmania | ||
25 | Daisy Jones & the Six | Mar 24 | /tv/daisy_jones_and_the_six | ||
26 | Shadow and Bone | Mar 16 | /tv/shadow_and_bone | ||
27 | The Order | /tv/the_order | |||
28 | Shrinking | Mar 24 | /tv/shrinking | ||
29 | Swarm | Mar 17 | /tv/swarm | ||
30 | The Last Kingdom | /tv/the_last_kingdom | |||
31 | Rain Dogs | Apr 03 | /tv/rain_dogs | ||
32 | Extrapolations | Apr 07 | /tv/extrapolations | ||
33 | War Sailor | Apr 02 | /tv/war_sailor | ||
34 | You | Mar 09 | /tv/you | ||
35 | She-Hulk: Attorney at Law | /tv/she_hulk_attorney_at_law | |||
36 | You | Mar 09 | /tv/you | ||
37 | She-Hulk: Attorney at Law | None | /tv/she_hulk_attorney_at_law | ||
38 | Breaking Bad | None | /tv/breaking_bad | ||
39 | The Lord of the Rings: The Rings of Power | None | /tv/the_lord_of_the_rings_the_rings_of_power |