{ "cells": [ { "cell_type": "markdown", "id": "4b3a0584-b52c-4873-abb8-8382e13ff5c0", "metadata": {}, "source": [ "# Document Extraction\n", "\n", "Here, we'll be extracting content from a longer document.\n", "\n", "\n", "The basic workflow is the following:\n", "\n", "1. Load the document\n", "2. Clean up the document (optional)\n", "3. Split the document into chunks\n", "4. Extract from *every* chunk of text\n", "\n", "-------------\n", "\n", "**ATTENTION** This is a *brute force* workflow -- there will be an LLM call for every piece of text that is being analyzed. \n", "This can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!\n", "\n", "---------------\n", "\n", "Let's apply this workflow to an HTML file.\n", "\n", "We'll reduce HTML to markdown. This is a lossy step, which can sometimes improve extraction results, and sometimes make extraction worse.\n", "\n", "When scraping HTML, executing javascript may be necessary to get all HTML fully rendered. \n", "\n", "Here's a piece of code that can execute javascript using playwright: \n", "\n", "\n", "```python\n", "async def a_download_html(url: str, extra_sleep: int) -> str:\n", " \"\"\"Download an HTML from a URL.\n", " \n", " In some pathological cases, an extra sleep period may be needed.\n", " \"\"\"\n", "\n", " async with async_playwright() as p:\n", " browser = await p.chromium.launch()\n", " page = await browser.new_page()\n", " await page.goto(url, wait_until=\"load\")\n", " if extra_sleep:\n", " await asyncio.sleep(extra_sleep)\n", " html_content = await page.content()\n", " await browser.close()\n", " return html_content\n", "```\n", "\n", "Another possibility is to use: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/url.html#selenium-url-loader\n", "\n", "---------\n", " \n", "Again this can be **expensive** 💰💰💰, so use at your own risk and monitor your costs!" ] }, { "cell_type": "code", "execution_count": 1, "id": "f8536314-f0f3-4bb9-acd6-f2cec4046380", "metadata": { "nbsphinx": "hidden", "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import sys\n", "\n", "sys.path.insert(0, \"../../\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "d93b3de7-9b81-456e-acff-4b0df9755c14", "metadata": { "nbsphinx": "hidden", "tags": [] }, "outputs": [], "source": [ "from typing import List, Optional\n", "import itertools\n", "import requests\n", "\n", "import pandas as pd\n", "from pydantic import BaseModel, Field, field_validator\n", "from kor import extract_from_documents, from_pydantic, create_extraction_chain\n", "from kor.documents.html import MarkdownifyHTMLProcessor\n", "from langchain_core.documents import Document\n", "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", "from langchain_openai import ChatOpenAI" ] }, { "cell_type": "markdown", "id": "725a851e-ab91-4aa4-94b5-747bbf096460", "metadata": {}, "source": [ "## LLM\n", "\n", "Instantiate an LLM. \n", "\n", "Try experimenting with the cheaper davinci models or with gpt-4o before trying the more expensive davinci-003 or gpt 4.\n", "\n", "In some cases, providing a better prompt (with more examples) can help make up for using a smaller model." ] }, { "cell_type": "markdown", "id": "b2aeb931-4725-4ab8-956e-18420d1c9a36", "metadata": { "tags": [] }, "source": [ "\n", "-------------------\n", "\n", "Quality can vary a **lot** depending on which LLM is used and how many examples are provided.\n", "\n", "-------------------" ] }, { "cell_type": "code", "execution_count": 8, "id": "fab20ada-6443-4799-b6e8-faf16a2fb585", "metadata": { "tags": [] }, "outputs": [], "source": [ "llm = ChatOpenAI(temperature=0, model='gpt-4o')" ] }, { "cell_type": "markdown", "id": "cd9eaffc-875d-4b2c-a189-6c7c181fe628", "metadata": {}, "source": [ "## Schema" ] }, { "cell_type": "code", "execution_count": 9, "id": "981938c5-f438-49a0-b511-329d31073a56", "metadata": { "tags": [] }, "outputs": [], "source": [ "\n", "\n", "class ShowOrMovie(BaseModel):\n", " name: str = Field(\n", " description=\"The name of the movie or tv show\",\n", " )\n", " season: Optional[str] = Field(\n", " description=\"Season of TV show. Extract as a digit stripping Season prefix.\",\n", " )\n", " year: Optional[str] = Field(\n", " description=\"Year when the movie / tv show was released\",\n", " )\n", " latest_episode: Optional[str] = Field(\n", " description=\"Date when the latest episode was released\",\n", " )\n", " link: Optional[str] = Field(description=\"Link to the movie / tv show.\")\n", "\n", " # rating -- not included because rating on rottentomatoes is in the html elements\n", " # you could try extracting it by using the raw HTML (rather than markdown)\n", " # or you could try doing something similar on imdb\n", "\n", " @field_validator(\"name\")\n", " def name_must_not_be_empty(cls, v):\n", " if not v:\n", " raise ValueError(\"Name must not be empty\")\n", " return v\n", "\n", "\n", "schema, extraction_validator = from_pydantic(\n", " ShowOrMovie,\n", " description=\"Extract information about popular movies/tv shows including their name, year, link and rating.\",\n", " examples=[\n", " (\n", " \"[Rain Dogs Latest Episode: Apr 03](/tv/rain_dogs)\",\n", " {\"name\": \"Rain Dogs\", \"latest_episode\": \"Apr 03\", \"link\": \"/tv/rain_dogs\"},\n", " )\n", " ],\n", " many=True,\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "id": "ff57507d-d789-4ae3-8763-0465e8b27686", "metadata": { "tags": [] }, "outputs": [], "source": [ "chain = create_extraction_chain(\n", " llm,\n", " schema,\n", " encoder_or_encoder_class=\"csv\",\n", " validator=extraction_validator,\n", " input_formatter=\"triple_quotes\",\n", ")" ] }, { "cell_type": "markdown", "id": "65a2ff9e-951a-46f8-8cb5-d3eab0c28f59", "metadata": {}, "source": [ "## Download\n", "\n", "Let's download a page containing movies from my favorite movie review site." ] }, { "cell_type": "code", "execution_count": 11, "id": "b5bd49b1-0b51-40ff-b34b-3ad7f90423b6", "metadata": { "tags": [] }, "outputs": [], "source": [ "url = \"https://www.rottentomatoes.com/browse/tv_series_browse/sort:popular\"\n", "response = requests.get(url) # Please see comment at top about using Selenium or" ] }, { "cell_type": "markdown", "id": "1226d70c-68ad-49b5-8327-3132a556bbf3", "metadata": {}, "source": [ "Remember that in some cases you will need to execute javascript! Here's a snippet\n", "\n", "```python\n", "from langchain.document_loaders import SeleniumURLLoader\n", "document = SeleniumURLLoader(url).load()\n", "```" ] }, { "cell_type": "markdown", "id": "1963c139-169a-4d07-a2bd-cb2e7dfa171b", "metadata": {}, "source": [ "## Extract\n", "\n", "Use langchain building blocks to assemble whatever pipeline you need for your own purposes." ] }, { "cell_type": "markdown", "id": "8d4d40c9-6c9b-431d-9bd4-66a00f3de36f", "metadata": {}, "source": [ "Create a langchain document with the HTML content." ] }, { "cell_type": "code", "execution_count": 12, "id": "396a16b7-bc45-4035-9848-f995cddbba2f", "metadata": { "tags": [] }, "outputs": [], "source": [ "doc = Document(page_content=response.text)" ] }, { "cell_type": "markdown", "id": "524e1105-f053-4954-9e11-e693820b4823", "metadata": {}, "source": [ "Convert to markdown\n", "\n", "**ATTENTION** This step is lossy and may end up removing information that's relevant for extraction. You can always try pushing the raw HTML through if you're not worried about cost." ] }, { "cell_type": "code", "execution_count": 13, "id": "22d3229a-aa6f-4586-a527-6f7835800740", "metadata": { "tags": [] }, "outputs": [], "source": [ "md = MarkdownifyHTMLProcessor().process(doc)" ] }, { "cell_type": "markdown", "id": "d98a7554-a3c7-42ea-99a4-d1e44aa9e22c", "metadata": {}, "source": [ "Break the document to chunks so it fits in context window" ] }, { "cell_type": "code", "execution_count": 14, "id": "814e4151-f7d7-412d-beab-58ed190b7dd3", "metadata": { "tags": [] }, "outputs": [], "source": [ "split_docs = RecursiveCharacterTextSplitter().split_documents([md])" ] }, { "cell_type": "code", "execution_count": 15, "id": "e663f031-7a61-4a27-b925-1c9a6f365bcc", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Latest Episode: Jul 17](/tv/presumed_innocent)\n", "\n", "Watch the trailer for Sausage Party: Foodtopia\n", "\n", "[52%\n", "\n", " 52%\n", "\n", " Sausage Party: Foodtopia\n", "\n", " Latest Episode: Jul 11](/tv/sausage_party_foodtopia)\n", "\n", "Watch the trailer for Exploding Kittens\n", "\n", "[69%\n", "\n", " 80%\n", "\n", " Exploding Kittens\n", "\n", " Latest Episode: Jul 12](/tv/exploding_kittens)\n", "\n", "Watch the trailer for Kite Man: Hell Yeah!\n", "\n", "[86%\n", "\n", " 100%\n", "\n", " Kite Man: Hell Yeah!\n", "\n", " Latest Episode: Jul 18](/tv/kite_man_hell_yeah)\n", "\n", "Watch the trailer for Vikings: Valhalla\n", "\n", "[96%\n", "\n", " 60%\n", "\n", " Vikings: Valhalla\n", "\n", " Latest Episode: Jul 11](/tv/vikings_valhalla)\n", "\n", "Watch the trailer for Marvel's Hit-Monkey\n", "\n", "[82%\n", "\n", " 93%\n", "\n", " Marvel's Hit-Monkey\n", "\n", " Latest Episode: Jul 15](/tv/marvels_hit_monkey)\n", "\n", "Watch the trailer for Snowpiercer\n", "\n", "[75%\n", "\n", " 69%\n", "\n", " Snowpiercer](/tv/snowpiercer)\n", "\n", "Watch the trailer for Shōgun\n", "\n", "[99%\n", "\n", " 92%\n", "\n", " Shōgun\n", "\n", " Latest Episode: Apr 23](/tv/shogun_2024)\n", "\n", "Watch the trailer for True Detective\n", "\n", "[79%\n", "\n", " 57%\n", "\n", " True Detective](/tv/true_detective)\n", "\n", "Watch the trailer for Land of Women\n", "\n", "[89%\n", "\n", " 36%\n", "\n", " Land of Women\n", "\n", " Latest Episode: Jul 17](/tv/land_of_women)\n", "\n", "Watch the trailer for Emperor of Ocean Park\n", "\n", "[50%\n", "\n", " 75%\n", "\n", " Emperor of Ocean Park\n", "\n", " Latest Episode: Jul 14](/tv/emperor_of_ocean_park)\n", "\n", "[86%\n", "\n", " 71%\n", "\n", " A Good Girl's Guide to Murder\n", "\n", " Latest Episode: Jul 01](/tv/a_good_girls_guide_to_murder)\n", "\n", "[75%\n", "\n", " Desperate Lies\n", "\n", " Latest Episode: Jul 05](/tv/desperate_lies)\n", "\n", "Watch the trailer for Simone Biles: Rising\n", "\n", "[100%\n", "\n", " Simone Biles: Rising\n", "\n", " Latest Episode: Jul 17](/tv/simone_biles_rising)\n", "\n", "Watch the trailer for Fool Me Once\n", "\n", "[69%\n", "\n", " 45%\n", "\n", " Fool Me Once](/tv/fool_me_once)\n", "\n", "Watch the trailer for Dear Child\n", "\n", "[100%\n", "\n", " 84%\n", "\n", " Dear Child](/tv/dear_child)\n", "\n", "Watch the trailer for Dark Matter\n", "\n", "[82%\n", "\n", " 82%\n", "\n", " Dark Matter\n", "\n", " Latest Episode: Jun 26](/tv/dark_matter_2024)\n", "\n", "Watch the trailer for The Serpent Queen\n", "\n", "[100%\n", "\n", " 92%\n", "\n", " The Serpent Queen\n", "\n", " Latest Episode: Jul 19](/tv/the_serpent_queen)\n", "\n", " Load more\n", "\n", "Close video\n", "\n", "See Details\n", "\n", "See Details\n", "\n", "* [Help](/help_desk)\n", "* [About Rotten Tomatoes](/about)\n", "* [What's the Tomatometer®?](/about#whatisthetomatometer)\n", "* \n", "\n", "* [Critic Submission](/critics/criteria)\n", "* [Licensing](/help_desk/licensing)\n", "* [Advertise With Us](https://together.nbcuni.com/advertise/?utm_source=rotten_tomatoes&utm_medium=referral&utm_campaign=property_ad_pages&utm_content=footer)\n", "* [Careers](//www.fandango.com/careers)\n", "\n", " Join the Newsletter\n", "\n", "Get the freshest reviews, news, and more delivered right to your inbox!\n", "\n", " Join The Newsletter\n", "\n", "Join The Newsletter\n", "\n", "Follow Us\n", "\n", "Copyright © Fandango. All rights reserved.\n", "\n", "Join The Newsletter\n", "Join The Newsletter\n", "* [Privacy Policy](https://www.nbcuniversal.com/fandango-privacy-policy)\n", "* [Terms and Policies](/policies/terms-and-policies)\n", "* \n", "* [California Notice](https://www.nbcuniversal.com/privacy/california-consumer-privacy-act)\n", "* [Ad Choices](https://www.nbcuniversal.com/privacy/cookies#accordionheader2)\n", "* \n", "* [Accessibility](/faq#accessibility)\n", "\n", "* V3.1\n", "* [Privacy Policy](https://www.nbcuniversal.com/fandango-privacy-policy)\n", "* [Terms and Policies](/policies/terms-and-policies)\n", "* \n", "* [California Notice](https://www.nbcuniversal.com/privacy/california-consumer-privacy-act)\n", "* [Ad Choices](https://www.nbcuniversal.com/privacy/cookies#accordionheader2)\n", "* [Accessibility](/faq#accessibility)\n", "\n", "Copyright © Fandango. All rights reserved.\n" ] } ], "source": [ "print(split_docs[-1].page_content)" ] }, { "cell_type": "code", "execution_count": 16, "id": "ec72cb5b-1fcb-4381-85e8-f060ea1f0077", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(split_docs)" ] }, { "cell_type": "markdown", "id": "de953c9a-cb03-4ac8-8ccd-a32bde1f5f61", "metadata": {}, "source": [ "Run extraction" ] }, { "cell_type": "code", "execution_count": 17, "id": "1d17ebaa-8222-48df-82d9-42173a60f5f7", "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain_community.callbacks import get_openai_callback" ] }, { "cell_type": "code", "execution_count": 18, "id": "bdb85b12-25b9-477c-bf7e-5407812f4807", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total Tokens: 6344\n", "Prompt Tokens: 5448\n", "Completion Tokens: 896\n", "Successful Requests: 4\n", "Total Cost (USD): $0.0\n" ] } ], "source": [ "with get_openai_callback() as cb:\n", " document_extraction_results = await extract_from_documents(\n", " chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True\n", " )\n", " print(f\"Total Tokens: {cb.total_tokens}\")\n", " print(f\"Prompt Tokens: {cb.prompt_tokens}\")\n", " print(f\"Completion Tokens: {cb.completion_tokens}\")\n", " print(f\"Successful Requests: {cb.successful_requests}\")\n", " print(f\"Total Cost (USD): ${cb.total_cost}\")" ] }, { "cell_type": "code", "execution_count": 19, "id": "d00f86ed-48c3-46c3-8d37-49de50817d93", "metadata": { "tags": [] }, "outputs": [], "source": [ "validated_data = list(\n", " itertools.chain.from_iterable(\n", " extraction[\"validated_data\"] for extraction in document_extraction_results\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 20, "id": "89e35b77-aac4-4a32-9515-9ebe9bfef046", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "47" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(validated_data)" ] }, { "cell_type": "markdown", "id": "4b5f6bb8-656f-4d92-b9d7-3dfce538705b", "metadata": {}, "source": [ "Extraction is not perfect, but you can use a better LLM or provide more examples!" ] }, { "cell_type": "code", "execution_count": 21, "id": "75f3e35e-d6a9-4c0d-a5c3-a509eaf7d6ec", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameseasonyearlatest_episodelink
0TwistersJuly 2024/m/twisters
1LonglegsJuly 2024/m/longlegs
2National AnthemJuly 2024/m/national_anthem
3Cobra Kai6July 2024/tv/cobra_kai/s06
4Cobra Kai6/tv/cobra_kai/s06
5Kite Man: Hell Yeah!1/tv/kite_man_hell_yeah/s01
6Simone Biles: Rising1/tv/simone_biles_rising/s01
7Lady in the Lake1/tv/lady_in_the_lake/s01
8Marvel's Hit-Monkey2/tv/marvels_hit_monkey/s02
9Those About to Die1/tv/those_about_to_die/s01
10Emperor of Ocean Park1/tv/emperor_of_ocean_park/s01
11Mafia Spies1/tv/mafia_spies/s01
12The Ark2/tv/the_ark/s02
13Unprisoned2/tv/unprisoned/s02
14Star Wars: The Acolyte1/tv/star_wars_the_acolyte/s01
15The Boys4/tv/the_boys_2019/s04
16Supacell1/tv/supacell/s01
17The Bear3/tv/the_bear/s03
18Presumed Innocent1/tv/presumed_innocent/s01
19Sunny1/tv/sunny/s01
20Cobra Kai6Jul 18https://editorial.rottentomatoes.com/article/c...
21Star Wars: The AcolyteJul 16/tv/star_wars_the_acolyte
22The BoysJul 18/tv/the_boys_2019
23SupacellJun 27/tv/supacell
24SunnyJul 17/tv/sunny
25Those About to DieJul 19/tv/those_about_to_die
26Cobra KaiJul 18/tv/cobra_kai
27The BearJun 26/tv/the_bear
28House of the DragonJul 14/tv/house_of_the_dragon
29Lady in the LakeJul 19/tv/lady_in_the_lake
30My Lady JaneJun 27/tv/my_lady_jane
31Presumed InnocentJul 17/tv/presumed_innocent
32Sausage Party: FoodtopiaJul 11/tv/sausage_party_foodtopia
33Presumed InnocentJul 17/tv/presumed_innocent
34Sausage Party: FoodtopiaJul 11/tv/sausage_party_foodtopia
35Exploding KittensJul 12/tv/exploding_kittens
36Kite Man: Hell Yeah!Jul 18/tv/kite_man_hell_yeah
37Vikings: ValhallaJul 11/tv/vikings_valhalla
38Marvel's Hit-MonkeyJul 15/tv/marvels_hit_monkey
39ShōgunApr 23/tv/shogun_2024
40Land of WomenJul 17/tv/land_of_women
41Emperor of Ocean ParkJul 14/tv/emperor_of_ocean_park
42A Good Girl's Guide to MurderJul 01/tv/a_good_girls_guide_to_murder
43Desperate LiesJul 05/tv/desperate_lies
44Simone Biles: RisingJul 17/tv/simone_biles_rising
45Dark MatterJun 26/tv/dark_matter_2024
46The Serpent QueenJul 19/tv/the_serpent_queen
\n", "
" ], "text/plain": [ " name season year latest_episode \\\n", "0 Twisters July 2024 \n", "1 Longlegs July 2024 \n", "2 National Anthem July 2024 \n", "3 Cobra Kai 6 July 2024 \n", "4 Cobra Kai 6 \n", "5 Kite Man: Hell Yeah! 1 \n", "6 Simone Biles: Rising 1 \n", "7 Lady in the Lake 1 \n", "8 Marvel's Hit-Monkey 2 \n", "9 Those About to Die 1 \n", "10 Emperor of Ocean Park 1 \n", "11 Mafia Spies 1 \n", "12 The Ark 2 \n", "13 Unprisoned 2 \n", "14 Star Wars: The Acolyte 1 \n", "15 The Boys 4 \n", "16 Supacell 1 \n", "17 The Bear 3 \n", "18 Presumed Innocent 1 \n", "19 Sunny 1 \n", "20 Cobra Kai 6 Jul 18 \n", "21 Star Wars: The Acolyte Jul 16 \n", "22 The Boys Jul 18 \n", "23 Supacell Jun 27 \n", "24 Sunny Jul 17 \n", "25 Those About to Die Jul 19 \n", "26 Cobra Kai Jul 18 \n", "27 The Bear Jun 26 \n", "28 House of the Dragon Jul 14 \n", "29 Lady in the Lake Jul 19 \n", "30 My Lady Jane Jun 27 \n", "31 Presumed Innocent Jul 17 \n", "32 Sausage Party: Foodtopia Jul 11 \n", "33 Presumed Innocent Jul 17 \n", "34 Sausage Party: Foodtopia Jul 11 \n", "35 Exploding Kittens Jul 12 \n", "36 Kite Man: Hell Yeah! Jul 18 \n", "37 Vikings: Valhalla Jul 11 \n", "38 Marvel's Hit-Monkey Jul 15 \n", "39 Shōgun Apr 23 \n", "40 Land of Women Jul 17 \n", "41 Emperor of Ocean Park Jul 14 \n", "42 A Good Girl's Guide to Murder Jul 01 \n", "43 Desperate Lies Jul 05 \n", "44 Simone Biles: Rising Jul 17 \n", "45 Dark Matter Jun 26 \n", "46 The Serpent Queen Jul 19 \n", "\n", " link \n", "0 /m/twisters \n", "1 /m/longlegs \n", "2 /m/national_anthem \n", "3 /tv/cobra_kai/s06 \n", "4 /tv/cobra_kai/s06 \n", "5 /tv/kite_man_hell_yeah/s01 \n", "6 /tv/simone_biles_rising/s01 \n", "7 /tv/lady_in_the_lake/s01 \n", "8 /tv/marvels_hit_monkey/s02 \n", "9 /tv/those_about_to_die/s01 \n", "10 /tv/emperor_of_ocean_park/s01 \n", "11 /tv/mafia_spies/s01 \n", "12 /tv/the_ark/s02 \n", "13 /tv/unprisoned/s02 \n", "14 /tv/star_wars_the_acolyte/s01 \n", "15 /tv/the_boys_2019/s04 \n", "16 /tv/supacell/s01 \n", "17 /tv/the_bear/s03 \n", "18 /tv/presumed_innocent/s01 \n", "19 /tv/sunny/s01 \n", "20 https://editorial.rottentomatoes.com/article/c... \n", "21 /tv/star_wars_the_acolyte \n", "22 /tv/the_boys_2019 \n", "23 /tv/supacell \n", "24 /tv/sunny \n", "25 /tv/those_about_to_die \n", "26 /tv/cobra_kai \n", "27 /tv/the_bear \n", "28 /tv/house_of_the_dragon \n", "29 /tv/lady_in_the_lake \n", "30 /tv/my_lady_jane \n", "31 /tv/presumed_innocent \n", "32 /tv/sausage_party_foodtopia \n", "33 /tv/presumed_innocent \n", "34 /tv/sausage_party_foodtopia \n", "35 /tv/exploding_kittens \n", "36 /tv/kite_man_hell_yeah \n", "37 /tv/vikings_valhalla \n", "38 /tv/marvels_hit_monkey \n", "39 /tv/shogun_2024 \n", "40 /tv/land_of_women \n", "41 /tv/emperor_of_ocean_park \n", "42 /tv/a_good_girls_guide_to_murder \n", "43 /tv/desperate_lies \n", "44 /tv/simone_biles_rising \n", "45 /tv/dark_matter_2024 \n", "46 /tv/the_serpent_queen " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(record.dict() for record in validated_data)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }