Post

ScrapeGraphAI - LLM and Graph Powered Web Scraping Python Library ๐Ÿ“š

ScrapeGraphAI is a robust web scraping Python library that employs Large Language Models (LLM) and direct graph logic to create scraping pipelines for websites, documents, and XML files.

Unlike rigid methods that rely on predefined patterns or manual adjustments, ScrapegraphAI dynamically adapts to variations in website structures.

โ€”โ€”โ€”โ€”โ€”โ€”โ€”

โš™๏ธFeatures:

โŠ Direct Graph Logic:

This feature leverages a graph-based approach to dynamically create scraping pipelines, ensuring efficient data retrieval based on user-defined prompts.

โŠ LLM Integration:

By integrating Large Language Models (LLMs), ScrapeGraphAI interprets user inputs and automates data extraction, removing the need for manual coding.

โŠ Multiple AI Platform Support:

Whether you prefer models from OpenAI, Azure, or Groq, ScrapeGraphAI supports integration with specific API keys and configurations, offering flexibility and choice.

โŠ SpeechGraph

ScrapeGraphAI can scrape information and convert it into voice audio. This unique feature allows providing an accessible and convenient way to interact with the extracted data.

โŠ OmniScraperGraph

An evolution of SmartScraperGraph equipped with image description capabilities. This enhancement enables users to extract images from single web pages and obtain accurate descriptions, enriching the dataset with valuable visual information. (GPT-4o only)

โ€”โ€”โ€”โ€”โ€”โ€”โ€”

Simple Setup and Configuration

Setting up ScrapeGraphAI is straightforward: There is an app made by streamlit.

Original Article : https://medium.com/@amanatulla1606/llm-web-scraping-with-scrapegraphai-a-breakthrough-in-data-extraction-d6596b282b4d

 LLM based Scraping

Translate to Korean

ScrapeGraphAI๋Š” LLM(Large Language Models) ๋ฐ ์ง์ ‘ ๊ทธ๋ž˜ํ”„ ๋กœ์ง์„ ์‚ฌ์šฉํ•˜์—ฌ ์›น ์‚ฌ์ดํŠธ, ๋ฌธ์„œ ๋ฐ XML ํŒŒ์ผ์— ๋Œ€ํ•œ ์Šคํฌ๋ž˜ํ•‘ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ์›น ์Šคํฌ๋ž˜ํ•‘ Python ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค.

์‚ฌ์ „ ์ •์˜๋œ ํŒจํ„ด์ด๋‚˜ ์ˆ˜๋™ ์กฐ์ •์— ์˜์กดํ•˜๋Š” ๊ฒฝ์ง๋œ ๋ฐฉ๋ฒ•๊ณผ ๋‹ฌ๋ฆฌ ScrapegraphAI๋Š” ์›น์‚ฌ์ดํŠธ ๊ตฌ์กฐ์˜ ๋ณ€ํ™”์— ๋™์ ์œผ๋กœ ์ ์‘ํ•ฉ๋‹ˆ๋‹ค.

โ€”โ€”โ€”โ€”โ€”โ€”โ€”

โš™๏ธ๊ธฐ๋Šฅ:

โŠ ์ง์ ‘ ๊ทธ๋ž˜ํ”„ ๋กœ์ง:

์ด ๊ธฐ๋Šฅ์€ ๊ทธ๋ž˜ํ”„ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ™œ์šฉํ•˜์—ฌ ์Šคํฌ๋ž˜ํ•‘ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋™์ ์œผ๋กœ ์ƒ์„ฑํ•˜์—ฌ ์‚ฌ์šฉ์ž ์ •์˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ฒ€์ƒ‰์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

โŠ LLM ํ†ตํ•ฉ:

ScrapeGraphAI๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ํ†ตํ•ฉํ•˜์—ฌ ์‚ฌ์šฉ์ž ์ž…๋ ฅ์„ ํ•ด์„ํ•˜๊ณ  ๋ฐ์ดํ„ฐ ์ถ”์ถœ์„ ์ž๋™ํ™”ํ•˜์—ฌ ์ˆ˜๋™ ์ฝ”๋”ฉ์˜ ํ•„์š”์„ฑ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

โŠ ๋‹ค์ค‘ AI ํ”Œ๋žซํผ ์ง€์›:

OpenAI, Azure ๋˜๋Š” Groq์˜ ๋ชจ๋ธ์„ ์„ ํ˜ธํ•˜๋Š”์ง€ ์—ฌ๋ถ€์— ๊ด€๊ณ„์—†์ด ScrapeGraphAI๋Š” ํŠน์ • API ํ‚ค ๋ฐ ๊ตฌ์„ฑ๊ณผ์˜ ํ†ตํ•ฉ์„ ์ง€์›ํ•˜์—ฌ ์œ ์—ฐ์„ฑ๊ณผ ์„ ํƒ๊ถŒ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

โŠ ์Šคํ”ผ์น˜๊ทธ๋ž˜ํ”„

ScrapeGraphAI๋Š” ์ •๋ณด๋ฅผ ๊ธ์–ด ์Œ์„ฑ ์˜ค๋””์˜ค๋กœ ๋ณ€ํ™˜ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ณ ์œ ํ•œ ๊ธฐ๋Šฅ์„ ํ†ตํ•ด ์ถ”์ถœ๋œ ๋ฐ์ดํ„ฐ์™€ ์ƒํ˜ธ ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์•ก์„ธ์Šค ๊ฐ€๋Šฅํ•˜๊ณ  ํŽธ๋ฆฌํ•œ ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โŠ ์˜ด๋‹ˆ์Šคํฌ๋ ˆ์ดํผ๊ทธ๋ž˜ํ”„

์ด๋ฏธ์ง€ ์„ค๋ช… ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ˜ SmartScraperGraph์˜ ์ง„ํ™”. ์ด ํ–ฅ์ƒ๋œ ๊ธฐ๋Šฅ์„ ํ†ตํ•ด ์‚ฌ์šฉ์ž๋Š” ๋‹จ์ผ ์›น ํŽ˜์ด์ง€์—์„œ ์ด๋ฏธ์ง€๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์ •ํ™•ํ•œ ์„ค๋ช…์„ ์–ป์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ท€์ค‘ํ•œ ์‹œ๊ฐ์  ์ •๋ณด๋กœ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ•ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (GPT-4o๋งŒ ํ•ด๋‹น)

โ€”โ€”โ€”โ€”โ€”โ€”โ€”

๊ฐ„๋‹จํ•œ ์„ค์ • ๋ฐ ๊ตฌ์„ฑ

ScrapeGraphAI๋ฅผ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค: streamlit์—์„œ ๋งŒ๋“  ์•ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Original Article : https://medium.com/@amanatulla1606/llm-web-scraping-with-scrapegraphai-a-breakthrough-in-data-extraction-d6596b282b4d

This post is licensed under CC BY 4.0 by the author.