Post

ScrapeGraphAI - LLM and Graph Powered Web Scraping Python Library ๐Ÿ“š

ScrapeGraphAI: LLM and Graph-Powered Web Scraping

Curiosity: How can we make web scraping more intelligent and adaptive? What happens when we combine LLMs with graph-based logic for data extraction?

ScrapeGraphAI is a robust Python library that employs Large Language Models (LLMs) and direct graph logic to create intelligent scraping pipelines for websites, documents, and XML files. Unlike rigid methods, it dynamically adapts to variations in website structures.

Framework Overview

graph TB
    A[ScrapeGraphAI] --> B[LLM Integration]
    A --> C[Graph Logic]
    A --> D[Multiple Platforms]
    
    B --> B1[Intelligent Extraction]
    C --> C1[Dynamic Pipelines]
    D --> D1[OpenAI/Azure/Groq]
    
    E[Websites] --> A
    F[Documents] --> A
    G[XML Files] --> A
    A --> H[Extracted Data]
    
    style A fill:#e1f5ff
    style B fill:#fff3cd
    style C fill:#d4edda
    style H fill:#f8d7da

Key Features

FeatureDescriptionBenefit
Direct Graph LogicGraph-based pipeline creationโฌ†๏ธ Flexibility
LLM IntegrationIntelligent data extractionโฌ†๏ธ Accuracy
Multi-Platform SupportOpenAI, Azure, Groqโฌ†๏ธ Choice
SpeechGraphVoice audio conversionโฌ†๏ธ Accessibility
OmniScraperGraphImage description (GPT-4o)โฌ†๏ธ Rich data

1. Direct Graph Logic

Retrieve: Graph-based approach dynamically creates scraping pipelines based on user-defined prompts.

How It Works:

  • User defines extraction goals
  • Graph logic creates pipeline
  • Adapts to website structure
  • Efficient data retrieval

Architecture:

graph LR
    A[User Prompt] --> B[Graph Builder]
    B --> C[Pipeline Nodes]
    C --> D[Extraction Logic]
    D --> E[Data Output]
    
    F[Website Structure] --> B
    G[LLM Analysis] --> C
    
    style A fill:#e1f5ff
    style B fill:#fff3cd
    style E fill:#d4edda

2. LLM Integration

Innovate: LLMs interpret user inputs and automate data extraction, eliminating manual coding.

Capabilities:

  • Natural language prompts
  • Automatic structure understanding
  • Intelligent field extraction
  • Context-aware parsing

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from scrapegraphai import ScrapeGraphAI

# Initialize with LLM
scraper = ScrapeGraphAI(
    llm_model="gpt-4",
    api_key="your-api-key"
)

# Natural language prompt
result = scraper.scrape(
    url="https://example.com",
    prompt="Extract all product names and prices"
)

print(result)

3. Multiple AI Platform Support

Retrieve: Flexible integration with various LLM providers.

PlatformSupportFeatures
OpenAIโœ… FullGPT-4, GPT-3.5
Azureโœ… FullAzure OpenAI
Groqโœ… FullFast inference

Configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# OpenAI
scraper = ScrapeGraphAI(
    llm_model="gpt-4",
    api_key="openai-key"
)

# Azure
scraper = ScrapeGraphAI(
    llm_model="gpt-4",
    api_key="azure-key",
    api_endpoint="azure-endpoint"
)

# Groq
scraper = ScrapeGraphAI(
    llm_model="llama-3",
    api_key="groq-key",
    provider="groq"
)

4. SpeechGraph

Innovate: Convert scraped information into voice audio for accessible interaction.

Features:

  • Text-to-speech conversion
  • Audio output
  • Accessible data interaction
  • Convenient consumption

Use Cases:

  • Accessibility applications
  • Audio content creation
  • Hands-free data access
  • Multimodal interfaces

5. OmniScraperGraph

Retrieve: Enhanced scraping with image description capabilities (GPT-4o only).

Capabilities:

  • Extract images from web pages
  • Generate accurate descriptions
  • Enrich datasets with visual information
  • Multimodal data extraction

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# OmniScraperGraph with GPT-4o
omni_scraper = ScrapeGraphAI(
    llm_model="gpt-4o",
    mode="omni"
)

# Extract images with descriptions
result = omni_scraper.scrape(
    url="https://example.com",
    extract_images=True
)

# Result includes:
# - Text content
# - Images
# - Image descriptions

Setup and Configuration

Retrieve: Simple setup with Streamlit app for easy configuration.

Quick Start:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Install
pip install scrapegraphai

# Basic usage
from scrapegraphai import ScrapeGraphAI

scraper = ScrapeGraphAI(
    llm_model="gpt-4",
    api_key="your-key"
)

# Scrape
data = scraper.scrape(
    url="https://example.com",
    prompt="Extract all article titles"
)

Streamlit App:

  • Visual interface
  • Easy configuration
  • Interactive scraping
  • Real-time results

Comparison: Traditional vs. ScrapeGraphAI

AspectTraditional ScrapingScrapeGraphAI
Adaptabilityโš ๏ธ Rigid patternsโœ… Dynamic
Setupโš ๏ธ Manual codingโœ… LLM-powered
Maintenanceโš ๏ธ Highโœ… Low
IntelligenceโŒ Pattern-basedโœ… LLM-based
MultimodalโŒ Text onlyโœ… Text + Images

Use Cases

Innovate: Apply ScrapeGraphAI to various data extraction scenarios.

Common Use Cases:

  • E-commerce product extraction
  • News article scraping
  • Research data collection
  • Content aggregation
  • Competitive analysis

Key Takeaways

Retrieve: ScrapeGraphAI combines LLMs with graph logic to create intelligent, adaptive web scraping pipelines that dynamically adjust to website structures.

Innovate: By leveraging LLM intelligence and graph-based pipelines, ScrapeGraphAI eliminates manual coding and pattern maintenance, making web scraping more accessible and robust.

Curiosity โ†’ Retrieve โ†’ Innovation: Start with curiosity about intelligent scraping, retrieve insights from ScrapeGraphAIโ€™s approach, and innovate by applying it to your data extraction needs.

Original Article: https://medium.com/@amanatulla1606/llm-web-scraping-with-scrapegraphai-a-breakthrough-in-data-extraction-d6596b282b4d

Next Steps:

  • Explore ScrapeGraphAI documentation
  • Try the Streamlit app
  • Experiment with different LLM providers
  • Build your scraping pipelines

 LLM based Scraping

Translate to Korean

ScrapeGraphAI๋Š” LLM(Large Language Models) ๋ฐ ์ง์ ‘ ๊ทธ๋ž˜ํ”„ ๋กœ์ง์„ ์‚ฌ์šฉํ•˜์—ฌ ์›น ์‚ฌ์ดํŠธ, ๋ฌธ์„œ ๋ฐ XML ํŒŒ์ผ์— ๋Œ€ํ•œ ์Šคํฌ๋ž˜ํ•‘ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ์›น ์Šคํฌ๋ž˜ํ•‘ Python ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค.

์‚ฌ์ „ ์ •์˜๋œ ํŒจํ„ด์ด๋‚˜ ์ˆ˜๋™ ์กฐ์ •์— ์˜์กดํ•˜๋Š” ๊ฒฝ์ง๋œ ๋ฐฉ๋ฒ•๊ณผ ๋‹ฌ๋ฆฌ ScrapegraphAI๋Š” ์›น์‚ฌ์ดํŠธ ๊ตฌ์กฐ์˜ ๋ณ€ํ™”์— ๋™์ ์œผ๋กœ ์ ์‘ํ•ฉ๋‹ˆ๋‹ค.

โ€”โ€”โ€”โ€”โ€”โ€”โ€”

โš™๏ธ๊ธฐ๋Šฅ:

โŠ ์ง์ ‘ ๊ทธ๋ž˜ํ”„ ๋กœ์ง:

์ด ๊ธฐ๋Šฅ์€ ๊ทธ๋ž˜ํ”„ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ™œ์šฉํ•˜์—ฌ ์Šคํฌ๋ž˜ํ•‘ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋™์ ์œผ๋กœ ์ƒ์„ฑํ•˜์—ฌ ์‚ฌ์šฉ์ž ์ •์˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ฒ€์ƒ‰์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

โŠ LLM ํ†ตํ•ฉ:

ScrapeGraphAI๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ํ†ตํ•ฉํ•˜์—ฌ ์‚ฌ์šฉ์ž ์ž…๋ ฅ์„ ํ•ด์„ํ•˜๊ณ  ๋ฐ์ดํ„ฐ ์ถ”์ถœ์„ ์ž๋™ํ™”ํ•˜์—ฌ ์ˆ˜๋™ ์ฝ”๋”ฉ์˜ ํ•„์š”์„ฑ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

โŠ ๋‹ค์ค‘ AI ํ”Œ๋žซํผ ์ง€์›:

OpenAI, Azure ๋˜๋Š” Groq์˜ ๋ชจ๋ธ์„ ์„ ํ˜ธํ•˜๋Š”์ง€ ์—ฌ๋ถ€์— ๊ด€๊ณ„์—†์ด ScrapeGraphAI๋Š” ํŠน์ • API ํ‚ค ๋ฐ ๊ตฌ์„ฑ๊ณผ์˜ ํ†ตํ•ฉ์„ ์ง€์›ํ•˜์—ฌ ์œ ์—ฐ์„ฑ๊ณผ ์„ ํƒ๊ถŒ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

โŠ ์Šคํ”ผ์น˜๊ทธ๋ž˜ํ”„

ScrapeGraphAI๋Š” ์ •๋ณด๋ฅผ ๊ธ์–ด ์Œ์„ฑ ์˜ค๋””์˜ค๋กœ ๋ณ€ํ™˜ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ณ ์œ ํ•œ ๊ธฐ๋Šฅ์„ ํ†ตํ•ด ์ถ”์ถœ๋œ ๋ฐ์ดํ„ฐ์™€ ์ƒํ˜ธ ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์•ก์„ธ์Šค ๊ฐ€๋Šฅํ•˜๊ณ  ํŽธ๋ฆฌํ•œ ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โŠ ์˜ด๋‹ˆ์Šคํฌ๋ ˆ์ดํผ๊ทธ๋ž˜ํ”„

์ด๋ฏธ์ง€ ์„ค๋ช… ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ˜ SmartScraperGraph์˜ ์ง„ํ™”. ์ด ํ–ฅ์ƒ๋œ ๊ธฐ๋Šฅ์„ ํ†ตํ•ด ์‚ฌ์šฉ์ž๋Š” ๋‹จ์ผ ์›น ํŽ˜์ด์ง€์—์„œ ์ด๋ฏธ์ง€๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์ •ํ™•ํ•œ ์„ค๋ช…์„ ์–ป์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ท€์ค‘ํ•œ ์‹œ๊ฐ์  ์ •๋ณด๋กœ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ•ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (GPT-4o๋งŒ ํ•ด๋‹น)

โ€”โ€”โ€”โ€”โ€”โ€”โ€”

๊ฐ„๋‹จํ•œ ์„ค์ • ๋ฐ ๊ตฌ์„ฑ

ScrapeGraphAI๋ฅผ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค: streamlit์—์„œ ๋งŒ๋“  ์•ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Original Article : https://medium.com/@amanatulla1606/llm-web-scraping-with-scrapegraphai-a-breakthrough-in-data-extraction-d6596b282b4d

This post is licensed under CC BY 4.0 by the author.