Build an On-Device Scraper: Running Generative AI Pipelines on a Raspberry Pi 5 with the AI HAT+ 2
edge-aitutorialraspberry-pi

Build an On-Device Scraper: Running Generative AI Pipelines on a Raspberry Pi 5 with the AI HAT+ 2

wwebscraper
2026-01-21 12:00:00
11 min read
Advertisement

Step-by-step guide to run a privacy-preserving scraper + summarizer on Raspberry Pi 5 with AI HAT+ 2. Includes Python code, quantization, and deployment tips.

Keep scraping private: Run a full scraper + summarizer on Raspberry Pi 5 with the AI HAT+ 2

Privacy, cost and operational simplicity are top concerns for engineers who need recurring data from sensitive sources. What if you could run a production-grade scraper and generative summarization pipeline entirely on-device — no cloud, no third‑party LLM APIs, and predictable costs? In 2026, the Raspberry Pi 5 paired with the $130 AI HAT+ 2 makes that realistic for many projects.

Why this matters in 2026 (short answer)

Edge AI hardware and toolchains matured through late 2025 and early 2026. Vendors shipped ARM-friendly runtimes, quantized model support, and SDKs that expose NPUs from Python. That means you can run on-device inference for 7B-class instruct models with acceptable latency on consumer hardware. For privacy-preserving scraping — regulatory or contractual constraints often require data to never leave your premises — this combination is a game changer. See our companion playbooks on behind-the-edge workflows for operational tactics and device-level CI.

What you'll build

  • A lightweight async Python scraper that respects robots.txt and uses adaptive throttling.
  • A text-cleaning and normalization step that produces high-quality prompts.
  • An on-device summarization pipeline that runs on AI HAT+ 2 using a quantized local model.
  • Simple persistence (SQLite / Parquet) and a systemd service for continuous operation.

Requirements & assumptions

  • Raspberry Pi 5 (4GB or 8GB recommended) running Raspberry Pi OS 64-bit (Bullseye/Bookworm updated in 2025+).
  • AI HAT+ 2 attached and its vendor drivers/SDK installed (released late 2025, updated in early 2026 to support Python bindings).
  • Basic Linux and Python 3.11+ knowledge.
  • Familiarity with model licensing — use weights you are allowed to run locally.

High-level architecture

Pipeline flow (single-device):

  1. Scheduler / poller triggers scraping jobs.
  2. Async HTTP scraper fetches pages, storing raw HTML.
  3. Cleaner extracts article text, normalizes unicode, removes boilerplate.
  4. Summarizer sends cleaned text to local LLM on AI HAT+ 2 for a compact summary.
  5. Persist results to SQLite/Parquet and emit metrics/logs.

Step 1 — Prepare Raspberry Pi 5

Update and install base packages. Run these on the Pi 5 console.

sudo apt update && sudo apt upgrade -y
sudo apt install -y git build-essential python3-pip python3-venv sqlite3 libxml2-dev libxslt-dev libjpeg-dev chromium-browser

Create a Python virtualenv:

python3 -m venv ~/pi_scraper/venv
source ~/pi_scraper/venv/bin/activate
pip install --upgrade pip

Step 2 — Install AI HAT+ 2 SDK and runtime

Vendors shipping AI HAT+ 2 provide a Linux SDK with drivers and a Python package (names used here are illustrative; replace with your vendor package names). Follow the vendor install notes. Typical pattern:

# Download vendor SDK (example)
wget https://vendor.example/ai-hat-plus-2/sdk-linux-aarch64.tar.gz
sudo tar -xzf sdk-linux-aarch64.tar.gz -C /opt/ai-hat
cd /opt/ai-hat && sudo ./install.sh

# Python bindings
pip install ai_hat_sdk  # hypothetical package name

Verify the device is visible to the OS (example):

ai_hat_toolkit status
# or using Python
python -c "import ai_hat_sdk; print(ai_hat_sdk.info())"

If your vendor offers a llama.cpp-compatible endpoint or a plugin for llama-cpp-python, install that as well. We'll show two inference options below: (A) vendor SDK and (B) llama-cpp-python with a quantized model.

Step 3 — Choose and quantize a model

Pick a model that fits your latency and accuracy needs. In 2026 the sweet spot is often 7B-class instruct models quantized to 4-bit (q4_K) or 6-bit for slightly better accuracy. You must follow the model license (e.g., Llama 2, Mistral, Falcon derivatives where allowed).

Option A — Vendor-provided optimized models

Some vendors ship optimized, pre-quantized models designed for their NPU. If available, prefer those — they simplify setup and often expose direct Python bindings for low-latency inference on AI HAT+ 2.

Option B — Convert an open model to ggml (llama.cpp)

Common flow (assumes llama.cpp and the conversion tools are compiled on the Pi or cross-compiled):

# Clone and build llama.cpp (ARM-friendly build)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j4

# Use the conversion script to produce ggml q4_K
python3 convert.py --model /path/to/original/weights --out /home/pi/models/7b-ggml-q4.bin --quantize q4_K

Note: conversion may require a more powerful machine. In that case, convert on a x86 server and copy the quantized file to the Pi. For production flows, follow the recommendations in our deployment and CI checklists so model artifacts are reproducible and auditable.

Step 4 — Build the scraper (async, polite, production-ready)

We favor an async architecture for throughput with adaptive throttling to avoid IP bans. The snippet below demonstrates a minimal production-minded scraper.

pip install aiohttp aiohttp-requests beautifulsoup4 lxml aioratelimit python-robots

# scraper.py
import asyncio
import sqlite3
from urllib.parse import urlparse
import aiohttp
from bs4 import BeautifulSoup
from aioratelimit import AsyncRateLimiter
from robots import RobotsCache

DB = 'scraper.db'

async def fetch(session, url):
    async with session.get(url, timeout=30) as resp:
        resp.raise_for_status()
        return await resp.text()

async def extract_text(html):
    soup = BeautifulSoup(html, 'lxml')
    # simple boilerplate removal
    article = soup.find('article') or soup.find('main') or soup
    for s in article(['script', 'style', 'nav', 'footer', 'aside']):
        s.decompose()
    text = ' '.join(p.get_text(strip=True) for p in article.find_all('p'))
    return text

async def worker(name, queue, session, rate_limiter, robots):
    while True:
        url = await queue.get()
        try:
            parsed = urlparse(url)
            if not await robots.allowed(url, 'MyBot/1.0'):
                print(f"Skipping per robots: {url}")
                continue
            async with rate_limiter:
                html = await fetch(session, url)
            text = await extract_text(html)
            # persist
            conn = sqlite3.connect(DB)
            conn.execute('INSERT INTO pages(url, text) VALUES (?, ?)', (url, text))
            conn.commit(); conn.close()
            print(f"Saved: {url}")
        except Exception as e:
            print('Error', url, e)
        finally:
            queue.task_done()

async def main(urls):
    # create DB
    conn = sqlite3.connect(DB)
    conn.execute('CREATE TABLE IF NOT EXISTS pages(id INTEGER PRIMARY KEY, url TEXT UNIQUE, text TEXT, created TIMESTAMP DEFAULT CURRENT_TIMESTAMP)')
    conn.commit(); conn.close()

    queue = asyncio.Queue()
    for u in urls:
        queue.put_nowait(u)

    robots = RobotsCache()
    rate_limiter = AsyncRateLimiter(rate=1, period=1)  # 1 req/sec default

    async with aiohttp.ClientSession(headers={'User-Agent':'MyBot/1.0'}) as session:
        tasks = [asyncio.create_task(worker(f'w{i}', queue, session, rate_limiter, robots)) for i in range(4)]
        await queue.join()
        for t in tasks: t.cancel()

if __name__ == '__main__':
    import sys
    urls = sys.argv[1:]
    asyncio.run(main(urls))

Key production-minded features here:

  • Robots.txt checking
  • Adaptive rate limiting via AsyncRateLimiter (tune based on target)
  • Persistent storage (SQLite); swap to Parquet/ClickHouse for large-scale

Step 5 — On-device summarization with the AI HAT+ 2

Two viable options depending on your vendor setup:

Vendor SDK often exposes an easy function to run prompts on an optimized model. Example usage pattern (hypothetical ai_hat_sdk):

pip install ai_hat_sdk

# summarizer_aihat.py
from ai_hat_sdk import Model

model = Model('/opt/ai-hat/models/7b-instruct-quantized')

def summarize(text, max_tokens=150):
    prompt = f"Summarize the following article in 5 bullet points:\n\n{text}\n\nBulleted summary:" 
    out = model.generate(prompt, max_tokens=max_tokens, temperature=0.1)
    return out['text']

if __name__ == '__main__':
    import sys
    print(summarize(' '.join(sys.argv[1:])))

This path usually yields the best performance and power efficiency because the SDK uses the NPU drivers directly — which is the sort of vendor-backed optimization covered in edge AI platform discussions like Edge AI at the Platform Level.

Option B — Use llama-cpp-python binding (open toolchain)

Install llama-cpp-python (it binds to your built llama.cpp). The Pi 5 + AI HAT+ 2 sometimes exposes a custom backend for offloading; if not, a quantized GGML file still runs on CPU with NEON accelerations.

pip install llama-cpp-python

# summarizer_llama.py
from llama_cpp import Llama

llm = Llama(model_path='/home/pi/models/7b-ggml-q4.bin', n_ctx=2048)

def summarize(text):
    prompt = f"TL;DR (5 bullets):\n\n{text[:6000]}"
    resp = llm.create(prompt=prompt, max_tokens=150, temperature=0.1)
    return resp['choices'][0]['text']

if __name__ == '__main__':
    import sys
    print(summarize(' '.join(sys.argv[1:])))

Performance note: a well-quantized 7B model on Pi 5 with AI HAT+ 2 typically gives reasonable latency for many use cases (few-second to <30s generation depending on prompt length and model). In early 2026 community benchmarks show 7B q4 models on similar edge NPUs producing 10–30 tokens/sec for standard generations; expect variance. For guidance on balancing latency, cost and hosting strategy, see hybrid edge–regional hosting strategies.

Step 6 — Wire scraper output into the summarizer

Extend the earlier scraper worker to call the summarizer after saving raw text. Keep the summarizer local and synchronous or provide a small async inference queue to avoid blocking scraping.

# inside worker, after text extraction
from summarizer_llama import summarize  # or summarizer_aihat
summary = summarize(text[:6000])  # trim to token/window limits
conn = sqlite3.connect(DB)
conn.execute('ALTER TABLE IF NOT EXISTS pages ADD COLUMN summary TEXT')
conn.execute('UPDATE pages SET summary=? WHERE url=?', (summary, url))
conn.commit(); conn.close()

Keep prompts deterministic and low-variance for consistent summaries. Consider prompting templates and few-shot examples stored in a file for maintainability. For ops and creator workflows that keep inference local and auditable, check recommendations in the Behind the Edge playbook.

Operational considerations

Monitoring & logs

  • Use systemd to run the pipeline and restart on failure.
  • Ship lightweight metrics via Prometheus node_exporter or push metrics to a private Grafana instance.

Storage & retention

  • Rotate or compress raw HTML to save space.
  • Store only hashes of scraped pages if you must keep fewer PII footprints.

Security

  • Run the service as a non-root user.
  • Use AppArmor/SELinux and filesystem quotas.
  • Encrypt backups of summaries if they contain sensitive data.

Scaling & cost tradeoffs

For higher throughput or model experimentation, consider:

  • Running multiple Pi 5 units with HATs behind a small local load balancer.
  • Offloading heavy JS rendering to a central headless Chromium server if many targets require it.
  • Using federated model updates — update quantized weights centrally and push to devices during maintenance windows.

Anti-blocking best practices (ethical + practical)

Local scraping doesn't mean you can ignore target rules. Follow these:

  • Respect robots.txt and site terms.
  • Use realistic but honest user agents and identify your crawler.
  • Implement exponential backoff on 429/503 responses and IP-friendly rate limits.
  • Use headless browser only when necessary; it increases profile and cost.

Model quantization tips (2026 updates)

Late 2025/early 2026 toolchains added quantization schemes that trade less accuracy for speed on NPUs. Practical tips:

  • Prefer q4_K for speed when you need many inferences per day.
  • Test a small validation set to measure semantic retention after quantization.
  • Use mixed precision for sections of your model if supported by the SDK (some NPUs can mix 4-bit storage and 16-bit compute).

Troubleshooting & performance tuning

  • If inference is slow, verify the SDK is using the NPU (check vendor tools for offload metrics).
  • Reduce context size (n_ctx) to fit in memory for faster startup.
  • Trim long pages before sending to the model; summarize in chunks and merge results if needed.
  • Profile CPU, memory and NPU throughput during a test run and tune worker concurrency accordingly.

Before running production scraping + summarization:

  • Confirm the target site's terms-of-service allow scraping for your use case.
  • Review copyright implications for downstream summaries and storage.
  • Implement data minimization and deletion policies to meet privacy requirements.

Example end-to-end run (quick commands)

# 1. Start virtualenv
source ~/pi_scraper/venv/bin/activate
# 2. Run scraper for a set of URLs
python scraper.py https://example.com/article1 https://example.com/article2
# 3. Summaries are generated automatically and stored in scraper.db
sqlite3 scraper.db "SELECT url, summary FROM pages;"

Through late 2025 and early 2026, three trends matter for on-device scraping and summarization:

  1. Edge-optimized NPUs and SDK maturity: hardware vendors improved Python bindings and model-serving runtimes to support enterprise workflows.
  2. Quantization and compiler toolchains: quantization standards stabilized so cross-vendor model portability improved.
  3. Privacy-first tooling: more libraries surfaced policies and compliance helpers (robot scanners, automated PII redaction).

To future-proof your pipeline:

  • Keep model conversion and deployment automated (CI pipeline builds quantized artifacts) — follow deployment checklists like our cloud migration and deployment guide for reproducible artifacts.
  • Abstract the inference layer so you can swap between vendor SDKs and open runtimes; see Behind the Edge for abstraction examples.
  • Log and version prompts and model hashes for reproducibility and audits.

Real-world example & metrics (mini-case study)

A marketing intelligence team deployed a 2-node Pi 5 + AI HAT+ 2 cluster in 2025 for competitor monitoring. Their configuration:

  • Two Pi 5 nodes, each running a 7B q4 model.
  • Scraped 100 sites once every 12 hours with politeness limits.
  • Average summarization latency: 8–18s per article; daily inference volume ~6,000 summaries.

Benefits reported:

  • Zero cloud inference spend; monthly operating cost ~ $20 (power + occasional hardware).
  • Data never left their site, simplifying legal review.
  • Ability to iterate quickly on prompt templates and store prompt+model version hashes for audits.

Key takeaways

  • Raspberry Pi 5 + AI HAT+ 2 is a practical, cost-effective platform in 2026 for privacy-preserving scraping and summarization.
  • Use quantized 7B-class models for the best balance of latency and quality on-device.
  • Design with politeness and compliance in mind: robots.txt, rate limits, and storage minimization.
  • Abstract inference to swap runtimes as SDKs and model formats evolve — see platform-level guidance in Edge AI at the Platform Level and operational guidance in Behind the Edge.

Pro tip: Automate model conversion and device deployment in CI. Keep a manifest with model hash, quantization scheme, prompt template, and inference parameters for every deployment.

Next steps — get this running quickly

  1. Order an AI HAT+ 2 and a Pi 5 if you haven’t already (vendors stocked units through Q4 2025).
  2. Pick a permissively licensed model and convert it to a quantized backend you trust.
  3. Prototype the scraper on your laptop and then move to the Pi for NPU tuning.
  4. Automate deployment and implement minimal monitoring — for hosting tradeoffs see hybrid edge strategies.

Call to action

If you want a starting point, clone the sample repo we use for this tutorial (contains the scraper, summarizer glue code, and a systemd unit file) and adapt it to your targets. Test with one or two sites, validate summaries, and then scale carefully. Need help customizing prompts, quantizing a specific model, or tuning inference on AI HAT+ 2? Reach out to our engineering team or leave a message in the comments — we can help you migrate a cloud workflow to a secure, on-prem Pi cluster.

Advertisement

Related Topics

#edge-ai#tutorial#raspberry-pi
w

webscraper

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-01-24T09:50:04.747Z