edge-aitutorialraspberry-pi

Build an On-Device Scraper: Running Generative AI Pipelines on a Raspberry Pi 5 with the AI HAT+ 2

UUnknown

2026-01-21

11 min read

Step-by-step guide to run a privacy-preserving scraper + summarizer on Raspberry Pi 5 with AI HAT+ 2. Includes Python code, quantization, and deployment tips.

Keep scraping private: Run a full scraper + summarizer on Raspberry Pi 5 with the AI HAT+ 2

Privacy, cost and operational simplicity are top concerns for engineers who need recurring data from sensitive sources. What if you could run a production-grade scraper and generative summarization pipeline entirely on-device — no cloud, no third‑party LLM APIs, and predictable costs? In 2026, the Raspberry Pi 5 paired with the $130 AI HAT+ 2 makes that realistic for many projects.

Why this matters in 2026 (short answer)

Edge AI hardware and toolchains matured through late 2025 and early 2026. Vendors shipped ARM-friendly runtimes, quantized model support, and SDKs that expose NPUs from Python. That means you can run on-device inference for 7B-class instruct models with acceptable latency on consumer hardware. For privacy-preserving scraping — regulatory or contractual constraints often require data to never leave your premises — this combination is a game changer. See our companion playbooks on behind-the-edge workflows for operational tactics and device-level CI.

What you'll build

A lightweight async Python scraper that respects robots.txt and uses adaptive throttling.
A text-cleaning and normalization step that produces high-quality prompts.
An on-device summarization pipeline that runs on AI HAT+ 2 using a quantized local model.
Simple persistence (SQLite / Parquet) and a systemd service for continuous operation.

Requirements & assumptions

Raspberry Pi 5 (4GB or 8GB recommended) running Raspberry Pi OS 64-bit (Bullseye/Bookworm updated in 2025+).
AI HAT+ 2 attached and its vendor drivers/SDK installed (released late 2025, updated in early 2026 to support Python bindings).
Basic Linux and Python 3.11+ knowledge.
Familiarity with model licensing — use weights you are allowed to run locally.

High-level architecture

Pipeline flow (single-device):

Scheduler / poller triggers scraping jobs.
Async HTTP scraper fetches pages, storing raw HTML.
Cleaner extracts article text, normalizes unicode, removes boilerplate.
Summarizer sends cleaned text to local LLM on AI HAT+ 2 for a compact summary.
Persist results to SQLite/Parquet and emit metrics/logs.

Step 1 — Prepare Raspberry Pi 5

Update and install base packages. Run these on the Pi 5 console.

sudo apt update && sudo apt upgrade -y
sudo apt install -y git build-essential python3-pip python3-venv sqlite3 libxml2-dev libxslt-dev libjpeg-dev chromium-browser

Create a Python virtualenv:

python3 -m venv ~/pi_scraper/venv
source ~/pi_scraper/venv/bin/activate
pip install --upgrade pip

Step 2 — Install AI HAT+ 2 SDK and runtime

Vendors shipping AI HAT+ 2 provide a Linux SDK with drivers and a Python package (names used here are illustrative; replace with your vendor package names). Follow the vendor install notes. Typical pattern:

# Download vendor SDK (example)
wget https://vendor.example/ai-hat-plus-2/sdk-linux-aarch64.tar.gz
sudo tar -xzf sdk-linux-aarch64.tar.gz -C /opt/ai-hat
cd /opt/ai-hat && sudo ./install.sh

# Python bindings
pip install ai_hat_sdk  # hypothetical package name

Verify the device is visible to the OS (example):

ai_hat_toolkit status
# or using Python
python -c "import ai_hat_sdk; print(ai_hat_sdk.info())"

If your vendor offers a llama.cpp-compatible endpoint or a plugin for llama-cpp-python, install that as well. We'll show two inference options below: (A) vendor SDK and (B) llama-cpp-python with a quantized model.

Step 3 — Choose and quantize a model

Pick a model that fits your latency and accuracy needs. In 2026 the sweet spot is often 7B-class instruct models quantized to 4-bit (q4_K) or 6-bit for slightly better accuracy. You must follow the model license (e.g., Llama 2, Mistral, Falcon derivatives where allowed).

Option A — Vendor-provided optimized models

Some vendors ship optimized, pre-quantized models designed for their NPU. If available, prefer those — they simplify setup and often expose direct Python bindings for low-latency inference on AI HAT+ 2.

Option B — Convert an open model to ggml (llama.cpp)

Common flow (assumes llama.cpp and the conversion tools are compiled on the Pi or cross-compiled):

# Clone and build llama.cpp (ARM-friendly build)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j4

# Use the conversion script to produce ggml q4_K
python3 convert.py --model /path/to/original/weights --out /home/pi/models/7b-ggml-q4.bin --quantize q4_K

Note: conversion may require a more powerful machine. In that case, convert on a x86 server and copy the quantized file to the Pi. For production flows, follow the recommendations in our deployment and CI checklists so model artifacts are reproducible and auditable.

Step 4 — Build the scraper (async, polite, production-ready)

We favor an async architecture for throughput with adaptive throttling to avoid IP bans. The snippet below demonstrates a minimal production-minded scraper.

pip install aiohttp aiohttp-requests beautifulsoup4 lxml aioratelimit python-robots

# scraper.py
import asyncio
import sqlite3
from urllib.parse import urlparse
import aiohttp
from bs4 import BeautifulSoup
from aioratelimit import AsyncRateLimiter
from robots import RobotsCache

DB = 'scraper.db'

async def fetch(session, url):
    async with session.get(url, timeout=30) as resp:
        resp.raise_for_status()
        return await resp.text()

async def extract_text(html):
    soup = BeautifulSoup(html, 'lxml')
    # simple boilerplate removal
    article = soup.find('article') or soup.find('main') or soup
    for s in article(['script', 'style', 'nav', 'footer', 'aside']):
        s.decompose()
    text = ' '.join(p.get_text(strip=True) for p in article.find_all('p'))
    return text

async def worker(name, queue, session, rate_limiter, robots):
    while True:
        url = await queue.get()
        try:
            parsed = urlparse(url)
            if not await robots.allowed(url, 'MyBot/1.0'):
                print(f"Skipping per robots: {url}")
                continue
            async with rate_limiter:
                html = await fetch(session, url)
            text = await extract_text(html)
            # persist
            conn = sqlite3.connect(DB)
            conn.execute('INSERT INTO pages(url, text) VALUES (?, ?)', (url, text))
            conn.commit(); conn.close()
            print(f"Saved: {url}")
        except Exception as e:
            print('Error', url, e)
        finally:
            queue.task_done()

async def main(urls):
    # create DB
    conn = sqlite3.connect(DB)
    conn.execute('CREATE TABLE IF NOT EXISTS pages(id INTEGER PRIMARY KEY, url TEXT UNIQUE, text TEXT, created TIMESTAMP DEFAULT CURRENT_TIMESTAMP)')
    conn.commit(); conn.close()

    queue = asyncio.Queue()
    for u in urls:
        queue.put_nowait(u)

    robots = RobotsCache()
    rate_limiter = AsyncRateLimiter(rate=1, period=1)  # 1 req/sec default

    async with aiohttp.ClientSession(headers={'User-Agent':'MyBot/1.0'}) as session:
        tasks = [asyncio.create_task(worker(f'w{i}', queue, session, rate_limiter, robots)) for i in range(4)]
        await queue.join()
        for t in tasks: t.cancel()

if __name__ == '__main__':
    import sys
    urls = sys.argv[1:]
    asyncio.run(main(urls))

Key production-minded features here:

Robots.txt checking
Adaptive rate limiting via AsyncRateLimiter (tune based on target)
Persistent storage (SQLite); swap to Parquet/ClickHouse for large-scale

Step 5 — On-device summarization with the AI HAT+ 2

Two viable options depending on your vendor setup:

Option A — Use the AI HAT+ 2 vendor SDK (recommended if available)

Vendor SDK often exposes an easy function to run prompts on an optimized model. Example usage pattern (hypothetical ai_hat_sdk):

pip install ai_hat_sdk

# summarizer_aihat.py
from ai_hat_sdk import Model

model = Model('/opt/ai-hat/models/7b-instruct-quantized')

def summarize(text, max_tokens=150):
    prompt = f"Summarize the following article in 5 bullet points:\n\n{text}\n\nBulleted summary:" 
    out = model.generate(prompt, max_tokens=max_tokens, temperature=0.1)
    return out['text']

if __name__ == '__main__':
    import sys
    print(summarize(' '.join(sys.argv[1:])))

This path usually yields the best performance and power efficiency because the SDK uses the NPU drivers directly — which is the sort of vendor-backed optimization covered in edge AI platform discussions like Edge AI at the Platform Level.

Option B — Use llama-cpp-python binding (open toolchain)

Install llama-cpp-python (it binds to your built llama.cpp). The Pi 5 + AI HAT+ 2 sometimes exposes a custom backend for offloading; if not, a quantized GGML file still runs on CPU with NEON accelerations.

pip install llama-cpp-python

# summarizer_llama.py
from llama_cpp import Llama

llm = Llama(model_path='/home/pi/models/7b-ggml-q4.bin', n_ctx=2048)

def summarize(text):
    prompt = f"TL;DR (5 bullets):\n\n{text[:6000]}"
    resp = llm.create(prompt=prompt, max_tokens=150, temperature=0.1)
    return resp['choices'][0]['text']

if __name__ == '__main__':
    import sys
    print(summarize(' '.join(sys.argv[1:])))

Performance note: a well-quantized 7B model on Pi 5 with AI HAT+ 2 typically gives reasonable latency for many use cases (few-second to <30s generation depending on prompt length and model). In early 2026 community benchmarks show 7B q4 models on similar edge NPUs producing 10–30 tokens/sec for standard generations; expect variance. For guidance on balancing latency, cost and hosting strategy, see hybrid edge–regional hosting strategies.

Step 6 — Wire scraper output into the summarizer

Extend the earlier scraper worker to call the summarizer after saving raw text. Keep the summarizer local and synchronous or provide a small async inference queue to avoid blocking scraping.

# inside worker, after text extraction
from summarizer_llama import summarize  # or summarizer_aihat
summary = summarize(text[:6000])  # trim to token/window limits
conn = sqlite3.connect(DB)
conn.execute('ALTER TABLE IF NOT EXISTS pages ADD COLUMN summary TEXT')
conn.execute('UPDATE pages SET summary=? WHERE url=?', (summary, url))
conn.commit(); conn.close()

Keep prompts deterministic and low-variance for consistent summaries. Consider prompting templates and few-shot examples stored in a file for maintainability. For ops and creator workflows that keep inference local and auditable, check recommendations in the Behind the Edge playbook.

Operational considerations

Monitoring & logs

Use systemd to run the pipeline and restart on failure.
Ship lightweight metrics via Prometheus node_exporter or push metrics to a private Grafana instance.

Storage & retention

Rotate or compress raw HTML to save space.
Store only hashes of scraped pages if you must keep fewer PII footprints.

Security

Run the service as a non-root user.
Use AppArmor/SELinux and filesystem quotas.
Encrypt backups of summaries if they contain sensitive data.

Scaling & cost tradeoffs

For higher throughput or model experimentation, consider:

Running multiple Pi 5 units with HATs behind a small local load balancer.
Offloading heavy JS rendering to a central headless Chromium server if many targets require it.
Using federated model updates — update quantized weights centrally and push to devices during maintenance windows.

Anti-blocking best practices (ethical + practical)

Local scraping doesn't mean you can ignore target rules. Follow these:

Respect robots.txt and site terms.
Use realistic but honest user agents and identify your crawler.
Implement exponential backoff on 429/503 responses and IP-friendly rate limits.
Use headless browser only when necessary; it increases profile and cost.

Model quantization tips (2026 updates)

Late 2025/early 2026 toolchains added quantization schemes that trade less accuracy for speed on NPUs. Practical tips:

Prefer q4_K for speed when you need many inferences per day.
Test a small validation set to measure semantic retention after quantization.
Use mixed precision for sections of your model if supported by the SDK (some NPUs can mix 4-bit storage and 16-bit compute).

Troubleshooting & performance tuning

If inference is slow, verify the SDK is using the NPU (check vendor tools for offload metrics).
Reduce context size (n_ctx) to fit in memory for faster startup.
Trim long pages before sending to the model; summarize in chunks and merge results if needed.
Profile CPU, memory and NPU throughput during a test run and tune worker concurrency accordingly.

Legal & compliance checklist

Before running production scraping + summarization:

Confirm the target site's terms-of-service allow scraping for your use case.
Review copyright implications for downstream summaries and storage.
Implement data minimization and deletion policies to meet privacy requirements.

Example end-to-end run (quick commands)

# 1. Start virtualenv
source ~/pi_scraper/venv/bin/activate
# 2. Run scraper for a set of URLs
python scraper.py https://example.com/article1 https://example.com/article2
# 3. Summaries are generated automatically and stored in scraper.db
sqlite3 scraper.db "SELECT url, summary FROM pages;"

2026 trends & future-proofing

Through late 2025 and early 2026, three trends matter for on-device scraping and summarization:

Edge-optimized NPUs and SDK maturity: hardware vendors improved Python bindings and model-serving runtimes to support enterprise workflows.
Quantization and compiler toolchains: quantization standards stabilized so cross-vendor model portability improved.
Privacy-first tooling: more libraries surfaced policies and compliance helpers (robot scanners, automated PII redaction).

To future-proof your pipeline:

Keep model conversion and deployment automated (CI pipeline builds quantized artifacts) — follow deployment checklists like our cloud migration and deployment guide for reproducible artifacts.
Abstract the inference layer so you can swap between vendor SDKs and open runtimes; see Behind the Edge for abstraction examples.
Log and version prompts and model hashes for reproducibility and audits.

Real-world example & metrics (mini-case study)

A marketing intelligence team deployed a 2-node Pi 5 + AI HAT+ 2 cluster in 2025 for competitor monitoring. Their configuration:

Two Pi 5 nodes, each running a 7B q4 model.
Scraped 100 sites once every 12 hours with politeness limits.
Average summarization latency: 8–18s per article; daily inference volume ~6,000 summaries.

Benefits reported:

Zero cloud inference spend; monthly operating cost ~ $20 (power + occasional hardware).
Data never left their site, simplifying legal review.
Ability to iterate quickly on prompt templates and store prompt+model version hashes for audits.

Key takeaways

Raspberry Pi 5 + AI HAT+ 2 is a practical, cost-effective platform in 2026 for privacy-preserving scraping and summarization.
Use quantized 7B-class models for the best balance of latency and quality on-device.
Design with politeness and compliance in mind: robots.txt, rate limits, and storage minimization.
Abstract inference to swap runtimes as SDKs and model formats evolve — see platform-level guidance in Edge AI at the Platform Level and operational guidance in Behind the Edge.

Pro tip: Automate model conversion and device deployment in CI. Keep a manifest with model hash, quantization scheme, prompt template, and inference parameters for every deployment.

Next steps — get this running quickly

Order an AI HAT+ 2 and a Pi 5 if you haven’t already (vendors stocked units through Q4 2025).
Pick a permissively licensed model and convert it to a quantized backend you trust.
Prototype the scraper on your laptop and then move to the Pi for NPU tuning.
Automate deployment and implement minimal monitoring — for hosting tradeoffs see hybrid edge strategies.

Call to action

If you want a starting point, clone the sample repo we use for this tutorial (contains the scraper, summarizer glue code, and a systemd unit file) and adapt it to your targets. Test with one or two sites, validate summaries, and then scale carefully. Need help customizing prompts, quantizing a specific model, or tuning inference on AI HAT+ 2? Reach out to our engineering team or leave a message in the comments — we can help you migrate a cloud workflow to a secure, on-prem Pi cluster.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.