Keep scraping private: Run a full scraper + summarizer on Raspberry Pi 5 with the AI HAT+ 2
Privacy, cost and operational simplicity are top concerns for engineers who need recurring data from sensitive sources. What if you could run a production-grade scraper and generative summarization pipeline entirely on-device — no cloud, no third‑party LLM APIs, and predictable costs? In 2026, the Raspberry Pi 5 paired with the $130 AI HAT+ 2 makes that realistic for many projects.
Why this matters in 2026 (short answer)
Edge AI hardware and toolchains matured through late 2025 and early 2026. Vendors shipped ARM-friendly runtimes, quantized model support, and SDKs that expose NPUs from Python. That means you can run on-device inference for 7B-class instruct models with acceptable latency on consumer hardware. For privacy-preserving scraping — regulatory or contractual constraints often require data to never leave your premises — this combination is a game changer. See our companion playbooks on behind-the-edge workflows for operational tactics and device-level CI.
What you'll build
- A lightweight async Python scraper that respects robots.txt and uses adaptive throttling.
- A text-cleaning and normalization step that produces high-quality prompts.
- An on-device summarization pipeline that runs on AI HAT+ 2 using a quantized local model.
- Simple persistence (SQLite / Parquet) and a systemd service for continuous operation.
Requirements & assumptions
- Raspberry Pi 5 (4GB or 8GB recommended) running Raspberry Pi OS 64-bit (Bullseye/Bookworm updated in 2025+).
- AI HAT+ 2 attached and its vendor drivers/SDK installed (released late 2025, updated in early 2026 to support Python bindings).
- Basic Linux and Python 3.11+ knowledge.
- Familiarity with model licensing — use weights you are allowed to run locally.
High-level architecture
Pipeline flow (single-device):
- Scheduler / poller triggers scraping jobs.
- Async HTTP scraper fetches pages, storing raw HTML.
- Cleaner extracts article text, normalizes unicode, removes boilerplate.
- Summarizer sends cleaned text to local LLM on AI HAT+ 2 for a compact summary.
- Persist results to SQLite/Parquet and emit metrics/logs.
Step 1 — Prepare Raspberry Pi 5
Update and install base packages. Run these on the Pi 5 console.
sudo apt update && sudo apt upgrade -y
sudo apt install -y git build-essential python3-pip python3-venv sqlite3 libxml2-dev libxslt-dev libjpeg-dev chromium-browserCreate a Python virtualenv:
python3 -m venv ~/pi_scraper/venv
source ~/pi_scraper/venv/bin/activate
pip install --upgrade pipStep 2 — Install AI HAT+ 2 SDK and runtime
Vendors shipping AI HAT+ 2 provide a Linux SDK with drivers and a Python package (names used here are illustrative; replace with your vendor package names). Follow the vendor install notes. Typical pattern:
# Download vendor SDK (example)
wget https://vendor.example/ai-hat-plus-2/sdk-linux-aarch64.tar.gz
sudo tar -xzf sdk-linux-aarch64.tar.gz -C /opt/ai-hat
cd /opt/ai-hat && sudo ./install.sh
# Python bindings
pip install ai_hat_sdk # hypothetical package nameVerify the device is visible to the OS (example):
ai_hat_toolkit status
# or using Python
python -c "import ai_hat_sdk; print(ai_hat_sdk.info())"If your vendor offers a llama.cpp-compatible endpoint or a plugin for llama-cpp-python, install that as well. We'll show two inference options below: (A) vendor SDK and (B) llama-cpp-python with a quantized model.
Step 3 — Choose and quantize a model
Pick a model that fits your latency and accuracy needs. In 2026 the sweet spot is often 7B-class instruct models quantized to 4-bit (q4_K) or 6-bit for slightly better accuracy. You must follow the model license (e.g., Llama 2, Mistral, Falcon derivatives where allowed).
Option A — Vendor-provided optimized models
Some vendors ship optimized, pre-quantized models designed for their NPU. If available, prefer those — they simplify setup and often expose direct Python bindings for low-latency inference on AI HAT+ 2.
Option B — Convert an open model to ggml (llama.cpp)
Common flow (assumes llama.cpp and the conversion tools are compiled on the Pi or cross-compiled):
# Clone and build llama.cpp (ARM-friendly build)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j4
# Use the conversion script to produce ggml q4_K
python3 convert.py --model /path/to/original/weights --out /home/pi/models/7b-ggml-q4.bin --quantize q4_KNote: conversion may require a more powerful machine. In that case, convert on a x86 server and copy the quantized file to the Pi. For production flows, follow the recommendations in our deployment and CI checklists so model artifacts are reproducible and auditable.
Step 4 — Build the scraper (async, polite, production-ready)
We favor an async architecture for throughput with adaptive throttling to avoid IP bans. The snippet below demonstrates a minimal production-minded scraper.
pip install aiohttp aiohttp-requests beautifulsoup4 lxml aioratelimit python-robots
# scraper.py
import asyncio
import sqlite3
from urllib.parse import urlparse
import aiohttp
from bs4 import BeautifulSoup
from aioratelimit import AsyncRateLimiter
from robots import RobotsCache
DB = 'scraper.db'
async def fetch(session, url):
async with session.get(url, timeout=30) as resp:
resp.raise_for_status()
return await resp.text()
async def extract_text(html):
soup = BeautifulSoup(html, 'lxml')
# simple boilerplate removal
article = soup.find('article') or soup.find('main') or soup
for s in article(['script', 'style', 'nav', 'footer', 'aside']):
s.decompose()
text = ' '.join(p.get_text(strip=True) for p in article.find_all('p'))
return text
async def worker(name, queue, session, rate_limiter, robots):
while True:
url = await queue.get()
try:
parsed = urlparse(url)
if not await robots.allowed(url, 'MyBot/1.0'):
print(f"Skipping per robots: {url}")
continue
async with rate_limiter:
html = await fetch(session, url)
text = await extract_text(html)
# persist
conn = sqlite3.connect(DB)
conn.execute('INSERT INTO pages(url, text) VALUES (?, ?)', (url, text))
conn.commit(); conn.close()
print(f"Saved: {url}")
except Exception as e:
print('Error', url, e)
finally:
queue.task_done()
async def main(urls):
# create DB
conn = sqlite3.connect(DB)
conn.execute('CREATE TABLE IF NOT EXISTS pages(id INTEGER PRIMARY KEY, url TEXT UNIQUE, text TEXT, created TIMESTAMP DEFAULT CURRENT_TIMESTAMP)')
conn.commit(); conn.close()
queue = asyncio.Queue()
for u in urls:
queue.put_nowait(u)
robots = RobotsCache()
rate_limiter = AsyncRateLimiter(rate=1, period=1) # 1 req/sec default
async with aiohttp.ClientSession(headers={'User-Agent':'MyBot/1.0'}) as session:
tasks = [asyncio.create_task(worker(f'w{i}', queue, session, rate_limiter, robots)) for i in range(4)]
await queue.join()
for t in tasks: t.cancel()
if __name__ == '__main__':
import sys
urls = sys.argv[1:]
asyncio.run(main(urls))Key production-minded features here:
- Robots.txt checking
- Adaptive rate limiting via AsyncRateLimiter (tune based on target)
- Persistent storage (SQLite); swap to Parquet/ClickHouse for large-scale
Step 5 — On-device summarization with the AI HAT+ 2
Two viable options depending on your vendor setup:
Option A — Use the AI HAT+ 2 vendor SDK (recommended if available)
Vendor SDK often exposes an easy function to run prompts on an optimized model. Example usage pattern (hypothetical ai_hat_sdk):
pip install ai_hat_sdk
# summarizer_aihat.py
from ai_hat_sdk import Model
model = Model('/opt/ai-hat/models/7b-instruct-quantized')
def summarize(text, max_tokens=150):
prompt = f"Summarize the following article in 5 bullet points:\n\n{text}\n\nBulleted summary:"
out = model.generate(prompt, max_tokens=max_tokens, temperature=0.1)
return out['text']
if __name__ == '__main__':
import sys
print(summarize(' '.join(sys.argv[1:])))This path usually yields the best performance and power efficiency because the SDK uses the NPU drivers directly — which is the sort of vendor-backed optimization covered in edge AI platform discussions like Edge AI at the Platform Level.
Option B — Use llama-cpp-python binding (open toolchain)
Install llama-cpp-python (it binds to your built llama.cpp). The Pi 5 + AI HAT+ 2 sometimes exposes a custom backend for offloading; if not, a quantized GGML file still runs on CPU with NEON accelerations.
pip install llama-cpp-python
# summarizer_llama.py
from llama_cpp import Llama
llm = Llama(model_path='/home/pi/models/7b-ggml-q4.bin', n_ctx=2048)
def summarize(text):
prompt = f"TL;DR (5 bullets):\n\n{text[:6000]}"
resp = llm.create(prompt=prompt, max_tokens=150, temperature=0.1)
return resp['choices'][0]['text']
if __name__ == '__main__':
import sys
print(summarize(' '.join(sys.argv[1:])))Performance note: a well-quantized 7B model on Pi 5 with AI HAT+ 2 typically gives reasonable latency for many use cases (few-second to <30s generation depending on prompt length and model). In early 2026 community benchmarks show 7B q4 models on similar edge NPUs producing 10–30 tokens/sec for standard generations; expect variance. For guidance on balancing latency, cost and hosting strategy, see hybrid edge–regional hosting strategies.
Step 6 — Wire scraper output into the summarizer
Extend the earlier scraper worker to call the summarizer after saving raw text. Keep the summarizer local and synchronous or provide a small async inference queue to avoid blocking scraping.
# inside worker, after text extraction
from summarizer_llama import summarize # or summarizer_aihat
summary = summarize(text[:6000]) # trim to token/window limits
conn = sqlite3.connect(DB)
conn.execute('ALTER TABLE IF NOT EXISTS pages ADD COLUMN summary TEXT')
conn.execute('UPDATE pages SET summary=? WHERE url=?', (summary, url))
conn.commit(); conn.close()Keep prompts deterministic and low-variance for consistent summaries. Consider prompting templates and few-shot examples stored in a file for maintainability. For ops and creator workflows that keep inference local and auditable, check recommendations in the Behind the Edge playbook.
Operational considerations
Monitoring & logs
- Use systemd to run the pipeline and restart on failure.
- Ship lightweight metrics via Prometheus node_exporter or push metrics to a private Grafana instance.
Storage & retention
- Rotate or compress raw HTML to save space.
- Store only hashes of scraped pages if you must keep fewer PII footprints.
Security
- Run the service as a non-root user.
- Use AppArmor/SELinux and filesystem quotas.
- Encrypt backups of summaries if they contain sensitive data.
Scaling & cost tradeoffs
For higher throughput or model experimentation, consider:
- Running multiple Pi 5 units with HATs behind a small local load balancer.
- Offloading heavy JS rendering to a central headless Chromium server if many targets require it.
- Using federated model updates — update quantized weights centrally and push to devices during maintenance windows.
Anti-blocking best practices (ethical + practical)
Local scraping doesn't mean you can ignore target rules. Follow these:
- Respect robots.txt and site terms.
- Use realistic but honest user agents and identify your crawler.
- Implement exponential backoff on 429/503 responses and IP-friendly rate limits.
- Use headless browser only when necessary; it increases profile and cost.
Model quantization tips (2026 updates)
Late 2025/early 2026 toolchains added quantization schemes that trade less accuracy for speed on NPUs. Practical tips:
- Prefer q4_K for speed when you need many inferences per day.
- Test a small validation set to measure semantic retention after quantization.
- Use mixed precision for sections of your model if supported by the SDK (some NPUs can mix 4-bit storage and 16-bit compute).
Troubleshooting & performance tuning
- If inference is slow, verify the SDK is using the NPU (check vendor tools for offload metrics).
- Reduce context size (n_ctx) to fit in memory for faster startup.
- Trim long pages before sending to the model; summarize in chunks and merge results if needed.
- Profile CPU, memory and NPU throughput during a test run and tune worker concurrency accordingly.
Legal & compliance checklist
Before running production scraping + summarization:
- Confirm the target site's terms-of-service allow scraping for your use case.
- Review copyright implications for downstream summaries and storage.
- Implement data minimization and deletion policies to meet privacy requirements.
Example end-to-end run (quick commands)
# 1. Start virtualenv
source ~/pi_scraper/venv/bin/activate
# 2. Run scraper for a set of URLs
python scraper.py https://example.com/article1 https://example.com/article2
# 3. Summaries are generated automatically and stored in scraper.db
sqlite3 scraper.db "SELECT url, summary FROM pages;"
2026 trends & future-proofing
Through late 2025 and early 2026, three trends matter for on-device scraping and summarization:
- Edge-optimized NPUs and SDK maturity: hardware vendors improved Python bindings and model-serving runtimes to support enterprise workflows.
- Quantization and compiler toolchains: quantization standards stabilized so cross-vendor model portability improved.
- Privacy-first tooling: more libraries surfaced policies and compliance helpers (robot scanners, automated PII redaction).
To future-proof your pipeline:
- Keep model conversion and deployment automated (CI pipeline builds quantized artifacts) — follow deployment checklists like our cloud migration and deployment guide for reproducible artifacts.
- Abstract the inference layer so you can swap between vendor SDKs and open runtimes; see Behind the Edge for abstraction examples.
- Log and version prompts and model hashes for reproducibility and audits.
Real-world example & metrics (mini-case study)
A marketing intelligence team deployed a 2-node Pi 5 + AI HAT+ 2 cluster in 2025 for competitor monitoring. Their configuration:
- Two Pi 5 nodes, each running a 7B q4 model.
- Scraped 100 sites once every 12 hours with politeness limits.
- Average summarization latency: 8–18s per article; daily inference volume ~6,000 summaries.
Benefits reported:
- Zero cloud inference spend; monthly operating cost ~ $20 (power + occasional hardware).
- Data never left their site, simplifying legal review.
- Ability to iterate quickly on prompt templates and store prompt+model version hashes for audits.
Key takeaways
- Raspberry Pi 5 + AI HAT+ 2 is a practical, cost-effective platform in 2026 for privacy-preserving scraping and summarization.
- Use quantized 7B-class models for the best balance of latency and quality on-device.
- Design with politeness and compliance in mind: robots.txt, rate limits, and storage minimization.
- Abstract inference to swap runtimes as SDKs and model formats evolve — see platform-level guidance in Edge AI at the Platform Level and operational guidance in Behind the Edge.
Pro tip: Automate model conversion and device deployment in CI. Keep a manifest with model hash, quantization scheme, prompt template, and inference parameters for every deployment.
Next steps — get this running quickly
- Order an AI HAT+ 2 and a Pi 5 if you haven’t already (vendors stocked units through Q4 2025).
- Pick a permissively licensed model and convert it to a quantized backend you trust.
- Prototype the scraper on your laptop and then move to the Pi for NPU tuning.
- Automate deployment and implement minimal monitoring — for hosting tradeoffs see hybrid edge strategies.
Call to action
If you want a starting point, clone the sample repo we use for this tutorial (contains the scraper, summarizer glue code, and a systemd unit file) and adapt it to your targets. Test with one or two sites, validate summaries, and then scale carefully. Need help customizing prompts, quantizing a specific model, or tuning inference on AI HAT+ 2? Reach out to our engineering team or leave a message in the comments — we can help you migrate a cloud workflow to a secure, on-prem Pi cluster.
Related Reading
- Edge AI at the Platform Level: On‑Device Models, Cold Starts and Developer Workflows (2026)
- Behind the Edge: A 2026 Playbook for Creator‑Led, Cost‑Aware Cloud Experiences
- Hybrid Edge–Regional Hosting Strategies for 2026: Balancing Latency, Cost, and Sustainability
- Cloud Migration Checklist: 15 Steps for a Safer Lift‑and‑Shift (2026 Update)
- Quantum-Enhanced Sports Predictions: A NFL Case Study
- Cheap Tech vs Premium: What Device Discounts Teach Us About Solar Product Shopping
- How to Stream the Big Match from Your Sinai Resort: Tech, Data and Where to Watch
- Using Bluesky's LIVE Badges and Cross-Platform Alerts to Drive Twitch Viewership
- Festival Fashion and Film: What Attendees Are Wearing at Berlinale and Unifrance This Season