Browser ExtensionsPrivacyAI

Creating a Local AI‑Powered Browser Extension for Private Research

UUnknown

2026-02-02

11 min read

Build a private browser extension that runs on‑device LLMs for offline summarization and notes — architecture, code, and packaging tips for 2026.

Build a private, on‑device AI browser extension for research — inspired by Puma

Hook: You want fast, private summarization and note‑taking inside your browser, but you don't want your research or URLs uploaded to a cloud API. In 2026, on‑device LLMs are powerful enough to run locally on laptops, desktops, and even some phones. This guide shows how to build a browser extension that runs models on the user device (inspired by Puma's local‑AI approach), with architecture choices, concrete code examples, and packaging tips for Chrome alternatives and mobile.

Why on‑device LLMs matter in 2026

By late 2025 and into 2026 the ecosystem matured: efficient quantized formats (gguf), WebGPU adoption across major browsers, and small‑footprint models (2B–7B) tuned for local execution. Hardware also improved — the Raspberry Pi 5 plus AI HAT upgrades and modern laptops with Apple silicon or AVX512 on Intel make local inference feasible for private workflows.

What you get with on‑device extensions:

Data never leaves the device — stronger privacy and compliance.
Lower latency for interactive summarization and note taking.
Offline capability for travel or sensitive research (power- and battery-aware field strategies).

High‑level architecture — two recommended approaches

Pick one of two architecture patterns depending on target platforms, model size, and performance goals.

1) Pure WebAssembly/WebGPU in‑extension (no native host)

Run a quantized model via WASM + WebGPU or WebNN inside a WebWorker. Best when you target desktop browsers with WebGPU support and smaller models (<=7B quantized).

Pros: Single package, no native installer, easier cross‑browser distribution.
Cons: Browser memory limits, slower than native GPU, model size must fit device memory.

2) Native companion + WebExtension (native messaging)

Use a native helper service (local server or native messaging host) to run the heavy model using CUDA/Metal/NNAPI. The extension communicates over native messaging or localhost websocket — a pattern common where teams adopt shared hardware backends or micro‑edge hosts.

Pros: Full hardware acceleration, supports large models, better throughput.
Cons: Requires packaging/signing native app per OS and more install friction (see device identity and approval workflows and best practices for installers).

Which to choose?

If you want the lowest friction and target research workflows on modern laptops and Chrome alternatives (Brave, Vivaldi, Edge), start with WebAssembly/WebGPU. If you need best throughput or want to target desktops with discrete GPUs or Raspberry Pi + AI HAT, use a native companion or a micro‑edge host.

End‑to‑end flow (short)

User triggers summarization via popup or keyboard shortcut.
Content script collects page text or selection, sends text to background worker.
Worker sends tokenized text to LLM runtime (WASM or native host).
Model returns summary; extension displays result and optionally saves encrypted note locally.

Concrete implementation: WebExtension + WASM model (7B quantized example)

This section shows a minimal, practical path to get a working proof‑of‑concept that runs fully local summarization inside a browser extension.

1) Scaffolding

Create a WebExtension (MV3) scaffold that supports Chrome alternatives and Firefox. Key files:

manifest.json (MV3)
popup.html + popup.js
content_script.js
background service worker (background.js)
model_worker.js (WebWorker to host WASM runtime)

// manifest.json (MV3 minimal)
{
  "manifest_version": 3,
  "name": "Private Research Summarizer",
  "version": "0.1.0",
  "permissions": ["scripting", "storage", "activeTab"],
  "action": { "default_popup": "popup.html" },
  "background": { "service_worker": "background.js" },
  "content_scripts": [{
    "matches": [""],
    "js": ["content_script.js"],
    "run_at": "document_idle"
  }]
}

2) Content capture and UI

Use a content script to get page text or the selected text, then send it to the background worker. Keep permissions minimal and explicit.

// content_script.js
self.addEventListener('click', (ev) => {
  // example: keyboard shortcut handler or selection capture
});

function getSelectionText() {
  const s = window.getSelection();
  return s ? s.toString() : '';
}

chrome.runtime.onMessage.addListener((msg, sender, reply) => {
  if (msg.action === 'capture-page') {
    const text = document.body.innerText || '';
    reply({ text });
  } else if (msg.action === 'capture-selection') {
    reply({ text: getSelectionText() });
  }
});

3) Model worker (WASM + WebGPU)

Use an optimized browser runtime such as a WASM port of llama.cpp or MLC‑LLM web builds. In 2026, several runtimes offer gguf model loading with WebGPU acceleration; you should pick one tested for your target browsers.

Load the model as a separate asset (do not bundle multi‑GB models inside the extension). Store models in IndexedDB or use the File System Access API to store large files and avoid repeated downloads.

// model_worker.js (simplified)
self.onmessage = async (evt) => {
  const { cmd, payload } = evt.data;
  if (cmd === 'init') {
    // fetch runtime & model manifest, then instantiate WASM
    importScripts('llama_wasm_runtime.js'); // placeholder runtime
    await LlamaWasm.init({ backend: 'webgpu' });
    // load quantized gguf model from user folder or CDN
    await LlamaWasm.loadModel('/models/7b-q4.gguf');
    postMessage({ status: 'ready' });
  }

  if (cmd === 'summarize') {
    const text = payload.text;
    const prompt = `Summarize the following for research notes:\n\n${text}`;
    const summary = await LlamaWasm.generate(prompt, { max_tokens: 200 });
    postMessage({ status: 'result', summary });
  }
};

4) Background service worker orchestration

The background service worker receives UI requests and routes them to model_worker. Use Comlink or structured clone to pass messages.

// background.js (simplified)
let modelWorker;

chrome.runtime.onInstalled.addListener(() => {
  modelWorker = new Worker('model_worker.js');
  modelWorker.postMessage({ cmd: 'init' });
  modelWorker.onmessage = (e) => {
    // forward ready/result to popup
    chrome.runtime.sendMessage(e.data);
  };
});

chrome.runtime.onMessage.addListener((msg, sender, reply) => {
  if (msg.action === 'summarize') {
    modelWorker.postMessage({ cmd: 'summarize', payload: { text: msg.text } });
    reply({ status: 'queued' });
  }
});

Native companion pattern — when you need it

For heavy models, use a native host. This is the recommended path if you want 13B+ quality or GPU performance on desktop.

Key implementation notes:

Use native messaging hosts (Chrome, Edge, Brave) or a localhost socket with CORS pinned to loopback.
Provide installers with code signing for Windows (.msi/.exe), macOS (.dmg/.pkg), and Linux packages (.deb/.rpm). Include an auto‑update channel or delta update mechanism for model binaries.
Implement an IPC protocol with JSON over stdin/stdout or a tiny TLS‑secured localhost port if you need websockets.

// Example: background.js sends request to native host
chrome.runtime.sendNativeMessage('com.example.llm_host', { cmd: 'summarize', text })
  .then(response => console.log('summary', response.summary))
  .catch(err => console.error(err));

Privacy & security best practices (critical)

Privacy is the main reason to choose an on‑device approach. Follow these rules to keep it private and auditable:

No default telemetry: Disable all telemetry and require explicit opt‑in. Document exactly what is collected if any.
Local‑only by design: Make the extension fail closed if it cannot guarantee local inference (don’t fallback to cloud by default).
Encrypt local notes: Use AES‑GCM or platform keystore to encrypt saved summaries and notes in IndexedDB or filesystem; pair this with an incident response playbook for audits.
Least permissions: Request only the permissions you need: activeTab, storage, scripting. Avoid broad host permissions unless necessary.
Model provenance & licensing: Ship instructions for using properly licensed models (Llama 2 family, Mistral, etc.). Do not bundle models with restrictive licensing unless compliant.

A private summarization tool must be auditable: make model loading and inference transparent and provide a clear privacy policy.

Handling model files: storage and updates

Model files are large. Don't package them in the extension. Instead:

Provide a settings UI to download models into a user data directory (use File System Access API on supported browsers or native installer path for native host).
Use delta updates or chunked downloads to recover from interruptions.
Store models in gguf or other efficient quantized formats supported by your runtime; prefer 4‑bit quantized variants (Q4_K_M) for memory efficiency.

Example: on first run, prompt the user to download a recommended model (2B for low‑end machines, 7B for mainstream laptops, native host for 13B+).

Performance tuning tips

Quantization tradeoffs: 4‑bit quantized models dramatically reduce RAM but can affect quality—test with your domain text.
Context window: For summarization, chunk long pages and summarize each chunk, then summarize the summaries (hierarchical summarization).
Prompt engineering: Use concise instruction prompts and limit max tokens to manage performance.
WebGPU pipelines: Pin to a preferred backend and fall back to WASM CPU if GPU fails. Consider micro‑edge and hosted options if local GPU is insufficient (micro‑edge instances).

Packaging for Chrome alternatives and mobile

Targeting Chrome alternatives (Brave, Vivaldi, Edge) is mostly identical to Chrome — they support MV3 WebExtensions. Firefox uses MV3 compatibility with some differences.

Chrome/Brave/Vivaldi/Edge: Publish CRX or use the browser stores; test MV3 service worker behavior across them.
Firefox: Use .xpi packages; some WebGPU features may differ; test runtime compatibility.
Safari/iOS: Mobile browsers have stricter extension support. iOS requires Safari App Extensions within a native app; fully on‑device LLMs on iOS are typically delivered via native apps (Puma shows the UX model).
Android: Some mobile browsers allow extensions (Firefox, Kiwi). For broader reach, provide a companion Android app that exposes a WebView or shares a local service — treat this like any field kit or local server deployment (field demo kits).

Mobile note

Puma (a modern mobile browser that champions on‑device AI) demonstrates the user experience: native integration of LLMs on mobile is smoother than web extension approaches. If you want true mobile on‑device performance and privacy, consider shipping a native mobile app rather than a pure extension. For multi‑user lab or kiosk setups, look into community cloud co‑op patterns and local governance.

Developer checklist before public release

Audit model licenses and include explicit user consent for model downloads.
Verify that no data is sent externally by default; run a network and incident audit during tests.
Provide clear model management in settings (download, delete, update).
Implement encryption and optional passphrase for saved notes.
Set up an opt‑in telemetry and bug report flow that optionally includes model traces but never raw user text without consent.
Prepare installers and sign them per platform to avoid user friction; combine with device identity and installer approval flows (device identity guidance).

Example: hierarchical summarization algorithm

Hierarchical summarization reduces token load and improves speed for long pages. Pseudocode below:

function hierarchicalSummarize(text, chunkSize = 3000, overlap = 200) {
  const chunks = splitText(text, chunkSize, overlap);
  const summaries = [];
  for (const chunk of chunks) {
    const s = model.generate(`Summarize briefly:\n\n${chunk}`);
    summaries.push(s);
  }
  // summarize the summaries into a final note
  const final = model.generate(`Combine these into a concise research note:\n\n${summaries.join('\n\n')}`);
  return final;
}

2026 trends and future predictions (practical implications)

WebGPU and WebNN will continue to close the gap with native GPU performance — expect browser inferences to be common for 2–7B models.
Model formats like gguf and robust quantization toolchains will standardize model exchange for on‑device use.
Edge devices (RPi + AI HAT, NPU phones) will make offline group deployments viable for teams wanting private networks for research.

Real‑world example: Raspberry Pi research kiosk

Use the native companion approach to create a local research kiosk: Raspberry Pi 5 + AI HAT 2 hosts a 7B quantized model, a small web server serves the extension UI via local network, and researchers connect with a browser (Brave/Vivaldi) extension that talks to the local Pi. This pattern provides a private, shared hardware backend while keeping data inside the organization. For on‑site setups, borrow best practices from edge field kits (edge field kit).

Common pitfalls and how to avoid them

Bundling models: Don't include multi‑GB models inside the extension bundle — use user download or native installation.
Background service worker timeouts: MV3 service workers can be ephemeral; spawn a dedicated worker for model orchestration (or rely on a native host).
Cross‑browser GPU inconsistencies: Provide CPU fallback and detect backend capabilities at runtime.
Security: Validate any model files (checksum) before loading to prevent tampering and tie this into your incident response plan (incident response playbook).

Actionable takeaways (quick checklist)

Decide architecture: WASM/WebGPU for simplicity, native host for performance.
Use gguf quantized models; offer 2B/7B defaults and native path for 13B+.
Store models outside the extension; use File System Access API or native paths.
Encrypt saved notes; default to no telemetry and explicit opt‑in.
Test on Chrome alternatives (Brave, Vivaldi, Edge) and mobile strategies (native app for iOS/Android).

Closing thoughts

On‑device LLMs in browser extensions unlock private, low‑latency research workflows that weren't practical a few years ago. Inspired by Puma's local‑AI UX and enabled by 2025–2026 advances (WebGPU, gguf, Pi AI HATs and better quantization), you can build a robust summarization and note‑taking extension that keeps data on the device and scales from laptop to Raspberry Pi kiosks.

Next step: pick a path — WebAssembly proof‑of‑concept or native companion — and build a minimal extension with one model (2B or 7B). Start with a small UI that captures selection, summarizes, and encrypts notes. Iterate on prompts and model sizes to find your sweet spot for quality vs. performance.

Call to action

If you want, I can generate a starter repo (manifest + worker + model loader) tailored to your target platform (desktop WebGPU or native host for macOS/Windows/Linux). Reply with the platforms and model sizes you want to support and I'll scaffold code and packaging instructions you can run locally.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.