April 6, 2026
The Problem
My downstairs neighbor regularly plays an acoustic instrument. Not a problem in itself — but I wanted to know: When exactly and for how long? So I built a small AI project that logs this automatically.
It’s also a learning project: How do you deploy an open-source AI model on a Raspberry Pi? How do you run real-time audio classification on an edge device?
The Idea
A Raspberry Pi 4 with an I2S microphone listens continuously. As soon as an acoustic instrument is detected, the system logs:
- When it starts and stops
- How long the session lasts
- Which instrument was detected
- How loud it was
- How confident the AI is (confidence score)
The data flows via MQTT to Home Assistant — where I can build automations, dashboards and notifications.
Hardware
| Component | Cost |
|---|---|
| Raspberry Pi 4 Model B (8 GB)* | already owned |
| INMP441 I2S MEMS Microphone* | ~5 EUR |
| microSD card 32 GB (SanDisk Ultra)* | already owned |
Total cost: ~5 EUR (if you already have the Pi)
The INMP441 is a digital MEMS microphone that connects directly to the Pi’s GPIO pins via the I2S interface. No USB, no driver issues — clean digital signal.
Wiring
INMP441 Raspberry Pi 4
───────── ──────────────
VDD → 3.3V (Pin 1)
GND → GND (Pin 6)
SCK → GPIO 18 (Pin 12)
WS → GPIO 19 (Pin 35)
SD → GPIO 20 (Pin 38)
L/R → GND (Mono)
Architecture: Two-Stage Pipeline
The system uses a two-stage detection approach to save CPU and power:
┌──────────┐ ┌───────────┐ ┌──────────────┐ ┌─────────────┐
│ INMP441 │───▶│ RMS │───▶│ YAMNet │───▶│ Event │
│ Mic │ │ Monitor │ │ Classifier │ │ Manager │
│ (I2S) │ │ (Stage 1) │ │ (Stage 2) │ └──┬──┬──┬───┘
└──────────┘ └───────────┘ └──────────────┘ │ │ │
┌───▼┐┌▼──▼──┐
│MQTT││SQLite│
└──┬─┘└──────┘
┌──▼──────────┐
│Home Assistant│
└─────────────┘
Stage 1: Volume Monitor (always active)
A simple RMS (Root Mean Square) monitor calculates the audio volume every 100ms. As long as it’s quiet, nothing happens — CPU load stays under 1%.
Only when the level exceeds a configurable threshold does Stage 2 kick in. Hysteresis prevents the system from flickering between “on” and “off” at borderline volumes.
class AudioMonitor:
def process_chunk(self, audio: np.ndarray) -> tuple[bool, float]:
rms = float(np.sqrt(np.mean(audio ** 2)))
if self._is_active:
# Deactivate only below hysteresis threshold
if rms < self.hysteresis:
self._is_active = False
else:
# Activate above threshold
if rms >= self.rms_threshold:
self._is_active = True
return self._is_active, rms
Stage 2: YAMNet AI Classification (on-demand)
YAMNet is a pre-trained audio classification model from Google. It recognizes 521 different sound classes — including many musical instruments like guitar, violin, piano, flute etc.
Why YAMNet and not an LLM like Gemma 4?
| Criterion | YAMNet | Gemma 4 E2B |
|---|---|---|
| Purpose | Audio classification | Speech recognition |
| Model size | ~4 MB | ~1.5 GB |
| Inference on Pi 4 | ~100 ms | Not optimized |
| Instrument classes | 521 incl. instruments | Not designed for this |
YAMNet runs as a TensorFlow Lite model at just ~4 MB — perfect for the Pi. One inference takes about 100ms. The model receives a 0.975-second audio chunk and returns the top-5 detected classes with confidence scores.
result = classifier.classify(audio_chunk)
# result.is_instrument = True
# result.instrument_name = "Acoustic guitar"
# result.confidence = 0.82
Event Manager: State Machine
A simple state machine tracks the “sessions”:
┌───────────┐ Instrument detected ┌──────────┐
│ IDLE │──────────────────────▶│ ACTIVE │
│ │◀──────────────────────│ │
└───────────┘ Timeout (10s without └──────────┘
detection)
- IDLE → ACTIVE: Instrument detected with confidence above threshold → session starts
- ACTIVE: Counts duration, tracks volume/confidence, sends MQTT updates every 5 seconds
- ACTIVE → IDLE: 10 seconds without detection → session is saved to database
The 10-second debouncing ensures that brief pauses (turning pages, short silence) don’t interrupt the session.
MQTT & Home Assistant
With MQTT auto-discovery, sensors appear automatically in Home Assistant:
| Sensor | Description |
|---|---|
sensor.instrument_detector_state | playing / idle |
sensor.instrument_detector_duration | Duration in seconds |
sensor.instrument_detector_volume | Volume in dB |
sensor.instrument_detector_confidence | Detection confidence 0.0–1.0 |
sensor.instrument_detector_instrument | Detected instrument |
This enables any Home Assistant automation — for example, a phone notification when the neighbor starts playing, or a long-term dashboard with play times per week.
REST API & Remote Control
A FastAPI web server runs on port 8000:
GET /api/status → Current state
GET /api/sessions → Session history
GET /api/stats → Daily/weekly statistics
POST /api/config → Adjust thresholds at runtime
POST /api/control/start → Start detection
POST /api/control/stop → Stop detection
GET /docs → Swagger UI
The Swagger UI at http://raspi:8000/docs makes testing and configuration particularly easy.
Installation: One Command
On a fresh Raspberry Pi OS, a single command does everything:
curl -sSL https://raw.githubusercontent.com/ckoehler99/ai_surveillance/main/setup/setup-pi.sh | \
sudo MQTT_BROKER=192.168.1.100 bash
The script handles everything: system packages, I2S driver, Python environment, YAMNet model, configuration and systemd service. Alternatively, the SD card can be fully pre-configured with cloud-init — the Pi sets itself up on first boot.
Tech Stack
| Component | Technology |
|---|---|
| Audio capture | sounddevice (I2S/ALSA) |
| AI model | YAMNet via tflite-runtime |
| MQTT | paho-mqtt with HA auto-discovery |
| Web server | FastAPI + uvicorn |
| Database | SQLite3 |
| Configuration | PyYAML |
What I Learned
- TFLite on the Pi is surprisingly fast. 100ms inference time for an audio classification model — more than enough for real-time.
- The two-stage pipeline saves enormous CPU. Without the RMS pre-check, the Pi would run YAMNet inference constantly — with the pipeline it stays under 1% CPU when idle.
- I2S is far superior to USB audio. No driver issues, no USB overhead, clean digital signal. Wiring is just 5 wires.
- YAMNet beats an LLM for this purpose. A specialized, lightweight model outperforms a general-purpose LLM on a focused classification task — both in speed and accuracy.
Next Steps
- Transfer Learning: Fine-tune YAMNet on my neighbor’s specific instrument — should significantly improve detection accuracy
- Piezo Contact Microphone: A second microphone attached directly to the ceiling for structure-borne sound detection. Possibly fusing both signals for higher confidence.
- Long-term Dashboard: Grafana integration for weekly and monthly statistics
Links
- GitHub: github.com/ckoehler99/ai_surveillance
- YAMNet: TensorFlow Models — YAMNet
- INMP441 Microphone (amazon.de): INMP441 I2S MEMS Module*
Links marked with * are affiliate links. If you purchase through these links, I receive a small commission — at no extra cost to you.