Acoustic Instrument Detection with Raspberry Pi and YAMNet

April 6, 2026

The Problem

My downstairs neighbor regularly plays an acoustic instrument. Not a problem in itself — but I wanted to know: When exactly and for how long? So I built a small AI project that logs this automatically.

It’s also a learning project: How do you deploy an open-source AI model on a Raspberry Pi? How do you run real-time audio classification on an edge device?

The Idea

A Raspberry Pi 4 with an I2S microphone listens continuously. As soon as an acoustic instrument is detected, the system logs:

When it starts and stops
How long the session lasts
Which instrument was detected
How loud it was
How confident the AI is (confidence score)

The data flows via MQTT to Home Assistant — where I can build automations, dashboards and notifications.

Hardware

Component	Cost
Raspberry Pi 4 Model B (8 GB)*	already owned
INMP441 I2S MEMS Microphone*	~5 EUR
microSD card 32 GB (SanDisk Ultra)*	already owned

Total cost: ~5 EUR (if you already have the Pi)

The INMP441 is a digital MEMS microphone that connects directly to the Pi’s GPIO pins via the I2S interface. No USB, no driver issues — clean digital signal.

Wiring

INMP441         Raspberry Pi 4
─────────       ──────────────
VDD        →    3.3V      (Pin 1)
GND        →    GND       (Pin 6)
SCK        →    GPIO 18   (Pin 12)
WS         →    GPIO 19   (Pin 35)
SD         →    GPIO 20   (Pin 38)
L/R        →    GND       (Mono)

Architecture: Two-Stage Pipeline

The system uses a two-stage detection approach to save CPU and power:

┌──────────┐    ┌───────────┐    ┌──────────────┐    ┌─────────────┐
│ INMP441  │───▶│  RMS      │───▶│  YAMNet      │───▶│  Event      │
│    Mic   │    │  Monitor  │    │  Classifier  │    │  Manager    │
│  (I2S)   │    │ (Stage 1) │    │  (Stage 2)   │    └──┬──┬──┬───┘
└──────────┘    └───────────┘    └──────────────┘       │  │  │
                                                    ┌───▼┐┌▼──▼──┐
                                                    │MQTT││SQLite│
                                                    └──┬─┘└──────┘
                                                    ┌──▼──────────┐
                                                    │Home Assistant│
                                                    └─────────────┘

Stage 1: Volume Monitor (always active)

A simple RMS (Root Mean Square) monitor calculates the audio volume every 100ms. As long as it’s quiet, nothing happens — CPU load stays under 1%.

Only when the level exceeds a configurable threshold does Stage 2 kick in. Hysteresis prevents the system from flickering between “on” and “off” at borderline volumes.

class AudioMonitor:
    def process_chunk(self, audio: np.ndarray) -> tuple[bool, float]:
        rms = float(np.sqrt(np.mean(audio ** 2)))

        if self._is_active:
            # Deactivate only below hysteresis threshold
            if rms < self.hysteresis:
                self._is_active = False
        else:
            # Activate above threshold
            if rms >= self.rms_threshold:
                self._is_active = True

        return self._is_active, rms

Stage 2: YAMNet AI Classification (on-demand)

YAMNet is a pre-trained audio classification model from Google. It recognizes 521 different sound classes — including many musical instruments like guitar, violin, piano, flute etc.

Why YAMNet and not an LLM like Gemma 4?

Criterion	YAMNet	Gemma 4 E2B
Purpose	Audio classification	Speech recognition
Model size	~4 MB	~1.5 GB
Inference on Pi 4	~100 ms	Not optimized
Instrument classes	521 incl. instruments	Not designed for this

YAMNet runs as a TensorFlow Lite model at just ~4 MB — perfect for the Pi. One inference takes about 100ms. The model receives a 0.975-second audio chunk and returns the top-5 detected classes with confidence scores.

result = classifier.classify(audio_chunk)
# result.is_instrument = True
# result.instrument_name = "Acoustic guitar"
# result.confidence = 0.82

Event Manager: State Machine

A simple state machine tracks the “sessions”:

    ┌───────────┐  Instrument detected  ┌──────────┐
    │   IDLE    │──────────────────────▶│  ACTIVE  │
    │           │◀──────────────────────│          │
    └───────────┘  Timeout (10s without └──────────┘
                   detection)

IDLE → ACTIVE: Instrument detected with confidence above threshold → session starts
ACTIVE: Counts duration, tracks volume/confidence, sends MQTT updates every 5 seconds
ACTIVE → IDLE: 10 seconds without detection → session is saved to database

The 10-second debouncing ensures that brief pauses (turning pages, short silence) don’t interrupt the session.

MQTT & Home Assistant

With MQTT auto-discovery, sensors appear automatically in Home Assistant:

Sensor	Description
`sensor.instrument_detector_state`	`playing` / `idle`
`sensor.instrument_detector_duration`	Duration in seconds
`sensor.instrument_detector_volume`	Volume in dB
`sensor.instrument_detector_confidence`	Detection confidence 0.0–1.0
`sensor.instrument_detector_instrument`	Detected instrument

This enables any Home Assistant automation — for example, a phone notification when the neighbor starts playing, or a long-term dashboard with play times per week.

REST API & Remote Control

A FastAPI web server runs on port 8000:

GET  /api/status        → Current state
GET  /api/sessions      → Session history
GET  /api/stats         → Daily/weekly statistics
POST /api/config        → Adjust thresholds at runtime
POST /api/control/start → Start detection
POST /api/control/stop  → Stop detection
GET  /docs              → Swagger UI

The Swagger UI at http://raspi:8000/docs makes testing and configuration particularly easy.

Installation: One Command

On a fresh Raspberry Pi OS, a single command does everything:

curl -sSL https://raw.githubusercontent.com/ckoehler99/ai_surveillance/main/setup/setup-pi.sh | \
  sudo MQTT_BROKER=192.168.1.100 bash

The script handles everything: system packages, I2S driver, Python environment, YAMNet model, configuration and systemd service. Alternatively, the SD card can be fully pre-configured with cloud-init — the Pi sets itself up on first boot.

Tech Stack

Component	Technology
Audio capture	sounddevice (I2S/ALSA)
AI model	YAMNet via tflite-runtime
MQTT	paho-mqtt with HA auto-discovery
Web server	FastAPI + uvicorn
Database	SQLite3
Configuration	PyYAML

What I Learned

TFLite on the Pi is surprisingly fast. 100ms inference time for an audio classification model — more than enough for real-time.
The two-stage pipeline saves enormous CPU. Without the RMS pre-check, the Pi would run YAMNet inference constantly — with the pipeline it stays under 1% CPU when idle.
I2S is far superior to USB audio. No driver issues, no USB overhead, clean digital signal. Wiring is just 5 wires.
YAMNet beats an LLM for this purpose. A specialized, lightweight model outperforms a general-purpose LLM on a focused classification task — both in speed and accuracy.

Next Steps

Transfer Learning: Fine-tune YAMNet on my neighbor’s specific instrument — should significantly improve detection accuracy
Piezo Contact Microphone: A second microphone attached directly to the ceiling for structure-borne sound detection. Possibly fusing both signals for higher confidence.
Long-term Dashboard: Grafana integration for weekly and monthly statistics