Acoustic Instrument Detection with Raspberry Pi and YAMNet

April 6, 2026

The Problem

My downstairs neighbor regularly plays an acoustic instrument. Not a problem in itself — but I wanted to know: When exactly and for how long? So I built a small AI project that logs this automatically.

It’s also a learning project: How do you deploy an open-source AI model on a Raspberry Pi? How do you run real-time audio classification on an edge device?

The Idea

A Raspberry Pi 4 with an I2S microphone listens continuously. As soon as an acoustic instrument is detected, the system logs:

  • When it starts and stops
  • How long the session lasts
  • Which instrument was detected
  • How loud it was
  • How confident the AI is (confidence score)

The data flows via MQTT to Home Assistant — where I can build automations, dashboards and notifications.

Hardware

ComponentCost
Raspberry Pi 4 Model B (8 GB)*already owned
INMP441 I2S MEMS Microphone*~5 EUR
microSD card 32 GB (SanDisk Ultra)*already owned

Total cost: ~5 EUR (if you already have the Pi)

The INMP441 is a digital MEMS microphone that connects directly to the Pi’s GPIO pins via the I2S interface. No USB, no driver issues — clean digital signal.

Wiring

INMP441         Raspberry Pi 4
─────────       ──────────────
VDD        →    3.3V      (Pin 1)
GND        →    GND       (Pin 6)
SCK        →    GPIO 18   (Pin 12)
WS         →    GPIO 19   (Pin 35)
SD         →    GPIO 20   (Pin 38)
L/R        →    GND       (Mono)

Architecture: Two-Stage Pipeline

The system uses a two-stage detection approach to save CPU and power:

┌──────────┐    ┌───────────┐    ┌──────────────┐    ┌─────────────┐
│ INMP441  │───▶│  RMS      │───▶│  YAMNet      │───▶│  Event      │
│    Mic   │    │  Monitor  │    │  Classifier  │    │  Manager    │
│  (I2S)   │    │ (Stage 1) │    │  (Stage 2)   │    └──┬──┬──┬───┘
└──────────┘    └───────────┘    └──────────────┘       │  │  │
                                                    ┌───▼┐┌▼──▼──┐
                                                    │MQTT││SQLite│
                                                    └──┬─┘└──────┘
                                                    ┌──▼──────────┐
                                                    │Home Assistant│
                                                    └─────────────┘

Stage 1: Volume Monitor (always active)

A simple RMS (Root Mean Square) monitor calculates the audio volume every 100ms. As long as it’s quiet, nothing happens — CPU load stays under 1%.

Only when the level exceeds a configurable threshold does Stage 2 kick in. Hysteresis prevents the system from flickering between “on” and “off” at borderline volumes.

class AudioMonitor:
    def process_chunk(self, audio: np.ndarray) -> tuple[bool, float]:
        rms = float(np.sqrt(np.mean(audio ** 2)))

        if self._is_active:
            # Deactivate only below hysteresis threshold
            if rms < self.hysteresis:
                self._is_active = False
        else:
            # Activate above threshold
            if rms >= self.rms_threshold:
                self._is_active = True

        return self._is_active, rms

Stage 2: YAMNet AI Classification (on-demand)

YAMNet is a pre-trained audio classification model from Google. It recognizes 521 different sound classes — including many musical instruments like guitar, violin, piano, flute etc.

Why YAMNet and not an LLM like Gemma 4?

CriterionYAMNetGemma 4 E2B
PurposeAudio classificationSpeech recognition
Model size~4 MB~1.5 GB
Inference on Pi 4~100 msNot optimized
Instrument classes521 incl. instrumentsNot designed for this

YAMNet runs as a TensorFlow Lite model at just ~4 MB — perfect for the Pi. One inference takes about 100ms. The model receives a 0.975-second audio chunk and returns the top-5 detected classes with confidence scores.

result = classifier.classify(audio_chunk)
# result.is_instrument = True
# result.instrument_name = "Acoustic guitar"
# result.confidence = 0.82

Event Manager: State Machine

A simple state machine tracks the “sessions”:

    ┌───────────┐  Instrument detected  ┌──────────┐
    │   IDLE    │──────────────────────▶│  ACTIVE  │
    │           │◀──────────────────────│          │
    └───────────┘  Timeout (10s without └──────────┘
                   detection)
  • IDLE → ACTIVE: Instrument detected with confidence above threshold → session starts
  • ACTIVE: Counts duration, tracks volume/confidence, sends MQTT updates every 5 seconds
  • ACTIVE → IDLE: 10 seconds without detection → session is saved to database

The 10-second debouncing ensures that brief pauses (turning pages, short silence) don’t interrupt the session.

MQTT & Home Assistant

With MQTT auto-discovery, sensors appear automatically in Home Assistant:

SensorDescription
sensor.instrument_detector_stateplaying / idle
sensor.instrument_detector_durationDuration in seconds
sensor.instrument_detector_volumeVolume in dB
sensor.instrument_detector_confidenceDetection confidence 0.0–1.0
sensor.instrument_detector_instrumentDetected instrument

This enables any Home Assistant automation — for example, a phone notification when the neighbor starts playing, or a long-term dashboard with play times per week.

REST API & Remote Control

A FastAPI web server runs on port 8000:

GET  /api/status        → Current state
GET  /api/sessions      → Session history
GET  /api/stats         → Daily/weekly statistics
POST /api/config        → Adjust thresholds at runtime
POST /api/control/start → Start detection
POST /api/control/stop  → Stop detection
GET  /docs              → Swagger UI

The Swagger UI at http://raspi:8000/docs makes testing and configuration particularly easy.

Installation: One Command

On a fresh Raspberry Pi OS, a single command does everything:

curl -sSL https://raw.githubusercontent.com/ckoehler99/ai_surveillance/main/setup/setup-pi.sh | \
  sudo MQTT_BROKER=192.168.1.100 bash

The script handles everything: system packages, I2S driver, Python environment, YAMNet model, configuration and systemd service. Alternatively, the SD card can be fully pre-configured with cloud-init — the Pi sets itself up on first boot.

Tech Stack

ComponentTechnology
Audio capturesounddevice (I2S/ALSA)
AI modelYAMNet via tflite-runtime
MQTTpaho-mqtt with HA auto-discovery
Web serverFastAPI + uvicorn
DatabaseSQLite3
ConfigurationPyYAML

What I Learned

  • TFLite on the Pi is surprisingly fast. 100ms inference time for an audio classification model — more than enough for real-time.
  • The two-stage pipeline saves enormous CPU. Without the RMS pre-check, the Pi would run YAMNet inference constantly — with the pipeline it stays under 1% CPU when idle.
  • I2S is far superior to USB audio. No driver issues, no USB overhead, clean digital signal. Wiring is just 5 wires.
  • YAMNet beats an LLM for this purpose. A specialized, lightweight model outperforms a general-purpose LLM on a focused classification task — both in speed and accuracy.

Next Steps

  • Transfer Learning: Fine-tune YAMNet on my neighbor’s specific instrument — should significantly improve detection accuracy
  • Piezo Contact Microphone: A second microphone attached directly to the ceiling for structure-borne sound detection. Possibly fusing both signals for higher confidence.
  • Long-term Dashboard: Grafana integration for weekly and monthly statistics

Links marked with * are affiliate links. If you purchase through these links, I receive a small commission — at no extra cost to you.