How My Transcriber Works Under the Hood

2026-04-19

A technical deep dive for developers and power users. How My Transcriber turns your Apple Voice Memos into searchable Markdown -- locally, privately, and surprisingly fast.

Architecture Overview

My Transcriber is a Tauri desktop application. That means a Rust backend for all the heavy lifting and a SvelteKit frontend for the UI, bundled into a native macOS app.

Tauri was the right choice for this project. It gives us native system access -- file system watching, process management, Metal GPU acceleration -- while keeping the UI layer simple and fast. The app bundle is small because we're using the system WebView rather than shipping a full Chromium browser.

The backend handles transcription, file watching, audio conversion, and all the state management. The frontend is purely for display and user interaction. They communicate through Tauri's IPC command system.

Where Voice Memos Live

Apple Voice Memos stores recordings in a specific location on your Mac:

~/Library/Group Containers/group.com.apple.VoiceMemos.shared/Recordings/

This folder is managed by the Voice Memos app and synced via iCloud. When you record a voice memo on your iPhone or Apple Watch, the M4A file eventually appears here on your Mac.

Accessing this folder requires Full Disk Access permission. That's a macOS security requirement -- no app can read another app's data without it. When you first set up My Transcriber, it guides you through granting this permission.

The app watches this folder for new files. When a new M4A appears, it gets queued for transcription.

Live Watching: Real-Time Transcription

Here's where it gets interesting. My Transcriber doesn't just wait for recordings to finish. It can detect when Voice Memos is actively recording and start transcribing in real time.

The live watcher monitors the Voice Memos Capture folder. When it detects an active recording, it starts processing the audio in chunks. As more audio becomes available, it processes the new chunks and emits partial transcription results.

This means you can see the transcription building up while you're still talking. The UI shows partial text as it comes in, along with progress indicators for how many seconds have been transcribed versus the current recording duration.

The chunked approach works because Whisper can process audio segments independently. Each chunk gets transcribed, and the results are accumulated into the full transcription. When the recording finishes, a final pass ensures everything is captured cleanly.

The M4A Problem

Voice Memos records in M4A format (AAC audio in an MP4 container). This is a good format for recording -- small files, good quality -- but it creates a problem for transcription.

The Rust audio ecosystem doesn't have great M4A support. The rodio library, which is the standard Rust audio playback library, has compatibility issues with Apple's M4A files. Rather than fighting with codec issues, we take the practical approach.

My Transcriber bundles a copy of ffmpeg. When a new M4A file appears, ffmpeg converts it to WAV (uncompressed PCM audio) before passing it to the Whisper model. This adds a small amount of processing time but eliminates all codec compatibility issues.

The ffmpeg binary is downloaded automatically during the build process -- architecture-specific for Apple Silicon or Intel. It's about 77MB per architecture and gets bundled into the app.

Whisper: The Transcription Engine

The actual transcription is done by OpenAI's Whisper model, running locally via whisper-rs (Rust bindings for whisper.cpp). No audio ever leaves your machine.

On first launch, My Transcriber downloads the Whisper model from Hugging Face. The default is the large-v3-turbo model, about 1.5GB, which supports 99+ languages with automatic language detection.

You can configure the model size in settings:

Tiny -- Fastest, least accurate. Good for quick notes in English.
Base -- The default. Good balance of speed and accuracy.
Small -- Better accuracy, slower. Worth it for important recordings.
Medium -- Near-professional quality. Noticeably slower.
Large-v3-turbo -- Best accuracy available. Requires more RAM and GPU.

Each model is downloaded once and cached locally. Switching models is just a settings change -- the new model downloads on the next transcription if not already present.

Metal GPU Acceleration

This is the single biggest performance factor in the entire app.

In debug builds (during development), Whisper runs on the CPU. Transcribing a 5-minute voice memo might take a couple of minutes.

In release builds, Metal GPU acceleration kicks in via Core ML. The same 5-minute voice memo transcribes in seconds. We're talking 100x+ faster. It's the difference between "this is unusable" and "this is instant."

Metal acceleration is enabled automatically in release builds. No configuration needed. If you're on Apple Silicon (M1 or later), you get GPU acceleration. This is one of the reasons My Transcriber is Mac-only -- the Metal GPU integration is what makes local transcription practical.

Background Catch-Up Processing

Macs sleep. They close. They lose network. Voice memos sync from iCloud at unpredictable times. The app needs to handle all of this gracefully.

My Transcriber uses a queue-based processing system. When a new voice memo is detected, it goes into a processing queue. If the Mac was asleep and several memos accumulated, they all get queued and processed in order when the app resumes.

The queue handles retries for failed transcriptions, tracks progress per file, and avoids re-processing files that were already transcribed. If you open your Mac on Monday morning after a weekend of recording voice memos on your phone, all of them get processed automatically.

There's also a catch-up scan on startup. The app compares what's in the Voice Memos folder with what's already been transcribed. Anything missing gets queued.

Output Format: Markdown with Frontmatter

Every transcription is written as a Markdown file with YAML frontmatter. Here's what one looks like:

---
captured_at: "2026-03-15T09:23:45+01:00"
duration: 183
language: en
model: base
source: voice_memo
transcript_only: false
---

Today I met with the client about the new
project timeline. They want to move the
launch date up by two weeks...

The frontmatter is machine-parseable. Any tool that reads YAML can extract the metadata. The body is plain text -- the raw transcription output from Whisper.

captured_at: When the voice memo was recorded (from the file metadata).

duration: Length in seconds.

language: Detected language code. Whisper detects this automatically.

model: Which Whisper model was used. Useful if you re-transcribe with a different model later.

Files are organized by year: Registry/{account_id}/voice/2026/2026-03-15-092345.md. The filename is derived from the recording timestamp, so files sort chronologically by default.

Multi-Account Support

If you have multiple Apple accounts -- say, a personal iCloud and a work iCloud -- each gets its own subfolder in the Registry directory.

The subfolder name is the first 8 characters of a hash of your Apple account identifier. It's deterministic (the same account always gets the same folder) but doesn't reveal your Apple ID.

The app only processes voice memos from the currently active account. This prevents accidentally transcribing a family member's voice memos if they share the same Mac. The Timeline UI shows which account each transcription belongs to and dims items from other accounts.

Privacy Architecture

Let's be explicit about what happens to your data.

Audio files: Read from the local Voice Memos folder. Converted to WAV locally. Processed by Whisper locally. Never uploaded anywhere. Optionally deleted after transcription.

Transcription text: Written to a local Markdown file. Never sent to any server. You control where the file lives and who can access it.

Network calls: The only network activity is downloading the Whisper model from Hugging Face on first launch (or when switching models). After that, the app works completely offline.

No telemetry. No analytics. No crash reporting. No "anonymous usage data." The app doesn't phone home. Period.

This isn't a marketing decision. It's an architectural one. The app doesn't have a server component. There's nowhere to send data even if we wanted to.

The Audio Deletion Workflow

My Transcriber supports a deliberate audio lifecycle: record, transcribe, review, accept, delete audio.

When you accept a transcription, the source audio file can be deleted automatically. The frontmatter gets updated with transcription_accepted: true and transcript_only: true.

This is opt-in. You can keep your audio files forever if you prefer. But for many people -- especially professionals handling sensitive conversations -- the ability to automatically remove audio recordings after transcription is essential.

There's also an auto-delete option: configure the app to delete source audio immediately after successful transcription, without requiring manual acceptance. For people who never want audio recordings lingering on disk, this is the way to go.

The Build System

For developers interested in the build process: the app is built with Tauri's standard toolchain plus some custom prebuild scripts.

pnpm run build triggers the full pipeline:

Download architecture-specific ffmpeg binary (~77MB)
Build the SvelteKit frontend
Compile the Rust backend with Metal acceleration enabled
Bundle everything into a native macOS .app

The Rust backend compiles with release optimizations that enable Metal GPU support. This is why the debug build is so much slower -- it's literally running on CPU instead of GPU.

Signed builds (pnpm run build:sign) are needed for testing features that require macOS permissions (Full Disk Access, Photos access). Unsigned builds lose these permissions every time you rebuild, which makes development painful.

Why Rust?

The natural question. Why not Electron with a Node.js backend? Or a native Swift app?

Rust gives us three things that matter here:

Performance. Audio processing and file watching need to be efficient. Rust's zero-cost abstractions mean the backend uses minimal resources while watching for changes and managing the transcription queue.

whisper-rs. The best Whisper implementation for local inference is whisper.cpp, and it has excellent Rust bindings. The Rust ecosystem gave us the shortest path to fast, local transcription with Metal support.

Tauri. The Tauri framework is built in Rust and produces small, fast native apps. The alternative (Electron) would ship a 200MB+ Chromium browser. Tauri uses the system WebView, so the app is a fraction of the size.

Error Handling and Recovery

Real-world audio processing is messy. Files get corrupted. iCloud sync delivers partial files. The Mac runs out of disk space mid-conversion.

The queue system handles this with retry logic. If a transcription fails, it gets retried with exponential backoff. If it fails repeatedly, it's marked as failed and the user can retry manually from the UI.

The ffmpeg conversion step validates that the input file is complete before attempting conversion. This catches the common case of iCloud delivering a partial download.

Application logs go to ~/Library/Logs/my.transcriber/ for production builds. If something goes wrong, there's a clear trail to follow.

What's Not Here (Yet)

Some things that would make sense but aren't implemented yet:

Speaker diarization. Identifying who's speaking in a multi-person recording. Whisper doesn't do this natively, and the third-party solutions aren't great yet for local inference.

Windows and Linux. The Voice Memos integration is deeply tied to macOS. A cross-platform version would need a different audio input strategy.

Streaming transcription to external tools. Right now the output is files. A webhook or API layer for real-time output to other tools is on the roadmap.

These are engineering challenges, not fundamental limitations. The architecture supports extending in these directions.

Explore It Yourself

If you're the kind of person who reads technical deep dives before downloading software, you're exactly who this app was built for.

Everything described here runs locally on your Mac. No account required. No server involved. Download it, grant Full Disk Access, record a voice memo, and watch it appear as Markdown.

Download for Mac (Free)

macOS 15+ required. Apple Silicon recommended for Metal GPU acceleration.

My Transcriber

Free. Local. Private. macOS 15+.

Not sure which? Apple menu → About This Mac. "Chip: Apple M..." = Apple Silicon. "Processor: Intel..." = Intel.

Stay updated

Get notified when we publish new posts. Sign up and we'll send updates straight to your inbox.