I think out loud. That has always worked better for me than sitting in front of a blank page, which means I end up with a lot of Just Press Record voice memos sitting on an iPhone - a few seconds at a time, sometimes a two-hour walking monologue, rarely anything in between. They sync to iCloud, and then nothing happens to them. I have yet to voluntarily scrub through a ninety-minute audio file to find the one sentence I cared about at minute 43.

What I actually want is: press record on the phone, and twenty minutes later a proper Markdown transcript shows up in my Obsidian vault, neatly filed under Transcripts/2026/04 April/2026-04-21 07-45-47.md, speaker-diarised, in the original language, with word-level timestamps in a JSON sidecar for the day I decide to build something more interesting on top. Searchable, quotable, linkable from my other notes.

This post is the write-up of that pipeline, now running end-to-end on my home server.

The shape of it

iPhone (JPR)
   │  iCloud sync
   ▼
iclouddrive ── rclone move ──► /home/.../icloud/Just Press Record/
                                         │
                                         │  transcribe_bot.py (hourly)
                                         ▼
                                    ffmpeg normalize + chunk
                                         │
                                         │  Modal RPC per 12-min chunk
                                         ▼
                                 WhisperX + align + pyannote (L4 GPU)
                                         │
                                         ▼
                           vault/Transcripts/YYYY/MM Month/YYYY-MM-DD HH-MM-SS.md
                                         │
                                         └──► WidgitaBot → Matrix Status room

Nothing about this stack is novel on its own. The interesting bit is how the pieces are stitched together so the whole thing is boring to operate - idempotent, crash-safe, restartable, and silent when nothing’s happening.

Pulling the audio: rclone with a stability check

The phone records into iCloud Drive, and rclone has a solid iclouddrive backend for pulling it down. The subtle problem is that Just Press Record writes the file to iCloud while still recording, so a naive rclone move will happily grab a half-written file and delete it from the phone.

The fix is embarrassingly simple but worth doing properly: icloudsync.py does a two-pass size probe. List everything with rclone lsjson, wait 90 seconds, list again, and only move files whose size (and mtime) didn’t change between the two snapshots. A still-recording file fails that check and stays put for the next cycle. A boring systemd timer (OnUnitInactiveSec=10min, Type=oneshot) runs it forever.

The transcription loop: one Python script, one Modal function

The local orchestrator is transcribe_bot.py. It’s a single-file, no-daemon, run-once-then-exit script that a separate systemd timer pokes hourly. Each run does the same thing for every new audio file:

  1. sha256 the source bytes. This is the idempotency key. The manifest at .transcribe-state/jobs.json maps hashes to {status, note, sidecar, ...}, so re-running the bot a hundred times on the same iCloud folder does nothing a hundred times.
  2. Move the file out of iCloud into audio-ingest/processing/<hash>/<orig>.m4a. From this point the source lives in “work” space, not “incoming” space, and won’t be double-picked by a concurrent scanner.
  3. Normalize with ffmpeg to 16 kHz mono 16-bit PCM WAV, which is what WhisperX wants anyway. Locking the format on the client side means the GPU worker never has to care about exotic iPhone codecs.
  4. Chunk into 12-minute pieces with 2 s overlap (only if the recording is over 30 minutes - shorter stuff goes in one shot). The overlap is there so words straddling a chunk boundary get real context from the next chunk, and the merger later just drops segments whose start falls inside the overlap tail of a non-final chunk.
  5. Call Modal, once per chunk, base64-encoding the WAV bytes into the payload. The worker returns absolute-time segments, word-level timestamps, and per-chunk speaker labels.
  6. Merge, render, atomic-write the Markdown note and the JSON sidecar to the Obsidian vault (tmp + os.replace, so Obsidian Sync never sees a half-written file).
  7. Delete the working copy, update the manifest, drop a notification in the outbox spool for WidgitaBot to pick up.

If any step fails, the file moves to audio-ingest/failed/<hash>/ together with an error.txt, and the manifest marks it as status: failed. No retry loop - failed means “the human should look at this,” not “let’s burn GPU hours hammering it.”

The heavy lifting runs on Modal. There is a single Transcriber class in modal_app/whisperx_worker.py:

  • The image pins torch==2.3.1, numpy<2, and critically transformers<5 (v5 dropped an internal torch.no_grad reference that WhisperX 3.3 still reaches into - the first deploy failed with a NameError: name 'torch' is not defined from inside the container, which took a second to parse). whisperx==3.3.1 pulls the rest of the ML stack itself.
  • A persistent Modal Volume at /models caches WhisperX large-v3, the per-language alignment model, and the pyannote diarization pipeline. First cold start downloads ~4 GB into the volume; every start after that opens them from disk in seconds.
  • @modal.enter() loads the ASR model once per container and keeps it in memory; align and diarize models are loaded lazily, keyed by language code, and cached for the container’s lifetime. scaledown_window=300 keeps the container warm for five minutes after the last call, which is usually enough to serve all chunks of a multi-chunk recording on one container.
  • The Hugging Face token for pyannote comes in via a Modal Secret (huggingface-secret). The source file has no credentials in it.
  • Audio is decoded to /tmp/whisperx_<job>_<chunk>_<rand>/in.wav and shutil.rmtree’d in a finally block. Nothing written to the Volume ever contains user audio - only models.

The function itself is ~60 lines of straightforward WhisperX code: transcribe → align → optionally diarize → return a dict. The whole file, including the image spec and a warmup entrypoint, is under 250 lines.

Matrix, so I know about it

The other half of the “no hiccups” story is knowing when the pipeline actually did something. I already have a self-hosted Synapse instance (matrix.widgita.xyz), so I wrote WidgitaBot - a matrix-nio-based Python bot that lives in an end-to-end-encrypted #status room and posts a short message whenever something interesting happens.

The coupling between the pipeline and the bot is deliberately loose: both sides agree on a directory, bot-outbox/, and a JSON schema. transcribe_bot.py writes a file atomically (tmp + rename) with a content field that is literally an m.room.message dict:

{
  "source": "transcribe_bot",
  "timestamp": "2026-04-21T17:56:06+00:00",
  "content": {
    "msgtype": "m.notice",
    "body": "Transcript ready: 2026-04-21 07-45-47 ..."
  }
}

WidgitaBot polls the directory, sends each message, and deletes the file. Malformed files move to .dead-letter/. If the bot is down when a transcript finishes, the file sits in the spool until the bot comes back. If the bot is down for a week, a week’s worth of messages arrive at once the moment it reconnects. You can’t lose a notification by restarting either process, which is the property I wanted.

This pattern - file-based spool as the contract between two processes - is one of those designs I keep reaching for in personal infrastructure. It needs no IPC library, no message broker, no schema registry. ls is your queue monitor, cat is your debugger, rm is your retry button.

Crash safety: orphan recovery

The bit I’m proudest of is also the most mundane. Transcribing the backlog includes a 5.6-hour recording, which on the current cadence is twenty-eight Modal calls. If the service gets killed mid-run - process OOM, kernel upgrade, whatever - the source file is already out of iCloud and sitting in audio-ingest/processing/<hash>/.

Instead of teaching the bot about partial-completion recovery, the scanner simply runs iter_orphan_sources() before iter_source_files() on each invocation. An orphan is any audio file in processing/<hash>/ that isn’t our normalized WAV or a chunk file. Those get processed first on the next tick, and because the sha256 key is derived from the bytes (not the path), the manifest check still works. A killed run becomes a “try again next hour” for free.

Filename wart, deliberate

One small thing that made me pick the timestamp separator I did: my Obsidian vault uses YYYY-MM-DD DayName as its existing note convention, which works because there’s only one of those per day. Audio logs can happen every few minutes, so I need a time in the filename. I started with 2026-04-21 07:45:47.md - colons are perfectly legal on ext4 and render fine in Obsidian desktop - but the vault is synced with Obsidian’s official sync daemon, which has historical trouble with colons on iOS/APFS. The fix was a one-character change: 2026-04-21 07-45-47.md. Not interesting, but the kind of thing that would have been a headache in a month if I’d left it.

What it doesn’t do (yet)

Two things I’m aware of and haven’t fixed:

  • Cross-chunk speaker identity. pyannote assigns SPEAKER_00, SPEAKER_01, etc. per chunk, and those labels don’t carry across chunk boundaries. On a two-hour conversation that’s split into ten chunks, the same person might appear as SPEAKER_c0_SPEAKER_00 in the first note and SPEAKER_c4_SPEAKER_01 further down. The scoped c<n>_ prefix flags this explicitly in both the Markdown and the JSON so no downstream tool mistakenly treats these as the same identity. A proper fix would be a second pass that clusters speaker embeddings across chunks; not today.
  • No batching across files. Each run processes one file at a time, sequentially. For the backlog that was fine - the pipeline cleared twelve hours of audio overnight - but for sustained heavy use, parallel max_containers on the Modal side plus a small producer/consumer loop locally would be the obvious upgrade. I’ll do it if I ever have the problem, not before.

Why it works well

It’s boring to operate. Every piece does one thing. The contracts between pieces are small (a directory, a JSON shape, a filename convention), which means swapping any one of them out later is a local change. No file is ever deleted before its result has been verified on disk. The manifest is a single JSON file I can read with cat and fix with vim if I have to. The GPU container carries no state between runs that I’d cry about losing; the only thing that lives on Modal’s side is a cache of model weights, and rebuilding that cache is a modal run ...::warmup away.

And the feedback loop is fast: I press record on the phone, walk for a bit, stop, and an hour or two later a new note appears in Obsidian with a short ping in Matrix. Which is, as it turns out, what I actually wanted - not the pipeline, but the sense that anything I think out loud into a microphone is now text by the time I next think to look for it.