December 2024

Voice Note Browser

I like thinking through ideas on a walk and sometimes record voice notes to myself as I go. The result is a collection of related notes, and I wanted a way to see their transcripts together on one page. But to my surprise, couldn’t find any apps that show more than a single transcript at a time.

So I built a little web app to do this using the newly-added transcripts in Apple’s Voice Memos, showing the transcripts from the couple of days’ worth of notes in an simple list.

I used an Apple Shortcut to copy out the most recent week of voice memos.
- I originally wanted to use rsync, but ran into permission issues and did not feel sufficiently bold to give my terminal app “Full Disk Access”.
I used SvelteKit to make the app.
- The transcript is stored as JSON alongside the audio in a metadata section. Here’s how I parsed it, treating the audio file contents as a utf-8 string:
```
const jsonStart = text.indexOf(`{"attributedString"`);
if (jsonStart > -1) {
  const jsonEnd = text.indexOf("\x00", jsonStart);
  const slice = text.slice(jsonStart, jsonEnd);
  const transcript = JSON.parse(slice);
}
```
- I learned that SvelteKit can stream promises, making it easy to load individual voice notes asynchronously and incrementally. The index page returns an array of Promises representing individual notes, and the frontend shows data as it arrives. This was very cool.

This came together quickly and I’m really happy with it – it’s nice when something so simple can be so useful.

A few ideas for future work:

Provenance: Make it easy to listen to the underlying audio for the cases when the transcription doesn’t get things right.
- For example, making a highlight on the page should pop up a little tooltip that lets you listen to the audio corresponding to that part of the transcript.
Structured extraction: Use LLMs to extract to-dos so that the ideas are easier for me to act upon.
- It would be nice if there was a sidebar with a bunch of to-do items next to each note.
- There might be other kinds of structured extraction that would also be useful. I could imagine a set of local language models making parallel passes over each note with each focused on extracting a particular kind of structure.