Voice Note Browser
I like thinking through ideas on a walk and sometimes record voice notes to myself as I go. The result is a collection of related notes, and I wanted a way to see their transcripts together on one page. But to my surprise, couldn’t find any apps that show more than a single transcript at a time.
So I built a little web app to do this using the newly-added transcripts in Apple’s Voice Memos, showing the transcripts from the couple of days’ worth of notes in an simple list.
- I used an Apple Shortcut to copy out the most recent week of voice memos.
- I originally wanted to use
rsync
, but ran into permission issues and did not feel sufficiently bold to give my terminal app “Full Disk Access”.
- I originally wanted to use
- I used SvelteKit to make the app.
- The transcript is stored as JSON alongside the audio in a metadata section. Here’s how I parsed it, treating the audio file contents as a utf-8 string:
; if jsonStart > -1
- I learned that SvelteKit can stream promises, making it easy to load individual voice notes asynchronously and incrementally. The index page returns an array of
Promise
s representing individual notes, and the frontend shows data as it arrives. This was very cool.
- The transcript is stored as JSON alongside the audio in a metadata section. Here’s how I parsed it, treating the audio file contents as a utf-8 string:
This came together quickly and I’m really happy with it – it’s nice when something so simple can be so useful.
A few ideas for future work:
- Provenance: Make it easy to listen to the underlying audio for the cases when the transcription doesn’t get things right.
- For example, making a highlight on the page should pop up a little tooltip that lets you listen to the audio corresponding to that part of the transcript.
- Structured extraction: Use LLMs to extract to-dos so that the ideas are easier for me to act upon.
- It would be nice if there was a sidebar with a bunch of to-do items next to each note.
- There might be other kinds of structured extraction that would also be useful. I could imagine a set of local language models making parallel passes over each note with each focused on extracting a particular kind of structure.