This project has moved to Codeberg: https://codeberg.org/richwaters/StoryAlign
─────────────────────────────────────────────────────
storyalign is a macOS command-line tool that combines an ebook with an audiobook to produce an enriched ebook containing audio narration. These enriched ebooks are sometimes called read-aloud books, synchronized audio-ebooks, enriched epubs, and more.
These enhanced books are ideal for people who like to switch between reading and listening, as a single app is used to both read and listen. This keeps your place within the book, regardless of the current reading/listening mode.
storyalign is based on the storyteller-platform project available here: https://gitlab.com/storyteller-platform/storyteller. It extracts the core alignment functionality from that project into a standalone tool with minimal dependencies.
- macOS on ARM (Apple Silicon)
- DRM-free EPUB 3
- M4B audiobook format
Download the zip containing the binary with this command:
curl -O -L https://github.com/richwaters/StoryAlign/releases/latest/download/storyalign-macos-arm64.zip
-
unzip the downloaded file
-
copy the storyalign binary to a directory in your PATH or run it in place with:
-
storyalign <epub file> <audiobook file>
On first run, it will prompt for confirmation to download necessary model files, and then create a new epub with "_narrated" appended to the basename of the input epub file. Subsequent runs will bypass the downloads.
- git clone https://github.com/richwaters/StoryAlign.git
- cd StoryAlign
- make install
That places the binary into the bin subdirectory. From there, you can cp ./bin/storyalign into a location in your PATH or run it in place.
storyalign [--outfile=<file>] [--granularity=(sentence|phrase|segment|group|word)] [--whisper-model=<file>] [--audio-loader=(avfoundation|ffmpeg)] [--log-level=(debug|info|timestamp|warn|error)] [--no-progress] [--throttle] [--start-chapter=<chapter name>] [--end-chapter=<chapter name>] [--report=(none|score|stats|full|json)] [--whisper-beam-size=<number>] [--whisper-dtw] [--session-dir=<directory>] [--stage=(epub|audio|transcribe|align|xml|export|report|all)] [--help] [--version] [--help-md] <ebook> <audiobook>
<ebook> The input ebook file (in .epub format)
<audiobook> The input audiobook file (in .m4b format).
--outfile=<file> Set the file in which to save the aligned book. Defaults to the name and path of the input file with '_narrated' appended to the basename of that file.
--granularity=(sentence|phrase|segment|group|word) Sets the unit for the synchronized highlighting during narration. The default is 'sentence', which creates the most accurate alignment and fewest highlight updates. The 'phrase' option breaks the sentence into smaller chunks for more frequent updates, so the highlight is less likely to be left on the previous page while audio continues. The 'segment' option relies on the transcription engine to break up the text within sentences. This ends up working like the 'phrase' option, but can be more attuned to audio timing than the semantics used by the 'phrase' option. The 'group' option moves the highlight with each word or small group of words based on timing. This reduces the page-stuck time while keeping things relatively smooth & accurate. The 'word' option moves the highlight with each individual word, which can feel a little choppy.
--whisper-model <file> The whisper model file. This is a 'ggml' file compatible with the whisper.cpp library. The 'ggml-tiny.en.bin' model is appropriate and best for most cases. If this option is not specified, storyalign will download and install the model after prompting for confirmation. If you do specify a model file, make sure the companion .mlmodelc files are installed in the same location as the specified .bin file.
--audio-loader=(avfoundation|ffmpeg) Selects the audio-loading engine. The default is 'avfoundation', which uses Apple's builtin frameworks to load and decode audio. In most cases this should work fine. The 'ffmpeg' option uses the FFmpeg command-line utility to load and decode audio. This might be helpful if you encounter issues with the default. To make use of it, you must have ffmpeg installed on your system and in your path.
--log-level=(debug|info|timestamp|warn|error) Set the level of logging output. Defaults to 'warn'. Set to 'error' to only report errors. If set to anything above 'warn', either redirect stderr (where these messages are sent) or use the --no-progress flag to prevent conflicts.
--no-progress Suppress progress updates.
--throttle By default, storyalign will use all of the resources the operating system allows. That can end up working the device pretty hard. Use this option to pare back on that. Aligning the book will take longer, but it'll keep the fans off.
--start-chapter=<chapter name> Specify the first chapter to align. This helps storyalign by allowing it to skip over chapters like the table of contents, forewords, etc. that are not in the audiobook. To some extent, this the epub itself provides this information in the form of a 'bodymatter' tag, but that is not always the case, and it often doesn't align with the true start of the audiobook.
--end-chapter=<chapter name> Specify the end chapter of the book, where 'end' means the chapter after the last chapter to align. This helps storyalign avoid attempting the alignment of chapters like afterwords, acknowledgements, next reads, etc. Some books provide a 'backmatter' tag that provides this type of information, but others do not.
─────
These options are useful for debugging and testing, but they usually aren't used in normal operation.
--report=(none|score|stats|full|json) Show a report describing the results of the alignment when it has completed. This 'score' choice emits a score that predicts the percentage of sentences that have been aligned correctly. Other options show more detailed information about what was aligned. The default is 'none'.
--whisper-beam-size=<number (1-8)> Set the number of paths explored by whisper.cpp when looking for the best transcription. Higher values will consider more options. That doesn't necessarily mean more accuracy. In fact, it's a bit arbitrary. (Lookup 'beam search curse' to learn more). storyalign defaults to 2 for large & medium models, 7 for tiny models and 5 for all other models.
--whisper-dtw Enable the dynamic time warping experimental feature for whisper.cpp and the experimental handling of that information in storyalign. This might improve accuracy of the timing of the transcription.
--session-dir=<directory> Set the directory used for session data. It is required when --stage is specified, and it tells storyalign where to store both temporary and persisted data.
--stage=(epub|audio|transcribe|align|xml|export|report|all) The processing stage to be run. When set, storyalign expects to find intermediate files stored in the directory pointed to by the session-dir argument. It will re-generate missing information required to run the specified stage.
─────
-h, --help Show this help information.
--version Show version information
--help-md Show the help text in markdown format. This can then be pasted into the README.md.
storyalign uses the 'whisper.cpp' for transcription of the audiobook. That project can be found at: https://github.com/ggml-org/whisper.cpp. By default, storyalign uses the tiny.en model which it downloads and installs under a .storyalign directory in the user's home folder. Other models can be downloaded from https://huggingface.co/ggerganov/whisper.cpp/tree/main. For best results, and to avoid a bunch of warnings, the companion .mlmodelc.zip file should be downloaded and installed in the same directory as the .bin model.
The large-v3-turbo seems to work the best in most cases, but in some cases the larger models can actually work worse. They can get stuck in a punctuation-less mode, and they can also suffer from the 'beam-search-curse'. To be honest, the whole thing seems a bit of a crapshoot. In the case of storyalign, you can spend a lot of time trying to get things perfect, but ultimately it's only the difference of a fraction of a percent of sentences being misaligned, and that doesn't have much of an effect on the reading/listening experience.
That said, the quality of the narrated epub is mostly dependent on the quality of the transcription so it is important for that part to work.
The --report option can be used to tell storyalign to produce a report about how well it thinks the alignment worked. This includes a score that is based on the percentage of sentences that it thinks were aligned correctly. This should usually be over 98 or 99%, but it can be less, especially for shorter books. This is due to the fact that some portions of the book like acknowledgements, about the author, etc. might not appear in the audio at all. For smaller books those sections are a larger percentage of the total book, which causes a lower overall score. Proper epubs will have 'bodymatter' and 'backmatter' attributes that point to the actual content of the book, but use of 'backmatter' is still spotty.
The storyalign reporting uses various mechanisms to determine if a sentence might be misaligned, but the main surefire indicator is if a sentence is too fast. That said, the current version of the reports still produces a lot of false positives.
Two iOS epub readers that support the narrated epubs created by storyalign are Storyteller Reader from https://apps.apple.com/us/app/storyteller-reader/id6474467720 and BookFusion from https://apps.apple.com/us/app/bookfusion/id1141834096. As storyalign is derived from storyteller-platform, you are highly encouraged to download that app and support that project as much as possible.
On macOS, there is an app called 'Thorium Reader' at: https://www.edrlab.org/software/thorium-reader/. I don't do much ebook reading on my Mac, but this app's search functionality has been incredibly useful in investigating misalignments reported in storyalign's reports.
DRM (Digital rights management) is a set of technical controls (mostly encryption) added to books to supposedly prevent unauthorized use or distribution. My sense is that authors themselves don't care much about it, and it is used to lock you into a single book-reading platform more than anything else. For obvious reasons, storyalign only works on DRM-free books. Many books can be purchased DRM-free from various platforms like ebookshop.org, libro.fm, and kobo.
smilcheck is a tool that can be useful for checking the epub-3 media overlays used by these narrated books. It's a work-in-progress, as I decided to focus more on the reporting within storyalign instead of continuing to improve smilcheck. Still, smilcheck is a useful external tool, as it can be run on any read-aloud epub, not just those created by storyalign. It works in a similar fashion to storyalign's reporting in that it examines the pacing of sentences to find misalignments. It differs in that it only uses the information in the final enhanced epub to make its determinations.
Usage is simply: smilcheck <epub file>
smilcheck should generally be run after confirming the book passes the checks in the epubcheck tool available here: https://www.w3.org/publishing/epubcheck/ (or with brew install epubcheck), as that tool performs important checks on the structure of the book, while smilcheck mostly focuses on sentence pacing.
This is a tool that performs a diff on two epubs. It doesn't do the diff itself, as it relies on an external tool (set by DIFFTOOL) at the top of the script for that. It just unzips the epubs into temporary directories and calls the difftool to perform the diff.
Usage is: epubdiff.sh <epub file 1> <epub file 2>
epubstrip.sh slims down a narrated epub by removing the audio and some of metadata (such as dates and times). It outputs a checksum of the content of the epub when it completes. This checksum is then used by the full book tests to ensure that code modifications don't cause unintended changes to the produced book. The --sum-only option can be used to output just the checksum without producing the stripped file.
Usage is: epubstrip.sh [--sum-only] <epub file> [<output file>]
This tool is used to make the expected result files for the full book tests. When run, it outputs the checksum of the stripped content which can then be manually entered into the testinfo.json file.
Usage is: mkExpected.sh <book name (no extension)>
It's helpful to debug the tool by running the different stages. To accomplish that, an Xcode scheme is used for each separate run stage. The generate_schemes_for_book.sh tool is used to generate the schemes from a template. This is a lot easier than using the Xcode scheme editor to add arguments, environment, etc. for each scheme. Basically, you can set all of the arguments for all of the schemes with a simple command. The basename of the epub file and the audio file must match for the script to work.
Usage is: generate_schemes_for_book.sh <options> <epub file>
Contributions, comments, and bug reports are welcome via GitHub issues, discussions, and pull requests.
This project is released under the MIT License. See LICENSE for details.