Open book repository Project Gutenberg has turned thousands of its titles into audiobooks practically overnight using synthetic speech, available now for download or streaming on multiple services. The selection is a bit idiosyncratic (as indeed the archive's is generally) but it is nevertheless a powerful demonstration of accessibility in literature.
Making an audiobook via traditional narration naturally takes quite a long time even in the best case, and of course the reader must be paid for their time and there is the matter of editing and publishing. For many titles it doesn't make sense financially to produce an audiobook, meaning many older and more obscure titles remain difficult for people who prefer that format to consume.
Project Gutenberg is, of course, dedicated to promulgating public domain literature in as many formats as possible, and filling this gap has likely been on their to-do list for years. But it was only when they teamed up with MIT and Microsoft that they were able to perform the kind of code magic necessary to use AI-generated speech to bring these books to life.
The problem with PG's archive, as valuable as it is, is that the files are not uniformly formatted. They come from various sources, often error-ridden optical character recognition processes, and often are imperfectly edited and corrected by volunteers. Even if they were flawless, it does not follow that the format would be easily read by a machine: you would end up narration of page numbers, footnotes, and other ephemera.
"Each one of the e-books in Project Gutenberg is in its own idiosyncratic html format with lots of text you wouldn’t want to hear read aloud like tables, contents, indices, page numbers etc. The hardest part of the project was extracting the good text to read aloud." explained project co-lead Mark Hamilton, affiliated with Microsoft and MIT.
To solve this, they designed a system that worked through the archive and identified book files that were formatted similarly, then figured out which of those clusters were the best suited to being automatically read out.
This first batch, being somewhat constrained in its selection, is a little idiosyncratic: for instance, there is only one Dickens book (the unfinished "Edwin Drood" at that) but a dozen volumes along the lines of "Notes and Queries, Number 176, March 12, 1853 A Medium of Inter-communication for Literary Men, Artists, Antiquaries, Genealogists, etc."
"We picked the books for the first batch based on what we felt the automated parser could do reasonably well," Hamilton continued. "Nevertheless, some key good ones fell through the cracks. Now that we have the first batch out, we’re working to generalize the system to get closer to the full 60k books in a future release."
As for the narration itself, the team has put together multiple machine learning and synthetic speech tools that have improved and become more accessible over the last few years. A few years ago it was obvious that automated audiobook production would soon arrive, and that is has — and at scale.
Here's how the paper on the project describes their approach to making a generated audiobook engaging:
To create an emotive reading of the text, we use an automatic speaker and emotion inference system to dynamically change the reading voice and tone based on context. This makes passages with multiple characters and emotional dialogue more life-like and engaging. To this end, we first segment the text into narration and dialogue and identify the speaker for each dialogue section. We then predict the emotion of each dialogue using in a self-supervised manner. Finally, we assign separate voices and emotions to the narrator and the character dialogues using the multi-style and contextual-based neural text- to-speech model proposed in.