« Topsight, February 25, 2009 | Main | The End of Long-Term Thinking »

John Henry was an Audiobook-Readin' Man

You might remember the story of old John Henry. He built rail lines, and could work harder and faster than any man alive. When the company brought in a steam-driven rail driving machine, though, they announced that they were going to fire all of the human rail workers. John Henry stepped up and challenged that machine.

Challenged it, and beat it.

And then dropped over dead.

Keep that in mind as you read this.

Roy Blount, Jr., the president of the Authors' Guild, wrote an editorial in the New York Times on February 25th, arguing that the text-to-speech feature of Amazon's new Kindle 2 electronic book reading device actually violates the intellectual property rights of the authors he represents, as it provides the functional equivalent of an audiobook, without paying for audiobook rights.

The crux of Blount's argument is that it's critical to set a precedent now, because the text-to-speech is an audio performance of the book, and even if the digital vocalization is now lousy, it won't always be.

Not surprisingly, authors who have more willingly entered the 21st century, such as Cory Doctorow, John Scalzi, Neil Gaiman, and Wil Wheaton, have attacked Blount's argument with gusto. Wil even provides an amusing side-by-side audio comparison (MP3) of himself and the Mac's "Alex" voice reading a section of his new book Sunken Treasure.

For Scalzi, Gaiman, and Wheaton, the crux of the argument is that Blount's concerns are worse than silly, because nobody would mistake the text-to-speech for real voice acting. (Doctorow, as is his practice, focuses on the legal aspect of Blount's argument, finding it more than wanting.)

My take on this? They're all wrong (well, probably not Cory)... and they're all right, too. That is, Blount is right about the technology, but wrong in his conclusions, while Scalzi/Gaiman/Wheaton/et al are wrong about the problem, but right about the proper response. The reason that Blount's wrong is that he's just trying to hold back the tide, fighting a battle that was lost long ago. The reason that the 21st century digital writers are wrong is that they've forgotten the Space Invaders rule: Aim at where your target will be, not at where it is.

Text-to-speech is laughably bad now for reading books aloud.

Text-to-speech could very well be the primary way people consume audiobooks within a decade.

At present, text-to-speech systems that go from ASCII to audio follow a few pronunciation conventions, but otherwise have no way of interpreting what is read for proper emphasis. For the kinds of uses that current text-to-speech systems typically see, that's good enough. For reading books, especially fiction, that's not.

But it's not hard to imagine what would be needed to make text-to-speech good enough for books, too. In order to give the right vocalization to the words it's reading, an "AutoAudio Book" would have to have one of three characteristics:

  • It could have been told in detail how to emphasize certain words and phrases, probably through some kind of XML-based markup standard. Call it DRML, or Dramatic Reading Markup Language. Given the existence of other kinds of voice control systems (such as speech synthesis markup language and pronunciation lexicon specification), such a standard isn't hard to imagine. It would take some pre-processing of the text files, though, to really make it work.

  • At the other end of the spectrum, it could actually understand what it's reading, and be able to provide emphasis based on what is going on in the story (basically, what you or I would do).

  • Somewhere in the middle would be a system that had a number of standard emphasis heuristics, and is able to take a raw text file and, after a little just-in-time processing, offer an audio version that would by no means be as good as a real voice actor, but would, for most people, be good enough.

The DRML version is possible now -- hell, I had DOS apps back in the 1990s that would let me add markers to a text file to tell primitive text-to-speech software how to read it. The "understand what it's reading" version, conversely, remains some time off; frankly, that's pretty close to a real AI, and if those are available for something as prosaic as an ebook reader, we have bigger disruptions to worry about.

But the "emphasis heuristics" scenario strikes me as just on the edge of possible. There would have to be some level of demand -- such as would arguably be demonstrated by the success of the Kindle 2 and its offspring. More importantly, it would require a dedicated effort to create the necessary heuristics; amusingly, Blount's editorial has probably done more than anything else to make irritated geeks want to figure out how to do just that. It would probably also need a more powerful processor in the ebook reader; that's the kind of incentive that might make Intel want to underwrite the aforementioned irritated geeks.

One can easily imagine a scenario in which we see a kind of "wiki-emphasis" editing, allowing tech-attuned readers, upon encountering a poorly-read section of an AutoAudio Book, to update it and upload the bugfix, thereby improving the heuristics. (Of course, that would undoubtedly result in orthographic edit-wars and dialect forking. But I digress.)

Ultimately, Blount's fears that a super text-to-speech system could undermine the market for professional audiobooks really have more to do with economic choices than technical ones. The requisite technologies are either here but expensive or just on the horizon, and the combination of technological pathways and legal precedent (as Doctorow describes) make the scenario of good-enough book reading systems all but certain. But that doesn't guarantee that the market for audio books goes away. The history of online music is illustrative here, I think: when the music companies were ignorant or stubborn, music sharing proliferated; when music companies finally figured out that it was smart to sell the music online at a low price, music sharing dropped off considerably.

The more that the book industry tries to fight book-reading systems, the more likely it is that these systems (whether for Kindles, or iPhones, or Googlephones, or whatever) will start to crowd out commercial audiobooks. The more that the book industry sees this as an opportunity -- keeping audiobook prices low, for example, or maybe providing ebooks with DRML "hinting" for a dollar more than the plain ebook -- the more likely it is that book reading systems will be seen as a curiosity, not a competitor.

None of these scenarios may be very heartening for authors, unfortunately. Sorry about that.

At least you're not likely to keel over and die competing with an automated audiobook.


The audiobook industry has never been able to keep up with production of the books being written. Dramatic production or not, many sight impaired are now thrilled that they FINALLY have access to technology to help them listen to ANY written document. If Amazon loses - so do all those who are sight impaired and if I were them, I'd start collecting my pennies to hire counsel to sue for rights to access that technology under the ADA.

Once again, I don't see why the Author's Guild has a problem here. Or at least, I don't see why the authors of the Guild would.

Doesn't an author get paid any time a consumer pays for a novel, whether it be a traditional print book, an e-book, or an audio book? Granted, the amount of that payment may differ for each format depending on contracts, but if the balance of sales shifts I am sure the percentages of commissions would shift as well due to natural market forces. And with e-books having less overhead cost, there is greater potential for the profits of the authors to increase, not only through higher commissions, but also increased sales (lower priced books means more consumers will purchase more books).

I would imagine there are a few people that would purchase two different formats of a work, such as both an audio book and a print version, but the majority of us will purchase either one or the other. So the impact on the author's income through the loss of traditional audio book sales, which is almost always a small percentage of total sales, would be minimal.

The only definite loss of income I can see now would be loss of income to the publishers, who charge far more for audio books than print versions. The same publishers who lose money due to print sales being lost to devices like Kindle.

Hmmm...the president of the Author's Guild taking a public stand that will discourage authors from allowing their books to be published via Kindle...which takes away some of the appeal of the fledgling product, which could potentially kill it. Sounds like a win-win for the publishing industry, rather than the authors (and sounds reminiscent of the tactics of the RIAA, suing file-sharers for the "sake of the artists" who never see a cent of the lawsuits).

So maybe the question should be about whose interests Roy Blount, Jr. represents.

If this is a problem, it can be solved fairly easily by including payment for audio rights in books licensed to ebooks with reader functions.

Something similar to your hypothetical "Dramatic Reading Markup Language" already exists, in the form of the proposed CSS3 speech module, including its proposed "say-instead". (The latter, for example, allows the author to specify the respective correct pronunciations for "John Paul II" vs. "Rocky II".)

Charles Chen, now at Google, built this support into his FireVox extension to make Firefox "self-voicing"; google "FireVox say-instead" for more information.

The technology may be close, However even if they manage to work it to make the proper intonations of the voice, it still won't have the personality. There isn't any comparison to having it voiced by the author in the way it was intended, or by someone that can put emotion, humor or wit in the proper places.

What amuses me about this debate is that the president of the Author's Guild presents the Kindle as a competitor to audio books. Prior to this, the mainstream consumers wouldn't have considered this. Now,he has set out the idea that audiobooks are just barely better than Kindle read books. This dramatically lowers the perceived value of audio books. Most people will read his speech about rights and royalties, and all they'll take away is that the Kindle is an almost viable alternative to audiobooks. I respect what he wanted, but this should have been left to the courts; he should not have made this a public debate. No matter which side wins, audio books will lose.

Forget DRML, you don't need it: think Google Translate. There exists a corpus of audiobooks, with intonation (thanks to professional readers). There exists speech-to-text software. It should be possible to build a corpus of spoken phrases, correlated with the reader's intonation, and then match text from a new work against the corpus to take a "best guess" approach to how to emphasize it.

(Google is apparently doing something similar, using the UN's gigantic corpus of multilingual translations of documents to build a database of phrase-level translations into multiple languages, which can then be applied to web pages. It's effectively a non realtime mechanical turk, leveraging the historic work of an army of translators.

(Biker Scott asks why the Authors' Guild has a problem with audiobooks. The answer is quite simple: it boils down to somewhere in the region of $1000-5000 per book, which is what the authors get paid by the likes of Audible.com for the right to produce audiobook editions of their work.)

*Could* a technology solution allow text to be read with some sort of dramatic flair?


But why *would* it?

For a few years, it was possible to create some amazing musical soundalikes by playing MIDI-esque files through the Commodore 64's SID chip. The files were compact enough to be transmitted quickly on a 1200-bps modem. They were a big hit, and the copyright accusations began flying.

Now we have the bandwidth to send actual songs to each other. Synthesis? It went right back to being a music production tool.

Why would books be any different? If the market really wants audiobooks to be free, they won't struggle with clever technological approximations that fling us into the uncanny valley. That happens only when there are true technological limits.

They'll just rip the DRM out of the audiobooks.

I can't believe that technology pundits have missed this simple point: adding mark-up for better reading is creating a derivative work which is prohibited by the existing licenses. If Kindle takes text and renders it as letters on a screen or renders it as spoken words is simply a function of the Kindle. That is significantly different from adding mark-up to the text of the book. In fact, it's akin to editing the book and releasing it under the same name -- would it be okay to release works on Kindle that have been edited (either adding emphasis for instance, or actually altering the text)? As far as I know, no author licensing under mid-20th century copyright methods permits the distribution of a derivative work, especially if it is to be published under the same title.

First of all, do you have any idea how much effort it would take to do all the markups to make it a true audiobook competitor? If you are so darned worried about WHERE IT WILL BE, then you should be looking to lock in the "rights" to special marked up text that would allow TTS to function as a competitor to audiobooks. And you should NOT be depriving those who've purchased the material from reading it outloud! It doesn't matter if its a wife, husband, parent or kindle. What do YOU care who reads it? But if you must protect something, then protect the issue of placing specialized pronounciation marks in the text. But leave the function there for disabled and those who are driving from a to be and want to keep going in the book. Geez, do you even realize how much money you COULD be making if you made it easier for people to get through the books they are reading? On average I finish a book 50 to 70% faster BECAUSE OF TTS. and then you know what I do? buy another. except, i am boycotting those who with hold tts from me. Further, i am encouraging my hacker friends to looking hacking DRM to free up TTS for those who are disabled, according to the copyright office THAT is legal!

You don't even need to invent a new XML vocabulary for this: the TEI has all that is needed, and has been used for adding interpretative markup for nearly two decades. The point is that doing so (sufficient to enable a realistic vocalization) arguably creates a new edition, with its own copyright. It's neither the misinterpretation of technology nor the fear of holding back the tide which is at fault: it's the concept of copyright which needs changing. It evolved in the age of metal type, and has not transferred successfully to the digital age.

///Peter Flynn
Editor, XML FAQ

Post a comment

All comments go through moderation, so if it doesn't show up immediately, I'm not available to click the "okiedoke" button. Comments telling me that global warming isn't real, that evolution isn't real, that I really need to follow [insert religion here], that the world is flat, or similar bits of inanity are more likely to be deleted than approved. Yes, it's unfair. Deal. It's my blog, I make the rules, and I really don't have time to hand-hold people unwilling to face reality.


Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered By MovableType 4.37