Listen Up: Your Guide to Turning Any Text Into Audio

Table of Contents

Have you ever felt like there’s simply too much to read and not enough time? Between articles, emails, reports, and even books, our eyes and brains are constantly bombarded with text. For years, I consumed content the traditional way – with my eyes fixed on a screen or page. But recently, something shifted dramatically. I discovered the power of listening to text, and honestly, it’s rapidly become my preferred method of “reading.” This isn’t about ditching traditional reading entirely, but about opening up a whole new, incredibly flexible way to consume information and entertainment.

The Multitasking Magic of Listening

The magic of listening lies in its incredible flexibility. Unlike reading, which typically requires your undivided visual attention, listening frees up your eyes and hands. This means you can effectively “read” while doing so many other things! I now happily make my way through long articles, dive into complex reports, or even enjoy entire books while I’m playing a relaxed video game, tackling a pile of dishes in the sink, or enjoying a walk outside. It’s perfect for making otherwise passive or routine tasks incredibly productive and entertaining. We already embrace this concept with podcasts and long-form video essays on platforms like YouTube, but why should we be limited to just content created in audio format?

The Text Content Gap

This brings me to a common frustration: the vast ocean of valuable text-based content that doesn’t have an easily accessible audio version. While many websites, news platforms, and publishers are starting to offer audio narration for their articles and books, it often comes with a catch. You frequently have to pay a premium or subscribe to a specific service to unlock the audio version, even if you already have free or paid access to the text itself. Think of services like Apple News+, Audible audiobooks (where you buy the audio separately), or the “listen” feature on some platforms like Medium, often locked behind their paywall. If I already have access to the text, I just want a simple, affordable (or free!) way to listen to it.

Fortunately, recent advancements in technology offer powerful ways to overcome this barrier: modern text-to-speech (TTS).

Enter Modern Text-to-Speech (TTS)

Now, when I say text-to-speech, I’m not talking about the choppy, robotic voices of yesteryear. TTS has been around for a long time; who can forget the slightly unsettling, yet groundbreaking, “Hello” from the first Macintosh revealed by Steve Jobs? Compared to today, that voice sounds incredibly primitive. We’ve entered a new generation of TTS that sounds remarkably natural, often almost indistinguishable from a human voice.

Companies like ElevenLabs are truly on the cutting edge of this technology. Their work focuses on generating speech that captures natural human intonation, rhythm, and even emotion. This is a fundamentally different approach than older TTS models, which often struggled with pronunciation of less common words or proper nouns, and read everything in a flat, monotone voice. For instance, a traditional TTS might read “OpenAI” as one single, awkwardly pronounced word (“oh-pen-nai”), whereas advanced models understand it’s two distinct concepts (“Open AI”) and pronounce it accordingly. It’s this leap in naturalness that makes listening to extended text content genuinely pleasant and comprehensible.

Practical Solutions for Listening to Text

But you don’t always need a fancy third-party service to start listening. There are fantastic practical solutions available right now. For iOS users, there’s an app called ElevenReader. What makes it stand out is that it’s 100% free and can turn virtually any text content – including plain text, PDF documents, and EPUB ebooks – into high-quality audio using advanced TTS techniques. It requires an internet connection to process, but the convenience and accessibility are unparalleled for turning your existing library into a listening library.

Beyond dedicated apps, your mobile and desktop operating systems also have powerful built-in text-to-speech capabilities. Both macOS and iOS have system-level voices that can read text aloud from almost any application. Now, it’s true that many of the default or older voices sound quite mechanical. However, both Apple platforms also offer “Premium” voices. These voices are built using an entirely new, more sophisticated approach to speech synthesis than their predecessors, resulting in significantly better and more natural-sounding audio. Crucially, these premium voices are available at no extra cost and often approach the quality of modern, cutting-edge TTS solutions. An advantage of using these built-in features is their system-wide integration – they can read text in almost any app, and once premium voices are downloaded, they can even work offline.

The main challenge with the built-in OS text-to-speech is that the features are often hidden away in the Accessibility settings, making them difficult for the average user to find and enable. But once you know where to look, you can unlock a powerful tool for turning almost any text on your screen into audio.

How to Enable Built-in TTS Features

Here’s a quick guide on how to enable these features:

On iOS (iPhone/iPad):

Open the Settings app.
Scroll down and tap on Accessibility.
Under the “VISION” section, tap on Spoken Content.
Toggle on Speak Selection. This allows you to highlight text in most apps and tap the “Speak” option that appears in the context menu.
Toggle on Speak Screen. Once enabled, you can swipe down from the top of the screen with two fingers to have the entire visible content of the screen read aloud. A small controller will appear, allowing you to pause, play, adjust speed, and skip forward or backward.
Tap on Voices to explore and download different language and premium voices. I highly recommend downloading some of the more natural-sounding “Premium” options for your language for the best experience.

On macOS:

Open System Settings (or System Preferences on older macOS versions).
Scroll down and click on Accessibility.
In the sidebar, click on Spoken Content.
Check the box next to Speak selection.
Click on Listening shortcut: to set a custom keyboard shortcut to trigger speaking the selected text. Choose a shortcut that is easy for you to remember and use. Once set, simply highlight the text you want to hear and press your chosen keyboard shortcut.
Click on System Voice to choose your preferred voice. Again, look for the more natural-sounding “Premium” options available for download.
You can also enable “Speak announcements” or “Speak items under pointer” based on your needs.

By enabling these features, you empower your device to read almost anything aloud, from webpages and emails to documents and notes.

The Power of Text-Audio Synchronization

Simply reading text aloud is powerful, but some cutting-edge solutions add a killer feature: text-audio synchronization. Examples include Amazon’s WhisperSync for Kindle, the ElevenReader app, and reading platforms like Readwise Reader. These apps don’t just read the text; they simultaneously highlight the word or line being spoken and automatically scroll the text as the audio progresses.

This synchronization is incredibly helpful. It strongly connects the auditory and visual information, which can significantly boost comprehension. When the audio hits a confusing phrase, you can instantly glance at the highlighted text to clear up any misconceptions. Furthermore, having the text highlighted makes it much easier to follow along, quickly find specific sections, highlight key points, and take notes – marrying the convenience of listening with the interactivity of reading.

Room for Improvement

While TTS has improved dramatically in just a few short years, these advancements also highlight areas where we still need progress. The technology isn’t perfect, and the structure of text content itself presents challenges for purely auditory consumption.

For example, I frequently read technical blogs with code examples. Even the most advanced TTS models sound incredibly strange when attempting to read code aloud. This is a hard problem because, frankly, it’s not natural to read code aloud in the first place! Code is designed to be visually parsed and understood, not spoken.

Text also includes many features that simply don’t have a natural equivalent in spoken audio. Consider footnotes. Text allows you to interrupt your reading flow to immediately check a footnote, ignore it, or save it for later – it’s interactive. I’ve listened to many audiobooks, and I’ve never found one that handles footnotes in a truly satisfactory manner. Audio is linear; you can’t easily jump around or pause the main narrative to explore a side note and then seamlessly return. The true solution might require TTS reading solutions that have primitives for interactive content structures.

Hyperlinks are another challenge. Sometimes the TTS parser just grabs the visible text and misses the link entirely. Other times, if the text includes markdown or the full URL, the TTS reads out the raw, often lengthy and confusing, link text, which is anything but natural or helpful.

Impact on the Industry

It’s important to recognize that these technological advancements, while beneficial for consumers, have a significant impact on certain industries and jobs. Audio transcriptionists, whose work involved manually converting audio to text, have seen their roles fundamentally changed (and in many cases, made obsolete) by highly accurate AI transcription services. Similarly, the jobs of human narrators for audiobooks and even some podcast voices could be on life support. Why would consumers pay a premium for a human-narrated audiobook if they can simply listen to their ebook using a nearly human-quality AI voice for free or minimal cost?

This is a brutal reality of technological disruption. Change is coming whether we like it or not, and it will require shifts in skills and industries.

However, the bright future is that content can become vastly more accessible to a broader audience than ever before.

Conclusion

The journey of text-to-speech has brought us from robotic voices to impressively natural-sounding narration. Combined with powerful built-in operating system features, innovative apps offering text-audio synchronization, and the potential of AI integration, listening to virtually any text content is becoming a realistic and incredibly beneficial alternative to traditional reading.

I have a dream that one day, any text can be instantly and seamlessly turned into natural-sounding audio via advanced TTS, and conversely, any audio can be accurately transcribed back into text. Furthermore, that all this text and audio can be synchronized to create a rich, accessible reading experience that empowers everyone, including blind, deaf, ADHD, and neurodivergent people, to consume information in the way that best suits their needs.

And let’s not forget the impact on global communication. With Large Language Models (LLMs), any text can now be translated into almost any language for minimal cost. High-quality TTS means that translated text can then be turned into high-quality audio, making information and stories accessible in spoken form across language barriers like never before. This convergence of AI technologies for translation and text-to-speech holds immense promise for humanity, breaking down barriers to knowledge and connection.

So, give listening to text a try – explore the built-in features, experiment with apps, and see how it can transform your content consumption and make every moment a potential “reading” moment. Your ears (and your busy schedule) might just thank you!

This was written by Daniel Lyons.

If you'd like to support him, please consider buying him a coffee so he can create more content like this.