Text-to-Speech Chrome Extension with Synced Highlighting

I kept losing the plot.

Not in life — in articles. I'd open a 3,000-word essay, read two paragraphs, and somewhere around paragraph three my eyes were still moving but nothing was landing. Tab closed. Onto the next thing. You know the feeling.

Audiobooks didn't fix it. When I just listen, my eyes wander off to a notification, a thumbnail, the void. I need both channels busy — ears and eyes, locked onto the same words at the same time. Like back in school, when a teacher ran a finger under the line they were reading aloud.

So I built that. A Chrome extension that reads any page out loud, highlights each word as it's spoken, and scrolls the page to keep up. Later it grew a second head: instead of just reading the page, it could explain it. This post is about how the thing actually works — and which parts were genuinely hard.

Spoiler: the speech was the easy part.

The real problem is synchronization, not audio

Text-to-speech is a solved problem. You POST some text, you get back an MP3. Done in an afternoon.

What's not solved — what makes the difference between "a robot reading at me" and "a teacher pointing at the page" — is keeping three things in lockstep:

the audio that's playing,
the word that's highlighted,
the scroll position of the page.

Break any one of them and the whole thing collapses into a glorified audiobook. The highlight is the point. It's the finger under the line.

To get word-level highlighting, you need word-level timestamps — the start and end time of every word in the generated audio. The TTS endpoint I use returns them alongside the MP3. Then it's a requestAnimationFrame loop: every frame, read audio.currentTime, binary-search the timestamp array for the word that's playing right now, and paint it.

function currentWordIndex(t: number, stamps: WordStamp[]): number {
  let lo = 0, hi = stamps.length - 1;
  while (lo <= hi) {
    const mid = (lo + hi) >> 1;
    if (t < stamps[mid].start) hi = mid - 1;
    else if (t > stamps[mid].end) lo = mid + 1;
    else return mid;
  }
  return Math.max(0, hi);
}

Highlighting itself uses the CSS Custom Highlight API (::highlight()), which lets you paint a range without wrapping every word in a <span> and nuking the page's layout. Underrated API. More browsers should brag about it.

Auto-scroll is its own little trap. Scroll on every word and the page seizes like a slot machine. So you only scroll when the active paragraph drifts outside a comfortable band — say, the middle 50% of the viewport. Inside the band, hands off.

I don't scrape pages. I read what's already rendered.

Here's a decision that saved me from an entire category of pain.

A lot of "read this page" tools think like crawlers: fetch the URL, fight Cloudflare, parse the HTML, dodge the 403. That's a war you don't need to fight, because by the time the user clicks your button, the content is already sitting in the DOM of their browser, fully rendered, past every login wall and paywall they legitimately have access to.

So the extraction model is the same one Google Translate's extension uses: the page loads, the user clicks, and a content script reads the live DOM. No network requests. No anti-bot cat-and-mouse. The text is right there.

The only job left is finding the article inside the soup of nav bars, sidebars, cookie banners, and "you might also like" rails.

Finding the article: space beats semantics

My first instinct was semantic — look for <article>, score by class names, the usual readability heuristics. It worked until it didn't. Modern sites are a <div> soup with class names like css-1q7v9x. Semantics lie.

What doesn't lie is geometry. The main content is the big visually-coherent column in the middle. So the extractor I landed on — I call it Visual Zone — works in space, not meaning:

Walk the DOM, collect every visible text block with its bounding box.
Detect columns by histogramming the left edges of those blocks. A real article has one dominant column; a portal homepage has five.
For each column, find the tightest common ancestor that covers it (coverage × tightness × depth).
Score zones, boost the article-shaped ones, drop the sidebar-shaped ones, and select.

The funny part: it deliberately does not skip <nav>, <footer>, <aside> up front. Filtering early throws away signal. Let the spatial math exclude them naturally and you get fewer false negatives. Counterintuitive, but the data agreed.

The rule I refuse to break

Early on I had a tempting fallback: when extraction can't map text to DOM elements (some canvas-rendered reader, say), just dump the text to the system TTS and read it. It "works."

I killed it. Here's the rule, written on the wall:

Never read without sync. If I can't highlight the word being spoken, I don't speak.

Because the moment you drop the highlight, you've quietly become an audiobook — and audiobooks already exist, made by people with better microphones than me. The entire reason this thing is worth building is the dual channel. A degraded "at least it reads" experience isn't a smaller version of the product. It's a different, worse product wearing its clothes.

So when a page can't be synced, the extension says "I can't read this one" instead of pretending. Saying no is a feature.

The second head: explaining, not just reading

Reading a page aloud is great when the page is readable. But some pages are dense — a paper, a legal clause, a tutorial in a language you half-speak. Reading those word-for-word doesn't help. You need a teacher, not a narrator.

So the extension got a second mode: it sends the page to an LLM and gets back an explanation — in your language — plus a set of marks to draw on the page. Circles. Underlines. A scribbled note in the margin. All in a handwritten style, on purpose: the handwriting visually separates "the AI's commentary" from "the original printed text" at a glance. You always know what's the page and what's the annotation.

Two engineering constraints made this interesting.

Marks anchor to verbatim text. The LLM proposes a mark like "circle the phrase 'eventual consistency'." The renderer then has to find that exact phrase in the live DOM and draw an ellipse around it — at the right pixel, even after lazy images shift the layout. So the contract is strict: every mark's text must be copied verbatim from the page, or it doesn't anchor. No paraphrasing allowed in the marking layer.

No floating lines. I tried connector lines and arrows pointing from a margin note to the text. Looked slick in a mockup. In a wall of real text, they're unreadable — you can't tell what an arrow is pointing at. So they're banned. Marks touch the words directly or they don't exist. A small rule that quietly improved everything.

Hard mode: Kindle and other canvas readers

Then there are the pages that render text as pixels.

Kindle Cloud Reader paints each page to a <canvas> as a blob image. There is no text in the DOM to read. Same with some Chinese reading platforms that draw glyphs onto canvas to stop you copying them.

For those, the DOM-reading model breaks, so you earn an exception to the rule — but only if you can still keep the sync. The pipeline:

Intercept the reader's own data layer to grab the page image (and, where available, token layout).
Run OCR in an offscreen document with tesseract-wasm, which gives you not just text but per-word bounding boxes.
Lay invisible, positioned text spans over the image at those boxes.
Highlight on those. The finger still moves under the line — it's just moving over a picture of the line.

It's a lot of machinery to honor one rule. But it's the rule that makes the product the product, so it's worth it.

(Funny epilogue: my analytics once showed Kindle "reading sessions" lasting 21 minutes and I nearly "fixed a performance bug." Turned out a single click auto-turns pages and keeps reading — those 21 minutes were someone happily listening to a whole chapter. The metric was lying, not the product. Always check whether the bug is in the code or the ruler.)

What I'd tell past me

The MP3 is 5% of the work. Sync is the product.
Read the rendered DOM. Don't be a crawler in a browser's clothing.
Geometry is more honest than class names.
Pick the one rule you won't break, write it on the wall, and let it kill your clever fallbacks.
Your metrics can be wrong in the same direction your assumptions are. Verify before you "fix."

The extension is called CastReader, if you want to see the dual-channel thing in action — it's on the Chrome Web Store, free to start. But honestly, whether or not you try it, the lesson that travels is the boring one: figure out the single thing that makes your project itself, and defend it harder than seems reasonable.

Everything else is just an MP3.

How I Built a Chrome Extension That Reads AND Explains Any Web Page

The real problem is synchronization, not audio

I don't scrape pages. I read what's already rendered.

Finding the article: space beats semantics

The rule I refuse to break

The second head: explaining, not just reading

Hard mode: Kindle and other canvas readers

What I'd tell past me

Comments

Command Palette

The real problem is synchronization, not audio

I don't scrape pages. I read what's already rendered.

Finding the article: space beats semantics

The rule I refuse to break

The second head: explaining, not just reading

Hard mode: Kindle and other canvas readers

What I'd tell past me

Comments