[SOLVED] how to highlight text as per audio on a website in realtime as the audio narrates it

how to highlight text as per audio on a website in realtime as the audio narrates it

I am trying to figure out which technology to use to hightlight a text as per the audio. Much like what https://speechify.com/ is doing.

This is assuming I am able to run a TTS algo and I am able to convert text to speech. I have tried multiple sources but I am unable to pinpoint the exact technology or methodology of highlighting the text as the audio speaks.

Any help would be much appreciated. I have already wasted 2 days on the internet to figure this out but no luck :(

Solution

A simple approach would be to use the event listener provided by the SpeechSynthesisUtterance boundary event to highlight words with vanilla JS. The emitted event gives us char indices, so no need to go crazy with regexes or super AI stuff :)

Before anything else, make sure the API is available

const synth = window.speechSynthesis
if (!synth) {
  console.error('no tts for you!')
  return
}

The tts utterance emits an 'boundary' event, we can use it to highlight text.

let text = document.getElementById('text')
let originalText = text.innerText
let utterance = new SpeechSynthesisUtterance(originalText)
utterance.addEventListener('boundary', event => {
  const { charIndex, charLength } = event
  text.innerHTML = highlight(originalText, charIndex, charIndex + charLength)
})
synth.speak(utterance)

Full example:

const btn = document.getElementById("btn")

const highlight = (text, from, to) => {
  let replacement = highlightBackground(text.slice(from, to))
  return text.substring(0, from) + replacement + text.substring(to)
}
const highlightBackground = sample => `<span style="background-color:yellow;">${sample}</span>`

btn && btn.addEventListener('click', () => {
  const synth = window.speechSynthesis
  if (!synth) {
    console.error('no tts')
    return
  }
  let text = document.getElementById('text')
  let originalText = text.innerText
  let utterance = new SpeechSynthesisUtterance(originalText)
  utterance.addEventListener('boundary', event => {
    const { charIndex, charLength } = event
    text.innerHTML = highlight(originalText, charIndex, charIndex + charLength)
   })
  synth.speak(utterance)
})

CodeSandbox link

This is pretty basic, and you can (and should) improve it.

Edit

Ooops, I forgot that this was tagged as ReactJs. Here's the same example with React (codesandbox link is in the comments):

import React from "react";

const ORIGINAL_TEXT =
  "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.";

const splitText = (text, from, to) => [
  text.slice(0, from),
  text.slice(from, to),
  text.slice(to)
];

const HighlightedText = ({ text, from, to }) => {
  const [start, highlight, finish] = splitText(text, from, to);
  return (
    <p>
      {start}
      <span style={{ backgroundColor: "yellow" }}>{highlight}</span>
      {finish}
    </p>
  );
};

export default function App() {
  const [highlightSection, setHighlightSection] = React.useState({
    from: 0,
    to: 0
  });
  const handleClick = () => {
    const synth = window.speechSynthesis;
    if (!synth) {
      console.error("no tts");
      return;
    }

    let utterance = new SpeechSynthesisUtterance(ORIGINAL_TEXT);
    utterance.addEventListener("boundary", (event) => {
      const { charIndex, charLength } = event;
      setHighlightSection({ from: charIndex, to: charIndex + charLength });
    });
    synth.speak(utterance);
  };

  return (
    <div className="App">
      <HighlightedText text={ORIGINAL_TEXT} {...highlightSection} />
      <button onClick={handleClick}>klik me</button>
    </div>
  );
}