javascriptregexcouchdb

Convert HTML to plain text in JS without browser environment


I have a CouchDB view map function that generates an abstract of a stored HTML document (first x characters of text). Unfortunately I have no browser environment to convert HTML to plain text.

Currently I use this multi-stage regexp

html.replace(/<style([\s\S]*?)<\/style>/gi, ' ')
    .replace(/<script([\s\S]*?)<\/script>/gi, ' ')
    .replace(/(<(?:.|\n)*?>)/gm, ' ')
    .replace(/\s+/gm, ' ');

while it's a very good filter, it's obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?


Solution

  • Converter HTML to plain text like Gmail:

    html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
    html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
    html = html.replace(/<\/div>/ig, '\n');
    html = html.replace(/<\/li>/ig, '\n');
    html = html.replace(/<li>/ig, '  *  ');
    html = html.replace(/<\/ul>/ig, '\n');
    html = html.replace(/<\/p>/ig, '\n');
    html = html.replace(/<br\s*[\/]?>/gi, "\n");
    html = html.replace(/<[^>]+>/ig, '');
    

    If you can use jQuery :

    var html = jQuery('<div>').html(html).text();