I have a CouchDB view map function that generates an abstract of a stored HTML document (first x
characters of text). Unfortunately I have no browser environment to convert HTML to plain text.
Currently I use this multi-stage regexp
html.replace(/<style([\s\S]*?)<\/style>/gi, ' ')
.replace(/<script([\s\S]*?)<\/script>/gi, ' ')
.replace(/(<(?:.|\n)*?>)/gm, ' ')
.replace(/\s+/gm, ' ');
while it's a very good filter, it's obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?
Converter HTML to plain text like Gmail:
html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
html = html.replace(/<\/div>/ig, '\n');
html = html.replace(/<\/li>/ig, '\n');
html = html.replace(/<li>/ig, ' * ');
html = html.replace(/<\/ul>/ig, '\n');
html = html.replace(/<\/p>/ig, '\n');
html = html.replace(/<br\s*[\/]?>/gi, "\n");
html = html.replace(/<[^>]+>/ig, '');
If you can use jQuery
:
var html = jQuery('<div>').html(html).text();