I need to convert a HTML String with nested Tags like this one:
const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"
Into the following Array of objects with this Structure:
const result = [{
text: "Hello World",
format: null
}, {
text: "I am a text with",
format: null
}, {
text: "bold",
format: ["strong"]
}, {
text: " word",
format: null
}, {
text: "I am a text with nested",
format: ["strong"]
}, {
text: "italic",
format: ["strong", "em"]
}, {
text: "Word.",
format: ["strong"]
}];
I managed the conversion with the DOMParser() as long as there are no nested Tags. I am not able to get it running with nested Tags, like in the last paragraph, so my whole paragraph is bold, but the word "italic" should be both bold and italic. I cannot get it running as a recursion.
Any help would be appreciated.
So the code I wrote so far is this one:
export interface Phrase {
text: string;
format: string | string[];
}
export class HTMLParser {
public parse(text: string): void {
const parser = new DOMParser();
const sourceDocument = parser.parseFromString(text, "text/html");
this.parseChildren(sourceDocument.body.childNodes);
// HERE SHOULD BE the result
console.log("RESULT of CONVERSION", this.phrasesProcessed);
}
public phrasesProcessed: Phrase[] = [];
private parseChildren(toParse: NodeListOf<ChildNode>) {
this.phrasesProcessed = [];
try {
Array.from(toParse)
.map(item => {
if (item.nodeType === Node.ELEMENT_NODE && item instanceof HTMLElement) {
return Array.from(item.childNodes).map(child => ({ text: child.textContent, format: (child.nodeType === Node.ELEMENT_NODE && child instanceof HTMLElement) ? child.tagName : null }));
} else {
return Array.from(item.childNodes).map(child => ({ text: child.textContent, format: null }));
}
})
.filter(line => line.length) // only non emtpy arrays
.map(element => ([...element, { text: "\n", format: null }])) // add linebreak after each P
.reduce((acc: (Phrase)[], val) => acc.concat(val), []) // flatten
.forEach(
element => {
// console.log("ELEMENT", element);
this.phrasesProcessed.push(element);
}
);
} catch (e) {
console.warn(e);
}
}
}
You can use recursion. And this seems a good case for a generator function. As it was not clear which tags should be retained in format
(apparently, not p
), I left this as a configuration to provide:
const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]);
function* iterLeafNodes(nodes, format=[]) {
for (let node of nodes) {
if (node.nodeType == 3) {
yield ({text: node.nodeValue, format: format.length ? [...format] : null});
} else {
const tag = node.tagName.toLowerCase();
yield* iterLeafNodes(node.childNodes,
formatTags.has(tag) ? format.concat(tag) : format);
}
}
}
// Example input
const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"
const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes;
let result = [...iterLeafNodes(nodes)];
console.log(result);
Note that this will still split the text when it is spread over multiple tags, which are considered non-formatting tags, like span
.
Secondly, I'm not convinced that having null
as a possible value for format
is more useful then just an empty array []
, but anyway, the above produces null
in that case.
\n
In comments you ask for the insertion of a line break after each p
element.
The code below will generate that extra element. Here I also used []
instead of null
for format
:
const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]);
function* iterLeafNodes(nodes, format=[]) {
for (let node of nodes) {
if (node.nodeType == 3) {
yield ({text: node.nodeValue, format: [...format]});
} else {
const tag = node.tagName.toLowerCase();
yield* iterLeafNodes(node.childNodes,
formatTags.has(tag) ? format.concat(tag) : format);
if (tag === "p") yield ({text: "\n", format: [...format]});
}
}
}
// Example input
const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"
const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes;
let result = [...iterLeafNodes(nodes)];
console.log(result);