I want to trim the space between the tag name and the attribute using StringUtils.strip(). Because I have some space which cannot be removed by the following Jericho methods:
the first method removes the normal space but not the other language space. This is the error I am getting. for example
html = "<a href=\"test.html\"><font></font></a>";
StartTag a at (r1,c1,p0) rejected because the name contains an invalid character at position (r1,c3,p2)
Encountered possible StartTag at (r1,c1,p0) whose content does not match a registered StartTagType
there is also a generateHTML method in jericho but we have to provide all the attribute values etc
public static java.lang.String generateHTML(java.util.Map<java.lang.String,java.lang.String> attributesMap)
In full sequential parse it does not recognise the other language space.
How can I remove other language space ONLY between the tag name and attribute? ( other language space in between the attribute value is OK) that is why I cannot do string.replaceALL()
You can use String.replaceAll().
String html = "<a href=\"test.html\"> <font></font></a>";
System.out.println(html.replaceAll("(?<=<\\w{1,100})[\\s\\u3000]+", " "));
// -> <a href="test.html"> <font></font></a>
This code replaces all spaces including \u3000
(ideographic space) by one space. The spaces must be preceded by <ELEMENT_NAME
. But the preceding is not replaced. (See "zero-width positive lookbehind" in Class Pattern) The length of ELEMENT_NAME
is limited between 1 to 100 in this code.