[SOLVED] Which encoding by DOMParsing (Antisamy)

Which encoding by DOMParsing (Antisamy)

I'm using antisamy for a new project, to prevent the xss vunerabilities. In the application, a user can upload content by an simple (ansi encoded) excel file. It should be possible to insert some html but no possible insert javascript and so on.

when I scan my input with antisamy, i get this errors: The a tag contained an attribute that we could not process. The href attribute had a value of "https& # 5 8 ;& # 4 7 ;& # 4 7 ;bla& # 4 6 ;bla& # 4 6 ;com& # 4 7 ;bla& # 4 7 ;...". This value could not be accepted for security reasons. We have chosen to filter the a tag in order to continue processing the input.

I added some spaces in the entities, that you can see it.

(But it should be like https://bla.bla.com/bla/...)

When i debug thru the code, the 'tainted Html input' and its href attribute seems to be right (so there is no problem with encoding of the excel file).

The antisamy-policy file looks like this:

...
<regexp name="onsiteURL" value="([\w\\/\.\?=&amp;;#-~]+|#(\w)+)"/>
<regexp name="offsiteURL" value="(\s)*((ht|f)tp(s?)://|mailto:)[A-Za-z0-9]+[~a-zA-Z0-9-_\.@#$%&amp;;:,\?=/\+!]*(\s)*"/>
...
<attribute name="href">
  <regexp-list>
    <regexp name="onsiteURL"/>
    <regexp name="offsiteURL"/>
  </regexp-list>
  <literal-list>
    <literal value="javascript:void(0)"/>
  </literal-list>
</attribute>
...

I also tested the regex pattern and as i thought the link was valid. Of course not, when it's encoded by html entities.

So what's the problem?

Thanks a lot in advance

I debugged thru the AntiSamy Code a bit and now I see the problem, but I sill can't fix the problem. The htmlentities were added by antisamy AFTER validation (if I would print it on an HTML page..). But my input will be parsed by org.cyberneko.html.parsers.DOMFragmentParser in the AntiSamy library with this statement: parser.parse(new InputSource(new StringReader(html)), dom); in my a tag the href attribute contains now something like this: https://bla.bla.com/bla?frame=Frameset[undefinable character]lang=en insted of https://bla.bla.com/bla?frame=Frameset&lang=en

So it seems to be an encoding problem, that the ampersand will not be an ampersand anymore. How could find out, which encoding i should use?

Edit: The character is E2 8C A9 -> ⟨

Solution

I've done a little workaround by replacing the "&" with "&amp ;". I don't know why, but it works. And this is the only character which doesn't work properly.