amazon-web-servicesamazon-cloudsearch

Removing invalid characters from amazon cloud search sdf


While trying to post the data extracted from a pdf file to a amazon cloud search domain for indexing, the indexing failed due to invalid chars in the data.

How can i remove these invalid charecters before posting to the search end point?

I tried escaping and replacing the chars, but didn't work.


Solution

  • I have fixed the problem using the solution available here

    RE_XML_ILLEGAL = u'([\u0000-\u0008\u000b-\u000c\u000e-\u001f\ufffe-\uffff])' + \
                     u'|' + \
                     u'([%s-%s][^%s-%s])|([^%s-%s][%s-%s])|([%s-%s]$)|(^[%s-%s])' % \
                      (unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
                       unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
                       unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff))
    x = u"<foo>text\u001a</foo>"
    x = re.sub(RE_XML_ILLEGAL, "?", x)