i am trying to parse rss/atom-feeds in my rails app, but i encountered some serious problems with non-ASCII characters, eg. the german umlauts ÄÖÜ or ß. Some feeds in the wild use proper UTF-8, but some others make me cry. The general Problem is:
I must be able to parse any Feeds, whatever encoding they might have. The "loss" of characters is not an option (though its my current status), because i do some text and language analysis with the feed-items.
What i use so far:
"Ä"
which means "Ä"The problem is, that i literally have no idea what i'm doing wrong.
def self.sanitize_encoding_and_htmlentities str
cd = CharDet.detect str
s = str.encode(:invalid => :replace, :undef => :replace, :replace => '')
coder = HTMLEntities.new
coder.decode(s)
end
This is my current sanitize method. As sample-feed i use
http://www.N24.de/2/index.rss
So far, the "special" characters got replaced completely. This is the only variant i found which just works without raising an error due to invalid byte stuff. I changed the encode method slightly, because i read in the ruby doc that without any encoding given, the encode method should "translate" to the given default_internal Encoding of the app, which is utf-8 in my case. CharDet stands there just for possible changes to anything related, might be useful.
I used the magic_encoding gem, so every file in my project should have the comment on the first line. My database is sqlite3 with utf-8.
As of 2012, is there anything i should look at? Did i make anything really wrong?
Thanks for help!
EDIT: The feeds may be rss of any kind, atom, and/or just invalid XML. The Encoding may be UTF-8, something different, or just says "utf-8" while its some windows-XXX stuff, and so on. I really need a solution for this alltogether.
Also the fetching/parsing must be as fast as possible, that's why i picked feedzirra.
My current Idea is to get the feedcontent, replace every char in the "title" and "description" nodes with htmlentities if possible, use the encode! method to switch to utf-8, and then unescape the htmlentities. After this, special characters should be keeped i think, but i can't get something like this working at the moment. Might this be a good approach?
Finally i found the main Problem:
Feedzirra already returns UTF-8 when accessing entries and their attributes. But i used the sanitize method to access attributes, which returns ASCII-8BIT and weird characters escaped as html-entities.
However, i kicked all the sanitizing and encoding stuff out of my code, and now it just works. Seems that FeedZirra has something built in to transcode the feeds if neccessary.