unit-testingcharacter-encodingxunit

Is there a set of "Lorem ipsums" files for testing character encoding issues?


For layouting we have our famous "Lorem ipsum" text to test how it looks like.

What I am looking for is a set of files containing Text encoded with several different encodings that I can use in my JUnit tests to test some methods that are dealing with character encoding when reading text files.

Example

Having a ISO 8859-1 encoded test-file and a Windows-1252 encoded test-file. The Windows-1252 have to trigger the differences in region 8016 – 9F16. In other words it must contain at least one character of this region to distinguish it from ISO 8859-1.

Maybe the best set of test-files is that where the test-file for each encoding contains all its characters once. But maybe I am not aware of sth - we all like this encoding stuff, right? :-)

Is there such a set of test-files for character-encoding issues out there?


Solution

  • How about trying to use the ICU test suite files? I don't know if they are what you need for your test, but they seem to have pretty complete from/to UTF mapping files at least: Link to the repo for ICU test files