I am about to use an algorithm to encode a variable length but very long String field retrieved from an XML file, then that encoded data should be persisted in the database.
Later, when I recieve a second file I need to fetch the encoded data from database (previously stored) and then decode it and validate with the new data for duplicate.
I tried org.apache.commons.codec.binary.Base64
class
it has 2 methods:
encodeBase64(Byte[] barray)
decodeBase64(String str)
which works perfectly fine and solves my problem. But it converts 55 char string to just 6 char String.
So I wonder if there is any case where these algorithm encodes 2 Strings which are very large and have only 1 char mismatch (for example) into same encoded byte arrays.
I donot know about the Base64
class much but if anyone can help me out it will be really helpful.
If you can suggest any other Algorithm which makes a large String short of fixed length and solves my purpose I will be happy to use it.
Thanks in advance.
Not very efficient.
Also, using sun.misc
classes gives a non-portable application.
Check out the following performance comparisons from MiGBase64:
So I wonder if there is any case where these algorithm encodes 2 Strings which are very large and have only 1 char mismatch (for example) into same encoded byte arrays.
Base64 isn't a hashing algorithm, it's an encoding and must therefore be bi-directional. Collisions can't be allowed by necessity - otherwise decoding would be non-deterministic. Base64 is designed to represent arbitrary binary data in an ASCII string. Encoding a Unicode string as Base64 will often increase the number of code points required since the Unicode character set requires multiple bytes. The Base64 representation of a Unicode string will vary depending on the encoding (UTF-8, UTF-16) used. For example:
Base64( UTF8( "test" ) ) => "dGVzdA=="
Base64( UTF16( "test" ) ) => "/v8AdABlAHMAdA=="
Solution 1
Use lossless compression
GZip( UTF8( "test" ) )
Here you are converting the string to byte array and using lossless compression to reduce the number of bytes you have to store. You can vary the char encoding and compression algorithm to reduce the number of bytes depending on the Strings you will be storing (ie if it's mostly ASCII then UTF-8 will probably be best.
Pros: no collisions, ability to recover original string
Cons: Bytes required to store value is variable; bytes required to store value is larger
Solution 2
Use a hashing algorithm
SHA256( UTF8( "test" ) )
Here you are converting the string to a fixed length set of bytes with a hashing function. Hashing is uni-directional and by its nature collisions can be possible. However, based on the profile and number of Strings that you expect to process you can select a hash function to minimise the likelihood of collisions
Pros: Bytes required to store value is fixed; bytes required to store value is small
Cons: Collisions possible, no ability to recover original string