I have some legacy code I am dealing with (so no I can't just use a URL with an encoded filename component) that allows a user to download a file from our website. Since our filenames are often in many different languages they are all stored as UTF-8. I wrote some code to handle the RFC5987 conversion to a proper filename* parameter. This works great until I have a filename with non-ascii characters and spaces. Per RFC, the space character is not part of attr_char so it gets encoded as %20. I have new versions of Chrome as well as Firefox and they are all converting to %20 to + on download. I have tried not encoding the space and putting the encoded filename in quotes and get the same result. I have sniffed the response coming from the server to verify that the servlet container wasn't mucking with my headers and they look correct to me. The RFC even has examples that contain %20. Am I missing something, or do all of these browsers have a bug related to this?
Many thanks in advance. The code I use to encode the filename is below.
Peter
public static boolean bcsrch(final char[] chars, final char c) {
final int len = chars.length;
int base = 0;
int last = len - 1; /* Last element in table */
int p;
while (last >= base) {
p = base + ((last - base) >> 1);
if (c == chars[p])
return true; /* Key found */
else if (c < chars[p])
last = p - 1;
else
base = p + 1;
}
return false; /* Key not found */
}
public static String rfc5987_encode(final String s) {
final int len = s.length();
final StringBuilder sb = new StringBuilder(len << 1);
final char[] digits = {'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'};
final char[] attr_char = {'!','#','$','&','\'','+','-','.','0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','^','_','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','|', '~'};
for (int i = 0; i < len; ++i) {
final char c = s.charAt(i);
if (bcsrch(attr_char, c))
sb.append(c);
else {
final char[] encoded = {'%', 0, 0};
encoded[1] = digits[0x0f & (c >>> 4)];
encoded[2] = digits[c & 0x0f];
sb.append(encoded);
}
}
return sb.toString();
}
Update
Here is a screen shot of the download dialog I get for a file with Chinese characters with spaces as mentioned in my comment.
So as Julian pointed out in the comments, I made a rookie Java error and forgot to do my character to byte conversion (thus I encoded the character's codepoint instead of the character's byte representation), hence the encoding was completely incorrect. This is clearly mentioned as a requirement in RFC 5987. I will be posting corrected code for doing the conversion. Once the encoding is correct, the filename* parameter is recognized properly by the browser and the filename used for the download is correct.
Below is the corrected escaping code which operates on the UTF-8 bytes of the string. The filename that was giving me trouble, now properly encoded looks like this:
Content-Disposition:attachment; filename*=UTF-8''Museum%20%E5%8D%9A%E7%89%A9%E9%A6%86.jpg
public static String rfc5987_encode(final String s) throws UnsupportedEncodingException {
final byte[] s_bytes = s.getBytes("UTF-8");
final int len = s_bytes.length;
final StringBuilder sb = new StringBuilder(len << 1);
final char[] digits = {'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'};
final byte[] attr_char = {'!','#','$','&','+','-','.','0','1','2','3','4','5','6','7','8','9', 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','^','_','`', 'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','|', '~'};
for (int i = 0; i < len; ++i) {
final byte b = s_bytes[i];
if (Arrays.binarySearch(attr_char, b) >= 0)
sb.append((char) b);
else {
sb.append('%');
sb.append(digits[0x0f & (b >>> 4)]);
sb.append(digits[b & 0x0f]);
}
}
return sb.toString();
}