curl browser utf-8 jax-rs byte-order-mark

Preserve UTF-8 BOM in Browser Downloads

I have a JAX-RS REST-Service that produces a CSV file and streams it back to the browser. Everything is set to UTF-8, so also the file I download via the browser is a valid UTF-8 File (without a BOM) that shows me valid, readable UTF-8 umlauts, etc. in Notepad++, Sublime, etc..

Opening such a file in Excel though leads to unreadable umlauts, etc. since Excel apparently tries to open it with another charset (CP-1252, I guess, but that doesn't really matter).

Saving the file with a BOM via Notepad++ and re-opening it in Excel works nicely. Seems like the detection of a BOM is the only way that Excel uses to detect UTF-8. Anyways - I thought that adding a BOM could help...

Did that. Same result. After a while, I figured out that the BOM gets removed under some circumstances: If I added any character right before the BOM, I could see the BOM in my Hex-Editor. After removal of that character, the BOM wouldn't be there anymore.

When I went on and downloaded the file via cURL I was really surprised. The BOM was there! Up until that I thought it might have to do with my application, Content-Types, Encodigs, HTTP Headers, etc. - but all of them seem to be fine.

Now, after hours of trying out different things, any ideas on how I can tell the browser to keep the BOM? Is there any HTTP Header I could set? Since Chrome, Internet Explorer, Edge, Firefox all remove the BOM, this sounds a little bit like a browser convention to me...

Many thanks for your highly appreciated help!

EDIT: Thanks to sideshowbarker answer, I found a workaround by prepending two BOMs to the content, so there will be a BOM remaining after the first BOM gets removed by the browser.

Solution

Workaround (from comments): Since only the first three bytes are read, you can prepend two BOMs to the source, which will result in the downloaded file being valid UTF-8 with a BOM.

As far as Excel specifically: Per the answer at https://stackoverflow.com/a/16766198/1143392, newer versions of Excel (from Office 365) do now support UTF-8.

As far as the cause of the behavior described in the question: The cause is, the relevant specs require the BOM to be stripped out, and that’s what browsers do. That is, browsers conform to the requirements of the UTF-8 decode algorithm in the Encoding spec, which is this:

To UTF-8 decode a byte stream stream, run these steps:

Let buffer be an empty byte sequence.

Read three bytes from stream into buffer.

If buffer does not match 0xEF 0xBB 0xBF, prepend buffer to stream.

Let output be a code point stream.

Run UTF-8’s decoder with stream and output.

Return output.

Step 3 is what causes the BOM to be stripped.

Given the Encoding spec requires that, I think there’s no way to tell browsers to keep the BOM.