I'm parsing HTTP headers. I want to split the header values into arrays where it makes sense.
For example, Cache-Control: no-cache, no-store
should return ['no-cache','no-store']
.
HTTP RFC2616 says:
Multiple message-header fields with the same field-name MAY be present in a message if and only if the entire field-value for that header field is defined as a comma-separated list [i.e., #(values)]. It MUST be possible to combine the multiple header fields into one "field-name: field-value" pair, without changing the semantics of the message, by appending each subsequent field-value to the first, each separated by a comma. The order in which header fields with the same field-name are received is therefore significant to the interpretation of the combined field value, and thus a proxy MUST NOT change the order of these field values when a message is forwarded
But I'm not sure if the reverse is true -- is it safe to split on comma?
I've already found one example where this causes problems. My User-Agent string, for example, is
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36
i.e., it contains a comma after "KHTML". Obviously I don't have more than one user agent, so it doesn't make sense to split this header.
Is User-Agent string the only exception, or are there more?
No, it is not safe to split headers based on commas. As an example, Accept: foo/bar;p="A,B,C", bob/dole;x="apples,oranges"
is a valid header but if you try to split on the comma with the intention of getting a list of mime-types, you'd get invalid results.
The correct answer is that each header is specified using ABNF, most of them in various RFCs, e.g. Accept:
is defined in RFC7231 Section 5.3.2.
I had this specific problem and wrote a parser and tested it on edge cases. Not only is parsing the header non-trivial, interpreting it and giving the correct result is also non-trivial.
Some headers are more complex than others, but essentially each header has it's own grammar which should be respected for correct (and secure) processing.