httphttp-headersrfc2616

How to split header values?


I'm parsing HTTP headers. I want to split the header values into arrays where it makes sense.

For example, Cache-Control: no-cache, no-store should return ['no-cache','no-store'].

HTTP RFC2616 says:

Multiple message-header fields with the same field-name MAY be present in a message if and only if the entire field-value for that header field is defined as a comma-separated list [i.e., #(values)]. It MUST be possible to combine the multiple header fields into one "field-name: field-value" pair, without changing the semantics of the message, by appending each subsequent field-value to the first, each separated by a comma. The order in which header fields with the same field-name are received is therefore significant to the interpretation of the combined field value, and thus a proxy MUST NOT change the order of these field values when a message is forwarded

But I'm not sure if the reverse is true -- is it safe to split on comma?

I've already found one example where this causes problems. My User-Agent string, for example, is

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36

i.e., it contains a comma after "KHTML". Obviously I don't have more than one user agent, so it doesn't make sense to split this header.

Is User-Agent string the only exception, or are there more?


Solution

  • No, it is not safe to split headers based on commas. As an example, Accept: foo/bar;p="A,B,C", bob/dole;x="apples,oranges" is a valid header but if you try to split on the comma with the intention of getting a list of mime-types, you'd get invalid results.

    The correct answer is that each header is specified using ABNF, most of them in various RFCs, e.g. Accept: is defined in RFC7231 Section 5.3.2.

    I had this specific problem and wrote a parser and tested it on edge cases. Not only is parsing the header non-trivial, interpreting it and giving the correct result is also non-trivial.

    Some headers are more complex than others, but essentially each header has it's own grammar which should be respected for correct (and secure) processing.