With FastAPI, How to add charset to content-type (media-type) on request header on OpenAPI (Swagger) doc?

@app.post("/")
def post_hello(username: str = Form(...)):
   return {"Hello": username}

OpenAPI (http:///docs) shows "application/x-www-form-urlencoded".

I tried to change like:

def post_hello(username: str = Form(..., media_type="application/x-www-form-urlencoded; charset=cp932")):
   return {"Hello": "World!", "userName": username}

but not be add charset=cp932

I want to set "application/x-www-form-urlencoded; charset=cp932" to Content-Type on Request. And I want to get username decoded by the charset.

Solution

Short answer

It doesn't appear to be a good idea in the general case; I don't think there's an easy~~, built-in~~ way to do it; and it may not be necessary.

Not a good idea (a.k.a. not standards-compliant)

This GitHub issue discusses why appending ;charset=UTF-8 to application/json isn't a good idea, and the same points brought up there apply in this case.

The HTTP/1.1 specification says that the Content-Type header lists the Media Type.

note: HTTP/2 shares these components with HTTP/1.1

The IANA manages the registry of all the commonly used Media Types (MIME)..

The entry for application/x-www-form-urlencoded says:

Media type name: application
Media subtype name: x-www-form-urlencoded

Required parameters: No parameters

Optional parameters:
No parameters

Encoding considerations: 7bit

Compare this to the entry for text/html:

MIME media type name : Text

MIME subtype name : Standards Tree - html

Required parameters : No required parameters

Optional parameters :
charset
The charset parameter may be provided to definitively specify the document's character encoding, overriding any character encoding declarations in the document. The parameter's value must be one of the labels of the character encoding used to serialize the file.

Encoding considerations : 8bit

The entry for application/x-www-form-urlencoded does not allow for charset to be added. So how should it be decoded from bytes? The URL spec states:

Let nameString and valueString be the result of running UTF-8 decode without BOM on the percent-decoding of name and value, respectively.

It sounds like, whatever the encoding is, UTF-8 should always be used in decoding.

The current HTML/URL specification also has this note about application/x-www-form-urlencoded:

The application/x-www-form-urlencoded format is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices. In particular, readers are cautioned to pay close attention to the twisted details involving repeated (and in some cases nested) conversions between character encodings and byte sequences. Unfortunately the format is in widespread use due to the prevalence of HTML forms.

So it sounds like it wouldn't be a good idea to do something differently.

No built-in way

Note: the built-in way to do what these solutions do is to use a custom Request class.

When building the /openapi.json object, the current version of FastAPI checks to see if a dependency is an instance of Form, and then uses an empty Form instance to build the schema, even if the actual dependency is a subclass of Form.

The default value of the media_type parameter of Form.__init__ is application/x-www-form-urlencoded, so every endpoint/path function with a Form() dependency will show the same Media Type in the docs, even though the class __init__( has a media_type parameter.

There are a few ways of changing what's listed in /openapi.json, which is what's used to generate the documentation, and the FastAPI docs list one official way.

For the example in the question, this would work:

from fastapi import FastAPI, Form
from fastapi.openapi.utils import get_openapi

app = FastAPI()


@app.post("/")
def post_hello(username: str = Form(...)):
    return {"Hello": username}


def custom_openapi():
    if app.openapi_schema:
        return app.openapi_schema

    app.openapi_schema = get_openapi(
        title=app.title,
        version=app.version,
        openapi_version=app.openapi_version,
        description=app.description,
        terms_of_service=app.terms_of_service,
        contact=app.contact,
        license_info=app.license_info,
        routes=app.routes,
        tags=app.openapi_tags,
        servers=app.servers,
    )

    requestBody = app.openapi_schema["paths"]["/"]["post"]["requestBody"]
    content = requestBody["content"]
    new_content = {
        "application/x-www-form-urlencoded;charset=cp932": content[
            "application/x-www-form-urlencoded"
        ]
    }
    requestBody["content"] = new_content

    return app.openapi_schema


app.openapi = custom_openapi

It's worth noting that, with this change, the docs user interface changes how it presents the experimentation section:

Compared to how just application/x-www-form-urlencoded with no charset specified is shown:

The above changes would only change th Media Type listed in the docs. Any form data sent to an endpoint/path function would still be:

parsed by python-multipart (roughly following the same steps described in the specification)
decoded by starlette with Latin-1

So even if starlette was changed to use a different encoding scheme for decoding form data, python-multipart still follows the steps outlined in the specification to use hardcoded byte values for & and ;, for example.

Fortunately, most* of the first 128 characters/codepoints are mapped to the same byte sequences between cp932 and UTF-8, so &, ;, and = all come out the same.

*except for 0x5C, which is sometimes ¥

One way to change starlette to use cp932 encoding would be using a middleware:

import typing
from unittest.mock import patch
from urllib.parse import unquote_plus

import multipart
from fastapi import FastAPI, Form, Request, Response
from fastapi.openapi.utils import get_openapi
from multipart.multipart import parse_options_header
from starlette.datastructures import FormData, UploadFile
from starlette.formparsers import FormMessage, FormParser

app = FastAPI()

form_path = "/"


@app.post(form_path)
async def post_hello(username: str = Form(...)):
    return {"Hello": username}


def custom_openapi():
    if app.openapi_schema:
        return app.openapi_schema

    app.openapi_schema = get_openapi(
        title=app.title,
        version=app.version,
        openapi_version=app.openapi_version,
        description=app.description,
        terms_of_service=app.terms_of_service,
        contact=app.contact,
        license_info=app.license_info,
        routes=app.routes,
        tags=app.openapi_tags,
        servers=app.servers,
    )

    requestBody = app.openapi_schema["paths"]["/"]["post"]["requestBody"]
    content = requestBody["content"]
    new_content = {
        "application/x-www-form-urlencoded;charset=cp932": content[
            "application/x-www-form-urlencoded"
        ]
    }
    requestBody["content"] = new_content

    return app.openapi_schema


app.openapi = custom_openapi


class CP932FormParser(FormParser):
    async def parse(self) -> FormData:
        """
        copied from:
        https://github.com/encode/starlette/blob/0.17.1/starlette/formparsers.py#L72-L110
        """
        # Callbacks dictionary.
        callbacks = {
            "on_field_start": self.on_field_start,
            "on_field_name": self.on_field_name,
            "on_field_data": self.on_field_data,
            "on_field_end": self.on_field_end,
            "on_end": self.on_end,
        }

        # Create the parser.
        parser = multipart.QuerystringParser(callbacks)
        field_name = b""
        field_value = b""

        items: typing.List[typing.Tuple[str, typing.Union[str, UploadFile]]] = []

        # Feed the parser with data from the request.
        async for chunk in self.stream:
            if chunk:
                parser.write(chunk)
            else:
                parser.finalize()
            messages = list(self.messages)
            self.messages.clear()
            for message_type, message_bytes in messages:
                if message_type == FormMessage.FIELD_START:
                    field_name = b""
                    field_value = b""
                elif message_type == FormMessage.FIELD_NAME:
                    field_name += message_bytes
                elif message_type == FormMessage.FIELD_DATA:
                    field_value += message_bytes
                elif message_type == FormMessage.FIELD_END:
                    name = unquote_plus(field_name.decode("cp932"))  # changed line
                    value = unquote_plus(field_value.decode("cp932"))  # changed line
                    items.append((name, value))

        return FormData(items)


class CustomRequest(Request):
    async def form(self) -> FormData:
        """
        copied from
        https://github.com/encode/starlette/blob/0.17.1/starlette/requests.py#L238-L253
        """
        if not hasattr(self, "_form"):
            assert (
                parse_options_header is not None
            ), "The `python-multipart` library must be installed to use form parsing."
            content_type_header = self.headers.get("Content-Type")
            content_type, options = parse_options_header(content_type_header)
            if content_type == b"multipart/form-data":
                multipart_parser = MultiPartParser(self.headers, self.stream())
                self._form = await multipart_parser.parse()
            elif content_type == b"application/x-www-form-urlencoded":
                form_parser = CP932FormParser(
                    self.headers, self.stream()
                )  # use the custom parser above
                self._form = await form_parser.parse()
            else:
                self._form = FormData()
        return self._form


@app.middleware("http")
async def custom_form_parser(request: Request, call_next) -> Response:
    if request.scope["path"] == form_path:
        # starlette creates a new Request object for each middleware/app
        # invocation:
        # https://github.com/encode/starlette/blob/0.17.1/starlette/routing.py#L59
        # this temporarily patches the Request object starlette
        # uses with our modified version
        with patch("starlette.routing.Request", new=CustomRequest):
            return await call_next(request)

Then, the data has to be encoded manually:

>>> import sys
>>> from urllib.parse import quote_plus
>>> name = quote_plus("username").encode("cp932")
>>> value = quote_plus("cp932文字コード").encode("cp932")
>>> with open("temp.txt", "wb") as file:
...     file.write(name + b"=" + value)
...
59

And sent as binary data:

$ curl -X 'POST' \
  'http://localhost:8000/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/x-www-form-urlencoded;charset=cp932' \
  --data-binary "@temp.txt" \
  --silent \
| jq -C .

{
  "Hello": "cp932文字コード"
}

Might not be necessary

In the manual encoding step, the output will look like:

username=cp932%E6%96%87%E5%AD%97%E3%82%B3%E3%83%BC%E3%83%89

Part of the percent-encoding step replaces any bytes representing characters higher than 0x7E (~ in ASCII) with bytes in a reduced ASCII range. Since cp932 and UTF-8 both map these bytes to the same codepoints (except for 0x5C which may be \ or ¥), the byte sequence will decode to the same string:

$ curl -X 'POST' \
  'http://localhost:8000/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/x-www-form-urlencoded;charset=cp932' \
  --data-urlencode "username=cp932文字コード" \
  --silent \
| jq -C .

{
  "Hello": "cp932文字コード"
}

This is only true for percent-encoded data.

Any data sent without percent-encoding will be processed and interpreted differently from how the sender intended it to be interpreted. For example, in the OpenAPI (Swagger) docs, the "Try it out" experimentation section gives an example with curl -d (same as --data), which doesn't urlencode the data:

$ curl -X 'POST' \
  'http://localhost:8000/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  --data "username=cp932文字コード" \
  --silent \
| jq -C .
{
  "Hello": "cp932æ–‡å—ã‚³ãƒ¼ãƒ‰"
}

It might still be a good idea to only use cp932 to process requests from senders that are configured in a way similar to the server.

One way to do this would be to modify the middleware function to only process data like this if the sender specifies that the data has been encoded with cp932:

import typing
from unittest.mock import patch
from urllib.parse import unquote_plus

import multipart
from fastapi import FastAPI, Form, Request, Response
from fastapi.openapi.utils import get_openapi
from multipart.multipart import parse_options_header
from starlette.datastructures import FormData, UploadFile
from starlette.formparsers import FormMessage, FormParser

app = FastAPI()

form_path = "/"


@app.post(form_path)
async def post_hello(username: str = Form(...)):
    return {"Hello": username}


def custom_openapi():
    if app.openapi_schema:
        return app.openapi_schema

    app.openapi_schema = get_openapi(
        title=app.title,
        version=app.version,
        openapi_version=app.openapi_version,
        description=app.description,
        terms_of_service=app.terms_of_service,
        contact=app.contact,
        license_info=app.license_info,
        routes=app.routes,
        tags=app.openapi_tags,
        servers=app.servers,
    )

    requestBody = app.openapi_schema["paths"]["/"]["post"]["requestBody"]
    content = requestBody["content"]
    new_content = {
        "application/x-www-form-urlencoded;charset=cp932": content[
            "application/x-www-form-urlencoded"
        ]
    }
    requestBody["content"] = new_content

    return app.openapi_schema


app.openapi = custom_openapi


class CP932FormParser(FormParser):
    async def parse(self) -> FormData:
        """
        copied from:
        https://github.com/encode/starlette/blob/0.17.1/starlette/formparsers.py#L72-L110
        """
        # Callbacks dictionary.
        callbacks = {
            "on_field_start": self.on_field_start,
            "on_field_name": self.on_field_name,
            "on_field_data": self.on_field_data,
            "on_field_end": self.on_field_end,
            "on_end": self.on_end,
        }

        # Create the parser.
        parser = multipart.QuerystringParser(callbacks)
        field_name = b""
        field_value = b""

        items: typing.List[typing.Tuple[str, typing.Union[str, UploadFile]]] = []

        # Feed the parser with data from the request.
        async for chunk in self.stream:
            if chunk:
                parser.write(chunk)
            else:
                parser.finalize()
            messages = list(self.messages)
            self.messages.clear()
            for message_type, message_bytes in messages:
                if message_type == FormMessage.FIELD_START:
                    field_name = b""
                    field_value = b""
                elif message_type == FormMessage.FIELD_NAME:
                    field_name += message_bytes
                elif message_type == FormMessage.FIELD_DATA:
                    field_value += message_bytes
                elif message_type == FormMessage.FIELD_END:
                    name = unquote_plus(field_name.decode("cp932"))  # changed line
                    value = unquote_plus(field_value.decode("cp932"))  # changed line
                    items.append((name, value))

        return FormData(items)


class CustomRequest(Request):
    async def form(self) -> FormData:
        """
        copied from
        https://github.com/encode/starlette/blob/0.17.1/starlette/requests.py#L238-L253
        """
        if not hasattr(self, "_form"):
            assert (
                parse_options_header is not None
            ), "The `python-multipart` library must be installed to use form parsing."
            content_type_header = self.headers.get("Content-Type")
            content_type, options = parse_options_header(content_type_header)
            if content_type == b"multipart/form-data":
                multipart_parser = MultiPartParser(self.headers, self.stream())
                self._form = await multipart_parser.parse()
            elif content_type == b"application/x-www-form-urlencoded":
                form_parser = CP932FormParser(
                    self.headers, self.stream()
                )  # use the custom parser above
                self._form = await form_parser.parse()
            else:
                self._form = FormData()
        return self._form


@app.middleware("http")
async def custom_form_parser(request: Request, call_next) -> Response:
    if request.scope["path"] != form_path:
        return await call_next(request)

    content_type_header = request.headers.get("content-type", None)
    if not content_type_header:
        return await call_next(request)

    media_type, options = parse_options_header(content_type_header)
    if b"charset" not in options or options[b"charset"] != b"cp932":
        return await call_next(request)

    # starlette creates a new Request object for each middleware/app
    # invocation:
    # https://github.com/encode/starlette/blob/0.17.1/starlette/routing.py#L59
    # this temporarily patches the Request object starlette
    # uses with our modified version
    with patch("starlette.routing.Request", new=CustomRequest):
        return await call_next(request)

Security

Even with this modification, I think the note in the spec about parsing content with percent-decode should also be highlighted:

⚠ Warning! Using anything but UTF-8 decode without BOM when input contains bytes that are not ASCII bytes might be insecure and is not recommended.

So I would be cautious about implementing any of these solutions at all.