The last few months some of our applications crashed periodically. Thanks to WebError ErrorMiddleware, we receive an email each time an internal server error occurs.
For example someone tried to retrieve all of our french territories data with the API.
End-user tools like web browsers generate valid UTF-8 requests with no effort, but non UTF-8 requests can be generated by some odd software or by hand from a ipython shell.
Let’s dive into the problem in ipython :
In : url = u'http://www.easter-eggs.com/é' In : url Out: u'http://www.easter-eggs.com/\xe9' In : url.encode('utf-8') Out: 'http://www.easter-eggs.com/\xc3\xa9' In : latin1_url = url.encode('latin1') Out: 'http://www.easter-eggs.com/\xe9' In : latin1_url.decode('utf-8') [... skipped ...] UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 27: unexpected end of data
This shows that U+00E9 is the Unicode codepoint for the ‘é’ character ( see Wikipedia), that its UTF-8 encoding are the 2 bytes ‘\xc3\xa9’, and that decoding in UTF-8 a latin1 byte throws an error.
The stack trace attached to the error e-mails helped us to find that the UnicodeDecodeError exception occurs when calling one of these Request methods: path_info, script_name and params.
So we wrote a new WSGI middleware to reject mis-encoded requests, returning a bad request HTTP error code to the client.
from webob.dec import wsgify import webob.exc @wsgify.middleware def reject_misencoded_requests(req, app, exception_class=None): """WSGI middleware that returns an HTTP error (bad request by default) if the request attributes are not encoded in UTF-8. """ if exception_class is None: exception_class = webob.exc.HTTPBadRequest try: req.path_info req.script_name req.params except UnicodeDecodeError: return exception_class(u'The request URL and its parameters must be encoded in UTF-8.') return req.get_response(app)
The source code of this middleware is published on Gitorious: reject-misencoded-requests
We could have guessed the encoding, and set the Request.encoding attribute, but it would have fixed only the read of PATH_INFO and SCRIPT_NAME, and not the POST and GET parameters which are expected to be encoded only in UTF-8.
That’s why we simply return a 400 bad request HTTP code to our users. This is simpler and does the work.