Secure web uploads

I’ve written hundreds of mailing list posts over the years, in my role first as a volunteer software developer and system administrator for Wikipedia, and later as an employee in the same role. But I’ve never had my own domain name, and I’ve never had a blog.

But I do have things to say, and I’ve often thought about setting up a soap box such as this, with the aim of reaching a wider audience than the mailing lists I usually post to. An important issue has finally come up, and I feel compelled to tell you about it. So I have created this blog.

The issue is a basic feature, which is present in many web applications: file uploads. Due to design choices by the browsers, particularly Internet Explorer, it turns out to be extremely difficult to allow users to upload arbitrary files, without endangering the security of the application.

We spent a lot of time working on secure uploads for MediaWiki, and we thought we had it more or less right. But it turns out that our handling of Internet Explorer wasn’t nearly rigorous enough, and there were still a number of ways to use file uploads to steal the authentication cookies of Internet Explorer users. In MediaWiki 1.13.3, I have, hopefully, closed these gaps. I did this by reverse-engineering three versions of Internet Explorer.

In the rest of this post, I’ll give a tutorial to building a file upload application, working through the security pitfalls from the most naive to the most subtle. I’ll use PHP in my examples, but none of the issues here are PHP-specific.

Upload feature in 10 lines of code: what could possibly go wrong?

Let’s suppose an unlucky newbie developer decided to build their upload feature by following the example in the PHP manual. What could possibly go wrong?

$uploaddir = '/var/www/uploads/';
$uploadfile = $uploaddir . basename($_FILES['userfile']['name']);
 
if (move_uploaded_file($_FILES['userfile']['tmp_name'], $uploadfile)) {
    echo "File is valid, and was successfully uploaded.\n";
} else {
    echo "Possible file upload attack!\n";
}

It opens up an arbitrary script execution vulnerability. An attacker can just upload a file ending with .php, navigate to it in their browser, and the server will execute it. Many web applications are (or have been) vulnerable to this most basic and severe vulnerability.

It is particularly severe because there is a profit motive to exploit it. Spammers have written scripts to search for these kinds of vulnerabilities. They automatically upload a script which runs perpetually, in a virtual() loop to avoid max_execution_time, which relays spam from another host out to the Internet. They can also use vulnerabilities such as this to set up a spamvertised website on the server.

If your code is going to be distributed, it’s not enough to ask the user to disable PHP execution in the /var/www/uploads directory. Nobody reads the manual. Instead we have to work out the circumstances under which a typically configured web server will execute a file as a script, and making sure that circumstance does not happen for uploaded files. In practice that means checking the file extension.

There is another pitfall, however, which is that some web servers (notably Apache with mod_mime) consider files to have multiple extensions. For example, index.php.fr is considered to be the equivalent of index.php, but in French. So to be secure, we must compile a blacklist of script extensions, and check each part of the filename against it.

Don’t forget that file extensions are case-insensitive.

$name = basename( $_FILES['userfile']['name'] );
if ( isBadExtension( $name ) ) {
    echo "Bad extension!";
    return;
}
 
function isBadExtension( $name ) {
    global $extensionBlacklist;
    $extensions = explode( '.', $name );
    unset( $extensions[0] );
    $extensions = array_map( 'strtolower', $extensions );
    foreach ( $extensions as $extension ) {
        if ( in_array( $extension, $extensionBlacklist ) ) {
            return true;
        }
    }
    return false;
}

Bad server-side extensions:

  • php, phtml, php3, php4, php5, phps
  • shtml
  • jhtml
  • pl
  • py
  • cgi

Client-side scripts

We know that certain file types will be executed by web servers as scripts, and these should not be allowed to be uploaded. Similarly, certain file types may contain scripts that will be executed by the client web browser. These types must be detected, and then either validated or disallowed. Although client scripts can’t take over the client’s computer and use it for sending spam, they can steal the user’s login credentials for your web site, and use it to do anything the user can do. JavaScript running from the same origin as your application will have full access to the application’s cookies. If the victim is an administrator for a web app such as Drupal or WordPress, the attacker may be able to insert arbitrary PHP code into the site’s skin, and thus take over the server.

To the list of client-side hazards, we will also add file types which can easily be downloaded and executed on the client computer, giving the uploader/attacker full control, without the user being properly warned of the risks of doing so.

Bad client-side extensions:

  • html, htm, mhtml, mht
  • svg
  • exe, scr, msi, com, pif, cmd, cpl
  • js, jsb, vbs, bat

For a long time, MediaWiki omitted SVG from the list, despite the fact that it has been as dangerous as HTML since Firefox 1.5.

Rather than maintain an extensive blacklist of file types in your application, it’s probably easier to just have a whitelist, say, just allowing the common image formats. But if you let the user configure the whitelist, you need to have a mechanism for warning them when they try to allow one of these dangerous types. And as we will see shortly, controlling the extension alone is not sufficient to provide security.

Content type detection

Back in around 1997, Microsoft decided that web application developers were having it too easy. They could blacklist a few bad file extensions and create a reasonably secure file upload application. So with the release of Internet Explorer 4.0, they launched a crackdown on this secure practice. The result was FindMimeFromData().

Apparently some users were uploading files with the wrong extension on them, or something. So the IE team decided that they weren’t going to trust the content type specified by the server (generally derived from the extension), and instead, they were going to try to detect the file type by looking at the data. They assigned an inexperienced developer to the case, and never bothered to properly test the resulting code.

The community determined that now, if a file had some HTML tags in the first 256 bytes, under certain obscure circumstances, Internet Explorer would decide that the file was in fact HTML, and go on to execute malicious scripts contained within the file. But Microsoft never documented which HTML tags would cause this, or what the circumstances were in which the type could be reassigned. They did release a vague and incomplete document called MIME Type Detection in Internet Explorer, but that wasn’t much help.

The general approach by a security-conscious web application is to check the first part of the file for HTML tags, to determine if IE will detect the file as HTML. Such files can then be rejected. The problem is, the algorithm in the web application must precisely match the secret algorithm used by IE. If the web application is less strict than IE in any minor detail during the process, that detail can be exploited by an attacker to create a file which will be accepted by the web application, but detected as HTML by IE.

I didn’t think this was an acceptable situation. So with the help of IDA Pro Freeware, I disassembled the relevant code in IE 5.0, 6.0 and 7.0. I then ported the whole thing from assembly language to PHP, with version differences incorporated into the code as conditional blocks. A web application can use this code to determine, with some confidence, what type Internet Explorer will assign to a file, and thus whether it should be allowed or not.

The algorithm is large and complex, so rather than detail it here in full, I would prefer to encourage reuse, redistribution and porting of my PHP code, which can be found on MediaWiki’s Subversion server:

If you want to blacklist dangerous file types, or you have a user-configurable whitelist, you are probably best off using the full code. For developers who just want to whitelist a few image types, it’s probably overkill, so I will give a synopsis here of the relevant parts.

FindMimeFromData synopsis

The algorithm proceeds as follows:

  1. First, look at the server’s proposed content type. If it is not in a list of “known” types, then immediately accept that type.
  2. If the proposed content type is text/html, image/gif, image/jpeg or, as of IE 7, image/png, then do a special case check to see if the data matches that declared type. If it does match, the type is returned.
  3. Do a heuristic test on the data to see if it looks like CDF, RSS, Atom, other XML, HTML, XBitMap, BinHex or “scriptlet”.  If the heuristic test matches, return the corresponding type.
  4. Look for magic numbers in the first few bytes of the file, for a sizeable list of candidate file types. If a match is found, return that type.
  5. Do a heuristic test to determine if the data is “text” or “binary”. This test is buggy and will detect non-ASCII text as binary.
  6. If the server’s proposed content type is known to be a binary type, and the heuristic suggests that the file is binary, return the proposed type.
  7. If the server’s proposed content type is known to be a text type, and the heuristic suggests that the file is text, return the proposed type.
  8. If the server’s proposed content type is on a list usually containing only text/html, return the proposed type.
  9. Search HKEY_CLASSES_ROOT to see if the file extension has a corresponding MIME type which might be returned. If it does, return it.
  10. Search HKEY_CLASSES_ROOT to see if the file extension has an application registered to it. If it does, return application/octet-stream.
  11. Return text/plain or application/octet-stream according to the result of step 5.

The trick for whitelisters is that you might be able to get off the ride at step 1 or 2, and so avoid the complexity of steps 3 to 11. For step 1, the known types are:

  • In IE 5: text/richtext, image/x-bitmap, application/postscript, application/base64, application/macbinhex40, application/x-cdf, text/scriptlet, application/pdf, audio/x-aiff, audio/basic, audio/wav, image/gif, image/pjpeg, image/jpeg, image/tiff, image/x-png, image/png, image/bmp, image/x-jg, image/x-art, image/x-emf, image/x-wmf, video/avi, video/x-msvideo, video/mpeg, application/x-compressed, application/x-zip-compressed, application/x-gzip-compressed, application/java, application/x-msdownload, text/html
  • In IE 7, text/xml and application/xml were added.

So if your type is not on this list, and you can be sure the webserver will always return it, then you can allow that upload. This is useful if your web app needs to stream arbitrary user data for some other reason: you can send a header “Content-Type: application/x-my-secret-type”, and IE won’t interpret it as HTML.

For step 2, the magic numbers are as follows:

  • HTML: complex, but ASCII “<html” in the first 255 bytes is sufficient
  • GIF: first 5 bytes must be GIF87 or GIF89. Note that PHP’s getimagesize() only checks that the first 3 bytes are “GIF”, so if you use getimagesize(), you will be insecure. Don’t rely on any libraries, check the magic numbers yourself.
  • JPEG: first 2 bytes must be hexadecimal FF D8.
  • PNG (only for IE 7+): first 8 bytes must be hexadecimal 89 50 4E 47 0D 0A 1A 0A

Safari

Safari is known to use a similar content detection process, based on the Internet Explorer one. But they didn’t have access to the Internet Explorer code when they wrote it, nor to my diassembly results. So it differs in many minor details. I can’t tell you what those details are in any rigorous way, I just know a few by word of mouth. Like Internet Explorer, it is closed-source and undocumented. So all I can say is that if you use Safari, don’t advertise the fact to any potentially malicious people. You’re probably insecure in most web apps.

Browser plugins

There are only two browser plugins which are used by large numbers of people across multiple browsers and platforms: Flash and Java. Both of them create severe security problems in a file upload application, and need to be dealt with specially.

Flash

A reasonable description of the cross-domain-policy problem in web applications was written by Stefan Esser, titled Poking new holes with Flash Crossdomain Policy Files. In a nutshell: you need to scan uploaded files in their entirety for the text “<cross-domain-policy>”, with optional whitespace between the tag name and the angle brackets, and reject any matching files. If you don’t, your server will be exposed to CSRF vulnerabilities. External flash applets see this text as a license to breach the same-origin policy, allowing them to utilise the victim’s cookies for arbitrary purposes.

Java

The so-called GIFAR vulnerability is particularly irritating, and MediaWiki only deals with it in a heavy-handed manner, by blacklisting uploads of zip and zip-like file formats (such as OpenOffice formats).

The problem is that an external web page can embed a java applet using a JAR file hosted on your site. Such a java applet can perform requests with the cookies of the site that hosts it, thus opening up a client script injection vulnerability.

The Java plugin doesn’t check if the file has a .jar extension. It doesn’t even check if the file has the proper magic number at the start, it just executes the file regardless. It does check to see if the directory at the end has an appropriate magic number, so you can use that to blacklist potential JAR files, and all other zip files along with it. To do this, search the last 65558 bytes of the file for the hexadecimal bytes 50 4B 05 06. Reject any files that match.

An alternative would be to parse the entire zip directory and to reject any archives that contain a file with a .class extension. I can’t vouch for this method. If you did this, the zip library you used would have to be exactly as tolerant of zip format errors as the one used by Java. It would probably be best to actually shell out to Java to do the test.

Conclusion

The web sucks. Writing secure web applications is bizarrely complicated and even the most diligent developers get it wrong. Me included. Use this post at your own risk.

Here’s a better idea: restrict web uploads to people who provide a credit card number, and pre-authorise a $50 fine for malicious uploads.

19 Comments

  1. Pingback: Tim is now vocal. « domas mituzas: vaporware, inc.

  2. Pingback: WikiAngela » Tim’s new blog

  3. Pingback: Welcome to the blogosphere Tim at Anyone Can Edit

  4. It’s also worth considering moving all uploads to a directory not publicly accessible, then using a download script to get the files; this makes it easier to implement stats, and more over no matter how malicious any file is, you never need worry as the file will never be executed, simply sent right back to the client.

    on the mime side; it’s worth noting that what you’d class as the same file with mime-type can often have multiple mime types, so check for them all (see mp3’s and jpegs for a good example)

    nice post though!

    • “It’s also worth considering moving all uploads to a directory not publicly accessible, then using a download script to get the files; this makes it easier to implement stats, and more over no matter how malicious any file is, you never need worry as the file will never be executed, simply sent right back to the client.”

      This avoids the server-side scripting issues described in the first section, but none of the client-side Cross-Site Scripting issues which make up the majority of the post, unless you set the Content-Disposition and Content-Type headers appropriately.

  5. Pingback: The Developer Day » Blog Archive » Advice to MySQL, Secure file uploads & more

  6. Wow, that must have been a lot of work. Wouldn’t it have been easier to just do an automated test on the files? If its marked as htm, send it to lynx verifying its somewhat valid html. If its a picture, send it through image magic to make sure. Ect, Ect,

    • It’s automated already, you don’t see me checking each upload to wikipedia.org by hand, do you?

      Just because Lynx says something is valid HTML doesn’t mean Java won’t allow it as a JAR file. As Nathan puts it, a file can have multiple types. Each client makes up its own rules for what it’s going to accept and what it won’t, and it’s possible to exploit the differences in those rules to produce a file that looks like one type to one browser, and another type to another browser.

  7. I’m concerned about the copyright status of the code you wrote. Under US law, at least, code you create derived from the disassembly of someone else’s code counts as a derivative work, and so if this code becomes part of MediaWiki MediaWiki will potentially include code to which Microsoft, at a minimum, has third-party rights. I’m not sure what Australian law on this is, nor am I clear on whether Australian law being different would be of any relief to those who are subject to US law. Still a touchy situation.

    • I’m concerned about it too, but I wasn’t prepared to compromise on security for the purposes of IP rights. There’s a number of defences you could use if the issue came up in court.

      One is that it’s not a derivative work, it just happens to do the same thing. Since the creative elements of the original work, such as variable names, were not preserved in my work, copyright does not carry on. Also the control structures and order of operations were liberally rearranged in order to produce code that does the same thing, but is simpler than Microsoft’s. This is the line I took in the file header.

      Another is that it’s fair use. My work is transformative in purpose, I’m not using it to build a browser, I’m using it to keep web uploads secure. It’s a small excerpt of the original work. There is law that suggests that no excerpt from a computer program is small enough to count as fair use, but there’s an argument that an exception can be made in this case on the basis of public good.

Leave a Reply to Nathan Cancel reply

Your email address will not be published. Required fields are marked *