<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Tim Starling's blog &#187; Web development</title>
	<atom:link href="http://tstarling.com/blog/category/web-development/feed/" rel="self" type="application/rss+xml" />
	<link>http://tstarling.com/blog</link>
	<description>Web software development and Wikimedia</description>
	<lastBuildDate>Wed, 23 Jun 2010 03:52:26 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Measuring memory usage with strace</title>
		<link>http://tstarling.com/blog/2010/06/measuring-memory-usage-with-strace/</link>
		<comments>http://tstarling.com/blog/2010/06/measuring-memory-usage-with-strace/#comments</comments>
		<pubDate>Wed, 23 Jun 2010 03:52:26 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[Web development]]></category>

		<guid isPermaLink="false">http://tstarling.com/blog/?p=42</guid>
		<description><![CDATA[In the tradition of abusing high-level Linux tools to produce useful low-level data, I present a method for estimating peak memory usage in Linux by text-processing the output from strace:

measure-memory

This Perl script invokes an arbitrary command via strace. It adds up memory allocated by mmap2() with no location hint and the file handle set to [...]]]></description>
			<content:encoded><![CDATA[<p>In the <a href="http://poormansprofiler.org/">tradition</a> of abusing high-level Linux tools to produce useful low-level data, I present a method for estimating peak memory usage in Linux by text-processing the output from strace:</p>
<ul>
<li><a href="/stuff/measure-memory">measure-memory</a></li>
</ul>
<p>This Perl script invokes an arbitrary command via strace. It adds up memory allocated by mmap2() with no location hint and the file handle set to -1, this is the way that malloc() typically allocates large amounts of memory. It also counts calls to brk(), and subtracts the sizes of munmap() calls for maps that were previously counted. It outputs the current memory usage rounded off to the nearest megabyte, whenever that number changes.</p>
<p><a href="http://stackoverflow.com/questions/1080461/peak-memory-measurement-of-long-running-process-in-linux">Other methods</a> for measuring peak memory usage typically revolve around polling /proc for resident set size (RSS), this potentially misses short-lived allocations. The GNU time command (/usr/bin/time, not the one built in to bash) can show peak RSS, but in some applications this can be a vast overestimate of physical memory usage, due to the way Linux counts RSS.</p>
<p>My method provides a reasonable estimate of the amount of memory allocated with malloc(). That can be a useful thing to know.</p>
]]></content:encoded>
			<wfw:commentRss>http://tstarling.com/blog/2010/06/measuring-memory-usage-with-strace/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>PHP memory optimisation ideas</title>
		<link>http://tstarling.com/blog/2010/01/php-memory-optimisation-ideas/</link>
		<comments>http://tstarling.com/blog/2010/01/php-memory-optimisation-ideas/#comments</comments>
		<pubDate>Thu, 14 Jan 2010 04:16:41 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[Web development]]></category>

		<guid isPermaLink="false">http://tstarling.com/blog/?p=33</guid>
		<description><![CDATA[My vague rant about PHP 5.3&#8217;s memory usage on php.internals turned into something potentially more useful when Stanislav Malyshev (a.k.a. Stas) started responding to it in an intelligent way, forcing me to come up with some more concrete ideas and to justify them. Some of the resulting text is quoted below, edited so that it [...]]]></description>
			<content:encoded><![CDATA[<p>My <a href="http://news.php.net/php.internals/46696">vague rant</a> about PHP 5.3&#8217;s memory usage on php.internals turned into something potentially more useful when <a href="http://php100.wordpress.com/">Stanislav Malyshev</a> (a.k.a. Stas) started responding to it in an intelligent way, forcing me to come up with some more concrete ideas and to justify them. Some of the resulting text is quoted below, edited so that it makes sense in this format.</p>
<pre>&lt;?php
$m = memory_get_usage();
$a = explode(',', str_repeat(',', 100000));
print (memory_get_usage() - $m)/100000;
?&gt;</pre>
<p>This is said to use 170 to 260 bytes per element on a 64-bit architecture. I think this is excessive.</p>
<p><strong>Stas</strong>: I do not see what could be removed from Bucket or zval without hurting the functionality.</p>
<p><strong>Tim</strong>: Right, and that&#8217;s why PHP is so bad compared to other languages. Its one-size-fits-all data structure has to store a lot of data per element to support every possible use case. However, there is room for optimisation. For instance, an array could start off as being like a C++ std::vector. Then when someone inserts an item into it with a non-integer key, it could be converted to a hashtable. This could potentially give you a time saving as well, because conversion to a hashtable could resize the destination hashtable in one step instead of growing it O(log N) times.</p>
<p>Some other operations, like deleting items from the middle of the array or adding items past the end (leaving gaps) would also have to trigger conversion. The point would be to optimise the most common use cases for integer-indexed arrays.</p>
<p>What about objects that can optionally pack themselves into a class-dependent structure and unpack on demand?</p>
<p><strong>Stas</strong>: Objects can do pretty much anything in Zend Engine now, provided you do some C <img src='http://tstarling.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  For the engine, object is basically a pointer and an integer, the rest is changeable. Of course, on PHP level we need to have more, but that&#8217;s because certain things just not doable on PHP level. Do you have some specific use case that would allow to reduce memory usage?</p>
<p><strong>Tim</strong>: Basically I&#8217;m thinking along the same lines as the array optimisation I suggested above. For my sample class, the zend_class_entry would have a hashtable like:</p>
<p>v1 =&gt; 0, v2 =&gt; 1, v3 =&gt; 2, v4 =&gt; 3, v5 =&gt; 4, v6 =&gt; 5, v7 =&gt; 6, v8 =&gt;7, v9 =&gt; 8, v10 =&gt; 9</p>
<p>The class is:</p>
<pre>class C { var $v1, $v2, $v3, $v4, $v5, $v6, $v7, $v8, $v9, $v10; }</pre>
<p>Then the object could be stored as a zval[10]. Object member access would be implemented by looking up the member name in the class entry hashtable and then using the resulting index into the zval[10]. When the object is unpacked (say if the user creates or deletes object members at runtime), then the object value becomes a hashtable.</p>
<p><strong>Stas</strong>: That would mean having 2 object types &#8211; &#8220;packed&#8221; and &#8220;unpacked&#8221; with all (most of) operations basically duplicated. However, for objects it&#8217;s easier than for arrays since objects API is more abstract. I&#8217;m not sure that would improve situation though &#8211; a lot of objects are dynamic and for those it would mean a penalty when the object is unpacked.</p>
<p>But this can be tested on the current engine (maybe even without breaking BC!) and if it gives good results it may be an option.</p>
<p><strong>Tim</strong>: What about an oparray format with less 64-bit pointers and more smallish integers?</p>
<p><strong>Stas</strong>: I&#8217;m not sure how the data op array needs can be stored without using pointers.</p>
<p><strong>Tim</strong>: Making oplines use a variable amount of memory (like they do in machine code) would be a great help.</p>
<p>For declarations, you could pack structures like zend_class_entry and zend_function_entry on to the end of the opline, and access them by casting the opline to the appropriate opcode-specific type. That would save pointers and also allocator overhead.</p>
<p>At the more extreme end of the spectrum, the compiler could produce a pointerless oparray, like JVM bytecode. Then when a function is executed for the first time, the oparray could be expanded, with pointers added, and the result cached. This would reduce memory usage for code which is never executed. And it would have the added advantage of making APC easier to implement, since it could just copy the whole unexpanded oparray with memcpy().</p>
<p><strong>Stas</strong>: opcodes can be cached (bytecode caches do it) but op_array can&#8217;t really be cached between requests because it contains dynamic structures. Unlike Java, PHP does full cleanup after each request, which means no preserving dynamic data.</p>
<p><strong>Tim</strong>: APC deep-copies the whole zend_op_array, see apc_copy_op_array() in apc_compile.c. It does it using an impressive pile of hacks which break with every major release and in some minor releases too. Every time the compiler allocates memory, there has to be a matching shared memory allocation in APC.</p>
<p>But maybe you missed my point. I&#8217;m talking about a cache which is cheap to construct and cleared at the end of each request. It would optimise tight loops of calls to user-defined functions. The dynamic data, like static variable hashtables, would be in it. The compact pointerless structure could be stored between requests, and would not contain dynamic data.</p>
<p>Basically a structure like the current zend_op_array would be created on demand by the executor instead of in advance by the compiler.</p>
<p><strong>Stas</strong>: I&#8217;m not sure how using pointers in op_array in such manner would help though &#8211; you&#8217;d still need to store things like function names, for example, and since you need to store it somewhere, you&#8217;d also have some pointer to this place. Same goes for a bunch of other op_array&#8217;s properties &#8211; you&#8217;d need to store them somewhere and be able to find them, so I don&#8217;t see how you&#8217;d do it without a pointer of some kind involved.</p>
<p><strong>Tim</strong>: You can do it with a length field and a char[1] at the end of the structure. When you allocate memory for the structure, you add some on for the string. Then you copy the string into the char[1], overflowing it.</p>
<p>If you need several strings, then you can have several byte offsets, which are added to the start of the char[1] to find the location of the string in question. You can make the offset fields small, say 16 bits.</p>
<p>But it&#8217;s mostly zend_op I&#8217;m interested in rather than zend_op_array. Currently if a zend_op has a string literal argument, you&#8217;d make a zval for it and copy it into op1.u.constant. But the zval allocation could be avoided. The handler could cast the zend_op to a zend_op_with_a_string, which would have a length field and an overflowed char[1] at the end for the string argument.</p>
<p>A variable op size would make iterating through zend_op_array.opcodes slightly more awkward, something like:</p>
<pre>for (; op &lt; oparray_end; op = (zend_op*)((char*)op + op-&gt;size)) {
   ...</pre>
<p>But obviously you could clean that up with a macro.</p>
<p>For the skeptical Mr. &#8220;everyone has 8GB of memory and tiny little data sets&#8221; <a href="http://lerdorf.com/">Lerdorf</a>, I could point out that reducing the average zend_op size and placing strings close to other op data will also make execution faster, due to the improved CPU cache hit rate.</p>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px;">
<blockquote>
<pre><span class="moz-txt-citetags">&gt; </span>I do not see what could be removed from Bucket or zval without hurting
<span class="moz-txt-citetags">&gt; </span>the functionality.</pre>
</blockquote>
</div>
]]></content:encoded>
			<wfw:commentRss>http://tstarling.com/blog/2010/01/php-memory-optimisation-ideas/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Secure web uploads</title>
		<link>http://tstarling.com/blog/2008/12/secure-web-uploads/</link>
		<comments>http://tstarling.com/blog/2008/12/secure-web-uploads/#comments</comments>
		<pubDate>Tue, 16 Dec 2008 10:13:47 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[Security]]></category>
		<category><![CDATA[Web development]]></category>
		<category><![CDATA[Wikimedia]]></category>

		<guid isPermaLink="false">http://tstarling.com/blog/?p=4</guid>
		<description><![CDATA[I&#8217;ve written hundreds of mailing list posts over the years, in my role first as a volunteer software developer and system administrator for Wikipedia, and later as an employee in the same role. But I&#8217;ve never had my own domain name, and I&#8217;ve never had a blog.
But I do have things to say, and I&#8217;ve [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve written hundreds of mailing list posts over the years, in my role first as a volunteer software developer and system administrator for <a href="http://en.wikipedia.org/">Wikipedia</a>, and later as an employee in the same role. But I&#8217;ve never had my own domain name, and I&#8217;ve never had a blog.</p>
<p>But I do have things to say, and I&#8217;ve often thought about setting up a soap box such as this, with the aim of reaching a wider audience than the mailing lists I usually post to. An important issue has finally come up, and I feel compelled to tell you about it. So I have created this blog.</p>
<p>The issue is a basic feature, which is present in many web applications: <strong>file uploads</strong>. Due to design choices by the browsers, particularly Internet Explorer, it turns out to be extremely difficult to allow users to upload arbitrary files, without endangering the security of the application.</p>
<p>We spent a lot of time working on secure uploads for <a href="http://www.mediawiki.org/">MediaWiki</a>, and we thought we had it more or less right. But it turns out that our handling of Internet Explorer wasn&#8217;t nearly rigorous enough, and there were still a number of ways to use file uploads to steal the authentication cookies of Internet Explorer users. In <a href="http://lists.wikimedia.org/pipermail/mediawiki-announce/2008-December/000080.html">MediaWiki 1.13.3</a>, I have, hopefully, closed these gaps. I did this by reverse-engineering three versions of Internet Explorer.</p>
<p>In the rest of this post, I&#8217;ll give a tutorial to building a file upload application, working through the security pitfalls from the most naive to the most subtle. I&#8217;ll use PHP in my examples, but none of the issues here are PHP-specific.</p>
<p><span id="more-4"></span></p>
<h2>Upload feature in 10 lines of code: what could possibly go wrong?</h2>
<p>Let&#8217;s suppose an unlucky newbie developer decided to build their upload feature by following the <a href="http://www.php.net/manual/en/features.file-upload.php">example in the PHP manual</a>. What could possibly go wrong?</p>

<div class="wp_syntax"><div class="code"><pre class="php php" style="font-family:monospace;"><span style="color: #000088;">$uploaddir</span> <span style="color: #339933;">=</span> <span style="">'/var/www/uploads/'</span>;
<span style="color: #000088;">$uploadfile</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$uploaddir</span> <span style="color: #339933;">.</span> <span style="color: #990000;">basename</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$_FILES</span><span style="color: #009900;">&#91;</span><span style="">'userfile'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="">'name'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>;
&nbsp;
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #990000;">move_uploaded_file</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$_FILES</span><span style="color: #009900;">&#91;</span><span style="">'userfile'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="">'tmp_name'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> <span style="color: #000088;">$uploadfile</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #990000;">echo</span> <span style="color: #0000ff;">&quot;File is valid, and was successfully uploaded.<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>;
<span style="color: #009900;">&#125;</span> <span style="color: #b1b100;">else</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #990000;">echo</span> <span style="color: #0000ff;">&quot;Possible file upload attack!<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>;
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>It opens up an arbitrary script execution vulnerability. An attacker can just upload a file ending with .php, navigate to it in their browser, and the server will execute it. Many web applications are (or have been) vulnerable to this most basic and severe vulnerability.</p>
<p>It is particularly severe because there is a profit motive to exploit it. Spammers have written scripts to search for these kinds of vulnerabilities. They automatically upload a script which runs perpetually, in a virtual() loop to avoid max_execution_time, which relays spam from another host out to the Internet. They can also use vulnerabilities such as this to set up a spamvertised website on the server.</p>
<p>If your code is going to be distributed, it&#8217;s not enough to ask the user to disable PHP execution in the /var/www/uploads directory. Nobody reads the manual. Instead we have to work out the circumstances under which a typically configured web server will execute a file as a script, and making sure that circumstance does not happen for uploaded files. In practice that means checking the file extension.</p>
<p>There is another pitfall, however, which is that some web servers (notably Apache with <a href="http://httpd.apache.org/docs/2.2/mod/mod_mime.html">mod_mime</a>) consider files to have <strong>multiple extensions</strong>. For example, index.php.fr is considered to be the equivalent of index.php, but in French. So to be secure, we must compile a blacklist of script extensions, and check each part of the filename against it.</p>
<p>Don&#8217;t forget that file extensions are case-insensitive.</p>

<div class="wp_syntax"><div class="code"><pre class="php php" style="font-family:monospace;"><span style="color: #000088;">$name</span> <span style="color: #339933;">=</span> <span style="color: #990000;">basename</span><span style="color: #009900;">&#40;</span> <span style="color: #000088;">$_FILES</span><span style="color: #009900;">&#91;</span><span style="">'userfile'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="">'name'</span><span style="color: #009900;">&#93;</span> <span style="color: #009900;">&#41;</span>;
<span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span> isBadExtension<span style="color: #009900;">&#40;</span> <span style="color: #000088;">$name</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #990000;">echo</span> <span style="color: #0000ff;">&quot;Bad extension!&quot;</span>;
    <span style="color: #b1b100;">return</span>;
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">function</span> isBadExtension<span style="color: #009900;">&#40;</span> <span style="color: #000088;">$name</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #990000;">global</span> <span style="color: #000088;">$extensionBlacklist</span>;
    <span style="color: #000088;">$extensions</span> <span style="color: #339933;">=</span> <span style="color: #990000;">explode</span><span style="color: #009900;">&#40;</span> <span style="">'.'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$name</span> <span style="color: #009900;">&#41;</span>;
    <span style="color: #990000;">unset</span><span style="color: #009900;">&#40;</span> <span style="color: #000088;">$extensions</span><span style="color: #009900;">&#91;</span><span style="color:#800080;">0</span><span style="color: #009900;">&#93;</span> <span style="color: #009900;">&#41;</span>;
    <span style="color: #000088;">$extensions</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array_map</span><span style="color: #009900;">&#40;</span> <span style="">'strtolower'</span><span style="color: #339933;">,</span> <span style="color: #000088;">$extensions</span> <span style="color: #009900;">&#41;</span>;
    <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span> <span style="color: #000088;">$extensions</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$extension</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span> <span style="color: #990000;">in_array</span><span style="color: #009900;">&#40;</span> <span style="color: #000088;">$extension</span><span style="color: #339933;">,</span> <span style="color: #000088;">$extensionBlacklist</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
            <span style="color: #b1b100;">return</span> <span style="color: #000000; font-weight: bold;">true</span>;
        <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span>
    <span style="color: #b1b100;">return</span> <span style="color: #000000; font-weight: bold;">false</span>;
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Bad server-side extensions:</p>
<ul>
<li>php, phtml, php3, php4, php5, phps</li>
<li>shtml</li>
<li>jhtml</li>
<li>pl</li>
<li>py</li>
<li>cgi</li>
</ul>
<h2>Client-side scripts</h2>
<p>We know that certain file types will be executed by web servers as scripts, and these should not be allowed to be uploaded. Similarly, certain file types may contain scripts that will be executed by the client web browser. These types must be detected, and then either validated or disallowed. Although client scripts can&#8217;t take over the client&#8217;s computer and use it for sending spam, they can steal the user&#8217;s login credentials for your web site, and use it to do anything the user can do. JavaScript running from the same origin as your application will have full access to the application&#8217;s cookies. If the victim is an administrator for a web app such as Drupal or Wordpress, the attacker may be able to insert arbitrary PHP code into the site&#8217;s skin, and thus take over the server.</p>
<p>To the list of client-side hazards, we will also add file types which can easily be downloaded and executed on the client computer, giving the uploader/attacker full control, without the user being properly warned of the risks of doing so.</p>
<p>Bad client-side extensions:</p>
<ul>
<li>html, htm, mhtml, mht</li>
<li>svg</li>
<li>exe, scr, msi, com, pif, cmd, cpl</li>
<li>js, jsb, vbs, bat</li>
</ul>
<p>For a long time, MediaWiki omitted <a href="http://www.w3.org/Graphics/SVG/">SVG</a> from the list, despite the fact that it has been as dangerous as HTML since Firefox 1.5.</p>
<p>Rather than maintain an extensive blacklist of file types in your application, it&#8217;s probably easier to just have a whitelist, say, just allowing the common image formats. But if you let the user configure the whitelist, you need to have a mechanism for warning them when they try to allow one of these dangerous types. And as we will see shortly, controlling the extension alone is not sufficient to provide security.</p>
<h2>Content type detection</h2>
<p>Back in around 1997, Microsoft decided that web application developers were having it too easy. They could blacklist a few bad file extensions and create a reasonably secure file upload application. So with the release of Internet Explorer 4.0, they launched a crackdown on this secure practice. The result was FindMimeFromData().</p>
<p>Apparently some users were uploading files with the wrong extension on them, or something. So the IE team decided that they weren&#8217;t going to trust the content type specified by the server (generally derived from the extension), and instead, they were going to try to detect the file type by looking at the data. They assigned an inexperienced developer to the case, and never bothered to properly test the resulting code.</p>
<p>The community determined that now, if a file had some HTML tags in the first 256 bytes, under certain obscure circumstances, Internet Explorer would decide that the file was in fact HTML, and go on to execute malicious scripts contained within the file. But Microsoft never documented which HTML tags would cause this, or what the circumstances were in which the type could be reassigned. They did release a vague and incomplete document called <a href="http://msdn.microsoft.com/en-us/library/ms775147(VS.85).aspx">MIME Type Detection in Internet Explorer</a>, but that wasn&#8217;t much help.</p>
<p>The general approach by a security-conscious web application is to check the first part of the file for HTML tags, to determine if IE will detect the file as HTML. Such files can then be rejected. The problem is, the algorithm in the web application must precisely match the secret algorithm used by IE. If the web application is less strict than IE in any minor detail during the process, that detail can be exploited by an attacker to create a file which will be accepted by the web application, but detected as HTML by IE.</p>
<p>I didn&#8217;t think this was an acceptable situation. So with the help of <a href="http://www.hex-rays.com/idapro/">IDA Pro Freeware</a>, I disassembled the relevant code in IE 5.0, 6.0 and 7.0. I then ported the whole thing from assembly language to PHP, with version differences incorporated into the code as conditional blocks. A web application can use this code to determine, with some confidence, what type Internet Explorer will assign to a file, and thus whether it should be allowed or not.</p>
<p>The algorithm is large and complex, so rather than detail it here in full, I would prefer to encourage reuse, redistribution and porting of my PHP code, which can be found on MediaWiki&#8217;s Subversion server:</p>
<ul>
<li>Download <a href="http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/IEContentAnalyzer.php?view=markup">IEContentAnalyzer.php</a></li>
</ul>
<p>If you want to blacklist dangerous file types, or you have a user-configurable whitelist, you are probably best off using the full code. For developers who just want to whitelist a few image types, it&#8217;s probably overkill, so I will give a synopsis here of the relevant parts.</p>
<h3>FindMimeFromData synopsis</h3>
<p>The algorithm proceeds as follows:</p>
<ol>
<li>First, look at the server&#8217;s proposed content type. If it is <strong>not in a list</strong> of &#8220;known&#8221; types, then immediately accept that type.</li>
<li>If the proposed content type is text/html, image/gif, image/jpeg or, as of IE 7, image/png, then do a special case check to see if the data matches that declared type. If it does match, the type is returned.</li>
<li>Do a heuristic test on the data to see if it looks like CDF, RSS, Atom, other XML, HTML, XBitMap, BinHex or &#8220;scriptlet&#8221;.  If the heuristic test matches, return the corresponding type.</li>
<li>Look for magic numbers in the first few bytes of the file, for a sizeable list of candidate file types. If a match is found, return that type.</li>
<li>Do a heuristic test to determine if the data is &#8220;text&#8221; or &#8220;binary&#8221;. This test is buggy and will detect non-ASCII text as binary.</li>
<li>If the server&#8217;s proposed content type is known to be a binary type, and the heuristic suggests that the file is binary, return the proposed type.</li>
<li>If the server&#8217;s proposed content type is known to be a text type, and the heuristic suggests that the file is text, return the proposed type.</li>
<li>If the server&#8217;s proposed content type is on a list usually containing only text/html, return the proposed type.</li>
<li>Search HKEY_CLASSES_ROOT to see if the file extension has a corresponding MIME type which might be returned. If it does, return it.</li>
<li>Search HKEY_CLASSES_ROOT to see if the file extension has an application registered to it. If it does, return application/octet-stream.</li>
<li> Return text/plain or application/octet-stream according to the result of step 5.</li>
</ol>
<p>The trick for whitelisters is that you might be able to get off the ride at step 1 or 2, and so avoid the complexity of steps 3 to 11. For step 1, the known types are:</p>
<ul>
<li>In IE 5: text/richtext, image/x-bitmap, application/postscript, application/base64, application/macbinhex40, application/x-cdf, text/scriptlet, application/pdf, audio/x-aiff, audio/basic, audio/wav, image/gif, image/pjpeg, image/jpeg, image/tiff, image/x-png, image/png, image/bmp, image/x-jg, image/x-art, image/x-emf, image/x-wmf, video/avi, video/x-msvideo, video/mpeg, application/x-compressed, application/x-zip-compressed, application/x-gzip-compressed, application/java, application/x-msdownload, text/html</li>
<li>In IE 7, text/xml and application/xml were added.</li>
</ul>
<p>So if your type is not on this list, and you can be sure the webserver will always return it, then you can allow that upload. This is useful if your web app needs to stream arbitrary user data for some other reason: you can send a header &#8220;Content-Type: application/x-my-secret-type&#8221;, and IE won&#8217;t interpret it as HTML.</p>
<p>For step 2, the magic numbers are as follows:</p>
<ul>
<li>HTML: complex, but ASCII &#8220;&lt;html&#8221; in the first 255 bytes is sufficient</li>
<li>GIF: first 5 bytes must be GIF87 or GIF89. Note that PHP&#8217;s getimagesize() only checks that the first 3 bytes are &#8220;GIF&#8221;, so if you use getimagesize(), you will be insecure. Don&#8217;t rely on any libraries, check the magic numbers yourself.</li>
<li>JPEG: first 2 bytes must be hexadecimal FF D8.</li>
<li>PNG (only for IE 7+): first 8 bytes must be hexadecimal 89 50 4E 47 0D 0A 1A 0A</li>
</ul>
<h3>Safari</h3>
<p>Safari is known to use a similar content detection process, based on the Internet Explorer one. But they didn&#8217;t have access to the Internet Explorer code when they wrote it, nor to my diassembly results. So it differs in many minor details. I can&#8217;t tell you what those details are in any rigorous way, I just know a few by word of mouth. Like Internet Explorer, it is closed-source and undocumented. So all I can say is that if you use Safari, don&#8217;t advertise the fact to any potentially malicious people. You&#8217;re probably insecure in most web apps.</p>
<h2>Browser plugins</h2>
<p>There are only two browser plugins which are used by large numbers of people across multiple browsers and platforms: Flash and Java. Both of them create severe security problems in a file upload application, and need to be dealt with specially.</p>
<h3>Flash</h3>
<p>A reasonable description of the cross-domain-policy problem in web applications was written by Stefan Esser, titled <a href="http://www.hardened-php.net/library/poking_new_holes_with_flash_crossdomain_policy_files.html">Poking new holes with Flash Crossdomain Policy Files</a>. In a nutshell: you need to scan uploaded files in their entirety for the text &#8220;&lt;cross-domain-policy&gt;&#8221;, with optional whitespace between the tag name and the angle brackets, and reject any matching files. If you don&#8217;t, your server will be exposed to CSRF vulnerabilities. External flash applets see this text as a license to breach the same-origin policy, allowing them to utilise the victim&#8217;s cookies for arbitrary purposes.</p>
<h3>Java</h3>
<p>The so-called <a href="http://hackaday.com/2008/08/04/the-gifar-image-vulnerability/">GIFAR vulnerability</a> is particularly irritating, and MediaWiki only deals with it in a heavy-handed manner, by blacklisting uploads of zip and zip-like file formats (such as OpenOffice formats).</p>
<p>The problem is that an external web page can embed a java applet using a JAR file hosted on your site. Such a java applet can perform requests with the cookies of the site that hosts it, thus opening up a client script injection vulnerability.</p>
<p>The Java plugin doesn&#8217;t check if the file has a .jar extension. It doesn&#8217;t even check if the file has the proper magic number at the start, it just executes the file regardless. It does check to see if the directory at the end has an appropriate magic number, so you can use that to blacklist potential JAR files, and all other zip files along with it. To do this, search the last 65558 bytes of the file for the hexadecimal bytes 50 4B 05 06. Reject any files that match.</p>
<p>An alternative would be to parse the entire zip directory and to reject any archives that contain a file with a .class extension. I can&#8217;t vouch for this method. If you did this, the zip library you used would have to be exactly as tolerant of zip format errors as the one used by Java. It would probably be best to actually shell out to Java to do the test.</p>
<h2>Conclusion</h2>
<p>The web sucks. Writing secure web applications is bizarrely complicated and even the most diligent developers get it wrong. Me included. Use this post at your own risk.</p>
<p>Here&#8217;s a better idea: restrict web uploads to people who provide a credit card number, and pre-authorise a $50 fine for malicious uploads.</p>
]]></content:encoded>
			<wfw:commentRss>http://tstarling.com/blog/2008/12/secure-web-uploads/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
	</channel>
</rss>
