anti-XSS

来源:互联网 发布:淘宝店铺海报怎么上传 编辑:程序博客网 时间:2024/05/29 12:47

关于浏览器安全涉及的内容:http://code.google.com/p/browsersec

本文转自转:http://code.google.com/p/browsersec/wiki/Part1#HTML_entity_encoding

更多的可以了解,HTML中关于字符解析的部分:http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html

HTML entity encodingHTML entities. The purpose of this scheme is to make it possible to safely render certain reserved HTML characters (e.g., < > &) within documents, as well as to carry high bit characters safely over 7-bit media. The scheme nominally permits three types of notation:

HTML features a special encoding scheme called

  • One of predefined, named entities, in the format of &<name>; - for example &lt; for <, &gt; for >, &rarr; for , etc,
  • Decimal entities, &#<nn>;, with a number corresponding to the desired Unicode character value - for example &#60; for <, &#8594; for ,
  • Hexadecimal entities, &#x<nn>;, likewise - for example &#x3c; for <, &#x2192; for .

In every browser, HTML entities are decoded only in parameter values and stray text between tags. Entities have no effect on how the general structure of a document is understood, and no special meaning in sections such as <SCRIPT>. The ability to understand and parse the syntax is still critical to properly understanding the value of a particular HTML parameter, however. For example, as hinted in one of the earlier sections, <A HREF="javascript&#09;:alert(1)"> may need to be parsed as an absolute reference to javascript<TAB>:alert(1), as opposed to a link to something called javascript& with a local URL hash string part of #09;alert(1).

Unfortunately, various browsers follow different parsing rules to these HTML entity notations; all rendering engines recognize entities with no proper ; terminator, and all permit entities with excessively long, zero-padded notation, but with various thresholds:

Test description MSIE6 MSIE7 MSIE8 FF2 FF3 Safari Opera Chrome Android Maximum length of a correctly terminated decimal entity 7 7 7 8* 8* 8* Maximum length of an incorrectly terminated decimal entity 7 7 7 8* 8* 8* Maximum length of a correctly terminated hex entity 6 6 6 8* 8* 8* Maximum length of an incorrectly terminated hex entity 0 0 0 8* 8* 8* Characters permitted in entity names (excluding A-Z a-z 0-9) none none none - . - . none none none none

* Entries one byte longer than this limit still get parsed, but incorrectly; for example, &#000000065; becomes a sequence of three characters, /x06 5 ;. Two characters and more do not get parsed at all - &#0000000065; is displayed literally).

An interesting piece of trivia is that, as per HTML entity encoding requirements, links such as:

http://example.com/?p1=v1&p2=v2

Should be technically always encoded in HTML parameters (but not in JavaScript code) as:

<a href="http://example.com/?p1=v1&amp;p2=v2">Click here</a>

In practice, however, the convention is almost never followed by web developers, and browsers compensate for it by treating invalid HTML entities as literal &-containing strings.