Non-Standard HTML Fuels XSS Attacks

by Mike Willbanks on May 6th, 2007

To start off with this post I feel that I should define XSS:
XSS (Cross-site scripting) is a security vulnerability which allows code injection into web pages viewed by other users.

While doing some coding research the other day I found an interesting XSS attack that I have not thought of previously due to working primarily standards based XHTML. Through the usage of non-standards based html or xhtml, XSS attacks can be fueled easier; even with using htmlspecialchars or htmlentities you can run into problems. Since I like to show by examples so see the following segment.

Say you are taking in user input and putting it into a link but you also use it for display on the page. This input doesn’t have any strict input so the choice was made to use htmlentities or htmlspecialchars to escape the input:

if (isset($_GET['user_input'])) {
    $user_input = htmlentities($_GET['user_input'], ENT_QUOTES);
    echo "<a href=next_page.php?input={$user_input}>{$user_input}";
} else {
    echo '<form action="" method="get"><input type="text" name="user_input"/></form>';

Now at first you may see nothing wrong with the above input as the htmlentities will escape most XSS attacks in this case. However, if you look at the href tag notice that there are no quotes around it. Using htmlentities it will allow the user to enter spaces and it will not be translated into the equivalent entity.

Say I enter the input as ” onclick=alert(null);” notice I can now execute javascript when a user clicks on the link. Sure this is only popping up an alert box but much more could easily be crafted in to this area. One thing that could be done is to use the php fuction urlencode which will encode the input for a url. This should help for the midterm but the best approach would be to standardize the html to utilize quotes as well as using other functions such as urlencode for html. An even better solution would be to filter the users input to only allow for the information that you are checking for.

I am positive that there are several sites that have this type of problem associated with it. As many sites already have XSS attacks present and many people fail to see the vast problem that this can create. However, if you do your research, finding information is quite easy to see that these attacks can be extremely harmful.

From PHP

  1. Dan permalink

    I came along a PHP Filter Function to help me prevent XSS

    What do you think about these function?

    It would be great if a community would make a GPL function like, always up to date, so whenever a new XSS is found the function could be updated?

    Is there something like this on the web?

  2. Dan, there was one that I found a while back. I will check into this and see if I can not find it again. But what the package did was scan through attributes and tags given a white list and removed the items that were found in error. A blacklist will never fully function correctly given that we will tend to always miss a certain area or a tag. Remember its not always javascript, it can be through css using an “expression”.

    Typically, the best way to guard against XSS is to validate the input by checking for the right type and filtering it accordingly. If you are accepting HTML you can parse the HTML to make sure that the attributes that are in the HTML that are not allowed in the white list will be removed. For example: using a XML parser to parse each tag and attribute to remove items not in the white list.

    The function that I would really like to see or added to strip_tags is an attribute white list to parse through HTML.

  3. Hopefully, the whitelist library you’re talking about is HTML Purifier ( (can’t pass up a chance to plug a library of my own) 😉 In general, however, just whitelisting tags and attributes isn’t sufficient: you usually have to do validation of the attribute contents.

    I must say, I heartily agree with your assertion that non-standard HTML fuels XSS attacks. Things like not using quotes in attributes, letting non-SGML characters, leaving literal quotes unescaped, can all help introduce possible XSS vectors due to the quirky nature of browser processing. With standards-compliant HTML, you can be reasonable certain that the result is unambiguous.

    @Dan: That XSS filter you’ve pointed to has been shown to have some vulnerabilities by the good folks over at (this thread comes to mind:,5002)

  4. @Edward:

    I don’t believe that was the one that I had used. The one I had used was a single class that allowed you to do either simple per tag with attributes and then global attributes and tags. However, from looking at your code it seems that it is highly flexible and robust the only draw back seeming to be how many objects might need to be instantiated into the variable scope.

    Your comment :

    “Things like not using quotes in attributes, letting non-SGML characters, leaving literal quotes unescaped, can all help introduce possible XSS vectors due to the quirky nature of browser processing.”

    is exactly correct. I find it interesting that so many sites fall in this area.

    Here is a few statistics to just say how common this vulnerability is:
    In Mitre CVE it shows that in 2006 21.5% of vulnerabilities reported were XSS. This is only what was reported and it is on the top of the list. This is a scary statistic. The power of an XSS attack is much more than people seem to think… With XSS if I know a system I can leverage a request have it change the users password or even retrieve their session cookie depending on the system… if that is not frightening I don’t know what is. Further RSnake thinks the number of vulnerable sites are approximately 80%.

  5. Ciprian permalink

    You should always use urlencode to encode url entities. Space will be translated in %20, making it harmless.

  6. @Ciprian: That is correct, in this case I was just showing an example of what was found in a real world application. The function urlencode would be the proper way to encode urls for the href tag. However, this is not always used. Essentially proper filtering + urlencode or htmlentities should be sufficient depending on the area of application. For instance if you are taking in an integer you may just want to append it to the string as long as it has been verified in that type of data.

  7. Santosh Patnaik permalink

    Have a look at htmLawed, a small and highly customizable HTML filter/purifier PHP script with anti-XSS capabilities.

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS