Contents


Create a security-based and machine-learning front end

Program your machine-learning front end to identify security inconsistencies in your Node.js applications

Comments

In this article, you learn how to create a security front end that automatically learns the proper format for application inputs. With this information, the front end identifies abnormal input and then blocks or causes an alert. While this scenario is not a perfect solution, it can greatly reduce the risk that applications face.

The sample application that illustrates this technique is written in Node.js, running on the Bluemix® platform. However, the same principles apply to any web application platform, running in any environment.

What you'll need to build your application

Run the appGet the code

Don't want to create the app yourself?

For a fast start, you can now deploy the simple, pre-built app directly into Bluemix. From there, you can edit and redeploy the code as many times as you like.

The problem with abnormal input

Many programmers naively expect that the input they get from the Internet will be valid, at least when it appears to come from the client side of their own application. Therefore, they neglect to put in smoke tests to verify that the input they receive is legitimate. However, the client is usually a browser under the control of the user. It is not difficult for hackers to send whatever input they want to the server side and potentially break the application.

The ideal solution is for programmers to write better code. However, since it is difficult to change human behavior, it is much easier to create a front end that learns what the application normally gets on your behalf. The front end also identifies when it gets anything else and reacts accordingly by sending an alert, blocking the input, and more. It is not a perfect solution as it could have both false positives and false negatives, but it does improve the application security.

Capturing Node.js input before the application

In this article, we modify a Node.js application by adding a module that receives input before the main application to verify it. This modification is relatively easy because Node.js's HTTP server package, express, supports what they call middleware. Middleware for Node.js is code that receives the request, modifies it in some way, and then forwards it for further processing. Node.js middleware is how Node.js handles input parsing.

There are four main methods that are used to provide input to a web application:

  1. In the URL's path.
  2. In the URL's query (forms with the GET method).
  3. In the request body as URL-encoded values (forms with the POST method).
  4. In a JSON request that is sent from the client (usually as REST requests with a method of either POST or PUT).

If the application uses POST or JSON, it has code that parses such requests. The follow snippet shows code that deals with all options:

// Parse post and put requests as JSON or a POST Form as appropriate.
app.post("*", bodyParser.json({type: "application/json"}));
app.put("*", bodyParser.json({type: "application/json"}));
app.post("*", bodyParser.urlencoded({type: "application/x-www-form-urlencoded", extended: true}));

For our purpose, we want to catch requests after those parsers. After the lines that call them, insert the following code:

var reqLog = [];

// Catch all HTTP requests
app.all("*", function(req, res, next) {
	
	reqLog[reqLog.length] = {
		path: req.originalUrl,
		method: req.method,
		query: req.query,
		body: typeof req.body === "undefined" ? {} : req.body
	};
	
	if (req.query.length > 0)
		reqLog[reqLog.length-1].query = req.query;

	if (typeof req.body !== "undefined")
		reqLog[reqLog.length-1].body = req.body;	

	next();
});


// Show the request log
app.get("/reqLog", function(req, res) {
	res.send(JSON.stringify(reqLog));
	
	reqLog = [];
});

Use the sample application, run a few requests, and then navigate to https://machine-learning-front-end.mybluemix.net/reqLog to see the results. If the results differ from what you expect, remember that as more readers run requests and visit the URL, inconsistencies will be inevitable. If they are inconsistent, remember that other readers might be using the application at the same time. Also note that reqLog empties every time that it is accessed. Finally, you might wane to cut the result and paste it at a JSON formatter to see it more clearly.

How does it work?

The app.all("*", function(req, res, next) {…}) call receives requests that match the path (which, in this case, is everything). Because we use .all, it covers all of the methods.

The function parameter has three parameters of its own. The usual two, req and res, have their normal meaning. But there is an additional parameter, next, which isn't used as often. It is a call to return the request for further processing. Our normal app.get and app.post calls can also use this parameter, it is just not needed when the function provides the actual response to send back to the browser.

The path and method are obligatory parts of the HTTP request, so they always exist. A req.query also always exists, because it is parsed by Express automatically and also might be empty. The final value, req.body, is more complicated. You can create the final value with bodyParser.json() or bodyParser.urlencoded(), however they are only called when specified. In the sample application, they are only called for some methods and only if the content type is either application/json or application/x-www-form-urlencoded.

// Parse post and put requests as JSON or a POST Form as appropriate.
app.post("*", bodyParser.json({type: "application/json"}));
app.put("*", bodyParser.json({type: "application/json"}));
app.post("*", bodyParser.urlencoded({type: "application/x-www-form-urlencoded", extended: true}));

It is easier to handle the body parameters if they are always in a hash table by checking whether the value is undefined (typeof req.body === "undefined"). If the value is undefined, put an empty hash table ({}) for the body.

Learning what to expect

After we define our different parameters, the next step is to use the request information that we capture to actually find out which input that the application expects. In general, we expect inputs to fall into three categories:

  • Multiple choice
  • Numeric, with a maximum and a minimum
  • Free form text

There are other possibilities, such as dates and phone numbers, but they are a lot less common. In this article, I ignore them for the sake of simplicity, but feel free to explore these possibilities.

There are many possible machine learning algorithms, but for this purpose we can use a very simple one:

  1. Start with assuming that every input field is multiple choice and track the values that were received.
  2. If the number of values is above a certain threshold, assume that this value is either numeric or free form text (based on existing values).

In the sample application, you can go to https://machine-learning-front-end.mybluemix.net/manual/<field>/<value> to add a value (and potentially a field) to the /manual URL (typically, using the GET method if you're accessing from a browser). This URL responds with the JSON for the inputs table, inputValuesTable. If you see fields and values that you did not create, it is possible that another reader is using sample application at the same time as you.

You can also retrieve the table from https://machine-learning-front-end.mybluemix.net/inputValuesTable. If you want to delete the table, use https://machine-learning-front-end.mybluemix.net/reset.

The function that adds values to inputValuesTable is called add2Input(). It is long, but conceptually simple. You can read it in the source code.

Path components

Any path component can be used as an input field, but usually there is at least one fixed string followed by inputs. There may be more, but they can be treated as a multiple choice input, possibly with a single value.

Note that this approach could result in false negatives. For example, if the two paths that are acting as services are /rest/int/:integer and /rest/str/:string, the algorithm will learn that the second path component is a multiple choice (either int or str), and the third component is free form text—even in the cases where it is a number. However, if the second value is treated as a fixed value, then /rest/:str/:int for example, the result will be a huge number of URLs. I will explain a more sophisticated algorithm later on in the section From prototype to production.

For path components, the URL is the first path component and the other path components get field names based on their order. To implement the path component orders, use the following code in the app.all("*", function …) call:

	// Treat path components as input fields
	var pathComponents = req.path.split("/");	
	for(var i=2; i<pathComponents.length; i++)
		processInput(req.method, "/" + pathComponents[1], i, pathComponents[i]);

Notice that the URL's path always starts with a slash, so the first value in the array that req.path.split("/") returns is always empty. The second value is the real first components and the rest are treated as fields.

Form inputs

Form inputs are already a list of fields and values. Adding them is very simple – the only issue is to make sure that req.body (if it exists) is the form input, not the result of a JSON REST call. You can tell them apart by the mime type:

	// If relevant, treat query (GET) and body (POST) fields as fields
	for (var field in req.query)
		processInput(req.method, req.path, field, req.query[field]);
	if (req.headers["content-type"] === "application/x-www-form-urlencoded")
		for (var bodyField in req.body)
			processInput(req.method, req.path, bodyField, req.body[bodyField]);

JSON REST calls

JSON is a harder problem because fields themselves can contain fields. To turn a JSON structure into a flat hash, we use the recursive flatten() function:

// Flatten a structure into fields and values
var flatten = function(name, data) {
	var retVal = {};
	
	if (typeof data === "object") {
		// Hash or list
		if (Array.isArray(data)) {
			// List
			for(var i=0; i<data.length; i++)
				retVal = Object.assign(retVal, flatten(name + "-" + i, data[i]));
		} else {
			// Hash
			for (var field in data)
				retVal = Object.assign(retVal, flatten(name + "-" + field, data[field]));			
		}
	} else {
		// This is a scalar value
		retVal[name] = data;
	}
	
	return retVal;
};

This code snippet in app.all("*", function …) calls flatten() and then processes all the fields:

	// Flatten and process JSON is received
	if (req.headers["content-type"] === "application/json") {
		var fields = flatten("body", req.body);
		for (var jsonField in fields)
			processInput(req.method, req.path, jsonField, fields[jsonField]);
	}

Dealing with invalid inputs

The processInput() function, so far, only calls add2Input(). However, learning what input to expect, by itself, does not secure anything. To get actual security, it is necessary to look at a field and check its value against the known options.

This is done by the function checkField(). This function is also pretty simple, it checks for four possible error conditions, and throws an exception on each one:

  1. The type is num, and the value is not a number.
  2. The type is num, and the value is too small (below the minimum).
  3. The type is num, and the value is too large (above the maximum).
  4. The type is mchoice, and the value is not one of the known values.

The processInput() function needs to decide whether to add to the input table (add2Input()) or check the field's value (checkField()). In the sample application, it checks if the field is known, and if it is, whether it has been seen five times or more (you can modify this value near the top of the source code file app.js):

// Process input. Either add it to the table, or verify it is legitimate
var processInput = function(method, url, field, value) {
	var key = method + ":" + url + ":" + field;
	var tableEntry = inputValuesTable[key];
	
	// We haven't seen this field yet, add it
	if (typeof tableEntry === "undefined")
		add2Input(key, value);	
	else {
		// If we haven't seen it enough times to think we know this field,
		// add this value
		if (tableEntry.count < count4Known)
			add2Input(key, value);	
		else
		// If we think we know the legitimate values, check it
			checkField(key, value);
	}  // If the type isn't undefined, meaning we've seen the field at least once
};

Finally, in the app.all("*", …) call, deal with the exceptions that are thrown by checkField(). Right now the application returns the error to the user, which is good for demonstration purposes. In a real implementation, you would probably want to provide a lot less information to a potential attacker, and log any infractions.

// Catch all HTTP requests
app.all("*", /* @callback */ function(req, res, next) {
	.
	.
	.	
	// Try processing the input fields. If any of them are invalid, catch the
	// error and send it to the user
	try {	
		// Treat path components as input fields (except the first one)
		var pathComponents = req.path.split("/");	
		for(var i=2; i<pathComponents.length; i++)
			processInput(req.method, "/" + pathComponents[1], i, pathComponents[i]);
		
		// If relevant, treat query (GET) and body (POST) fields as fields
		for (var field in req.query)
			processInput(req.method, req.path, field, req.query[field]);
		if (req.headers["content-type"] === "application/x-www-form-urlencoded")
			for (var bodyField in req.body)
				processInput(req.method, req.path, bodyField, req.body[bodyField]);

		// Flatten and process JSON is received
		if (req.headers["content-type"] === "application/json") {
			var fields = flatten("body", req.body);
			for (var jsonField in fields)
				processInput(req.method, req.path, jsonField, fields[jsonField]);
		}
	} catch (e) {
		res.send("" + e);
		return ;  // Make sure we return without calling next(), so the invalid request
		          // will not be processed
	}

		
	next();
});

From prototype to production

The sample application that is provided within this article is written as a prototype to illustrate the technique, not for production purposes. (I do not recommend that readers use this application in a production environment without making the suggested changes first.) There are a number of fairly simple changes that would improve it significantly.

Path components

A more sophisticated algorithm treats every path component as a series of fixed values until there are enough values and then turn it into a text field. For example, if it sees the following paths, the algorithm will assume that each of these values is a fixed URL:

  • /rest/strCompare/a/b
  • /rest/strCompare/a/c
  • /rest/strCompare/b/c

When the algorithm sees that there are a lot of different URLs of the form /rest/strCompare/<string>/<string>, it will decide that only /rest/strCompare is actually a URL, and the other two strings are input fields.

Expand the numeric range

Depending on the application, it may be a good idea to allow numeric values that are slightly lower than the observed minimum or higher than the observed maximum. The problem with this approach is that an attacker could gradually expand the range until it includes problematic values.

Storage: Memory or database?

Normally, data in production is stored in a database so that it won't be erased when the application is restarted. The only reason I use hash tables in memory for these developerWorks articles is simplicity. However, in this case there is a valid argument to use memory rather than a database. The memory gets erased when the application is restarted, for example, because the code changed. In that situation it makes sense to relearn the input fields, whose names and meanings might change.

On the other hand, doing this means that for the first few uses after the application restarts it is less protected, because the system is still relearning the input fields and accepts everything as legitimate. It is a trade-off between security and the convenience of not having to manually specify that an application changed.

Seeing what isn’t there

The current algorithm only checks if existing fields have legitimate values. But what if the client does not send a field that the application expects? It is possible to modify the app.all("*", function …) call to track all the fields that are provided, and then raise an alarm when any of those fields is unexpectedly missing.

Allow? Log? Block?

Currently, the front end allows all access while it is learning the input fields, and then blocks any non-allowed value. This is sufficient for a demonstration, but in a real life application it would lead to too many false positives (cases where an attack is detected even though none is in progress), rendering the application unusable.

It would be better to track the total number of times information about an input field has been validated, and in most cases just log it (so somebody can see if the learned values need to be adjusted) – and only block access in cases where it is clear that the value is invalid, for example because thousands of legitimate values have passed for that field.

Conclusion

The technique in this article is an attempt to patch a human problem (programmers who forget to address the issues) by using technology. Like most such attempts, it is imperfect and prone to errors. However, it can still improve your security posture, especially if tuned properly to your usage scenario.


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Security
ArticleID=1044779
ArticleTitle=Create a security-based and machine-learning front end
publish-date=04122017