From the course: PHP: Creating Secure Websites

Sanitizing data - PHP Tutorial

From the course: PHP: Creating Secure Websites

Start my 1-month free trial

Sanitizing data

- [Presenter] In addition to writing validations, we also want to sanitize the data that comes into your PHP code. Many common types of hacks, pass in data that's carefully constructed to cause harm. Special strings can be used to affect many parts of the code, from the database to JavaScript, to HTML and more. Sanitizing the input will convert harmful data into harmless data. In general, we don't want to try to detect every possible type of harmful data. It's simply too easy to miss something. Instead, we will run a sanitization process on the data to neutralize or remove powerful characters. How we sensitize the data depends on how we plan to use it. For putting the data into the database then we need to sanitize it for the database. If we're going to output a string to the browser page then we need to sanitize it for HTML. If it's going to be used in JavaScript it must be first made safe for JavaScript. Each one of these types has different characters that have special meaning. You can perform sanitization in two ways. You can either encode characters or escape them. Encoding characters means replacing those powerful characters with harmless equivalent characters. Escaping characters means looking for the powerful characters in the string and adding an escape character before them that renders them harmless. I don't recommend that you write your own custom sanitization functions. Instead, it's much better to use the well-tested PHP specific functions. It's very hard to get sanitizations right and account for all of the possible cases. And do not try to remove or correct invalid data in a string. It becomes a game of cat and mouse. You pull out data that looks incorrect but then when the string is joined back together without that data that string now is also harmful. Fortunately, you don't have to write your own custom function because PHP gives you some good sanitization functions and filters. The first of these, HTML special chars, looks for characters that have special meaning to HTML and renders them harmless by encoding them into HTML entities. The HTML entities function is similar but goes a little further. Instead of just encoding characters that have special power, it encodes anything that has an equivalent HTML entity. For example, the copyright symbol doesn't have any special meaning to HTML so HTML special chars would leave it alone but HTML entities would replace it with an entity. The function strip tags removes anything that's an HTML tag or PHP tag. I realized that I said not to remove content from strings but this is an exception because strip tags does it effectively. The PHP function URL encode will encode a string so that it can be used in a link or a URL. JSON encode will encode a string so that it's safe for use in JavaScript or JSON. If you're using a database, you'll need to escape strings before you use them with the database both for inserting data into the table as well as for querying a table. Most databases offer some kind of functions with them that will escape data specifically for that database type. If you're using MySQL then you can use MySQLi real escape string. If you don't have a database specific function available, PHP also offers add slashes which is a generic function that escapes key characters that are typically associated with databases; primarily quotation marks. But the database specific function is still going to be better. In this chart, I have a third column called filter. We'll come back to those. First, let's see the PHP functions listed in the first column in action. In the file Sanitizing HTML.php, I have a variable called sanitize which allows me to easily turn on and off the sanitization features. After that, there's an HTML string which contains HTML tags. You should think of it as a placeholder for any data that allows a user to change the look of the HTML page. Here the user is able to style the text even if we don't want it to be styled. There's a JavaScript string which includes JavaScript tags and a JavaScript alert. You should think of that as a placeholder for any malicious JavaScript code. If I'm not sanitizing, it will output those to the browser. Let's take a look. Click over here and let's just load this page and you can see it pops up with a gotcha alert and then we have this styled HTML text that outputs. Let's turn on sanitization by setting this to true. Come back over and let's reload the page. We didn't get a JavaScript pop-up this time and we just got text output on the page. The powerful characters have been rendered harmless. If you take a look at the page source, you'll see why. You can see that the less than sign was replaced by an HTML character entity, < and the quote characters and the greater than characters have also been replaced. These HTML entities don't have the same meaning for HTML. I'll go back to the PHP code page and you'll see that I've also got some other examples here with HTML entities and strip tags. Let's try strip tags. I'll just uncomment that one and comment this one out. Let me take away that and remove that one. Now let's come back. I'll close the page source and let's reload the page. Now, you can see that it completely removed the tags and if you were to view source you'll see that they're completely gone. It's a different approach. It's stripped out the tags instead. Next let's look at the Sanitizing URLs page. It gets a parameter called title from the URL. If it's not set, it will be the string nothing yet. The variable URL string is being set to something that has an ampersand in it. The ampersand has special meaning in a URL. When sanitizing is enabled, it will call a URL encode on the string and then output the result. Right now sanitization set to false. Let's go to Firefox and try it out. So right now it says title is nothing yet. If I click add title link, you'll see that now it says title is URL encoding working? That's the string that's right here. But notice this ends try it and see but here we've got try it and then nothing after it. And that's because that ampersand is used to separate different values. So this, as far as the URL is concerned, is the beginning of a new attribute. If we instead turn on sanitization, true, come back. Now let's clear title and add title again. Now you see we get the whole thing because it encoded it for the URL. The spaces have become plus signs. You can see that some of the other things like the question mark and the ampersand have also been encoded into their percent entities that URLs use. And let's look at the file Sanitizing_sql to get an idea of how we would sanitize something for a database. You see that it takes a string from the URL, here it is name, and if sanitization is turned on, it will call it an escape function on it. It can use add slashes but if we had a database, it would be better to use the database specific function like MySQLi real escape string. Then I'm going to use that string to construct some SQL. So you can see I'm dropping it in right here. This is SQL that I would normally send to a database. Let's look at Firefox to see some examples. Here I've got name equals Kevin and so this is the SQL it would send to the database. That's perfectly appropriate. It would search for all customers where the name is Kevin but imagine if the user constructed something malicious. Let's say, instead, they changed this to be name equals %27+OR+1=1;--. Now look at the SQL that it generated. This is in fact something that's malicious because where name equals now has two single quotes after it because that was an encoded entity that was right there. And so it's quote, quote that ends the name so where name is equal to nothing or where one equals one. Well, one always equals one so that returns true. So this will always return true. The dash dash is an SQL comment that says ignore everything that comes after that. So the rest of it is irrelevant. This would match every single customer in the database. If we did this to a login script, we might be able to trick it into giving us access. Or a user might try submitting, just replace this part here with a semi-colon and then a plus, drop, plus, table, plus, customers. And now I have two SQL statements. Select from customers where name equals this and then right after that drop table customers. This is a new SQL command which would destroy the customers database completely. Let's sanitize the data instead. Let's come back over here, let's turn on true. Let's reload that same page and you can see now it put a slash in front of that first single quote mark. So now it doesn't have the same meaning. So now this is the string. So it's going to look where customers where name is equal to a literal single quote and then semi-colon and so on. This single quote doesn't have the same effect as this one does. These PHP functions have been around for a long time but starting in PHP 52 some new filter methods were added and the name of these filters is in the third column. We use them with the PHP function filter var. The arguments or the string you want to filter and the filter you want to apply. A full list of all those filter types is on the php.net website. Let me give you an example. This string contains a JavaScript alert. We can filter var on the string with filter sanitize string. It strips out the tags just like strip tags did. I'm not crazy about these filters. They have long names that are hard to remember and I don't see much gain over the other functions. You can pick the style that you prefer but both sets are available to sanitize your data.

Contents