After reading the “Securing PHP” article written James Cunningham I thought I might gather a few points about using PHP from the developer side. Keep in mind that I’m not a security expert. However, this article contains a few starting points on preventing exploits, making PHP apps perform better, and miscellaneous stuff that I consider to be best practices. Your mileage may (and probably will) vary, so as always: take everything with a grain of salt no matter if you read it here or elsewhere. This is not so much a check list of concrete actions, it’s more a collection of points worth keeping in mind as you code.
PHP? WTF possessed you?!?
The first thing you come across when starting out with PHP is probably the fact that it has an extremely bad reputation. You will hear lots of things, including “it doesn’t scale”, “it’s not a real language”, “it doesn’t have X so it sucks”, “it’s not safe”, or “it’s Blub and you’re too stupid to realize it’s Blub From the perspective of the language runtime itself, this is all a lot of crap. Still, the trolls are often correct – though it’s generally not PHP’s fault per se. The blame lies solely with the developers using it. As a PHP adept, this should comfort you because it’s something that can be fixed on your end. You can also derive consolation from the fact that other web languages and frameworks suffer from the same problems, it’s just not generally advertised. The bad news is that PHP application failures are huge and numerous, because the language is both popular and powerful enough to enable truly epic bugs.
The Basics: How PHP Works
Because PHP is so accessible and ubiquitous, there are a lot of people copying and pasting scripts together – people who in a more perfect world would be forbidden by law from ever touching source code. Even when real developers are doing it, hacking an easy language does not absolve anybody from the responsibility of knowing what goes on inside a system behind the scenes.
At the most fundamental level, the webserver hands a request from a browser over to the PHP runtime. This sounds like a really simple concept and for the most part, it is. Nowadays, most serious web servers are configured to shove requests directly into the gaping maw of a long-running PHP process dispatcher. After PHP is done with its thing, it passes the result data back to the webserver, which in turn hands it over to the client browser. Historically, this was not always the case: in the past, web servers often started up a complete PHP instance just for one request. If this sounds inefficient to you, you are absolutely right. After people realized how wasteful this method was, they adopted the current model of re-using PHP instances after they completed their jobs. However, some mass-market ISP hosting plans still use the older model, all the more reason to keep an eye on your code performance at all times.
A clean slate
Every developer should internalize this: fundamentally, PHP is a per-request environment. Whatever you did during the last request, the next one will start with a completely blank slate. This stateless paradigm is not very common as web languages go. Many others operate as a persistent environment. PHP’s way of doing things like this is both awesome and problematic, depending on your use case. On the plus side, this allows developers to look at each request as an isolated problem. Also, it’s much more difficult to make a mistake that takes down the entire server. There are less memory leaks and other weird effects that come from having a stateful runtime. But on the negative side it also means developers must understand how rebuilding the entire environment for every request comes at a computational cost. Many PHP apps are slow because developers did not consider this cost. It’s our responsibility to keep that startup cost low, so doing as little initialization as possible upfront is always a sensible concept.
Things to avoid:
Anything that smells of gratuitous initialization procedures. Don’t load, check, connect, or compute anything that isn’t needed. PHP is not designed to fulfill your dream of becoming an OOP purist. Avoid huge class hierarchies: keep in mind that all this structure has to be parsed and then instantiated at every request. It’s often better to have a very flat class system. Don’t store big amounts of data in the $_SESSION variable because it, too, has to be reloaded on every request. Unless you’re sure your web server does opcode caching (with APC for example), don’t use huge files full of unnecessary source code.
Things to Do:
Procedural programming is not necessarily evil. Do it whenever it has speed and/or simplicity advantages. Only require()/include() files that are definitely needed. Consider using a class loader (carefully) to load functionality modules on demand instead of monolithically including all your stuff upfront.
Things to know:
- HTTP request headers: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields
- CGI interface, variables: http://en.wikipedia.org/wiki/Common_Gateway_Interface
Profiling, profiling, profiling…
You don’t even need serverside debuggers or fancy instrumentation to achieve this. The function microtime() returns a timestamp in microseconds, and memory_get_usage() gives you a basic idea about your script’s memory behavior. This makes it easy for you to check the two most fundamental resources at key points during your application’s execution path.
Personally, I like to use an extremely simple profiling function based on the microtime() function. Using a function like this to profile your code will allow you to measure how horrible certain operations really are. For example, connecting to a database. Running regular expressions. It all has a cost and you need to know what it is. It’s always good to know what’s going on behind the scenes – so avoid libraries that obfuscate their behavior for the sake of fake simplicity.
For more serious profiling and debugging, you should check out: http://xdebug.org/ but microtime() still provides you with valuable and quick information in any PHP environment.
Sane(r) Input and Output
One of the most common publicly visible mistakes is failure to sanitize the input of a web application. It’s important to remember that every single piece of data coming into your app is potentially hazardous. The web really is out to get you! In PHP, the $_REQUEST array contains all the parameters pertaining to the current request and it’s essential not to trust them. Sadly, there is no single way to make this data safe. It depends on what you do with it. On a more positive note, the handling of user input generally falls into one or more of the following categories and there are standard practices you can employ to avoid the worst:
>>> Rule 1 of input data hygiene: Nuke it from orbit, it’s the only way to make sure! <<<
Text in Databases
SQL database queries come with a little bit of baggage. Often, you will need to take user input and run queries with it. For example, you might want to store data in the DB or you might want to retrieve it. Most database abstraction libraries will let you use a syntax like this:
SELECT * FROM articles WHERE id = ?
The nice part of having support for placeholders like “?” is that you don’t need to worry about making the content of your variable safe. You can just pass it as another parameter to your query function. Depending on the database and the library used, this might also have the further advantage of enabling the database to precompile the query and execute subsequent queries a bit faster.
There are, however, situations where you might not want to use or be prevented from using an abstraction layer or library. Using the built-in MySQL functions can work, too, you just need to be more careful. You do have to take care of properly escaping the variables yourself. There is just one single function that you can use to make data safe for MySQL consumption and its name is mysql_real_escape_string() If you’re using any other function, stop it immediately. mysql_real_escape_string() is your one and only true lord and savior: worship the blessed bytes it churns out.
Dynamic Code Paths
This is a touchy subject. PHP allows you to do a lot more things with your code based on variables than most other languages. If you’re going to use those features, make sure to have a very good reason for it. If done right, those features can substantially reduce the complexity of your code – but you have to be extremely careful.
Dynamic include()s: PHP allows you to include a file specified in a variable. You can do stuff like this:
include('inc/'.$moduleName.'.php');. Contrary to many other people, I think it’s fine to use this feature in principle because it allows you to introduce very simple extension mechanisms into your app and it can help keep your codebase clean. But as always, with this kind of power, comes a huge responsibility: you have to make sure
$moduleName is legit and can’t be used to call arbitrary code on your server. A good way to ensure basic sanity inside this variable is to use at least
basename($moduleName) on it, but a much better solution would be to strip out any non-alphabetic characters. Nukes. Orbits. See above!
Dynamic variables and functions: In PHP, you can set the content of a variable
$v by specifying its name through another variable. For example, if you set
$nnn = 'vvv';, you can do
$$nnn to access
$vvv. But wait, there is more. Suppose you have a function
vvv();, you can call this function by writing it as
$$nnn();. Obviously, this is very powerful stuff, so you have to make sure the “id” variable (in this example $nnn) is sane and can’t be used from the outside to call arbitrary stuff in your app. Contrary to the previous methods, there is no single way to make sure this code can’t be abused: you’ll have to make sure of it in a manner that is appropriate to your code specifically.
Eval is evil: for some unknowable reason, many newbie developers seem to be enthralled by eval(), probably because they failed their saving throw against its evil whisperings of doom at some point. In any case, eval() is probably the one function responsible for the most colorful WTFs in coding history.
eval(); allows you to execute arbitrary code by invoking an interpreter inside the interpreter. Developers often use this to implement dynamic features, such as event handlers that can be specified by the users of an application. The dangers here are obvious: there is no way to make this safe. If you allow your users to specify custom code, you damn well better make sure they’re trustworthy, because they will have access to anything and everything on your server. I believe in 99% of all cases, the desire to use
eval(); is not even remotely legitimate. However, there might be scenarios where
eval() is justified, for example in CMS applications or meta-programming projects. For PHP beginners and intermediates there is just one rule regarding
eval(): if you’re using it, you’re doing it wrong. It’s that simple. Don’t listen to the voices!
Calling the shell
Sometimes, when a specific and arcane piece of data processing is required, it can be a good idea to just call a Unix text command from inside your PHP application. If you do that, you have to be aware that your code has a high likelihood being specific to your server configuration. Chances are, your call won’t work on another configuration. Whenever some tiny aspect of your shell call is influenced by user data, you have to use escapeshellarg() and escapeshellcmd() to sanitize those values.
Regular expressions are very practical, small and can be highly performant if done right. They do, however, require a specialty tech priest to come in and bless the code. You can’t just copy an arcane regex ritual from the web and expect it to work on your project. Regexes often sit there and look like they’re working, but in reality they’re just lying dormant until they finally betray you at a more opportune time. It’s ridiculously easy to get them wrong, and they don’t lend themselves to bug-spotting at one glance. You probably should not rely on regular expressions to sanitize your data, unless you know exactly what your expression does. Because even if you think you know what it does, chances are high it will do weird things with weird inputs. Even the Supreme Grandmaster Regex Scholars of Perldom get this stuff wrong with a scary high probability. Personally, I’d advise you to steer clear of security-relevant regexes if the same can be done with a few simple and more maintainable lines of real code instead. Orbits. Nukes. Swords. Gunfights.
What to know:
- Regular Expressions: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
We already talked about database security. As always, there is more. In most cases your database will be MySQL. Like PHP, MySQL has a very bad reputation – so it’s only natural that those two should pair up to be the web server standard configuration of the civilized world. MySQL will be fine for most of your standard storage and query needs. A lot of people tend to prematurely optimize their web projects and choose a NoSQL solution because they believe it’s going to be faster, or even simply because “the cool kids are doing it”. Needless to say, this is not actually warranted in most cases. PHP allows you to go with the database you like the most – if that’s MySQL, you’ll save a lot of configuration trouble. If it’s not MySQL, that’s fine, too. But chances are very good you don’t need a special DB solution for your web project so unless you’re experimenting you’ll probably get the most mileage out of something you already know very well.
What to avoid:
Avoid making a database connection when your request doesn’t need it. And when your request does need it, you should prefer persistent database connections over ad-hoc ones (e.g., use
mysql_pconnect()) for the simple reason that it potentially shaves some execution time off the connection part. Avoid executing many queries in favor of consolidating them into just a few. Every time you fire off a query to the DB server, your program has to wait for data to come back. It’s also a good idea to avoid hugely complex queries, especially if you need to plan for scalability. SQL gives you a lot of rope to potentially hang yourself, keep performance in mind and measure it!
A word about legacy libraries
You may be wondering why I’m referencing an obsolete MySQL API above. Even though the very first example is all about DB abstraction libraries, people still bring it up. To make it clear: I’m not condoning its usage, but chances are you’ll come across similar problems at some point, for example if you’re debugging legacy code or if you work with other libraries where proper escaping becomes an issue. It’s worth to at least be aware of unsanitized input at all times. I’m not advocating the use of failure-prone, low-level, obsolete libraries – I’m trying to talk about being more conscious of dirty data. Again: use PDO, or whatever suits you, avoid naked mysql_ calls.
What to know:
- MySQL, obviously: http://dev.mysql.com/doc/refman/5.0/en/tutorial.html
- Indexes are essential
File uploading is the act of accepting a file from the browser into your web application. Like all user input, you must be prepared for the worst. People will try to crash your uploading code or they will try to upload executable files onto your server. When accepting an uploaded file, it’s important to check its content first before storing it on the server. For example, if an attacker manages to upload a PHP code file into one of your directories, it’s game over. That’s what happened to the Trojans. One day they allowed a highly executable horse to upload into their root directory. It was not a good day.
To avoid this, you have to make sure that, say, an image file uploaded to your app is actually an image file. The $_FILES variable contains information about a file’s MIME type. Sadly, you have to disregard this information, because it’s been supplied by the user’s browser and is thus utterly evil. Instead, you have to get the actual MIME type of the file directly. In the good old times, you could use mime_content_type() however this is now a deprecated function. Rescue comes from an unexpected place: the GD library has a function that, among other things, returns the actual MIME types of image files: getimagesize() Use this to check what your uploaded file actually contains and simply reject everything that does not correspond to one of the MIME types explicitly allowed by you.
What to know:
- MIME types: http://en.wikipedia.org/wiki/MIME
- HTTP methods: http://en.wikipedia.org/wiki/HTTP#Request_methods
Extra Credit: Caching
On many servers, you will have access to a service called memcached. It’s essentially a mini server process that allows you to store and retrieve arbitrary data very fast. To save or retrieve a data package from memcache, your application needs to connect to the memcached server and give it the key of the object you’re interested in. This key can be any string. Keep in mind though, that other applications are using the same key/value storage so choose your keys in a manner that does not lead to conflicts. Remember to store user-specific information with a user-specific key.
Like with any server, connecting to memcache and retrieving a cached object costs time. It’s only worth it if that interval is actually shorter than the time it takes to recompute a given data structure.
In the absence of a fancy memcached service, you can simply fake it, too. For example, if your application needs to generate a report or a big chunk of HTML that seldom changes, you can store that data in the file system as well. Simply give your application a tmp/ folder and dump your cached objects there. But while memcached objects expire automatically, you’ll have to take care of the lifecycle of cache files yourself.
More Performance Tuning
Keeping your response times low inside the app is just the beginning. There is much more you can do to speed up page loading. Going into all the details here would be drastically out of scope. There are several books dedicated to these problems. Just a few performance hints here that are not generally advertised:
Consolidating static files
For example, say your web app needs JQuery, 4 JQuery plugins, 2 custom JS code files and 4 CSS files. A browser must make 11 requests to get those files. While the files themselves may be small, the latencies of requesting them can add up to a considerable delay. What’s worse, you’ll have at least 11 include tags inside your HTML for every request! You could now combine those files manually inside a text editor, but that makes them harder to maintain. It’s easier to just have PHP compile two large files: one JS and CSS!
Keep an eye on your output
"grep" is your friend. With grep, you can search your entire app for all occurrences of a string and it can really help you find out whether a given piece of code is actually used.
You Do Have Comments
That’s it from my end. There is lots more to be said on all of these subjects, but it should be enough to point interested people in the right direction. Chances are, battle-hardened PHP veterans disagree with any or all of these suggestions. Some may even get violently ill upon reading this. Pay attention to them, they might have a point. Or not. Decide what makes sense to you.