PHPitfalls

After reading the “Securing PHP” article written James Cunningham I thought I might gather a few points about using PHP from the developer side. Keep in mind that I’m not a security expert. However, this article contains a few starting points on preventing exploits, making PHP apps perform better, and miscellaneous stuff that I consider to be best practices. Your mileage may (and probably will) vary, so as always: take everything with a grain of salt no matter if you read it here or elsewhere. This is not so much a check list of concrete actions, it’s more a collection of points worth keeping in mind as you code.

PHP? WTF possessed you?!?

The first thing you come across when starting out with PHP is probably the fact that it has an extremely bad reputation. You will hear lots of things, including “it doesn’t scale”, “it’s not a real language”, “it doesn’t have X so it sucks”, “it’s not safe”, or “it’s Blub and you’re too stupid to realize it’s Blub From the perspective of the language runtime itself, this is all a lot of crap. Still, the trolls are often correct – though it’s generally not PHP’s fault per se. The blame lies solely with the developers using it. As a PHP adept, this should comfort you because it’s something that can be fixed on your end. You can also derive consolation from the fact that other web languages and frameworks suffer from the same problems, it’s just not generally advertised. The bad news is that PHP application failures are huge and numerous, because the language is both popular and powerful enough to enable truly epic bugs.

The Basics: How PHP Works

Because PHP is so accessible and ubiquitous, there are a lot of people copying and pasting scripts together – people who in a more perfect world would be forbidden by law from ever touching source code. Even when real developers are doing it, hacking an easy language does not absolve anybody from the responsibility of knowing what goes on inside a system behind the scenes.

At the most fundamental level, the webserver hands a request from a browser over to the PHP runtime. This sounds like a really simple concept and for the most part, it is. Nowadays, most serious web servers are configured to shove requests directly into the gaping maw of a long-running PHP process dispatcher. After PHP is done with its thing, it passes the result data back to the webserver, which in turn hands it over to the client browser. Historically, this was not always the case: in the past, web servers often started up a complete PHP instance just for one request. If this sounds inefficient to you, you are absolutely right. After people realized how wasteful this method was, they adopted the current model of re-using PHP instances after they completed their jobs. However, some mass-market ISP hosting plans still use the older model, all the more reason to keep an eye on your code performance at all times.

A clean slate
Every developer should internalize this: fundamentally, PHP is a per-request environment. Whatever you did during the last request, the next one will start with a completely blank slate. This stateless paradigm is not very common as web languages go. Many others operate as a persistent environment. PHP’s way of doing things like this is both awesome and problematic, depending on your use case. On the plus side, this allows developers to look at each request as an isolated problem. Also, it’s much more difficult to make a mistake that takes down the entire server. There are less memory leaks and other weird effects that come from having a stateful runtime. But on the negative side it also means developers must understand how rebuilding the entire environment for every request comes at a computational cost. Many PHP apps are slow because developers did not consider this cost. It’s our responsibility to keep that startup cost low, so doing as little initialization as possible upfront is always a sensible concept.

Things to avoid:
Anything that smells of gratuitous initialization procedures. Don’t load, check, connect, or compute anything that isn’t needed. PHP is not designed to fulfill your dream of becoming an OOP purist. Avoid huge class hierarchies: keep in mind that all this structure has to be parsed and then instantiated at every request. It’s often better to have a very flat class system. Don’t store big amounts of data in the $_SESSION variable because it, too, has to be reloaded on every request. Unless you’re sure your web server does opcode caching (with APC for example), don’t use huge files full of unnecessary source code.

Things to Do:
Procedural programming is not necessarily evil. Do it whenever it has speed and/or simplicity advantages. Only require()/include() files that are definitely needed. Consider using a class loader (carefully) to load functionality modules on demand instead of monolithically including all your stuff upfront.

Things to know:
– HTTP request headers: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields
– CGI interface, variables: http://en.wikipedia.org/wiki/Common_Gateway_Interface

Profiling, profiling, profiling…

PHP makes it easy for you to track the amount of processing time and memory your app is using. It is absolutely essential to track this. Intuitively, I’d say that the execution speed of PHP falls somewhere between Ruby and JavaScript(V8) and it’s easy to make mistakes that end up using a lot of memory and/or valuable CPU time.

You don’t even need serverside debuggers or fancy instrumentation to achieve this. The function microtime() returns a timestamp in microseconds, and memory_get_usage() gives you a basic idea about your script’s memory behavior. This makes it easy for you to check the two most fundamental resources at key points during your application’s execution path.

Personally, I like to use an extremely simple profiling function based on the microtime() function. Using a function like this to profile your code will allow you to measure how horrible certain operations really are. For example, connecting to a database. Running regular expressions. It all has a cost and you need to know what it is. It’s always good to know what’s going on behind the scenes – so avoid libraries that obfuscate their behavior for the sake of fake simplicity.

For more serious profiling and debugging, you should check out: http://xdebug.org/ but microtime() still provides you with valuable and quick information in any PHP environment.

Sane(r) Input and Output

One of the most common publicly visible mistakes is failure to sanitize the input of a web application. It’s important to remember that every single piece of data coming into your app is potentially hazardous. The web really is out to get you! In PHP, the $_REQUEST array contains all the parameters pertaining to the current request and it’s essential not to trust them. Sadly, there is no single way to make this data safe. It depends on what you do with it. On a more positive note, the handling of user input generally falls into one or more of the following categories and there are standard practices you can employ to avoid the worst:

>>> Rule 1 of input data hygiene: Nuke it from orbit, it’s the only way to make sure! <<<

Displaying Data
In the most common scenario, a user submits some kind of text to your application and the app in turn displays that text on the site. Naively displaying whatever the user put in opens a huge opportunity for attacks on your site with an XSS exploit Thankfully, it’s easy to sanitize this kind of data. In most cases, htmlspecialchars() will take a text and render it harmless by escaping angle brackets and other problematic characters. But in some cases you might want to allow the user to enter markup instead of just plain text. In theory, PHP lets you specify a set of “good” tags and filter all the other ones out with strip_tags() but this function is horribly unsafe because it allows malicious users to sneak JavaScript event attributes into the allowed tags. That means you have to use something to strip those attributes out as well (there are some examples in the PHP documentation), however this is not trivial. In fact, I believe it’s the single biggest reason why bulletin boards started their own markup language known as BBCode.

Text in Databases
SQL database queries come with a little bit of baggage. Often, you will need to take user input and run queries with it. For example, you might want to store data in the DB or you might want to retrieve it. Most database abstraction libraries will let you use a syntax like this:

SELECT * FROM articles WHERE id = ?

The nice part of having support for placeholders like “?” is that you don’t need to worry about making the content of your variable safe. You can just pass it as another parameter to your query function. Depending on the database and the library used, this might also have the further advantage of enabling the database to precompile the query and execute subsequent queries a bit faster.

There are, however, situations where you might not want to use or be prevented from using an abstraction layer or library. Using the built-in MySQL functions can work, too, you just need to be more careful. You do have to take care of properly escaping the variables yourself. There is just one single function that you can use to make data safe for MySQL consumption and its name is mysql_real_escape_string() If you’re using any other function, stop it immediately. mysql_real_escape_string() is your one and only true lord and savior: worship the blessed bytes it churns out.

Dynamic Code Paths

This is a touchy subject. PHP allows you to do a lot more things with your code based on variables than most other languages. If you’re going to use those features, make sure to have a very good reason for it. If done right, those features can substantially reduce the complexity of your code – but you have to be extremely careful.

Dynamic include()s: PHP allows you to include a file specified in a variable. You can do stuff like this: include('inc/'.$moduleName.'.php');. Contrary to many other people, I think it’s fine to use this feature in principle because it allows you to introduce very simple extension mechanisms into your app and it can help keep your codebase clean. But as always, with this kind of power, comes a huge responsibility: you have to make sure $moduleName is legit and can’t be used to call arbitrary code on your server. A good way to ensure basic sanity inside this variable is to use at least basename($moduleName) on it, but a much better solution would be to strip out any non-alphabetic characters. Nukes. Orbits. See above!

Dynamic variables and functions: In PHP, you can set the content of a variable $v by specifying its name through another variable. For example, if you set $nnn = 'vvv';, you can do $$nnn to access $vvv. But wait, there is more. Suppose you have a function vvv();, you can call this function by writing it as $$nnn();. Obviously, this is very powerful stuff, so you have to make sure the “id” variable (in this example $nnn) is sane and can’t be used from the outside to call arbitrary stuff in your app. Contrary to the previous methods, there is no single way to make sure this code can’t be abused: you’ll have to make sure of it in a manner that is appropriate to your code specifically.

Eval is evil: for some unknowable reason, many newbie developers seem to be enthralled by eval(), probably because they failed their saving throw against its evil whisperings of doom at some point. In any case, eval() is probably the one function responsible for the most colorful WTFs in coding history.

eval(); allows you to execute arbitrary code by invoking an interpreter inside the interpreter. Developers often use this to implement dynamic features, such as event handlers that can be specified by the users of an application. The dangers here are obvious: there is no way to make this safe. If you allow your users to specify custom code, you damn well better make sure they’re trustworthy, because they will have access to anything and everything on your server. I believe in 99% of all cases, the desire to use eval(); is not even remotely legitimate. However, there might be scenarios where eval() is justified, for example in CMS applications or meta-programming projects. For PHP beginners and intermediates there is just one rule regarding eval(): if you’re using it, you’re doing it wrong. It’s that simple. Don’t listen to the voices!

Calling the shell
Sometimes, when a specific and arcane piece of data processing is required, it can be a good idea to just call a Unix text command from inside your PHP application. If you do that, you have to be aware that your code has a high likelihood being specific to your server configuration. Chances are, your call won’t work on another configuration. Whenever some tiny aspect of your shell call is influenced by user data, you have to use escapeshellarg() and escapeshellcmd() to sanitize those values.

Regular expressions
Regular expressions are very practical, small and can be highly performant if done right. They do, however, require a specialty tech priest to come in and bless the code. You can’t just copy an arcane regex ritual from the web and expect it to work on your project. Regexes often sit there and look like they’re working, but in reality they’re just lying dormant until they finally betray you at a more opportune time. It’s ridiculously easy to get them wrong, and they don’t lend themselves to bug-spotting at one glance. You probably should not rely on regular expressions to sanitize your data, unless you know exactly what your expression does. Because even if you think you know what it does, chances are high it will do weird things with weird inputs. Even the Supreme Grandmaster Regex Scholars of Perldom get this stuff wrong with a scary high probability. Personally, I’d advise you to steer clear of security-relevant regexes if the same can be done with a few simple and more maintainable lines of real code instead. Orbits. Nukes. Swords. Gunfights.

What to know:
– Regular Expressions: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

Databases

We already talked about database security. As always, there is more. In most cases your database will be MySQL. Like PHP, MySQL has a very bad reputation – so it’s only natural that those two should pair up to be the web server standard configuration of the civilized world. MySQL will be fine for most of your standard storage and query needs. A lot of people tend to prematurely optimize their web projects and choose a NoSQL solution because they believe it’s going to be faster, or even simply because “the cool kids are doing it”. Needless to say, this is not actually warranted in most cases. PHP allows you to go with the database you like the most – if that’s MySQL, you’ll save a lot of configuration trouble. If it’s not MySQL, that’s fine, too. But chances are very good you don’t need a special DB solution for your web project so unless you’re experimenting you’ll probably get the most mileage out of something you already know very well.

What to avoid:
Avoid making a database connection when your request doesn’t need it. And when your request does need it, you should prefer persistent database connections over ad-hoc ones (e.g., use mysql_pconnect()) for the simple reason that it potentially shaves some execution time off the connection part. Avoid executing many queries in favor of consolidating them into just a few. Every time you fire off a query to the DB server, your program has to wait for data to come back. It’s also a good idea to avoid hugely complex queries, especially if you need to plan for scalability. SQL gives you a lot of rope to potentially hang yourself, keep performance in mind and measure it!

A word about legacy libraries
You may be wondering why I’m referencing an obsolete MySQL API above. Even though the very first example is all about DB abstraction libraries, people still bring it up. To make it clear: I’m not condoning its usage, but chances are you’ll come across similar problems at some point, for example if you’re debugging legacy code or if you work with other libraries where proper escaping becomes an issue. It’s worth to at least be aware of unsanitized input at all times. I’m not advocating the use of failure-prone, low-level, obsolete libraries – I’m trying to talk about being more conscious of dirty data. Again: use PDO, or whatever suits you, avoid naked mysql_ calls.

What to know:
– MySQL, obviously: http://dev.mysql.com/doc/refman/5.0/en/tutorial.html
– Indexes are essential

File Uploads

File uploading is the act of accepting a file from the browser into your web application. Like all user input, you must be prepared for the worst. People will try to crash your uploading code or they will try to upload executable files onto your server. When accepting an uploaded file, it’s important to check its content first before storing it on the server. For example, if an attacker manages to upload a PHP code file into one of your directories, it’s game over. That’s what happened to the Trojans. One day they allowed a highly executable horse to upload into their root directory. It was not a good day.

To avoid this, you have to make sure that, say, an image file uploaded to your app is actually an image file. The $_FILES variable contains information about a file’s MIME type. Sadly, you have to disregard this information, because it’s been supplied by the user’s browser and is thus utterly evil. Instead, you have to get the actual MIME type of the file directly. In the good old times, you could use mime_content_type() however this is now a deprecated function. Rescue comes from an unexpected place: the GD library has a function that, among other things, returns the actual MIME types of image files: getimagesize() Use this to check what your uploaded file actually contains and simply reject everything that does not correspond to one of the MIME types explicitly allowed by you.

What to know:
– MIME types: http://en.wikipedia.org/wiki/MIME
– HTTP methods: http://en.wikipedia.org/wiki/HTTP#Request_methods

Extra Credit: Caching

On many servers, you will have access to a service called memcached. It’s essentially a mini server process that allows you to store and retrieve arbitrary data very fast. To save or retrieve a data package from memcache, your application needs to connect to the memcached server and give it the key of the object you’re interested in. This key can be any string. Keep in mind though, that other applications are using the same key/value storage so choose your keys in a manner that does not lead to conflicts. Remember to store user-specific information with a user-specific key.

Like with any server, connecting to memcache and retrieving a cached object costs time. It’s only worth it if that interval is actually shorter than the time it takes to recompute a given data structure.

In the absence of a fancy memcached service, you can simply fake it, too. For example, if your application needs to generate a report or a big chunk of HTML that seldom changes, you can store that data in the file system as well. Simply give your application a tmp/ folder and dump your cached objects there. But while memcached objects expire automatically, you’ll have to take care of the lifecycle of cache files yourself.

More Performance Tuning

Keeping your response times low inside the app is just the beginning. There is much more you can do to speed up page loading. Going into all the details here would be drastically out of scope. There are several books dedicated to these problems. Just a few performance hints here that are not generally advertised:

Consolidating static files
PHP makes it ridiculously easy to stream arbitrary content to your users, including JavaScript and CSS files. The nice aspect of serving JavaScript and CSS from PHP is twofold: first, you have full control over the HTTP headers needed to tweak the browser’s caching behavior – this is an option you don’t have when serving static files from your webserver directly in most cases. Second, you can actually concatenate multiple JS or CSS files into one big chunk and give this chunk out in one go.

For example, say your web app needs JQuery, 4 JQuery plugins, 2 custom JS code files and 4 CSS files. A browser must make 11 requests to get those files. While the files themselves may be small, the latencies of requesting them can add up to a considerable delay. What’s worse, you’ll have at least 11 include tags inside your HTML for every request! You could now combine those files manually inside a text editor, but that makes them harder to maintain. It’s easier to just have PHP compile two large files: one JS and CSS!

Keep an eye on your output
From time to time, have a look at the HTML code your app is producing. Sometimes it’s amazing how much junk can creep in there. Have a look at every element and ask yourself if this really needs to be in there. The same goes for CSS and JavaScript files: over time they tend to accumulate dead end sections that not used anymore. The Unix command "grep" is your friend. With grep, you can search your entire app for all occurrences of a string and it can really help you find out whether a given piece of code is actually used.

You Do Have Comments

That’s it from my end. There is lots more to be said on all of these subjects, but it should be enough to point interested people in the right direction. Chances are, battle-hardened PHP veterans disagree with any or all of these suggestions. Some may even get violently ill upon reading this. Pay attention to them, they might have a point. Or not. Decide what makes sense to you.

Happy coding!

[email protected]

25 thoughts on “PHPitfalls

  1. Pingback: PHP Pitfalls | My Daily Feeds

  2. Justinas

    > There is just one single function that you can use to make data safe for MySQL consumption and its name is mysql_real_escape_string()

    Or you could step into the 21st century and use PDO or MySQLi. ext/mysql is deprecated as of PHP 5.5.

  3. Jordan

    I work with almost exclusively ASP.Net these days, but I use to be half decent at writing PHP (and wrote a small proof of concept in PHP a few months ago).

    PHP just feels like you have to place so many restrictions on yourself and honestly, I will never recommend anyone use it for any project, ever. There is almost always a better chocie between ASP.Net, Rails, Node.js, etc. PHP has became the Cobol of our time. The whole idea of writing a programming language that non-programmers use is the primary broken thing about PHP, and it’s too late for it to be fixed. Now. On to your post and why I still think PHP is very bad:

    1. A clean slate: Basically means you have to lazily load everything. Something very different from nearly ever other application framework
    2. Have a flat class system: Yea, it’s not like OOP ever made code easier to write or anything. PHP is inheritely more difficult to write properly
    3. Profiling with microtime(): Seriously? Have you ever used an actual profiler. timing something with microtime is nice for benchmarks, but to know what is actually wrong with the performance of your app, a real profiler is required
    4. Sane inputs: good points there.
    5. Everything dynamic: Eh, some of it’s good and some of it’s bad. Good points here, though things like dynamic includes are not nearly so needed when you can have extra code for a single page request which doesn’t affect performance at all (ie, any other framework)
    6. Caching without memcached or some such: Yea, it’s not like a persistent application would be nice to have or anything
    7. Consolidating static files: Please for the love of god do not do this! Any web server worth it’s weight in salt can serve these faster than PHP can dream AND provide the proper HTTP cache headers so it doesn’t request it more than once per session.

    Good blog post about writing “good” PHP, but most people who care about writing good code has already moved on to other frameworks/languages.

  4. udo Post author

    That’s a good point. I did cover XDebug though. I’m not suggesting the microtime() hack is a replacement for anyone who wants to do serious profiling.

  5. Andeew Pennebaker

    I was ready to write off PHP as a ruddy awful Per descendant, but your article points out how PHP actually does some things netter than other languages, such as separating each http request and encouraging lighweight OO hierarchies. So PHP isn’t as fucked as I expected.

  6. Mike

    “but most people who care about writing good code has already moved on to other frameworks/languages.”

    This is funny. I do mostly C# code but the statement above is just flamebait. :D PHP is fun and has its place for the job at hand.

    Good post. I also agree he never said microtime was a profiler replacement.

  7. Pingback: Boston web developer and web designer : Erik August Johnson : Blog

  8. Stan Vass

    “Functional programming” is a very different thing. You mean procedural programming. Sorry to nitpick :)

  9. Pingback: PHPitfalls by Udo's ProgBlog | Le Toucan Veille chez 6TM | Scoop.it

  10. udo Post author

    Hi Stan, not in my opinion. I think the preferred style here is indeed functional, but not necessarily procedural. A lot of people still use OO to organize code and to do light-weight inheritance, but the coding philosophy is still functional. Of course, you’re right when you’re saying that PHP historically emphasizes a procedural style.

  11. udo Post author

    Michael, I thought the extra paragraph on legacy libraries and several sentences throughout the article covered this, but apparently not ;-) I took your suggestion to heart and mentioned PDO by name now.

  12. Dionysis Zindros

    Hi udo,

    Stan is actually right. You don’t mean “functional programming”, you do mean “procedural programming” or perhaps “using functions instead of objects”. Functions in PHP are not first-class citizens, and hence functional programming is impossible, unlike Haskell, LISP, Scheme, or Javascript to name a few.

    In particular, the only functional feature of PHP is the “create_function” function, which is unfortunately currently incorrectly implemented in PHP from a computer-science linguistic point of view.

  13. udo Post author

    Dionysis, this is absolutely not true. I use anonymous functions in PHP all time time, it would really suck if they didn’t exist. It works like this:
    $f = function() { return('...I was made by a function object'); };

  14. zhangtaihao

    Any small-scale project can be built by anyone easily, but not necessarily securely. Any seasoned PHP developer recognizes the important of structured code, and may have gained that knowledge by working with or designing frameworks. Frameworks built for large-scale systems tend to specifically influence the way developers work, which is incidentally also why inexperienced developers tend to hate complex frameworks (because they have not gained a holistic view of well-developed PHP applications).

    Other established web languages have some sort of framework built into an industry standard. You see Java, ASP.NET, and Ruby developers working with a reasonably well defined methodology, sometimes because there is no other way to do it, but mostly because the language environment affords a standardized (often optimized) structure to the produced code. Because the scale of the open PHP community is absolutely staggering, PHP as a whole has not quite evolved to a state of consistency that some of those other established web languages have ended up (e.g. a lot of PHP programmers have never heard of the PSR-0 standard).

    To illustrate, it’s easy to build a Symfony application that can perform at levels on par with pre-compiled systems. You do find a lot of Open Source middleware for PHP, and you can use advanced techniques like dependency injection and model-driven development in frameworks like Symfony. You just don’t hear it as often as you do in, for example, the Java or ASP.NET space, simply because PHP has no “standard” framework (no, I don’t consider Zend a PHP standard, but merely a choice).

  15. Batman

    @Jordan

    PHP is not a programming language. Neither is ASP. They’re are scripting languages.

    ASP.NET is definitely no better than PHP. Anytime a developer complains about a language it’s 100% of the time because they lack the logic and skillsets to create secure, fast, responsive code.

    If PHP sucked so bad, why would Google (Google primarily uses Python, but uses PHP for several of its applications), Yahoo, Wikipedia, WordPress and Facebook be using it? Your code is only as good as your developer can make it.

  16. Anthony

    Good stuff. Best line: “Regexes often sit there and look like they’re working, but in reality they’re just lying dormant until they finally betray you at a more opportune time.” Few statements are more true than that!

  17. Martin

    Loving that line… “That’s what happened to the Trojans. One day they allowed a highly executable horse to upload into their root directory. It was not a good day.”

  18. Pingback: Tweet Parade (no.50 Dec 2012) | gonzoblog

  19. Pingback: PHP Digest: Wordpress 3.5 Release, PHP OOP, Cross Domain AJAX Guide and Much More | Zfort Group Blog

  20. Pingback: Peligros de PHP | Programación en Internet

Comments are closed.