The performance reducing database cache
Here's another story from the same company which brought us the self deleting user table.
We wanted to add a small script to one of the sites they built. This site is visited by a reasonable amount of users and generally works quite well. Sometimes we get load spikes which are very annoying. Not so much that it crashes the server, more like pages taking a around 3 seconds instead of 1 second to load. We never really investigated those performance issues. Anyway, the script we wanted to develop doesn't integrate directly with the CMS they deployed, but might make use some information from it. My first thought was: "Maybe we can gather some useful information from the session?". So I wrote a tiny PHP file and uploaded it to the server:
<?php session_start(); header("Content-Type: text/plain"); print_r($_SESSION);
I opened the script in my browser and it just started circling the little spinner in the tab bar for a while. "Odd ...", I thought - it shouldn't take this long. Then the result finally appeared. It looked like I would have needed a one kilometer long screen to fit all the data. Checking on the server, I figured out that the session was 1.7MB in size. Typically a session should be few kilobytes max - so what the hell is going on?
Grepping through the source code I quickly found the culprit. Unfortunately without any comments, so I can only try to reconstruct the process that must have been going on in the developers mind. I'll do that using an inner monologue:
Great. I've built this nice class which maps perfectly to the records of the data table. That means instead of messing with SQL I can just run a "SELECT * FROM x" on page load and store all the data in an array by creating an instance of my marvellous class for each record. Hmm ... This looks like it's going to be time consuming to do this on every page load ... I know what! I'll put it in a cache - that will speed things up! ... But which cache? In PHP nothing is preserved across page loads ... Except ... Yes! The session! That's what's kept, I'll put it there! Oh and I use the PHP session functions instead of the CMS's database based session system to make it faster.
There are few things wrong with this however:
- You do not run a "SELECT * FROM x" to select thousands of records and then loop though an array to find them. If you only need a few records, you form a proper SQL query which only returns the records you need. If you need many (or all), then you may run this style of query - but you store the records in a hashtable or some other construct where your records can be found easily. Never linearily iterate over an array to find items!
- As the developer (probably) correctly assumes, nothing in PHP is preserved between requests. Every requests starts the PHP process from scratch. No variables make it through, Caching object instances like the developer attempted is fine if you are using an application server written in Java, node.js, Python, C, etc.. It is fine in some environment which preserves state across requests. PHP doesn't. Even though in PHP it might seem that the Session is preserved, it is actually stored on disk between requests. So when something changes, all the objects are serialized and written to a file on disk. During the next request the file is loaded, deserialized and all the instances rebuilt - exactly what this programmer tried to avoid.
- Using the PHP session at all is a bad idea when the CMS comes with its own session management system. First of all there is no synchronization between the sessions, one might be still valid even if the other one runs out. Also many PHP programmers don't know that PHPs session management system is blocking. Only one process may use the session at one time. Meaning, if you for example run two AJAX calls at the same time and they both use session_start() on the same session file, one of them has to wait until the execution of the other script finishes before it can run. The session system provided by the CMS might be non-blocking. If it is blocking however, mixing the session types is even worse because it could cause deadlocks!
Just to put icing on the cake, there is a bug which reloads the data from the database on each pageload anyway, making the "caching" totally pointless. It was an attempt to periodically invalidate the cache to account for changes in the database. So here's the process the PHP interpreter has to walk through on each page load:
- Find the session file on disk and load it into memory
- Deserialize thousands of objects and load them into an array in the session
- Throw everything which was just loaded away
- Query the database and recreate the thousands of objects, store them in the session
- Work with the data in the session, maybe access a few objects, maybe not
- Serialize thousands of objects and store them in the session file on disk
ARGH!! With just a few lines of code I disabled this "caching" in our testing instance, reducing the steps to:
- Query the database and create the thousands of objects, store them in a global veriable
- Work with the data in the variable, maybe access a few objects, maybe not
The proper fix would be to actually use SQL the way it was meant. But this small fix is giving us a huge performance improvement already and it wasn't causing major issues before. So for now we'll just leave it like it is, with the "caching" disabled.
Like they say: "A little knowledge is a dangerous thing". A junior programmer might naively assume that session variables just seems to magically keep existing between requests, not knowing they actually make a round-trip through the disk everytime. If this code would have been properly reviewed by a senior developer this would have never made it into production.
... oh and just to finish the story - there was nothing of value stored in the PHP session which we could have used for the script we were developing. But my colleague found another way to implement it.
Neuen Kommentar schreiben