Helping to Pay for Drizzle.org
October 27, 2008
Well, it has certainly been sometime since I had blogged last, however, that does not mean that I have not been keeping note of what is going on in the industry. If you have not heard about the Drizzle Project, then you must be living in a cave!
The Drizzle project is building a database optimized for computing cloud and web applications. It is being designed for massive concurrency on modern multi-cpu/core architecture. The code is originally derived from MySQL.
Mike Shadle recently made a contribution by negotiating and purchasing Drizzle.org for $1K. If you haven’t contributed back to open source, here is a chance to show your token of appreciation. Help to support this project and help pay for Drizzle.org!
Update To My Readers…
September 15, 2008
So there has been a very long period where I have not written lately. This is largely due to many situations that have happened over the last couple of months. While I was working on purchasing a home that had fallen through due to some unforeseen market conditions and waiting times due to the mortgage crisis in the US. I had decided to wait another 6 months and then my laptop was stolen which caused a loss in a significant amount of documentation and information that I had previously put together.
I am hoping to get things back to normal in the near future when my travel schedule slows down as well as getting situated in a new place over the next few weeks. I am planning on doing a few more performance related posts as well as posting on some additional topics that I have not fully covered as of yet.
PHP Performance Series: Maximizing Your MySQL Database
June 18, 2008
In the first article of the PHP Performance Series, I focused on PHP Caching Techniques. This time I want to talk about maximizing your database. This article will deal mostly with MySQL, however, you should be able to note many of the different aspects even if you do not directly utilize MySQL.
Application SQL Performance
Application level SQL performance is much different than the performance of the SQL query itself but rather how it has been designed to work in the application. Many of the items I will be addressing in this area is designing your application to make less queries thus improving scalability and likely performance. However, performance does not always equal scalability as the same with scalability does not always equal performance.
If you have read my blog before you may notice that I have used some of this content before but putting in this section for terms of completeness.
Lazy Connections
Utilizing lazy connections for your database is a great step in applications that do not need to utilize the database through a full request or even if it needs to be utilized at all. The concept here is to not initialize the connection to the database unless absolutely essential to keep your connection pool free of massive amounts of sleeping connections.
Simple Lazy Connection Example
While not a full example, I believe this shows you a simple technique in handling lazy connections.
class My_Db { private $_connected = false; private $_connection; public function connect($host, $user, $pass, $db) { //method will simply set the connection variables } private function _connect() { if ($this->_connection = mysql_connect()) { $this->_connected = true; } } public function query($query) { if (!$this->_connected) { $this->_connect(); } mysql_query($query, $this->_connection); } }
Iterating Queries
This is one of the most common items I usually see when looking over another developers code or even several of the open source projects out there. I am defining iterating queries as a query that executes on a loop. These can be very expensive and often times are definitely not needed.
An Iterating Query Example
if (isset($_GET['ids'])) { foreach($_GET['ids'] as $id) { $rs = mysql_query('SELECT * FROM my_table WHERE my_id = ' . (int) $id); $row = mysql_fetch_assoc($rs); print_r($row); } }
Fixing The Iterating Query Example
if (isset($_GET['ids'])) { $ids = array_map('intval', $_GET['ids']); $ids = implode(',', $ids); $rs = mysql_query('SELECT * FROM my_table WHERE my_id IN (' . $ids . ')'); while($row = mysql_fetch_assoc($rs)) { print_r($row); } }
Need Based Selects
There is no need to select information that you do not need. For starters this increases the memory usage on the request as well as the I/O time in fetching all of the record data from the columns that are not being utilized and transferring them through PHP. The more data you select the slower the query and larger the memory footprint. This is especially true with the TEXT and BLOG column types.
SELECT * IS BAD!
This actually is bad in 2 different areas. First and of a higher concern is what is actually being utilized from the query if you ever need to change something in the application? Say you have a large application with thousands of files that utilize a select query and you will have no idea where the variables are without actually researching each and every area of the application that has the wonderful SELECT *. Basically your maintainability slowly dies. Secondly as I stated before, there is also a performance and memory hit here, simply stated, do not use SELECT *.
One question that is commonly asked after this comment is what if I am using all of the columns? Still, are you going to be using all of the columns in 3 months, 6 months, 1 year, 5 years?
Use the Correct Data Type (Don’t Quote Everything)
Yes, utilizing the correct type of data in your query does matter. You can cause the database to miss indexes or come back with invalid results. Besides that aspect it is slower since the database has to convert the data into the correct type.
Example of Incorrect Data Types
if (isset($_GET['id'])) { $id = mysql_real_escape_string($_GET['id']); $rs = mysql_query("SELECT * FROM my_table WHERE my_id = '{$id}'"); }
Example of Correct Data Types
if (isset($_GET['id'])) { $id = (int) $_GET['id']; $rs = mysql_query('SELECT * FROM my_table WHERE my_id = ' . $id); }
By the way, the 2nd one is also quicker because we only had to type cast for the numeric instead of running it through mysql_real_escape_string.
Hierarchical Data
When utilizing navigation trees, you should ensure that you are utilizing a proper technique and not pushing your database for it. Further, cache it instead of hitting it every request at the same time! There is already a great article over at MySQL on Managing Hierarchical Data so I am not going to go deep into this. But anytime you are managing a tree of data please spare other developers and utilize it correctly.
Database Design
Database design is typically the first area that you find problems start. A bad data model can plague your application with both performance and maintainability concerns. However, also note, the more performance driven that you make your database, the less maintainable that it can become (debatable - can really depend on size and scale).
Please note that this is not an all encompassing list as I wanted to give a little more of an idea from a developer point of view on the optimization techniques in database design rather than providing a full guide. If you are looking for a deep level of information I invite you to please go to the MySQL Manual on Database Design Feel free to add recommendations in the comments.
Normalization
Normalization is a technique utilized for minimizing the duplication of information. This is typically the best thing to start out with, if you are currently having problems and your database is not well normalized focus on that first. Likely you have tables, columns or data that shouldn’t need to work as it currently is.
Columns instead of Table Example
A common problem that many applications you find have is that they store the data into just columns instead of actually making a one-to-many table. Take for instance the following table:
| Column Name | Column Type |
|---|---|
| user_id | integer |
| user_email | varchar |
| user_website | varchar |
| user_website_2 | varchar |
| user_website_3 | varchar |
What you are likely seeing here is a user requested to have up to 3 websites and the developer working on it figured there would never need to be anymore. Now this is a focus for gaining further modularity as well as normalization.
Table instead of Columns Example
Using a table instead of columns would better support this feature as well as allow for further growth in the future.
| Column Name | Column Type |
|---|---|
| user_id | integer |
| user_email | varchar |
| Column Name | Column Type |
|---|---|
| website_id | integer |
| user_id | integer |
| website_url | varchar |
You might be thinking, what does this have to do with performance? Well it has to deal with maintainability, better handling of your indexes, as well as lowering the table size in the initial table. Size of your tables does matter definitely when you need to start adding more indexes to columns that you might not need to. Take both examples and attempt to find users without a website entered without an index on the user_website columns. This becomes very slow instead of doing a simple select au.user_id from app_user au where au.user_id NOT IN (select distinct user_id from app_user_website) instead of having to add an additional index instead of simply using the foreign key that you would likely have defined in the app_user_website table for the user_id.
Denormalization
Denormalization is the concept of copying data into other tables in order to reduce the amount of joins that you need to create. These should typically always be handled with triggers and if you are unable to do triggers, please ensure that anything that creates, modifies or deletes handles through a layer of encapsulation. Otherwise you will end up with stale records.
You will only need to do this when you have exhausted all of the other routes such as checking and creating indexes, if data access was really needed and lastly there was no other method for the needs that you were going after.
Table Types
Use the correct table type for what you are doing or attempting to do. I suggest setting up a matrix of what you need your table to do as well as your database as a whole. This first step is often neglected by many developers.
| Feature | Option |
|---|---|
| Read vs. Write > 15% | Yes/No |
| Transactions | Yes/No |
| Foreign Key Support | Yes/No |
| Full-Text Indexes | Yes/No |
The list above is by far no means a full list but you should document what you need and what you are using the database for. For example if you are reading and writing on the same database table, MyISAM is likely a bad idea if you have greater than 15% of reads or writes. MyISAM tables will easily lock when there is a long read and a subsequent insert with a table lock thus not allowing any further reads until that insert has completed. So make a list and figure out what needs to be there. This will certainly help you with furthering your database.
SQL Query Optimizations
Optimizing your SQL to perform is not rocket science, however, this seems to be one area where applications seem to start crumbling down from the point of an application hitting popularity. Simply the features that are developed and time change the database is rarely taken into effect in what might be affected.
A simple rule to follow is when utilizing your database to design your queries to your database architecture and current rules, when those cannot be achieved refactor and adjust the database to be able to handle the new situations.
The Simple Rules
- Use your explain/execution plan
- The less joins the better
- Ensure you are utilizing your indexes (see first bullet)
- Temporary tables can be good when doing operations on complex data sets
- Stay away from derived tables and non-materialized views (see above bullet)
- Roll up data that can be aggregated
- Select the columns you need, not SELECT *
If you are looking for ways to better optimize your queries, again, please go to the MySQL Manual on Query Optimization
Exit(0);
You may have noticed that this blog post has taken me quite a while to push this out. Besides in the middle of purchasing a home, some resource constraints at work, a couple side projects and maintaining my life I just didn’t have much time to finish writing a more complete post.
To go a little further, I’ve cut down the contents in this blog as you may have been sick of reading already as well as the amount of information that could have been potentially written here could easily have been a full book if you wanted to get into each and every aspect. I figured for my sanity as well as yours I should cut it shorter. If you have any information to add please submit comments.
PHP Performance Series: Caching Techniques
February 27, 2008
Welcome to the first edition of the PHP performance series, a new series that I will be explaining ways to gain efficiencies and squeezing more performance out of your applications. This first edition, caching techniques, focuses on ways to cache data to optimize your current sites. Some of the concepts here are fairly easy to implement while others may take strategic design in the architecture of your application. Whether you are working on a high profile web application or simply a web development farm these concepts apply to the masses.
Opcode Caching
Opcode caching is likely one of the most simple and effective ways of increasing performance in PHP. By utilizing an Opcode cache you will eliminate many unneeded inefficiencies that happen during the execution process. Opcode caches solve this by storing the opcodes in memory in order to not compile files on each step in the process.
There are many opcode caches available for consumption. You have APC, XCache, eAccelerator and Zend Platform. You make your choice up of what you like the best as they all have advantages and disadvantages which is out of the scope of this article.
File Priming
This is typically more relevant to larger scale companies that have release processes. When you are pushing out a new release, typically you do not want to have your caching system waiting until each page is hit until it is processed in the opcode cache. Instead what can be done, is to run a utility script after the release is pushed out to run each file through the opcode caching extensions compile function. There is an example of this on my performance overview post which has a section about file priming for APC Each of the different opcode caches typically have a way to prime the files, so just look into the API documents.
Caching Variables
Many opcode caches also allow for you to place variable data, also known as user land data, into the cache (typically in memory). This is useful for storing your configuration values or data that is expensive to get and will likely not change.
Example: APC Variables
if (($config = apc_fetch('config')) === false) { require('/path/to/includes/config.php'); apc_store('config', $config); }
A practical example of this was using the Zend Framework and simply running an ab bench after storing the results of the XML configuration file in the cache. This saved parsing time as well as extremely quick access to the configuration file.
Figure: APC Variables in Use
The Code
if (($conf = apc_fetch('pbs_config')) === false) { $conf = new Zend_Config_Xml(PB_PATH_CONF . '/base.xml', 'production'); apc_store('pbs_config', $conf); }
The Benchmark Command
ab -t30 -c5 http://www.example.com/
Results Without The APC Variable
Concurrency Level: 5
Time taken for tests: 30.33144 seconds
Complete requests: 684
Failed requests: 0
Write errors: 0
Results With The APC Variable
Concurrency Level: 5
Time taken for tests: 30.12173 seconds
Complete requests: 709
Failed requests: 0
Write errors: 0
As you could see we had approximately a 3-4% gain in performance by simply caching our configuration file. There is many other areas that could be added into these areas of memory thus increasing your overall performance. Find a few of these and you will certainly see increases in the amount of requests handled. Note that the server that is being tested on is an older box and including a mass amount of files using the Zend Framework.
Make sure to check the documentation on each opcode cache to ensure what you can store in the variable scope (some will not support automatically serializing objects so be careful). Further, ensure you have enough memory allocated in order to do this in specific areas. Lastly I did not include the other op code cache examples here; I simply wanted to give an example to show what common usage would be like.
File Caching
Many times there are areas where the server is processing the same page of content that has not changed. There are always opportunities to cache this type of content, whether in part or in full. I’ll attempt to address both areas here from a simplistic point of view, rather than discussing techniques of generating static content that could be utilized by running a static web server.
For the sake of time and being practical with pre-existing tools, I will be showing the examples in the Pear::Cache_Lite package.
Full File Caching
Full file caching is rather hard to achieve on many different sites when we are pulling data for different reasons and sometimes from different sources. However, while that may be true, there are certainly cases where you do not need to have the “most” up to date data available at that very second. Even a 5-10 minute delay on extremely high traffic sites will award you a performance increase. It is always good to ensure that you are checking your site for these types of areas and creating an easy way to allow for future modification.
While you always have to come at caching with different angles, this is quite possibly the quickest way to add it in and is certainly not flawless. The following example simply takes a snapshot of the page and stores it for use again. This is not a complete logical approach but may be good for certain users.
I do not recommend this for a long term solution but if you need something that is short term and this meets your needs, implement if you like but sooner or later you will see the drawbacks to this method. Such as no content is ever dynamic or certain pieces of content need to be updated sooner than others.
The Bootstrap Cache Example:
require('/path/to/pear/Cache/Lite/Output.php'); $options = array( 'cacheDir' => '/tmp/', 'lifeTime' => 10 ); $cache = new Cache_Lite_Output($options); if (!($cache->start($_SERVER['REQUEST_URI']))) { require('/path/to/bootstrap.php'); $cache->end(); }
The .htaccess Cache Example:
.htaccess
php_value auto_prepend_file /path/to/cache_start.php
php_value auto_append_file /path/to/cache_end.php
cache_start.php
require('Cache/Lite/Output.php'); $options = array( 'cacheDir' => '/tmp/', 'lifeTime' => 10 ); $cache = new Cache_Lite_Output($options); if (($cache->start($_SERVER['REQUEST_URI']))) { exit; }
cache_end.php
$cache->end();
Cache Lite does a lot of the heavy work for you such as file locking, deciding on how to save the content through the parameter given (here we are just using the REQUEST URI). You may need to take in consideration the $_POST variables, $_COOKIE variables or even the $_SESSION variables depending on what you are attempting to achieve.
Partial File Caching
Partial file caching is typically the route that you will likely see the most benefits overall. You likely have quite a bit of content that does not need to be real-time, however, you would like it to be updated once in a while. Or secondly, you have specific portions of the site that simply do not need to be updated at all. This is where the partial caching comes in and really allows you to see quite a bit of performance gains across the board.
Caching Contents Of A String
require('Cache/Lite.php'); $options = array( 'cacheDir' => '/tmp/', 'lifeTime' => 3600 //1 hour ); $cache = new Cache_Lite($options); if (($categories = $cache->get('categories')) === false) { $rs = mysql_query('SELECT category_id, category_name FROM category'); $categories = '<ul class="category">'; while($row = mysql_fetch_assoc($rs)) { $categories .= '<li><a href="category.php?id=' . $row['category_id'] . '">' . $row['category_name'] . '</a></li>'; } $categories .= '</ul>'; $cache->save($categories, 'categories'); } echo $categories;
While this is a highly simplistic example, it shows the flexibility to store contents. You could even store an array instead in order to cycle through it at a later time.
Caching An Array Of Results
require('Cache/Lite.php'); $options = array( 'cacheDir' => '/tmp/', 'lifeTime' => 3600, //1 hour 'automaticSerialization' => true ); $cache = new Cache_Lite($options); if (($categories = $cache->get('categories')) === false) { $rs = mysql_query('SELECT category_id, category_name FROM category'); $categories = array(); while($row = mysql_fetch_assoc($rs)) { $categories[] = $row; } $cache->save($categories, 'categories'); } var_dump($categories);
As you can see, you can store different types of data through the cache. However, with file caching I would be reluctant to store database data as there are better solutions for that type of role which I will be talking about shortly.
Memory Caching
There are a few different ways to produce caches in memory including: memcached, database memory tables, utilizing RAM disk and another option is using the opcode caches memory caching from the beginning of this article. It is best to keep things in memory that are utilized most often and often have a small footprint.
Memcached
From the memcached website:
memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
Essentially what this is saying is that it is able to be stored on a central server with many servers accessing it, it is not tied into your web server such as an opcode cache as it runs it’s own daemon and it is typically utilized for caching database results (doesn’t mean there aren’t additional things it is good for such as session handling — it is already integrated if you just cache the session handler to “memcache” and change your session.save_path to the server with memcached).
Memcache Example
$post_id = (int) $_GET['post_id']; $memcached = new Memcache; $memcached->connect('hostname', 11211); if (($row = $memcached->get('post_id_' . $post_id)) === false) { //yes this is safe, we type casted it already ;) $rs = mysql_query('SELECT * FROM post WHERE post_id = ' . $post_id); if ($rs && mysql_num_rows($rs) > 0) { $row = mysql_fetch_assoc($rs); // cache compressed for 1 hour $memcached->set('post_id_' . $post_id, $row, MEMCACHE_COMPRESSED, time() + 3600); } } var_dump($row);
This is a fairly typical example of memcached. We stored a single item in memory for future usage that might be accessed quite a bit. I recommend using this for records that are accessed the most, thats what a cache is all about.
Memcache Session Example
session.save_handler = memcache session.save_path = "tcp://hostname:11211"
As you can see session handling is quite easy. For multiple memcached servers comma separate the save_path value with each server.
Database Memory Tables
Database memory tables, while I am not going to give an example, can be useful for session data. You can easily create a table with the storage engine of memory using MySQL. Create your own session handler and provide the data that way. This is a quick way to boost performance on sessions as well as keeping them distributed between multiple web servers. Personally, if you can, I would go the memcached route to keep the load off of the database server and let it work on serving other requests.
RAM Disk
While utilizing your RAM as a disk is not distributed it can easily be a quick adjustment to make your site perform faster. However, you might want to note the amount of memory you are going to be utilizing and ensure that on reboot that this directory is put back on the RAM Disk. Remember that information placed in RAM is lost on reboot or power failure.
Bind RAM to a Directory
mount --bind -ttmpfs /path/to/site/tmp /path/to/site/tmp
I attempt to avoid this route as I believe that the risk outweighs the gains, unless you are dealing with massive servers. But there are better tools such as memcached that I would trust more.
exit(0);
I hope that this was informative to some of you regarding caching techniques in PHP. I didn’t fully cover all of the potential caching techniques such as database caching that the RDBMS’s do and some of the other items such as Squid. I may cover more of these at a later time, if I attempt to get into it all now this post will never see the light of day. If you have anything to add send in a comment. Please note, I do not deploy these tactics on everything and anything but decide on certain logistics when and where these need to be implemented. Take into consideration the scale of the project, current overall impact and if you are just optimizing it just for the sake of doing it.
Over Engineering Software
February 17, 2008
Many times as developers, we tend to take our projects and over engineer them since we foresee most of the features that we may want in the future even although there is no purpose for it quite yet. This is quite a hindrance to actually ever getting the software we develop out the door as we can continue refactoring, extending and making our software more complex for what the end-users of a system might need in the future or even potentially edge cases that will never hit 90% of the population.
I have come to the distinct conclusion that when we start to over engineer a project, the chances of it ever getting out the door get slimmer by the second. I am not saying that it is bad to look at what is going to come in the future, however, we need to first develop for the need at hand and extend later by not coding ourselves into a box. This would mean ensuring that your data model, code structure and business logistics fall in line in a way where if things change you aren’t consistently having to modify mass amounts of code. Remember, OOP (Object Oriented Programming) is your friend.
Lets take for instance authentication, authorization and privileges. Presently, your software needs a single login, no groups or specific privileges are needed. You know that you will need this in the future but why develop it now? Simply create it in a way that is expandable in the future and move on. Keep it on a future to do list and cross the bridge when it is actually essential. This might not be the best example as many large applications that we develop may need this out of box. If we look specifically at the development time when you are creating all of this it would be simple to create the architecture behind it and implement it later.
A practice that I will sometimes use in my code, is if I know I am going to be building it out in the future, is to build the class definitions and having them return true. Say we are going to implement groups in the future but at the current moment they are not needed. To go through each section of code in the future implementing the checks would take more time than just implementing it in the beginning. Now with the class definition and the method returning true you could simply call a method to check if the group or user has access to the page (which at this point will return true) and when you implement the business logistics later there is no need to create a large scale change. Yet again, as I stated prior authorization and authentication may not be a specifically great example here.
You will have likely taken slightly more time but in the end saved a mass amount of time by doing this. You certainly will likely miss some areas with that base check but everything should be golden at this stage and you saved yourself hours of writing the business logic. This certainly doesn’t apply to every feature you think of because that would be madness and that would completely negate the purpose of this post.
A simple way to gauge if you should develop now or later; do all of these apply?
- Is this critical to my users success?
- Is the application crippled (unusable) with out this feature?
- Is the cost to develop it now 50-75% less than it would be in the future?
- Is there a business need to support this feature?
I believe these are all critical questions to ask yourself when you are developing, otherwise, you may end up with 20 unfinished projects that will never see the light of the day because the enthusiasm of starting the project has diminished and since there is nothing out the door since it is unfinished there is no community to help build your enthusiasm about the project. At this state you are burnt out, bored and that project may never see the light of day again regardless of how great of an idea it was in the first place or what solutions it may have solved.



