Speed up your data migration with Spork

One of the blogs I like to keep an eye on is Kris Wallsmith his personal blog. He is a Symfony2 contributor and also author of Assetic and Buzz. Last year he wrote about a new experimental project called Spork: a wrapper around pcntl_fork to abstract away the complexities with spawning child processes with php. The article was very interesting, although I didn’t had any valid use case to try the library out. That was, until today.

It happens to be we were preparing a rather large data migration for a application with approximately 17,000 users. The legacy application stored the passwords in a unsafe way – plaintext – so we had to encrypt ’em al during the migration. Our weapon of choice was bcrypt, and using the BlowfishPasswordEncoderBundle implementing was made easy. Using bcrypt did introduce a new problem: encoding all these records would take a lot of time! That’s where Spork comes in!

Setting up the Symfony2 migration Command

If possible I wanted to fork between 8 and 15 processes to gain maximum speed. We’ll run the command on a VPS with 8 virtual cores so I want to stress the machine as much as possible ;). Unfortunately the example on GitHub as well on his blog didn’t function any more so I had to dig in just a little bit. I came up with this to get the forking working:

The command generates the following output:

Make it a little bit more dynamic

To be really useful I’ve added some parameters so we can control the behavior a little more. As I mentioned before I wanted to control the amount forks so I added a option to control this. This value needs to be passed on to the constructor of the ChunkStrategy:

I also added a max parameter so we can run some tests on a small set of users, instead of the whole database. When set I pass it on to the setMaxResults method of the $query object.

Storing the results in MySQL: Beware!

In Symfony2 projects storing and reading data from the database is pretty straight forward using Doctrine2. However when you start forking your PHP process keep in mind the following:

  1. all the forks share the same database connection;
  2. when the first fork exits, it will also close the database connection;
  3. database operations in running forks will yield: General error: 2006 MySQL server has gone away

This is a known problem. In order to fix this problem I create and close a new connection in each fork:

That’s basically it. Running this command on a VPS comparable with c1.xlarge Amazone EC2 server did speed up things a lot. So if you’re also working on a import job like this which can be split up in separate tasks you know what to do… Give Spork a try! It’s really easy, I promise.

UPDATE 2013-03-19
As stated in the comments by Kris, you should close the connection just before forking. Example of how to do this: