Advanced Laravel migrations using temporary tables

If you've ever ran a big migration in Laravel before you might know there are a lot of things that can go wrong. Processes run out of memory, migrations crash halfway through leaving some data unmigrated, rollbacks go wrong, etc… Let's see how we can avoid all of the above when creating a somewhat more advanced migration.

Today I came across an interesting problem where I needed to save the hashed version of an order's ID to the orders table. This new hash_id is generated based on the order's ID and a fixed secret stored in the application's config. In short this means the new hash_id value couldn't be calculated by a SQL query. Instead we'd have to calculate each hash locally and then send it to the database. The naive solution would be the following:

public function up()
{
    Schema::table('orders', function (Blueprint $table) {
        $table->string('hash_id')->index();
    });

    Order::query()
        ->get()
        ->each(function (Order $order) {
            $order->update(['hash_id' => HashId::generate($order->id)]);
        });
}

This would probably work for a couple of orders, maybe even thousands. But I wouldn't want to run this migration on a production database with thousands of models for all the aforementioned reasons.
There are some alright solutions like using lazy collections (cursor()), retrieving the orders in chunks (chunk()) or a combination of both. Those methods might help you execute medium to large migrations (maybe 10-20K rows), but in our case previous deploys taught us that these don't often scale as well as we'd hope.

To cut down on migration duration, memory usage and risk of failing with a half-migrated database, I decided to keep looking for a better solution.

For a start, it's always a good idea to perform the bulk of the migration in database land. This saves us from having to fetch the data, migrate it locally using PHP and having to push the updates to the database. Additionally, keeping the number of queries to a minimum is crucial for time and memory usage as well.

Keeping that in mind, the ideal solution would be some kind of single SQL statement that could update each row based on a id => hashed_id mapping that we provide to it:

UPDATE orders 
SET hash_id = (
  SELECT hash_id 
  FROM ([id: 1, hash_id: 'abc123'], [id: 2, hash_id: 'xyz789']) 
  WHERE id = orders.id
)

Unfortunately, I made that syntax up. We can't just create a temporary array of data in an SQL query to select from… or can we?

[Enter temporary tables]

As the name suggest, temporary tables are just like regular tables, except they will be automatically removed once the database connection closes. In our case, they will be removed after running all migrations. Using these temporary tables we can locally prepare our data mapping from id to hash_id, then insert that mapping in as few insert queries as possible and finally use that temporary table to update the orders in a single query. In SQL this would look something like this:

CREATE TEMPORARY TABLE temp_mappings(
    id INT PRIMARY KEY,
    hash_id VARCHAR(255)
);

INSERT INTO temp_mappings (id, hash_id) VALUES (1, 'abc123'), (2 , 'xyz789'), (3, 'foobar');

UPDATE orders 
    INNER JOIN temp_mappings ON temp_mappings.id = orders.id 
    SET orders.hash_id = temp_mappings.hash_id;

Notice how temporary tables are basically identical to the way normal tables work in the database.

For Laravel migrations, support for temporary table support comes out of the box using Schema::create() and $table->temporary(). The above SQL example then becomes the following migration:

public function up()
{
    Schema::table('orders', function (Blueprint $table) {
        $table->string('hash_id')->index();
    });

    Schema::create('temp_mappings', function (Blueprint $table) {
        $table->temporary(); // thanks, Laravel
        $table->integer('id')->primary();
        $table->string('hash_id');
    });

    Order::query()->select('id') // we only need IDs to calculate the hash
        ->cursor() // use lazy collections
        ->map(function (Order $order) {
            $hashId = HashId::make($order->id);
          
            /* Prepare the value string for the SQL statement */
            return "({$order->id}, '{$hashId}')";
        })
        ->chunk(10000) // insert mappings in 10K chunks
        ->each(function (LazyCollection $chunk) {
            $values = $chunk->implode(',');

            DB::insert(DB::raw("INSERT INTO temp_mappings(id, reference) VALUES {$values}"));
        });
  
    Order::query()
        ->join('temp_mappings', 'temp_mappings.id', 'orders.id')
        ->update(['hash_id' => DB::raw('temp_mappings.hash_id')]);
}

As you can see, we're still using chunks and lazy collections. However, we're only selecting the id for each order and thus keeping memory usage small. To further improve this we could use the DB facade to fetch order IDs as an array, instead of model instances.
Notice that we're also not executing an update statement per order, but instead aggregating the update data in as few insert statements as possible.
Finally, the actual migration statement becomes a cross table update, joining data from the mappings table on the order. This single statement will either run succesfully and migrate all the data or fail and migrate none, allowing you to easily recover after any issues.

In hindsight

After writing this post I thought of an easier way for the relatively basic example above: the INSERT ... ON DUPLICATE KEY UPDATE syntax. Using this syntax we can insert multiple rows that will be either created or updated based on the existance of a (primary) key. This works a lot like Laravel's updateOrCreate().

In our example we could insert the id to hash_id mappings directly into the orders table using ON DUPLICATE KEY UPDATE. All the inserted id keys exist, so all rows would be updated instead of inserted, even though it's an INSERT statement:

INSERT INTO orders (id, hash_id)
    VALUES 
        (1, 'abc123'),
        (2, 'xyz789'),
        (3, 'foobar')
    ON DUPLICATE KEY UPDATE hash_id = VALUES(hash_id);

However, this method is still susceptible to failing halfway through the migration when using chunks at scale. It's also less flexible as it relies on matching keys in the database.

Key takeaways

  • Temporary tables exist and are useful for migrations (or filtering down a large table);
  • use chunks and lazy collections wisely: they're powerful but not a one-line replacement for looping over a million models;
  • keeping the number of queries low in migrations (or anywhere else) is always a good idea;
  • make your migrations fail with either all data migrated or no data migrated. Failing halfway through is a pain to deal with;
  • make the database server do the bulk of the migrations' work instead of PHP;
  • INSERT ... ON DUPLICATE KEY UPDATE exists and can be used to batch update rows based on simple constraints;
  • and finally: think about your database model before you start coding. If you're going to need a calculated hash of your IDs, take care of it before you have 50K+ rows.