Customized Suzuki Intruder in French village
Photo: Copyright © 2014 Eelke Blok

Proposal: Modernise the Drupal update framework

The current Drupal update framework (roughly: hook_update_N) is quite old. It is still firmly rooted in the procedural origins of Drupal. Also, it has several shortcomings. Some of those shortcomings were addressed in the new(ish) hook_post_update mechanism, but that does not replace the old system, it only augments it for update code that needs the full Drupal framework to be available.

This blogpost is a repost of an issue I created in the Drupal Ideas issue queue. The proposal is inspired by Laravel migrations. These work by defining a class in a pre-determined location and will simply be run “in order”. What in order means we’ll look into later.

Problems with the current system

Sequence numbering

The current system based around hook_update_N will check the “schema version” of a particular module and checks if there are any update hooks with a sequence number higher than that number. It will then execute those, in numerical order, for that particular module. In relation to other modules (an update for a particular module might depend on another update from another module to have run), the order can be influenced by hook_update_dependencies().

It is a clever mechanism, but it does mean that if the solution to an issue needs to add an update hook, that patch/PR ends up chasing the current schema version. Sometimes, a patch is fine, except for the sequence number of its update hook. Running a project with a patch containing an update hook means you need to be constantly on guard whether an update to the module actually introduces a new update hook.

Also, the actual numbering of update hooks is often inconsistent. There are guidelines as to what exactly this number should be. However, only the first digit has any technical bearing, causing the guidelines for the rest of these numbers to regularly be overlooked, even by popular contrib modules. Also, the technical meaning of that first digit, is that it corresponds to the major core version the module is compatible with, which in itself has become a troublesome concept since the new backwards compatible nature of major core versions.

Inconsistent descriptions

A similar problem as the one causing inconsistent sequence numbers applies to the docblock for an update hook; it should only contain a description for the update, which diverges from the regular coding guidelines for function docblocks. (The documentation for hook_update_N not being very explicit about that probably doesn't help; "The documentation block preceding this function is stripped of newlines and used as the description for the update on the pending updates task list, which users will see when they run the update.php script."). This, too, is often missed, causing e.g. "Implements ..." phrasing to creep into the description for update hooks. Other maintainers prefer to add the update number in the description – which actually is not a bad idea – but is not part of the convention so makes for an inconsistent overview of update hooks to run.

Procedural code in an OO world

With Drupal moving to object-oriented code more and more, it is time for the update framework to do the same. OO has many benefits, the discussion of which really is not in the scope of this proposal. It would allow update code to move away from getting essential parts of the Drupal API by calling the Drupal global object and instead use proper Dependency Injection.

The proposal

As said, this proposal is inspired by Laravel Migrations. They serve a very similar purpose to Drupal update hooks, that is, applying changes to the database to adapt it to updated code. One area where we might need to step up is automatically generating code; Laravel migrations get a timestamp automatically when generating them using Laravel’s standard artisan command line tool. Migrations are executed, by default, in the order of their timestamp. For Drupal, this could be a plugin-type named Update (Migration would be too confusing, since we already have a Migration framework that deals with content migrations).

Of course, in Laravel, there is only the one application, with a single set of Migrations (the Migrations also define the database schema, BTW; this should probably be considered out of scope here, though; it also means that the database schema is not in a fixed location, but rather the result of the accumulated Migrations. This would not be a design choice I would make). In the Drupal general sense, modules will need to be able to define their own Updates, and we will therefor also need a system like hook_update_dependencies to declare dependencies with the updates of other modules. Although it might seem attractive to open up this dependency system for updates from the module itself, this could easily lead to confusing update order and nightmarish dependency resolution; if a dependency arises with another Update from the same module, it should probably just be re-ordered to come after whatever Update it depends on, to keep the order transparent. "How is this different from the current system with a sequential number?", you might ask. The key is that the number is not recorded anywhere, some unique identifier of the update is (like with hook_post_update). This means that an update does not need to change unless there is an actual, technical reason to do so. Incidentally, using the timestamps to come up with a global order by default could potentially also help with implicit dependencies where a dependency actually exists that was not made explicit in code. Not a huge advantage, but nice nonetheless.

Requirements (MoSCoW)

Let’s define some requirements we want a new update framework to adhere to.

  1. M: Determine whether an update ran based on an Update machine name (as with hook_post_update) (i.e. not a single sequence number).
  2. M: updates (hook_update_N) and postupdates use the same mechanism. They should live in the same namespace (Drupal\module_name\Updates?) and use e.g. an annotation to distinguish the type (i.e. ‘database’ for the old hook_update_N and ‘content’ for hook_post_update; a clarification in terminology seems in order here).
  3. M: Sequence of execution given any one set of Updates - both within a single module, but also across modules - needs to be deterministic, even if no explicit dependencies were defined.
  4. M: It is possible to define dependencies on Updates from other modules using an annotation.
  5. M:Dependency injection is supported. Maybe the possible dependencies are restricted based on update type (see 2).
  6. C: A way to indicate an Update may run while the system is online (i.e. not in maintenance mode). Content updates can take a very long time, while it often isn’t really essential for the site to be offline while they run. This could also be something that could be manipulated using an alter mechanism, where site developers can opt in to online processing if they deem it responsible for their particular project. This is probably only safe or even possible for post_update-type (‘content’) updates. This would probably require the creation of a new addition to the system that uses a cron queue to process update jobs.
  7. C: Take into account blue/green deployments. I am not sure how realistic this is, or even how much connection there really is to this issue. It might need a separate ideas issue. The idea behind blue/green deployments is that database and code from one deployment to the next are backwards and forwards compatible, allowing multiple server nodes to be updated out of sync, allowing the site to remain accessible at all times. This seems like a tough challenge with Drupal, where any number of modules might make arbitrary changes to the database.

Examples

Since speaking is silver and code is gold, let's see how this could look.

Simple example

First, a simple one, system_update_8803(). This update is used to enable the new-in-8.8 path_alias module, which turned path aliases into proper entities. This demonstrates the naming of the file, the naming of the class (no real restrictions, except it’s probably a good idea to have the class name to end in ‘Update’, by convention) and the basic Annotation.

20200823_enable_path_alias.php:

<?php
use Drupal\Core\Updates\UpdateBase;
use Symfony\Component\DependencyInjection\ContainerInterface;

/**
 * Class EnablePathAliasUpdate.
 *
 * Example update replicating system_update_8803().
 *
 * @Update(
 *   id = "enable_path_alias",
 *   label = @Translation("Install the 'path_alias' entity type."),
 *   type = "database",
 * )
 */
class EnablePathAliasUpdate extends DatabaseUpdateBase {

  /**
   * {@inheritdoc}
   */
  public function update() {
    // Enable the Path Alias module if needed.
    if (!$this->moduleHandler()->moduleExists('path_alias')) {
      $this->moduleInstaller()->install(['path_alias'], FALSE);

      return $this->t('The "path_alias" entity type has been installed.');
    }
  }

}

The class extends a DatabaseUpdateBase. The idea is that that every update would implement an UpdateInterface, which has several methods to support the update process, the most notable of which would be update(). DatabaseUpdateBase would be a base class that has several conveniences, like some useful dependencies it gets injected by default. In the example we seen moduleHandler and moduleInstaller used. There could also well be a ContentUpdateBase, with possibly different injected dependencies, to provide some guidance what sort of update would be acceptable for each type.

Batched example

Next, a more complicated example. An important aspect of many update hooks is batching. Batching is implemented in a slightly weird way, though. Most notable, it requires the use of a #finished key in the otherwise completely free to use “sandbox”. The example shows how these two principles could be separated. It takes system_update_8804() as its subject, which follows the previous update to actually convert existing path aliases to the new entity type.

20200824_convert_path_aliases.php:

<?php

use Drupal\Core\Updates\BatchedUpdateInterface;
use Drupal\Core\Updates\DatabaseUpdateBase;
use Symfony\Component\DependencyInjection\ContainerInterface;

/**
 * Class ConvertPathAliasesUpdate.
 *
 * Example update replicating system_update_8804().
 *
 * @Update(
 *   id = "convert_path_aliases",
 *   label = @Translation("Convert path aliases to entities."),
 *   type = "database",
 *   depends = {
 *     'system:enable_path_alias',
 *   },
 * )
 */
class ConvertPathAliasesUpdate extends DatabaseUpdateBase {

  /**
   * {@inheritdoc}
   */
  public function update() {
    // Bail out early if the entity type is not using the default storage class.
    $storage = $this->entityTypeManager()->getStorage('path_alias');
    if (!$storage instanceof PathAliasStorage) {
      $this->setCompletion(1);
      return;
    }

    $step_size = 200;
    $url_aliases = $this->database->select('url_alias', 't')
      ->condition('t.pid', $this->sandbox['current_id'], '>')
      ->fields('t')
      ->orderBy('pid', 'ASC')
      ->range(0, $step_size)
      ->execute()
      ->fetchAll();

    if ($url_aliases) {
      /** @var \Drupal\Component\Uuid\UuidInterface $uuid */
      $uuid = $this->uuid();

      $base_table_insert = $this->database->insert('path_alias');
      $base_table_insert->fields(['id', 'revision_id', 'uuid', 'path', 'alias', 'langcode', 'status']);
      $revision_table_insert = $this->database->insert('path_alias_revision');
      $revision_table_insert->fields(['id', 'revision_id', 'path', 'alias', 'langcode', 'status', 'revision_default']);
      foreach ($url_aliases as $url_alias) {
        $values = [
          'id' => $url_alias->pid,
          'revision_id' => $url_alias->pid,
          'uuid' => $uuid->generate(),
          'path' => $url_alias->source,
          'alias' => $url_alias->alias,
          'langcode' => $url_alias->langcode,
          'status' => 1,
        ];
        $base_table_insert->values($values);

        unset($values['uuid']);
        $values['revision_default'] = 1;
        $revision_table_insert->values($values);
      }
      $base_table_insert->execute();
      $revision_table_insert->execute();

      $this->sandbox['progress'] += count($url_aliases);
      $last_url_alias = end($url_aliases);
      $this->sandbox['current_id'] = $last_url_alias->pid;

      // If we're not in maintenance mode, the number of path aliases could change
      // at any time so make sure that we always use the latest record count.
      $missing = $this->database->select('url_alias', 't')
        ->condition('t.pid', $this->sandbox['current_id'], '>')
        ->orderBy('pid', 'ASC')
        ->countQuery()
        ->execute()
        ->fetchField();
      $this->setCompletion($missing ? $this->sandbox['progress'] / ($this->sandbox['progress'] + (int) $missing) : 1);
    }
    else {
      $this->setCompletion(1);
    }

    if ($this->isFinished()) {
      // Keep a backup of the old 'url_alias' table if requested.
      if (Settings::get('entity_update_backup', TRUE)) {
        $old_table_name = 'old_' . substr(uniqid(), 0, 6) . '_url_alias';
        if (!$this->database->schema()->tableExists($old_table_name)) {
          $this->database->schema()->renameTable('url_alias', $old_table_name);
        }
      }
      else {
        $this->database->schema()->dropTable('url_alias');
      }

      return t('Path aliases have been converted to entities.');
    }
  }

  /**
   * {@inheritdoc}
   */
  public function initializeBatch() {
    $this->sandbox['progress'] = 0;
    $this->sandbox['current_id'] = 0;
  }

}

First, update hooks that use batching usually start out with some sort of check whether the hook is being run for the first time, based on whether values are set in the sandbox. This shows that with a plugin approach this could just be separated out into a separate method, initializeBatch(). The framework could ensure this is always called first (the base class could include a dummy implementation that does nothing).

Second, this example shows a few other default dependencies from the UpdateBase; entityTypeManager and database. It also shows the base has a sandbox member variable, that functions much like the sandbox parameter passed to update hooks. Again, the update function is not much bothered with it; instead, the UpdateInterface would have a getSandbox() method (implemented by the base) that gets called by the framework. The returned sandbox would then be handled like it is in the procedural update hooks; gets serialized, entered into the batch table, and on the next run, retrieved, unserialized and passed into the plugin using a setSandbox() method.

Similar to startup clauses, finishing the batch is also handled in methods, allowing it to be largely standardized in the base class. The update() method may call setCompletion to let the outside world know how it is progressing (the value works the same as it would when passing it in the #finished key on an old-style sandbox; a float between 0 and 1, or 1 to indicate the process has finished).

Instead of having to check the value directly, the UpdateInterface also offers a method isFinished(). Nothing fancy, but it allows for some nice self-documenting code (the “real” update hook checked for $sandbox[#finished]>=1) before removing the old table and optionally creating a backup of it). The framework would call isFinished() as well, in order to determine if the update has finished, and if not, call a method getCompletion() to get the actual completion number.

How to proceed

Please feel free to discuss this proposal, or completely ignore it, although I am very interested in hearing what people think. This is my first ideas proposal, please be gentle. Let's see if we can modernise yet another part of Drupal!

Comments on this blogpost have been disabled. Please discuss this proposal in the issue in the Drupal Ideas issue queue.