Drupal7: Migrating and Tidying invalid HTML

David G - DrupalAs I mentioned previously I’m migrating a legacy database of presidential documents for all the U.S. presidents (and some other people, events and times in history). In general there are ~110,000 pieces of content I’m migrating into Drupal Nodes. These are literally documents released by the Whitehouse and other government entities — in general in textual format.

The legacy webdeveloper added simple HTML markup to the released documents to format them for display on a webpage. An example legacy document is:

<p><i> Gentlemen of the Senate and of the House of
Representatives:</i> <p>In pursuance of the authority vested in
the President of the United States by an act of Congress passed
the 3d of March last, to reduce the weights of the copper coin of
the United States whenever he should think it for the benefit of
the United States, provided that the reduction should not
exceed 2 pennyweights in each cent, and in the like proportion
in a half cent, I have caused the same to be reduced since the
27th of last December, to wit, 1 pennyweight and 16 grains in each
cent, and in the like proportion in a half cent; and I have given
notice thereof by proclamation.<p>By the letter of the judges of the
circuit court of the United States, held at Boston in June last, and
the inclosed application of the under-keeper of the jail at that place,
of which copies are herewith transmitted, Congress will perceive
the necessity of making a suitable provision for the maintenance
of prisoners committed to the jails of the several States under the
authority of the United States.<p>GO. WASHINGTON.

As you can see this is … relatively horrible. Historically speaking a while back (say 10 years) browsers didn’t care if your HTML was valid or malformed. Browsers basically closed missing tag elements for you — and made assumptions; which we’ve learned since that era in computing that these assumptions shouldn’t be made and now we have HTML5 YAY! But, I have ~110,000 documents with really bad HTML … how can I fix this on Migration of my content.

Well, 1 of the first steps of any migration process is acquiring the Source data. Once we have a variable holding our source data it may be nice to able to validate, alter or tweak it before we begin to process it into our Destination system (drupal).

The Migrate module provides a commonly implemented API method called prepareRow() that allows me to interact with the Source data and modify it before Drupal ingests the data into a drupal Content Fype Field.

So for my Document content type in Drupal I implement this method and call a function to clean up my malformed HTML:

// In file document.inc a drupal Migrate module "migration class file".

public function prepareRow($row) {
    // Always include this fragment at the beginning of every prepareRow()
    // implementation, so parent classes can ignore rows.
    if (parent::prepareRow($row) === FALSE) {
      return FALSE;
    }

    // Do this work once for full body and summary. Rather than multiple
    // times using callbacks().
    $row->content = PrezMigration::html5_tidy($row->content);

    $row->derived_ux_categories = $this->findNewUXTaxonomyCategories($row->doctype);

    return TRUE;
  }

In the above code $row->content is the PHP value for the legacy database column (“content” a mysql text column) Drupal will use for the Body field of a Node. I clean up the malformed HTML by passing it through PrezMigration::html5_tidy($row->content). YAY! Simple!

You’ll note a small comment about callbacks(). Instead of using prepareRow() to prep data the Migrate module additionally supports inline callback functions when defining source to destination field mappings. But, in the final Drupal configuration both the Body and Trimmed Body field will make use of this valid HTML By using prepareRow() and altering $row->content in this fashion — I do the workload once per Document and not twice via callbacks (once for the body instance and once for the trimmed body instance). This saves me time in my migration by not doing un-needed work (or duplicate work).

So what is the function PrezMigration::html5_tidy. Well, just to be fancy (and organized!) my migration class is derived from a base migration class for my project. This base class has some general utility methods, tooling and instrumentation utilities to make MY job easier. Here is my base migration class in almost it’s entirety:

<?php


/**
 * @file
 *
 * A custom migration base class to store functions useable across migrations.
 */
class PrezMigration extends Migration {

  public function handleException($exception, $save = TRUE) {
    parent::handleException($exception, $save);
    $crashdump = array(
      'exception' => $exception,
      'record' => $this->sourceValues,
    );

    $time = time();
    $record = array(
      'crashdump' => serialize($crashdump),
      'tstamp' => $time,
    );
    drupal_write_record('migrate_prez_crashdump', $record);
  }


//
// Custom utility functions beyond this point.
//

  public static function html5_tidy($content) {
    $root = realpath(dirname(__FILE__) . str_repeat('/..', 8));
    $config_file = $root . '/src/tidy5.config';
    $cmd = $root . '/tidy-html5/build/cmake/tidy5';
    $error_log = $root . '/src/logs/tidy.errors.log';
    $exec = "$cmd -config $config_file";
    $clean_html = '';

    $descriptorspec = array(
      0 => array('pipe', 'r'),
      1 => array('pipe', 'w'),
      2 => array('file', $error_log, 'a')
    );

    //   echo "calling: $exec";
    //   echo "content is: " . strlen($content) . " characters.";
    $process = proc_open($exec, $descriptorspec, $pipes, NULL, array());

    if (is_resource($process)) {
      fwrite($pipes[0], $content . PHP_EOL);
      fclose($pipes[0]);
      $clean_html = stream_get_contents($pipes[1]);
      //     echo "Tidy returned: \n" . $clean_html . "\n\n";
      $exit = proc_close($process);
    }
    return $clean_html;
  }




  /**
   * Truncates text starting from the end.
   *
   * Cuts a string to the length of $length and replaces the first characters
   * with the ellipsis if the text is longer than length.
   *
   * ### Options:
   *
   * - `ellipsis` Will be used as Beginning and prepended to the trimmed string
   * - `exact` If false, $text will not be cut mid-word
   *
   * @param string $text String to truncate.
   * @param int $length Length of returned string, including ellipsis.
   * @param array $options An array of options.
   * @return string Trimmed string.
   */
  public static function tail($text, $length = 100, array $options = []) {
    // omitted for brevity. This is tail() from the CakePHP framework.
  }

  /**
   * Truncates text.
   *
   * Cuts a string to the length of $length and replaces the last characters
   * with the ellipsis if the text is longer than length.
   *
   * ### Options:
   *
   * - `ellipsis` Will be used as ending and appended to the trimmed string
   * - `exact` If false, $text will not be cut mid-word
   * - `html` If true, HTML tags would be handled correctly
   *
   * @param string $text String to truncate.
   * @param int $length Length of returned string, including ellipsis.
   * @param array $options An array of HTML attributes and options.
   * @return string Trimmed string.
   * @link http://book.cakephp.org/3.0/en/core-libraries/string.html#truncating-text
   */
  public static function truncate($text, $length = 100, array $options = []) {
   // ... ommitted for brevity. See the CakePHP framework for source.
  }
}

So my method html5_tidy uses proc_open() to pipe the legacy HTML content to a command line program called Tidy released by the W3C. I specifically found and used this version Tidy supporting HTML5 and it cleans up the malformed HTML (uneven Paragraph tags), and it replaces legacy <i> tags with valid HTML5 <em> tags. It even reformats the HTML to include visible indention levels!

If you’re curious here is my tidy.config file:

join-classes: no
logical-emphasis: yes
drop-empty-elements: no
anchor-as-name: no
doctype: auto
drop-empty-paras: no
fix-uri: no
literal-attributes: yes
merge-divs: no
merge-spans: no
numeric-entities: no
preserve-entities: yes
quote-ampersand: no
quote-marks: no
show-body-only: yes
indent: auto
indent-spaces: 2
tab-size: 2
wrap: 0
wrap-asp: no
wrap-jste: no
wrap-php: no
wrap-sections: no
tidy-mark: no
new-blocklevel-tags: article,aside,command,canvas,dialog,details,figcaption,figure,footer,header,hgroup,menu,nav,section,summary,meter
new-inline-tags: video,audio,canvas,ruby,rt,rp,time,meter,progress,datalist,keygen,mark,output,source,wbr

// Change these only if you need to debug a problem with Tidy
force-output: yes
quiet: yes
show-warnings: no

The final end result of all this processing is the following HTML fragment that is valid HTML5:

<p><em>Gentlemen of the Senate and of the House of Representatives:</em></p>
<p>In pursuance of the authority vested in the President of the United States by an act of Congress passed the 3d of March last, to reduce the weights of the copper coin of the United States whenever he should think it for the benefit of the United States, provided that the reduction should not exceed 2 pennyweights in each cent, and in the like proportion in a half cent, I have caused the same to be reduced since the 27th of last December, to wit, 1 pennyweight and 16 grains in each cent, and in the like proportion in a half cent; and I have given notice thereof by proclamation.</p>
<p>By the letter of the judges of the circuit court of the United States, held at Boston in June last, and the inclosed application of the under-keeper of the jail at that place, of which copies are herewith transmitted, Congress will perceive the necessity of making a suitable provision for the maintenance of prisoners committed to the jails of the several States under the authority of the United States.</p>
<p>GO. WASHINGTON.</p>

Notes: By default tidy-html5 outputs a full valid HTML document. By setting show-body-only to 1 tidy-html5 returns only the HTML5 content found in the generated <body> tag of the cleaned up source content — this is what we want as Drupal will be providing the rest of the HTML page itself!

Looking for quality web hosting? Look no further than Arvixe Web Hosting!

Tags: , , , | Posted under Drupal, Drush | RSS 2.0

Author Spotlight

David Gurba

I am a web programmer currently employed at UCSB. I have been developing web applications professionally for 8+ years now. For the last 5 years I’ve been actively developing websites primarily in PHP using Drupal. I have experience using LAMP and developing data driven websites for clients in aviation, higher education and e-commerce. If you’d like to contact me I can be reached at david.gurba@arvixe.com

Leave a Reply

Your email address will not be published. Required fields are marked *