drupaldrupal-7drupal-taxonomy

Drupal: Merging Taxonomy Terms with Massive Duplicates


I have a database which has been used for research purposes. Unfortunately, during this research, an algorithm was allowed to proceed for too long which was inadvertently creating duplicate taxonomy terms instead of reusing the original TID for the first instance of a term.

In order to correct this, an attempt was made to use the "term_merge" and "taxonomy_manager" modules. "term_merge" offers an interface for removing duplicates and it boasts being able to set limits on how many terms it loads as a time in order to prevent exhausting the memory limit of the database server. With my use case, however, I am not even able to load the configuration screen located at /admin/structure/taxonomy/[My-Vocabulary]/merge, much less the duplicates interface found at /admin/structure/taxonomy/[My-Vocabulary]/merge/duplicates, as both of these exhaust the memory limit despite said limit being set to 1024M.

To get around this, I've written a custom module which calls the term_merge function found in the term_merge module. As there is only one node bundle in this project which makes use of the taxonomy vocabulary in question, I am able to safely write my own logic to merge duplicate terms without having to use the functions provided by the term_merge module, but I would like to make use of it as it is designed for this purpose and, theoretically, allows for a safer process.

My module provides a page callback as well as logic to procure a list of TIDs which refer to a duplicated taxonomy term. Here is the code which contains the call to the term_merge function:

//Use first element, with lowest TID value, as the 'trunk'
// which all other terms will be merged into

$trunk = $tids[0];

//Remove first element from branch array, to ensure the trunk 
//is not being merged into itself

array_shift($tids);

//Set the merge settings array, similarly to the default values 
//which are given in _term_merge_batch_process of term_merge.batch.inc

$merge_settings = array(
  'term_branch_keep' => FALSE,
  'merge_fields' => array(),
  'keep_only_unique' => TRUE,
  'redirect' => -1,
  'synonyms' => array(),
);

term_merge($tids, $trunk, $merge_settings);

This does not result in any merged terms, nor does it provide any errors or notices in Watchdog or the webserver logs.

I have also tried calling term_merge for each individual duplicate TID to be merged, rather than using an array of TIDs as a whole.

I would appreciate any input on how best to use the term_merge functions programmatically, or an alternative that will allow me to remove many duplicate terms from a large database where some terms have thousands of duplicates.

For reference, here are the comments which provide information about the parameters taken in term_merge, found in term_merge.module of the contributed term_merge module:

/**
 * Merge terms one into another using batch API.
 *
 * @param array $term_branch
 *   A single term tid or an array of term tids to be merged, aka term branches
 * @param int $term_trunk
 *   The tid of the term to merge term branches into, aka term trunk
 * @param array $merge_settings
 *   Array of settings that control how merging should happen.     Currently
 *   supported settings are:
 *     - term_branch_keep: (bool) Whether the term branches should not be
 *       deleted, also known as "merge only occurrences" option
 *     - merge_fields: (array) Array of field names whose values should be
 *       merged into the values of corresponding fields of term trunk (until
 *       each field's cardinality limit is reached)
 *     - keep_only_unique: (bool) Whether after merging within one field only
 *       unique taxonomy term references should be kept in other entities. If
 *       before merging your entity had 2 values in its taxonomy term reference
 *       field and one was pointing to term branch while another was pointing to
 *       term trunk, after merging you will end up having your entity
 *       referencing to the same term trunk twice. If you pass TRUE in this
 *       parameter, only a single reference will be stored in your entity after
 *       merging
 *     - redirect: (int) HTTP code for redirect from $term_branch to
 *       $term_trunk, 0 stands for the default redirect defined in Redirect
 *       module. Use constant TERM_MERGE_NO_REDIRECT to denote not creating any
 *       HTTP redirect. Note: this parameter requires Redirect module enabled,
 *       otherwise it will be disregarded
 *     - synonyms: (array) Array of field names of trunk term into which branch
 *       terms should be added as synonyms (until each field's cardinality limit
 *       is reached). Note: this parameter requires Synonyms module enabled,
 *       otherwise it will be disregarded
 *     - step: (int) How many term branches to merge per script run in batch. If
 *       you are hitting time or memory limits, decrease this parameter
 */

Solution

  • It would seem that since the function term_merge was developed with the intent that would be used within a function to handle form submissions, my custom module uses it in a manner where batch_process fails to be called.

    Explicitly calling the following solves this:

    batch_process()
    

    No arguments need to be passed to the function.