algorithmnlplemmatization

How to turn plural words singular?


I'm preparing some table names for an ORM, and I want to turn plural table names into single entity names. My only problem is finding an algorithm that does it reliably. Here's what I'm doing right now:

  1. If a word ends with -ies, I replace the ending with -y
  2. If a word ends with -es, I remove this ending. This doesn't always work however - for example, it replaces Types with Typ
  3. Otherwise, I just remove the trailing -s

Does anyone know of a better algorithm?


Solution

  • Those are all general rules (and good ones) but English is not a language for the faint of heart :-).

    My own preference would be to have a transformation engine along with a set of transformations (surprisingly enough) for doing the actual work. You would run through the transformations (from specific to general) and, when a match was found, apply the transformation to the word and stop.

    Regular expressions would be an ideal approach to this due to their expressiveness. An example rule set:

     1. If the word is fish, return fish.
     2. If the word is sheep, return sheep.
     3. If the word is "radii", return "radius".
     4. If the word ends in "ii", replace that "ii" with "us" (octopii,virii).
     5. If a word ends with -ies, replace the ending with -y
     6. If a word ends with -es, remove it.
     7. Otherwise, just remove any trailing -s.
    

    Note the requirement to keep this transformation set up to date. For example, let's say someone adds the table name types. This would currently be captured by rule #6 and you would get the singular value typ, which is obviously wrong.

    The solution is to insert a new rule somewhere before #6, something like:

     3.5: If the word is "types", return "type".
    

    for a very specific transformation, or perhaps somewhere later if it can be made more general.

    In other words, you'll basically need to keep this transformation table updated as you find all those wondrous exceptions that English has spawned over the centuries.


    The other possibility is to not waste your time with general rules at all.

    Since the use case of this requirement is currently only to singularise the table names, and that set of table names will be relatively tiny (at least compared to the set of plural English words), just create another table (or some sort of data structure) called singulars which maps all the current plural table names (employees, customers) to singular object names (employee, customer).

    Then every time a table is added to your schema, ensure you add an entry to the singulars "table" so you can singularize it.