phpjavascript

Count words like Microsoft Word does


I need to count words in a string using PHP or Javascript (preferably PHP). The problem is that the counting needs to be the same as it works in Microsoft Word, because that is where the people assemble their original texts in so that is their reference frame. PHP has a word counting function (http://php.net/manual/en/function.str-word-count.php) but that is not 100% the same as far as I know.

Any pointers?


Solution

  • The real problem here is that you're trying to develop a solution without really understanding the exact requirements. This isn't a coding problem so much as a problem with the specs.

    The crux of the issue is that your word-counting algorithm is different to Word's word-counting algorithm - potentially for good reason, since there are various edge-cases to consider with no obvious answers. Thus your question should really be "What algorithm does Word use to calculate word count?" And if you think about this for a bit, you already know the answer - it's closed-source, proprietary software so no-one can know for sure. And even if you do work it out, this isn't a public interface so it can easily be changed in the next version.

    Basically, I think it's fundamentally a bad idea to design your software so that it functions identically to something that you cannot fully understand. Personally, I would concentrate on just developing a sane word-count of your own, documenting the algorithm behind it and justifying why it's a reasonable method of counting words (pointing out that there is no One True Way).

    If you must conform to Word's attempt for some short-sighted business reason, then the number one task is to work out what methodology they use to the point where you can write down an algorithm on paper. But this won't be easy, will be very hard to verify completely and is liable to change without notice... :-/