regexperln-gram

Perl paragraph n-gram


Let's say I have a sentence of text:

$body = 'the quick brown fox jumps over the lazy dog';

and I want to get that sentence into a hash of 'keywords', but I want to allow multi-word keywords; I have the following to get single word keywords:

$words{$_}++ for $body =~ m/(\w+)/g;

After this is complete, I have a hash that looks like the following:

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

The next step, so that I can get 2-word keywords, is the following:

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

But that only gets every "other" pair; looking like this:

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1

I also need the one word offset:

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1

Is there an easier way to do this than the following?

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;

Solution

  • While the described task might be interesting to code by hand, would not it be better to use an existing CPAN module that handles n-grams? It looks like Text::Ngrams (as opposed to Text::Ngram) can handle word-based n-gram analysis.