I'm using Moses to make a Language model.
I followed the instructions from this link: Baseline System: Moses
I have google 1-gram file that looks like:
</S> 95119665584
<S> 95119665584
, 30578667846
. 22077031422
<UNK> 21594821357
the 19401194714
- 16337125274
of 12765289150
and 12522922536
That means that the word "of" appeared 12,765,289,150 times.
Now I want to make a Language Model from this file ("Build Language Model"),
I don't know if this file format will work with Moses.
In the tutorial they are working with "europarl-v6.en", but I can't find it on the web to check the file format.
I need to represent each letter as word, so "hello" becomes "h e l l o".
After representing each word as I said , which format should I use?
Should it be:
o f
o f
o f
a n d
a n d
Or like the original format:
o f 12765289150
a n d 12522922536
Or maybe in other format ?
Does it still count as google n-gram ?
I followed the link: How can I use the Google Web N-gram corpus to build an LM as @ MukundKRoy suggested, but I don't know how to use it in my case (1-gram,2-gram...my new file isn't const).
I'll be glad if someone can tell me what format should this file be to use it with SRILM as simple as I can. Thanks
SRILM is taking care of the 1-2-3..-grams, don't bother.
I've done something similar, take a look over here:
Moses Installation and Training Run-Through
In PART II - Build a Model
, section Build Language Model
, it is working perfect with google n-grams.
Let me know if that worked for you.