I'm trying to use the PET Parser, but the given documentation for usage is insufficient. Can anyone point me to a good article or tutorial on using PET? Does it support UTF-8?
To use the PET parser, first you have to load a grammar for the language of interest. The grammar must be authored in the TDL language, as used in the DELPH-IN consortium (wiki here). Large, compatible grammars are available for several languages, including English, Japanese, and German. There are also smaller grammars available, and you can write your own.
For this--and for working with these grammars--your best bet is Ann Copestake's book, "Implementing Typed Feature Structure Grammars" (CSLI 2002). The book provides a thorough introduction to TDL and grammars such as these which function via the unification of typed feature structures. The grammars support bidirectional mapping between syntax (surface strings) and semantics ("meaning," represented according to Copestake's MRS--Minimal Recursion Semantics). Note that these are precision grammars, which means that they are generally less tolerant of ungrammatical inputs than statistical systems.
The English Resource Grammar (ERG) is a large grammar of English which has broad, general-domain coverage. It's open source and you can download it from the website. An online demo, powered by the PET parser, can be found here.
The PET parser runs in two steps. The first, called flop produces a "compiled" version of the grammar. The second step is the actual parsing, which uses the cheap program. You will need to obtain these two PET binaries for your Linux machine, or build them yourself. This step may not be easy if you're not familiar with building software on Linux. PET does not run on Windows (or Mac, to my knowledge).
Running flop is easy. Just go to your /erg directory, and type:
$ flop english.tdl
This will produce the english.grm file. Now you can parse sentences by running cheap:
$ echo the child has the flu. | cheap --mrs english.grm
This example produces a single semantic representation of the sentence in MRS (Minimal Recursion Semantics) format:
[ LTOP: h1
INDEX: e2 [ e SF: PROP TENSE: PRES MOOD: INDICATIVE PROG: - PERF: - ]
RELS: <
[ _the_q_rel<-1:-1>
LBL: h3
ARG0: x6 [ x PERS: 3 NUM: SG IND: + ]
RSTR: h5
BODY: h4 ]
[ "_child_n_1_rel"<-1:-1>
LBL: h7
ARG0: x6 ]
[ "_have_v_1_rel"<-1:-1>
LBL: h8
ARG0: e2
ARG1: x6
ARG2: x9 [ x PERS: 3 NUM: SG ] ]
[ _the_q_rel<-1:-1>
LBL: h10
ARG0: x9
RSTR: h12
BODY: h11 ]
[ "_flu_n_1_rel"<-1:-1>
LBL: h13
ARG0: x9 ] >
HCONS: < h5 qeq h7 h12 qeq h13 > ]
Copestake's book explains the specific syntax and linguistic formalism used in grammars that are compatible with PET. It also serves as a user's manual for the open-source LKB system, which is a more interactive system that can also parse with these grammars. In addition to parsing, the LKB can do the reverse: generate sentences from MRS semantic representations. The LKB is currently only supported on Linux/Unix. There are actually a total of four DELPH-IN compliant grammar processing engines, including LKB and PET.
For Windows, there is agree, a multi-threaded parser/generator (and here) that I've developed for .NET; it also supports both generation and parsing. If you need to work with the grammars interactively, you might want to consider using the LKB or agree in addition to--or instead of--PET. The interactive client front-ends for agree are mostly WPF-based, but the engine and a simple console client can run on any Mono platform.
ACE is another open-source DELPH-IN compatible parsing and generation system which is designed for high performance, and it is available for Linux and MacOS.
The LKB is written in Lisp, whereas PET and ACE are C/C++, so the latter are the faster parsers for production use. agree is also much faster than the LKB, but only becomes faster than PET when parsing complex sentences, where overheads from agree's lock-free concurrency become amortized.
[11/25/2011 edit: agree now supports generation as well as parsing]