I'm currently executing a rather complicated data pre-processing operation, this is:
cat large_file.txt \ | ./reverb -q | cut --fields=16,17,18 | awk -F\\t -vq="'" 'function quote(token) { gsub(q, "\\"q, token); return q token q } { print quote($2) "(" quote($3) ", " quote($1) ")." }' >> output.txt
As you can see, this is rather convoluted, first cat, then to that ./reverb, then to cut, and finally to awk.
Next I want to pass the output to a java program, i.e.:
public static void main(String[] args) throws IOException
{
Ontology ontology = new Ontology();
BufferedReader br = new BufferedReader(new FileReader("/home/matthias/Workbench/SUTD/2_January/Prolog/horn_data_test.pl"));
Pattern p = Pattern.compile("'(.*?)'\\('(.*?)','(.*?)'\\)\\.");
String line;
while ((line = br.readLine()) != null)
{
Matcher m = p.matcher(line);
if( m.matches() )
{
String verb = m.group(1);
String object = m.group(2);
String subject = m.group(3);
ontology.addSentence( new Sentence( verb, object, subject ) );
}
}
for( String joint: ontology.getJoints() )
{
for( Integer subind: ontology.getSubjectIndices( joint ) )
{
Sentence xaS = ontology.getSentence( subind );
for( Integer obind: ontology.getObjectIndices( joint ) )
{
Sentence yOb = ontology.getSentence( obind );
Sentence s = new Sentence( xaS.getVerb(),
xaS.getObject(),
yOb.getSubject() );
System.out.println( s );
}
}
}
}
What would be the best way to synthesize this process into one coherent operation? Ideally I'd like to just specify the input file and the output file and run it once. As it stands the entire process is quite discombobulated.
Maybe I can just put all these calls into a bash script? Is that feasible?
The input initially contains English language sentences, one per line, this is:
Oranges are delicious and contain vitamin c.
Brilliant scientists learned that we can prevent scurvy by imbibing vitamin c.
Colorless green ideas sleep furiously.
...
The pre-processing makes it look like this:
'contain'('vitamin c','oranges').
'prevent'('scurvy','vitamin c').
'sleep'('furiously','ideas').
...
The java program is for learning "rules" by inference, so if the processed data yields 'contain'('vitamin c','oranges').
& 'prevent'('scurvy','vitamin c').
then the java code will emit 'prevent'('scurvy','oranges').
I looked at the source code for reverb and I think it's very easy to adapt it to produce the output you want. If you look at the reverb class CommandLineReverb.java, it has the following two methods:
private void extractFromSentReader(ChunkedSentenceReader reader)
throws ExtractorException {
long start;
ChunkedSentenceIterator sentenceIt = reader.iterator();
while (sentenceIt.hasNext()) {
// get the next chunked sentence
ChunkedSentence sent = sentenceIt.next();
chunkTime += sentenceIt.getLastComputeTime();
numSents++;
// make the extractions
start = System.nanoTime();
Iterable<ChunkedBinaryExtraction> extractions = extractor
.extract(sent);
extractTime += System.nanoTime() - start;
for (ChunkedBinaryExtraction extr : extractions) {
numExtrs++;
// run the confidence function
start = System.nanoTime();
double conf = getConf(extr);
confTime += System.nanoTime() - start;
NormalizedBinaryExtraction extrNorm = normalizer
.normalize(extr);
printExtr(extrNorm, conf);
}
if (numSents % messageEvery == 0)
summary();
}
}
private void printExtr(NormalizedBinaryExtraction extr, double conf) {
String arg1 = extr.getArgument1().toString();
String rel = extr.getRelation().toString();
String arg2 = extr.getArgument2().toString();
ChunkedSentence sent = extr.getSentence();
String toks = sent.getTokensAsString();
String pos = sent.getPosTagsAsString();
String chunks = sent.getChunkTagsAsString();
String arg1Norm = extr.getArgument1Norm().toString();
String relNorm = extr.getRelationNorm().toString();
String arg2Norm = extr.getArgument2Norm().toString();
Range arg1Range = extr.getArgument1().getRange();
Range relRange = extr.getRelation().getRange();
Range arg2Range = extr.getArgument2().getRange();
String a1s = String.valueOf(arg1Range.getStart());
String a1e = String.valueOf(arg1Range.getEnd());
String rs = String.valueOf(relRange.getStart());
String re = String.valueOf(relRange.getEnd());
String a2s = String.valueOf(arg2Range.getStart());
String a2e = String.valueOf(arg2Range.getEnd());
String row = Joiner.on("\t").join(
new String[] { currentFile, String.valueOf(numSents), arg1,
rel, arg2, a1s, a1e, rs, re, a2s, a2e,
String.valueOf(conf), toks, pos, chunks, arg1Norm,
relNorm, arg2Norm });
System.out.println(row);
}
The first method is called per sentence and does the extraction. Then it calls the second method to print the tab separated values to the output stream. I guess all you have to do is implement your own version of the second method 'printExtr()'.