I would like to combine with a linux command, if possible, all the words that start with a capital letter, excluding the one at the beginning of the line. The goal is to create edges between these words. For example:
My friend John met Beatrice and Lucio.
The result I would like to have should be:
I managed to get all the words that start with a capital letter, excluding the word at the beginning of the line through a regex. The regex is:
*cat gov.json | grep -oP "\b([A-Z][a-z']*)(\s[A-Z][a-z']*)*\b | ^(\s*.*?\s).*" > nodes.csv*
The nodes managed to enter them individually in column, ie:
The goal now is to create the possible combinations between names that start with a capital letter and put them into a file. Any suggestions?
Here is another awk
script doing the task, building the output while reading input.
script.awk
allowing duplicate names.
BEGIN {FPAT = " [[:upper:]][[:alpha:]]+"}
{
for (i = 1; i <= NF; i++ ) {
for (name in namesArr) {
namePairsArr[pairsCount++] = namesArr[name] $i;
}
namesArr[namesCount++] = $i;
}
}
END {for (i = 0; i < pairsCount; i++) print namePairsArr[i];}
If duplicate names not allowed, script.awk
is:
BEGIN {FPAT = " [[:upper:]][[:alpha:]]+"}
{
for (i = 1; i <= NF; i++ ) {
if (nameSeenArr[$i]) continue;
nameSeenArr[$i] = 1;
for (name in namesArr) {
namePairsArr[pairsCount++] = namesArr[name] $i;
}
namesArr[namesCount++] = $i;
}
}
END {for (i = 0; i < pairsCount; i++) print namePairsArr[i];}**
run
awk -f script.awk gov.json > nodes.csv
sample input file:
My friend John met Beatrice and Lucio
My friend Johna met Beatricea and Lucioa
sample output:
John Beatrice
John Lucio
Beatrice Lucio
John Johna
Beatrice Johna
Lucio Johna
John Beatricea
Beatrice Beatricea
Lucio Beatricea
Johna Beatricea
John Lucioa
Beatrice Lucioa
Lucio Lucioa
Johna Lucioa
Beatricea Lucioa