bashshellcurllynxw3m

Bulk website query using text based browsers


I want a text browser like lynx,w3m or links to perform a bulk query from a list of available links. The results will be filtered for a key word and should be added to the original list. An example, let the list be in list.txt:

"http://dict.cc//?s=Chemical"
"http://dict.cc//?s=Fenster"

I can extract the result if I only submit one link a time, e.g.

head -n 1 list.txt | xargs links -dump | sed -n '/NOUN/p'
tail -n 1 list.txt | xargs links -dump | sed -n '/NOUN/p'

works as expected, but not:

cat list.txt | xargs links -dump | sed -n '/NOUN/p'

or

for line in `cat list.txt`; do links -dump $line ; done

What am I doing wrong? Next step, the output should be appended to list in the correct line, so that list.txt will look like this after the operation:

"http://dict.cc//?s=Chemical" edit  NOUN   a chemical | chemicals       -
"http://dict.cc//?s=Fenster" NOUN   das Fenster | die Fenster    edit

Should be possible by combination or usage with other tools like paste etc. This does not work like above, what would be a better solution?:

for line in `cat list.txt`; do echo -n $line && links -dump $line; done

The example is just for demonstration, I will use other sites than dict.cc. Unfortunately no API/REST available.


Solution

  • I have twiddled with the commands until I found the bug. The problem lies in the double quotes of URLs given in list.txt. After removing, this works fine:

    for line in `cat engl.txt`; do 
      echo -n $line && links -dump $line| sed -n '/NOUN/p' 
    done
    

    If one (has to) keep double quotes, using the entries in the file above as an command to links passed by xargs works (but not the command just above):

    for line in `cat list.txt`; do 
      echo -n $line && echo $line | xargs links -dump | sed -n '/NOUN/p'
    done