linuxbashconsolelynx

How to download text from many webpages to file?


I'm trying to download a polish dictionary. Unfortunately, the existing files contain all inflections (not sure what the proper english word is). I found out that the command

lynx --dump https://sjp.pl/slownik/lp.phtml?f_vl=2&page=1 > file.txt

can download a single dictionary webpage. I would then have to somehow extract only the dictionary entries from the block of text, but at least it's a start.

Unfortunately, I'm a linux noob and don't know how I can iterate through all the 3067 pages.


Solution

  • Untested, but you should be able to do it quite fast and easily with GNU Parallel

    parallel -qk 'lynx --dump https://sjp.pl/slownik/lp.phtml?f_vl=2&page={}' ::: {1..3067} > file.txt
    

    If it doesn't work, try removing the single quotes. If that doesn't work, try putting a backslash before the &. Sorry, I don't have any way to test at the moment.

    The slow way is:

    for ((i=1;i<3068;i++)) ; do
       lynx --dump ...page=$i
    done > file.txt