pythonbashsedawk

How to write find-all function (with regex) in awk or sed


I have bash function which run python (which return all found regex from stdin)

function find-all() {
    python -c "import re
import sys
print '\n'.join(re.findall('$1', sys.stdin.read()))"
}

When I use this regex find-all 'href="([^"]*)"' < index.html it should return first group from the regex (value of href attribute from file index.html)

How can I write this in sed or awk?


Solution

  • I suggest you use grep -o.

    -o, --only-matching
           Show only the part of a matching line that matches PATTERN.
    

    E.g.:

    $ cat > foo
    test test test
    test
    bar
    baz test
    $ grep -o test foo
    test
    test
    test
    test
    test
    

    Update

    If you were extracting href attributes from html files, using a command like:

    $ grep -o -E 'href="([^"]*)"' /usr/share/vlc/http/index.html
    href="style.css"
    href="iehacks.css"
    href="old/"
    

    You could extract the values by using cut and sed like this:

    $ grep -o -E 'href="([^"]*)"' /usr/share/vlc/http/index.html| cut -f2 -d'=' | sed -e 's/"//g'
    style.css
    iehacks.css
    old/
    

    But you'd be better off using html/xml parsers for reliability.