pythonregexrubyalgorithmsplit

Split on regex (more than a character, maybe variable width) and keep the separator like GNU awk


In GNU awk, there is a four argument version of split that can optionally keep all the separators from the split in a second array. This is useful if you want to reconstruct a select subset of columns from a file where the delimiter may be more complicated than just a single character.

Suppose I have the following file:

# sed makes the invisibles visible...
# ∙ is a space; \t is a literal tab; $ is line end
$ sed -E 's/\t/\\t/g; s/ /∙/g; s/$/\$/' f.txt
a\t∙∙b∙c\td∙_∙e$
a∙∙∙b∙c\td∙_∙e$
∙∙∙a∙∙∙b∙c\td∙_∙e$
a∙∙∙b_c\td∙_∙e\t$
abcd$

Here I have a field comprised of anything other than the delimiter character set, and a delimiter of one or more characters of the set [\s_].

With gawk, you can do:

gawk '{
    printf "["
    n=split($0, flds, /[[:space:]_]+/, seps)
    for(i=1; i<=n; i++) 
           printf "[\"%s\", \"%s\"]%s", flds[i], seps[i], i<n ? ", " : "]" ORS
    }
' f.txt

Prints (where the first element is the field, the second is the match to the delimiter regexp):

[["a", "      "], ["b", " "], ["c", "   "], ["d", " _ "], ["e", ""]]
[["a", "   "], ["b", " "], ["c", "  "], ["d", " _ "], ["e", ""]]
[["", "   "], ["a", "   "], ["b", " "], ["c", " "], ["d", " _ "], ["e", ""]]
[["a", "   "], ["b", "_"], ["c", "  "], ["d", " _ "], ["e", "   "], ["", ""]]
[["abcd", ""]]

Ruby's str.split, unfortunately, does not have the same functionality. (Neither does Python's or Perl's.)

What you can do is capture the match string from the delimiter regexp:

irb(main):053> s="a   b c    d _ e"
=> "a   b c    d _ e"
irb(main):054> s.split(/([\s_]+)/)
=> ["a", "   ", "b", " ", "c", "    ", "d", " _ ", "e"]

Then use that result with .each_slice(2) and replace the nil's with '':

irb(main):055> s.split(/([\s_]+)/).each_slice(2).map{|a,b| [a,b]}
=> [["a", "   "], ["b", " "], ["c", "    "], ["d", " _ "], ["e", nil]]
irb(main):056> s.split(/([\s_]+)/).each_slice(2).map{|a,b| [a,b]}.map{|sa| sa.map{|e| e.nil? ? "" : e} }
=> [["a", "   "], ["b", " "], ["c", "    "], ["d", " _ "], ["e", ""]]

Which allows gawk's version of split to be replicated:

ruby -ne 'p $_.gsub(/\r?\n$/,"").split(/([\s_]+)/).each_slice(2).
                map{|a,b| [a,b]}.map{|sa| sa.map{|e| e.nil? ? "" : e} }' f.txt

Prints:

[["a", "\t  "], ["b", " "], ["c", "\t"], ["d", " _ "], ["e", ""]]
[["a", "   "], ["b", " "], ["c", "\t"], ["d", " _ "], ["e", ""]]
[["", "   "], ["a", "   "], ["b", " "], ["c", "\t"], ["d", " _ "], ["e", ""]]
[["a", "   "], ["b", "_"], ["c", "\t"], ["d", " _ "], ["e", "\t"]]
[["abcd", ""]]

So the same output (other than the line with trailing \t which gawk has as an empty field, delimiter combination.)

In Python, roughly the same method also works:

python3 -c '
import sys, re 
from itertools import zip_longest
with open(sys.argv[1]) as f:
    for line in f:
        lp=re.split(r"([\s_]+)", line.rstrip("\r\n"))
        print(list(zip_longest(*[iter(lp)]*2, fillvalue="")) )
' f.txt   

I am looking for a general algorithm to replicate the functionality of gawk's four argument split in Ruby/Python/Perl/etc. The Ruby and Python I have here works.

Most of solutions (other than for gawk) to I want to split on this delimiter and keep the delimiter? involve a unique regex more complex than simply matching the delimiter. Most seem to be either scanning for a field, delimiter combination or use lookarounds. I am specifically trying to use a simple regexp that matches the delimiter only without lookarounds. With roughly the same regexp I would have used with GNU awk.

So stated generally:

  1. Take a regexp matching the delimiter fields (without having to think much about the data fields) and put inside a capturing group;
  2. Take the resulting array of [field1, delimiter1, field2, delimiter2, ...] and create array of [[field1, delimiter1], [field2, delimiter2], ...]

That method is easily used in Ruby (see above) and Python (see above) and Perl (I was too lazy to write that one...)

Is this the best way to do this?


Solution

  • With splitting you always have one more field than the delimiters, which is why you have to fill in an empty string as the delimiter for the last field. A simpler way to achieve the filling would be to always append an empty string to the list returned by the split so that you can use the itertools.batched function (available since Python 3.12, or as a recipe beforehand) to produce easy pairings:

    import re
    from io import StringIO
    from itertools import batched
    
    file = StringIO('''a\t  b c\td _ e
    a   b c\td _ e
       a   b c\td _ e
    a   b_c\td _ e\t
    abcd''')
    
    for line in file:
        print(list(batched(re.split(r"([\s_]+)", line.rstrip('\r\n')) + [''], 2)))
    

    This outputs:

    [('a', '\t  '), ('b', ' '), ('c', '\t'), ('d', ' _ '), ('e', '')]
    [('a', '   '), ('b', ' '), ('c', '\t'), ('d', ' _ '), ('e', '')]
    [('', '   '), ('a', '   '), ('b', ' '), ('c', '\t'), ('d', ' _ '), ('e', '')]
    [('a', '   '), ('b', '_'), ('c', '\t'), ('d', ' _ '), ('e', '\t'), ('', '')]
    [('abcd', '')]
    

    Demo here