swiftregexswift5.7swift-regexbuilder

How to capture more than 10 things using Swift 5.7's RegexBuilder?


Let's say I have a file that stores information about people, and one of the lines look like this:

Sweeper 30 1992-09-22 China/Beijing - 0 2020-07-07 Mary/Linda - Pizza/Lemon

From left to right, it's name, age, date of birth, country of birth, city of birth, number of children, date of marriage (optional), wife's name (optional), ex-wife's name (optional), favourite food, least favourite food.

I want to get all the information from the line using the Swift 5.7 RegexBuilder module, I tried:

let regex = Regex {
    /([a-zA-Z ]+)/ // Name
    " "
    TryCapture { OneOrMore(.digit) } transform: { Int($0) } // Age
    " "
    Capture(.iso8601Date(timeZone: .gmt)) // Date of Birth
    " "
    /([a-zA-Z ]+)/ // Country of Birth
    "/"
    /([a-zA-Z ]+)/ // City of Birth
    " - "
    TryCapture { OneOrMore(.digit) } transform: { Int($0) } // Children Count
    Optionally {
        " "
        Capture(.iso8601Date(timeZone: .gmt)) // Date of Marriage
        Optionally {
            " "
            /([a-zA-Z ]+)/ // Wife
            Optionally {
                "/"
                /([a-zA-Z ]+)/ // Ex-wife
            }
        }
    }
    " - "
    /([a-zA-Z ]+)/ // Favourite food
    "/"
    /([a-zA-Z ]+)/ // Least Favourite Food
}

However, Swift says that it is unable to type check this in reasonable time.

I know the reason this happens is because RegexComponentBuilder (the result builder for regex components) only has overloads for up to 10 "C"s or something like that (not too sure on the details):

static func buildPartialBlock<W0, W1, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, R0, R1>(
    accumulated: R0,
    next: R1) -> Regex<(Substring, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10)> where R0 : RegexComponent, R1 : RegexComponent, R0.RegexOutput == (W0, C1, C2, C3), R1.RegexOutput == (W1, C4, C5, C6, C7, C8, C9, C10
)

If I make all the Optionally parts required, the error message becomes a bit more apparent.

Ambiguous use of 'buildPartialBlock(accumulated:next:)'

SwiftUI has a similar problem, where the number of views in a view builder cannot exceed 10, in which case you just use a Group to make some of the views a single view. Can you do something similar in RegexBuilder? Make some of the captures a single capture? It seems to have something to do with AnyRegexOutput, but I'm not sure how to use it.

How do I resolve this compiler error?


To avoid an XY problem:

I have a data file where the data is formatted very haphazardly, i.e. not very machine-readable at all like CSV or JSON. Lines are written in all sorts of formats. Random delimiters are used in random places.

Then another line in the file would have the same information, but formatted in a different way.

What I want to do is to convert this weirdly formatted file into a easy-to-work-with format, like CSV. I've decided to do this with the Swift 5.7 RegexBuilder API. I would find a line in the file, write a regex that match that line, convert all the lines of the file that match that regex to CSV, then rinse and repeat.

Therefore, I would like to avoid using multiple regexes to parse a single line, as this would mean that I would be writing a lot more regexes.

I'm not sure if a parser like ANTLR4 would solve my problem. Given how randomly the file is formatted, I would need to be changing the parser a lot, causing the files to be generated again and again. I don't think that will be as convenient as using RegexBuilder.


Solution

  • As a hack, you can create a generalised CustomConsumingRegexComponent implementation that takes in

    We can basically create a regex component that takes in some regex and outputs any type T we want, essentially "grouping" the captures.

    It's also possible to just not do the transformation, and you'd end up with nested tuples, but I don't like that.

    struct Group<RegexOutput, Component: RegexComponent>: CustomConsumingRegexComponent {
    
        let component: () -> Component
        
        let transform: (Component.RegexOutput) -> RegexOutput
        
        init(@RegexComponentBuilder _ regexBuilder: @escaping () -> Component, transform: @escaping (Component.RegexOutput) -> RegexOutput) {
            component = regexBuilder
            self.transform = transform
        }
        
        func consuming(_ input: String, startingAt index: String.Index, in bounds: Range<String.Index>) throws -> (upperBound: String.Index, output: RegexOutput)? {
            let innerRegex = Regex(component)
            guard let match = input[index...].prefixMatch(of: innerRegex) else { return nil }
            let upperBound = match.range.upperBound
            let output = match.output
            let transformedOutput = transform(output)
            return (upperBound, transformedOutput)
        }
    }
    

    The reason why this is only a hack, is because the regex inside the Group doesn't actually know about the stuff outside the Group, so quantifiers inside the Group won't backtrack to try to match the stuff outside the Group.

    For example, to fix the code in the question, I can put all the marriage-related info into a Group, but I have to add a lookahead inside the Group:

    struct Marriage {
        let marriageDate: Date
        let wife: Substring?
        let exWife: Substring?
    }
    
    let r = Regex {
        /([a-zA-Z ]+)/ // Name
        " "
        TryCapture { OneOrMore(.digit) } transform: { Int($0) } // Age
        " "
        Capture(.iso8601Date(timeZone: .gmt)) // Date of Birth
        " "
        /([a-zA-Z ]+)/ // Country of Birth
        "/"
        /([a-zA-Z ]+)/ // City of Birth
        " - "
        TryCapture { OneOrMore(.digit) } transform: { Int($0) } // Children Count
    
        Optionally {
            " "
            Capture(Group {
                Capture(.iso8601Date(timeZone: .gmt)) // Date of Marriage
                Optionally {
                    " "
                    /([a-zA-Z ]+)/ // Wife
                    Optionally {
                        "/"
                        /([a-zA-Z ]+)/ // Ex-wife
                    }
                }
                Lookahead(" - ")
            } transform: { (_, date, wife, exWife) in
                Marriage(marriageDate: date, wife: wife, exWife: exWife as? Substring) // unwrap the double optional
            })
        }
        " - "
        /([a-zA-Z ]+)/ // Favourite food
        "/"
        /([a-zA-Z ]+)/ // Least Favourite Food
    }
    

    Without the lookahead, this is what happens:

    The innermost [a-zA-Z ]+ would match Linda, and also the space after it, causing " - " to not match. Normally, this would cause backtracking, but since things inside the Group doesn't know about things outside the Group, backtracking does not occur here, and the whole match fails.