regexvimsubstitution

What would be the best approach to this substitution in Vim?


A several line document has a header/title section and then about 10 listings under each. I need to put the header/title info in with each of the listings so that they can be properly uploaded into a website (using comma and pipe delimiters). It looks like this:

SectionName1 and TitleName1
     1111 - The SubSectionName A

     222 - The SubSectionName B

     3333 - The SubSectionName C

SectionName2 and TitleName2
     444 - The SubSectionName D

     55555 - The SubSectionName E

     66 - The SubSectionName F

Repeating several hundred times. What I need is to produce something like:

SectionName1,TitleName1,1111,SubSectionNameA
SectionName1,TitleName1,222,SubSectionNameB
SectionName1,TitleName1,3333,SubSectionNameC
SectionName2,TitleName2,444,SubSectionNameD
SectionName2,TitleName2,55555,SubSectionNameE
SectionName2,TitleName2,66,SubSectionNameF

I realize there can multiple approaches to this solution, but I'm having a difficult time pulling the trigger on any one method. I understand submatches, joins and getline but I am not good at practical use of them in this scenario.

Any help to get me mentally started would be greatly appreciated.


Solution

  • Let me propose the following quite general Ex command solving the issue.1

    :g/^\s*\h/d|let@"=substitute(@"[:-2],'\s\+and\s\+',',','')|ki|/\n\s*\h\|\%$/kj|
    \   'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=@".','.submatch(1).','/|'i,'js/\s\+//g
    

    At the top level, this is the :global command that enumerates the lines starting with zero or more whitespace characters followed by a Latin letter or an underscore (see :help /\h). The lines matching this pattern are supposed to be the header lines containing section and title names. The rest of the command, after the pattern describing the header lines, are instructions to be executed for each of those lines.

    The actions to be performed on the headers can be divided into three steps.

    1. Delete the current header line, at the same time extracting section and title names from it.

      :d|let@"=substitute(@"[:-2],'\s\+and\s\+',',','')
      

      First, remove the current line, saving it into the unnamed register, using the :delete command. Then, update the contents of that register (referred to as @"; see :help @r and :help "") to be result of the substitution changing the word and surrounded by whitespace characters, to a single comma. The actual replacement is carried out by the substitute() function.

      However, the input is not the exact string containing the whole header line, but its prefix leaving out the last character, which is a newline symbol. The [:-2] notation is a short form of the [0:-2] subscript expression that designates the substring from the very first byte to the second one counting from the end (see :help expr-[:]). This way, the unnamed register holds the section and the title names separated by comma.

    2. Determine the range of dependent subsection lines.

      :ki|/\n\s*\h\|\%$/kj
      

      After the first step, the subsection records belonging to the just parsed header line are located starting from the current line (the one followed the header) until the next header line or, if there is no such line below, the end of buffer. The numbers of these lines are stored in the marks i and j, respectively. (See :helpg ^A mark is for description of marks.)

      The marks are placed using the :k command that sets a specified mark at the last line of a given range which is the current line, by default. So, unlike the first line of the considered block, the last one requires a specific line range to point out its location. A particular form of range, denoting the next line where a given pattern matches, is used in this case (see :help :range). The pattern defining the location of the line to be found, is composed in such a way that it matches a line immediately preceding a header (a line starting with possible whitespace followed by an alphabetical character), or the very last line. (See :help pattern for details about syntax of Vim regular expressions.)

    3. Transform the delineated subsection lines according to desired format, prepending section and title names found in the corresponding header line.

      :'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=@".','.submatch(1).','/|'i,'js/\s\+//g
      

      This step comprised of the two :substitute commands that are run over the range of lines delimited by the locations labelled by the marks i and j (see :help [range]).

      The first substitution command matches the beginning of a subsection line—an identifier followed by a hyphen and the word The, all floating in a whitespace—and replaces it with the contents of the unnamed register, holding the section and title names concatenated with a comma, the matched identifier, and another comma. The second substitution finalizes the transformation by squeezing all whitespace characters on the line to gum the subsection name and the following letter together.

      To construct the replacement string in the first :substitute command, the substitute-with-an-expression feature is used (see :help sub-replace-\=). The substitution part of the command should start with \= for Vim to interpret the remaining text not in a regular way, but as an expression (see :help expression). The result of that expression's evaluation becomes the substitution string. Note the use of the submatch() function in the substitute expression to retrieve the text of a submatch by its number.


    1 The command is wrapped for better readability, its one-line version is listed below for ease of copy-pasting into Vim command line. Note that the wrapped command can be used in a Vim script without any change.

    :g/^\s*\h/d|let@"=substitute(@"[:-2],'\s\+and\s\+',',','')|ki|/\n\s*\h\|\%$/kj|'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=@".','.submatch(1).','/|'i,'js/\s\+//g