google-sheets google-sheets-formula array-formulas

How to extract multiple blocks of rows between two keywords in a Google Sheets column into adjacent columns (formula-only)

I have a single column that contains mixed data (strings, numbers, sometimes blanks) and from time to time, there are marker rows that delimit blocks of data. What I would like to do is:

Start collecting when the row is exactly "Specific competencies."
Stop collecting when the row matches "Basic knowledge of (could be anything) course (or courses, ending with s).". (example matches: "Basic knowledge of Mathematics course", "Basic knowledge of Science courses", etc.).

Example (simplified):

Column
data1
data2
Specific competencies. <-- start collecting (from next row)
data3
data4
Basic knowledge of Mathematics course. <-- stop collecting (not included)
data5
Specific competencies. <-- start collecting (from next row)
data6
data7
data8
data9
data10
Basic knowledge of Science courses. <-- stop collecting (not included)
data11
...

Desired output (each block in its own column):

Block 1	Block 2	...
data3	data6	...
data4	data7
	data8
	data9
	data10

For a single block, I can do:

=LET(range,B:B, FILTER(range,ISBETWEEN( ROW(range), MATCH("Specific competencies.",range,0), MATCH(TRUE, REGEXMATCH(range, "^Basic knowledge of.* courses?\.$"), 0))))

But that only extracts the first block, not multiple ones.

I have also tried using REDUCE + LAMBDA + HSTACK to iterate row by row and build the results, but I run into errors.

Any help is appreciated. Thank you!

Solution

Assuming that you're not wanting to output the header row, you should be able to use the following formula.

=index(map(split(join("♠", if(regexmatch(B:B, "^Specific competencies|Basic knowledge of"), "♦", B:B)), "♦"), lambda(c, tocol(split(c, "♠")))))

You can use any two characters for the separators, but try to use uncommon characters. The formula first replaces "Specific competencies..." and "Basic knowledge of..." with the spade character "♠" and then joins the entire column together using the diamond character "♦". Next, it splits on the spade, so that each group is in its own column. Finally, it MAPs over those values and uses TOCOL/SPLIT to flatten the group values into an array down each column.

Edit: The formula returns an error if there are numerical values due to the REGEXMATCH. This can be easily overcome by attaching an empty string onto all the values in column B with REGEXMATCH(""&B:B; however, the next roadblock occurs because there are around 22,000 rows and joining together that many rows exceeds the 50,000 character limit. In addition, the formula wasn't returning just the values between the start and ending cells.

Therefore, I'm changing the formula to the following:

=bycol(wrapcols(filter(sequence(rows(A:A)),regexmatch(""&A:A,"^Specific competencies|Basic knowledge of")),2),lambda(c,chooserows(A:A,sequence(index(c,2)-index(c,1)-1,1,index(c,1)+1))))

This formula first FILTERs the SEQUENCE(ROWS(A:A)) for all the starting and ending cells. It then uses WRAPCOLS to make vertical pairs of start/end indices. Then, if uses BYCOL to iterate across the pairs, using them to generate the SEQUENCE of row numbers to pull from column A using CHOOSEROWS.