sqlamazon-redshiftdbt

Explode values in dbt


I'm trying to replicate the repeat_by and explode functions from Polars using dbt with a Redshift database but am having difficulty finding an equivalent solution.

Here's the sample Polars DataFrame code for context:

NOTE: I don't need replicate column, if that's an issue, then don't worry about it

import polars as pl

# Sample DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["New York", "Los Angeles", "Chicago"],
    "replicate": [2, 3, 1]
})

Polars DataFrame output:

shape: (3, 4)
┌─────────┬─────┬─────────────┬───────────┐
│ name    ┆ age ┆ city        ┆ replicate │
│ ---     ┆ --- ┆ ---         ┆ ---       │
│ str     ┆ i64 ┆ str         ┆ i64       │
╞═════════╪═════╪═════════════╪═══════════╡
│ Alice   ┆ 25  ┆ New York    ┆ 2         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Charlie ┆ 35  ┆ Chicago     ┆ 1         │
└─────────┴─────┴─────────────┴───────────┘

Using the repeat_by and explode functions:

df.select(pl.all().repeat_by('replicate').explode())

Result:

shape: (6, 4)
┌─────────┬─────┬─────────────┬───────────┐
│ name    ┆ age ┆ city        ┆ replicate │
│ ---     ┆ --- ┆ ---         ┆ ---       │
│ str     ┆ i64 ┆ str         ┆ i64       │
╞═════════╪═════╪═════════════╪═══════════╡
│ Alice   ┆ 25  ┆ New York    ┆ 2         │
│ Alice   ┆ 25  ┆ New York    ┆ 2         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Bob     ┆ 30  ┆ Los Angeles ┆ 3         │
│ Charlie ┆ 35  ┆ Chicago     ┆ 1         │
└─────────┴─────┴─────────────┴───────────┘

I need help finding a way to achieve the same result in dbt with Redshift.


Solution

  • I used JOIN and a CROSS JOIN since you mentioned generate_series is not supported.

    I tested this in Snowflake , since I dont have redshift. I guess the syntax might not be the exact same, but still should work.

    let me know if it works for you.

    SET tbl_count = (SELECT max(replicate) FROM my_table);
    
    WITH numbers AS (
        SELECT 
            row_number() OVER ( order by 1) AS num
        FROM 
            (SELECT 1 FROM my_table LIMIT $tbl_count ) )
            ,
    repeated_data AS (
        SELECT 
            t.name,
            t.age,
            t.city,
            t.replicate
        FROM 
            my_table t
        CROSS JOIN numbers
        WHERE numbers.num <= t.replicate
    )
    SELECT 
        name, 
        age, 
        city,
        replicate
    
    FROM 
        repeated_data
    ORDER BY name, age;
    

    Trying with Recursion since using Limit variable is tricky in Redshift

    WITH RECURSIVE numbers AS (
        -- Base case: start from 1
        SELECT 1 AS num
        UNION ALL
        -- Recursive case: increment the number by 1 each time
        SELECT num + 1
        FROM numbers
        WHERE num < (SELECT max(replicate) FROM my_table) -- Stop when num reaches max(replicate)
    ),
    repeated_data AS (
        SELECT 
            t.name,
            t.age,
            t.city,
            t.replicate
        FROM 
            my_table t
        CROSS JOIN numbers
        WHERE numbers.num <= t.replicate
    )
    SELECT 
        name, 
        age, 
        city,
        replicate
    FROM 
        repeated_data
    ORDER BY 
        name, age;
    

    Output :

    enter image description here