R/SQL: Finding Out If Table Contains A Certain Value

I am working Netezza SQL.

Suppose I have 3 tables located on a server (e.g. in real life there could be many such tables and each table with many rows and many columns):

table1: col1,col2,col3
table2: col4,col5,col6
table3: col7,col8, col9
etc.

My Question: I am trying to find out which columns from which tables contain values %LIKE% "ABC" (i.e. if the column has a row that contains a value %LIKE% "ABC")

I thought that maybe using R/SQL together might be a better way to solve this problem because I don't think there is a way to do this exclusively using Netezza SQL (perhaps this is possible with a dynamic query, but this looks very difficult.

Step 1: I have this information in a table already (created using the built-in information_schema.column table within Netezza):

# I re-created this for the stackoverflow example

table_name = c("table1","table1", "table1", "table2","table2", "table2", "table3", "table3", "table3")
col_name = c("col1","col2", "col3", "col4", "col5", "col6","col7", "col8", "col9")

summary = data.frame(table_name, col_name)

  table_name col_name
1     table1     col1
2     table1     col2
3     table1     col3
4     table2     col4
5     table2     col5
6     table2     col6
7     table3     col7
8     table3     col8
9     table3     col9

Step 2: I created a column of SQL queries that I will be sending:

# generate a column of SQL queries to be sent
summary$queries = SQL(paste0("select * from ", summary$table_name, " where ", "summary$col_name " ," LIKE %abc%"))

table_name col_name                                                 queries
1     table1     col1 select * from table1 where summary$col_name  LIKE %abc%
2     table1     col2 select * from table1 where summary$col_name  LIKE %abc%
3     table1     col3 select * from table1 where summary$col_name  LIKE %abc%
4     table2     col4 select * from table2 where summary$col_name  LIKE %abc%
5     table2     col5 select * from table2 where summary$col_name  LIKE %abc%
6     table2     col6 select * from table2 where summary$col_name  LIKE %abc%
7     table3     col7 select * from table3 where summary$col_name  LIKE %abc%
8     table3     col8 select * from table3 where summary$col_name  LIKE %abc%
9     table3     col9 select * from table3 where summary$col_name  LIKE %abc%

Step 3: Next, I wrote the following R code:

#import libraries
library(odbc)
library(DBI)

# connect
my_con <- dbConnect(odbc:: odbc()...)

# create for loop

my_list = list()

for (i in 1:nrow(summary)) {
  tryCatch({
    is_null_i = dbExecute(my_con, summary$queries[i])
    ifelse(is.null(is_null_i) == "FALSE", my_list[[i]] = is_null_i, NA)
  }, error = function(e) {
    # ignore error
  })
}

My Problem: Although each individual dbExecute(my_con, summary$queries[i]) seems to run, is_null_i doe not see to be created.

In the end, I am looking for an output like this:

# hypothetical output tables and columns that contain the word %ABC%:

  table_name col_name
1     table1     col1
2     table1     col3
3     table3     col7
4     table3     col8

Can someone please show me a better way to do this? Perhaps it might be possible to do this exclusively in Netezza?

Thanks!

Solution

library(tidyverse)
library(dbplyr)

We can use duckdb to create a small in-memory test setup

con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")

Let’s say we have the following three tables:

table1 <- tibble(
  col1 = c("A", "B", "aBc"),
  col2 = c("A", "B", "abc"),
  col3 = c("A", "B", "ABC")
)

table2 <- tibble(
  col4 = c("A", "B", "aBc"),
  col5 = c("A", "B", "ABC"),
  col6 = c("A", "B", "abc"),
  col7 = c("A", "B", "ABC")
)

table3 <- tibble(
  col8 = c("A", "B", "ABC"),
  col9 = c("A", "B", "abc"),
)

copy_to(con, table1)
copy_to(con, table2)
copy_to(con, table3)

And the expected output is therefore

table      column
table1     col3
table2     col5
table2     col7
table3     col8

As you know, we can list all the tables on the database “server”

DBI::dbListTables(con)
#> [1] "table1" "table2" "table3"

We could in theory use a for loop, but for these kind of operations, I prefer to keep things in a dataframe and rely on the different dplyr tools. To start we can move the list of table names to a tibble with enframe and setup the connections to the different remote tables. It can be done like this:

(remote_data <- DBI::dbListTables(con) |> 
  enframe(value = "table_name") |> 
  rowwise() |> 
  mutate(
    data = list(tbl(con, table_name))
  )
)
#> # A tibble: 3 × 3
#> # Rowwise: 
#>    name table_name data          
#>   <int> <chr>      <list>        
#> 1     1 table1     <tbl_dck_[,3]>
#> 2     2 table2     <tbl_dck_[,4]>
#> 3     3 table3     <tbl_dck_[,2]>

So now we have a column with a list of connections to the duckdb tables. Instead of manually constructing the queries, we can rely on the dbplyr package to do the translation for us. Let’s start with just one of the tables

remote_data[["data"]][[1]] |> 
  summarise(
    across(everything(), \(column) any(str_detect(column, "ABC")))
  ) |> 
  pivot_longer(everything())
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # Source:   SQL [3 x 2]
#> # Database: DuckDB 0.5.1 [unknown@Linux 5.15.90.1-microsoft-standard-WSL2:R 4.2.2/:memory:]
#>   name  value
#>   <chr> <lgl>
#> 1 col1  FALSE
#> 2 col2  FALSE
#> 3 col3  TRUE

And now, we can repeat it for all the remote tables like this

(results <- remote_data |> 
  reframe(
    table_name, 
    data |> 
      summarise(
        across(everything(), \(column) any(str_detect(column, "ABC")))
      ) |> 
      pivot_longer(
        everything(), 
        names_to = "column_name", 
        values_to = "contains_value"
      ) |> 
      collect()
  ) |> 
  ungroup()
)
#> # A tibble: 9 × 3
#>   table_name column_name contains_value
#>   <chr>      <chr>       <lgl>         
#> 1 table1     col1        FALSE         
#> 2 table1     col2        FALSE         
#> 3 table1     col3        TRUE          
#> 4 table2     col4        FALSE         
#> 5 table2     col5        TRUE          
#> 6 table2     col6        FALSE         
#> 7 table2     col7        TRUE          
#> 8 table3     col8        TRUE          
#> 9 table3     col9        FALSE

and to get the expected output, we just need to filter and select

results |> 
  filter(contains_value) |> 
  select(-contains_value)
#> # A tibble: 4 × 2
#>   table_name column_name
#>   <chr>      <chr>      
#> 1 table1     col3       
#> 2 table2     col5       
#> 3 table2     col7       
#> 4 table3     col8

^{Created on 2023-06-12 with reprex v2.0.2}