I am working Netezza SQL.
Suppose I have 3 tables located on a server (e.g. in real life there could be many such tables and each table with many rows and many columns):
My Question: I am trying to find out which columns from which tables contain values %LIKE% "ABC" (i.e. if the column has a row that contains a value %LIKE% "ABC")
I thought that maybe using R/SQL together might be a better way to solve this problem because I don't think there is a way to do this exclusively using Netezza SQL (perhaps this is possible with a dynamic query, but this looks very difficult.
Step 1: I have this information in a table already (created using the built-in information_schema.column
table within Netezza):
# I re-created this for the stackoverflow example
table_name = c("table1","table1", "table1", "table2","table2", "table2", "table3", "table3", "table3")
col_name = c("col1","col2", "col3", "col4", "col5", "col6","col7", "col8", "col9")
summary = data.frame(table_name, col_name)
table_name col_name
1 table1 col1
2 table1 col2
3 table1 col3
4 table2 col4
5 table2 col5
6 table2 col6
7 table3 col7
8 table3 col8
9 table3 col9
Step 2: I created a column of SQL queries that I will be sending:
# generate a column of SQL queries to be sent
summary$queries = SQL(paste0("select * from ", summary$table_name, " where ", "summary$col_name " ," LIKE %abc%"))
table_name col_name queries
1 table1 col1 select * from table1 where summary$col_name LIKE %abc%
2 table1 col2 select * from table1 where summary$col_name LIKE %abc%
3 table1 col3 select * from table1 where summary$col_name LIKE %abc%
4 table2 col4 select * from table2 where summary$col_name LIKE %abc%
5 table2 col5 select * from table2 where summary$col_name LIKE %abc%
6 table2 col6 select * from table2 where summary$col_name LIKE %abc%
7 table3 col7 select * from table3 where summary$col_name LIKE %abc%
8 table3 col8 select * from table3 where summary$col_name LIKE %abc%
9 table3 col9 select * from table3 where summary$col_name LIKE %abc%
Step 3: Next, I wrote the following R code:
#import libraries
library(odbc)
library(DBI)
# connect
my_con <- dbConnect(odbc:: odbc()...)
# create for loop
my_list = list()
for (i in 1:nrow(summary)) {
tryCatch({
is_null_i = dbExecute(my_con, summary$queries[i])
ifelse(is.null(is_null_i) == "FALSE", my_list[[i]] = is_null_i, NA)
}, error = function(e) {
# ignore error
})
}
My Problem: Although each individual dbExecute(my_con, summary$queries[i])
seems to run, is_null_i doe not see to be created.
In the end, I am looking for an output like this:
# hypothetical output tables and columns that contain the word %ABC%:
table_name col_name
1 table1 col1
2 table1 col3
3 table3 col7
4 table3 col8
Can someone please show me a better way to do this? Perhaps it might be possible to do this exclusively in Netezza?
Thanks!
library(tidyverse)
library(dbplyr)
We can use duckdb to create a small in-memory test setup
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
Let’s say we have the following three tables:
table1 <- tibble(
col1 = c("A", "B", "aBc"),
col2 = c("A", "B", "abc"),
col3 = c("A", "B", "ABC")
)
table2 <- tibble(
col4 = c("A", "B", "aBc"),
col5 = c("A", "B", "ABC"),
col6 = c("A", "B", "abc"),
col7 = c("A", "B", "ABC")
)
table3 <- tibble(
col8 = c("A", "B", "ABC"),
col9 = c("A", "B", "abc"),
)
copy_to(con, table1)
copy_to(con, table2)
copy_to(con, table3)
And the expected output is therefore
table column
table1 col3
table2 col5
table2 col7
table3 col8
As you know, we can list all the tables on the database “server”
DBI::dbListTables(con)
#> [1] "table1" "table2" "table3"
We could in theory use a for loop, but for these kind of operations, I prefer
to keep things in a dataframe and rely on the different dplyr tools. To start
we can move the list of table names to a tibble with enframe
and setup the
connections to the different remote tables. It can be done like this:
(remote_data <- DBI::dbListTables(con) |>
enframe(value = "table_name") |>
rowwise() |>
mutate(
data = list(tbl(con, table_name))
)
)
#> # A tibble: 3 × 3
#> # Rowwise:
#> name table_name data
#> <int> <chr> <list>
#> 1 1 table1 <tbl_dck_[,3]>
#> 2 2 table2 <tbl_dck_[,4]>
#> 3 3 table3 <tbl_dck_[,2]>
So now we have a column with a list of connections to the duckdb tables.
Instead of manually constructing the queries, we can rely on the dbplyr
package to do the translation for us. Let’s start with just one of the tables
remote_data[["data"]][[1]] |>
summarise(
across(everything(), \(column) any(str_detect(column, "ABC")))
) |>
pivot_longer(everything())
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # Source: SQL [3 x 2]
#> # Database: DuckDB 0.5.1 [unknown@Linux 5.15.90.1-microsoft-standard-WSL2:R 4.2.2/:memory:]
#> name value
#> <chr> <lgl>
#> 1 col1 FALSE
#> 2 col2 FALSE
#> 3 col3 TRUE
And now, we can repeat it for all the remote tables like this
(results <- remote_data |>
reframe(
table_name,
data |>
summarise(
across(everything(), \(column) any(str_detect(column, "ABC")))
) |>
pivot_longer(
everything(),
names_to = "column_name",
values_to = "contains_value"
) |>
collect()
) |>
ungroup()
)
#> # A tibble: 9 × 3
#> table_name column_name contains_value
#> <chr> <chr> <lgl>
#> 1 table1 col1 FALSE
#> 2 table1 col2 FALSE
#> 3 table1 col3 TRUE
#> 4 table2 col4 FALSE
#> 5 table2 col5 TRUE
#> 6 table2 col6 FALSE
#> 7 table2 col7 TRUE
#> 8 table3 col8 TRUE
#> 9 table3 col9 FALSE
and to get the expected output, we just need to filter and select
results |>
filter(contains_value) |>
select(-contains_value)
#> # A tibble: 4 × 2
#> table_name column_name
#> <chr> <chr>
#> 1 table1 col3
#> 2 table2 col5
#> 3 table2 col7
#> 4 table3 col8
Created on 2023-06-12 with reprex v2.0.2