ramazon-web-servicesamazon-redshift

How do I connect R to a Redshift database on Apple M3?


Intro and System

I am trying to access a Redshift database from R. I am on an Apple M3 Pro machine on Sonoma 14.5. I have R version 4.4.1. I am expecting to give the database name (e.g., "dev"), the host URL, the port number, my username, and my password; I am expecting to receive a stable connection where I can query the database.

I have tried using a few packages, described below.

Attempt 1: RPostgres

{RPostgres} is the most straightforward. I used this post and the official package documentation to help set up.

First, I installed the necessary system software by running brew install libpq. I also elsewhere saw that brew install postgresql may help, too, so I have those installed.

I installed the package using install.packages("RPostgres") as well as installing it from source using remotes::install_github("r-dbi/RPostgres") or install.packages("RPostgres", type = "source"). I then ran:

con <- dbConnect(
  RPostgres::Redshift(),
  dbname = "dev", 
  host = Sys.getenv("RS_URLS"),
  port = 5439,
  user = Sys.getenv("RS_USER"),
  password = Sys.getenv("RS_PASS"),
  sslmode = "require"
)

Where the environmental variables are the Redshift URL, username, and password for my database. No matter what I tried, I keep getting the error:

Error: connection to server at "<host>" (<ip>), port 5439 failed: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

This appears to be a known and open issue. It doesn't matter if I installed from CRAN or from source—I get the same error.

Attempt 2: RPostgreSQL

{RPostgreSQL} says it will also use libpq in the Description to the package on CRAN. Using the syntax from the example in the documentation:

drv <- dbDriver("PostgreSQL")
con <- dbConnect(
  drv,
  dbname = "dev",
  host = Sys.getenv("RS_URLS"),
  port = 5439,
  user = Sys.getenv("RS_USER"),
  password = Sys.getenv("RS_PASS")
)

Throws the error:

Error in postgresqlNewConnection(drv, ...) : 
  RPosgreSQL error: could not connect <user>@<host>:5439 on dbname "dev": FATAL:  no pg_hba.conf entry for host "???", user "<user>", database "dev", SSL off

Note on SSL that I have successfully hooked Redshift up to a database visualizer on my computer, and it required SSL to be on. Candidy, I have no clue what SSL is or means, but there isn't an option for it here (while there is in {RPostgres} but it still failed with that flagged).

I made some progress here by discovering this post. The OP was adding SSL information to their database name. I can do the same:

library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(
  drv,
  dbname = "'dbname=dev sll=require'",
  host = Sys.getenv("RS_URLS"),
  port = 5439,
  user = Sys.getenv("RS_USER"),
  password = Sys.getenv("RS_PASS")
)

Interestingly, this gets me to no error but an aborted R session (much in common with Attempt 4 using RJDBC below).

Attempt 3: odbc

Posit (fka RStudio) officially recommends using ODBC via the {odbc} package. There is comprehensive documentation:

Starting with the vignette, I ran brew install unixodbc as well as brew install psqlodbc. However, I don't think that's the driver to use since I want to use Redshift. So I followed the instructions on that Amazon guide for downloading, installing, and formatting the .ini files for the driver.

Per Amazon's recommendation, I took the odbcinst.ini file from the Setup subdirectory from the install and copied it to my home directory, prepended with . to keep it hidden. When I cat that file, it reads:

[ODBC Drivers]
Amazon Redshift=Installed

[Amazon Redshift]
Description=Amazon Redshift ODBC Driver
Driver=/opt/amazon/redshift/lib/libamazonredshiftodbc.dylib

The issue is when I run odbcinst -j in the terminal—per the setup vignette—I do not see the correct paths:

unixODBC 2.3.12
DRIVERS............: /usr/local/etc/odbcinst.ini
SYSTEM DATA SOURCES: /usr/local/etc/odbc.ini
FILE DATA SOURCES..: /usr/local/etc/ODBCDataSources
USER DATA SOURCES..: /Users/<username>/.odbc.ini
SQLULEN Size.......: 8
SQLLEN Size........: 8
SQLSETPOSIROW Size.: 8

And if I cat the first two .ini files, they're empty. The user data source .ini simply does not exist.

After reading this GitHub issue, I realized I had to change my .Renviron file to have it look for the right directory to get the correct .ini file. I updated that to: ODBCSYSINI="/Users/<username>" since Amazon suggested I move the .ini to my home directory (since it will re-write the file if one gets a new download or reinstalls the driver).

So now, into R.

If I have the {odbc} package list the drivers, it reads the correct path where I know the driver exists:

> odbc::odbcListDrivers()
             name       attribute                                                value
1    ODBC Drivers Amazon Redshift                                            Installed
2 Amazon Redshift     Description                          Amazon Redshift ODBC Driver
3 Amazon Redshift          Driver /opt/amazon/redshift/lib/libamazonredshiftodbc.dylib

I've navigated there repeatedly and know that driver is there. So when I use the suggested Redshift format, I enter:

con <- dbConnect(
  odbc(),
  Driver       = "Amazon Redshift",
  servername   = Sys.getenv("RS_URLS"),
  database     = "dev",
  UID          = Sys.getenv("RS_USER"),
  PWD          = Sys.getenv("RS_PASS"),
  Port         = 5439
)

And I receive the error:

Error: nanodbc/nanodbc.cpp:1138: 00000%0D%0A[Amazon][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function: [Amazon][Support] (50483) Could not load shared library, all attempted paths ("") failed

Googling around, I actually ended up at another Posit document—albeit for a different database—that was aimed at troubleshooting for Mac users. It suggests finding where libodbcinst.dylib is and adding it to the configuration file for the driver. For me, that is at /opt/amazon/redshift/lib/ and it's called amazon.redshiftodbc.ini. Per the link above, I added the location to libodbcinst.dylib and another line at the bottom of it:

ODBCInstLib=/usr/local/Cellar/unixodbc/2.3.12/lib/libodbcinst.dylib
DriverManagerEncoding=UTF-16

There's also another location for this file elsewhere in /usr/local/lib/ but it points to that Cellar path:

libodbcinst.dylib -> ../Cellar/unixodbc/2.3.12/lib/libodbcinst.dylib

Regardless of what path I put in the configuration file, I get the same error, but now it adds that path to the error message:

> con <- dbConnect(
+   odbc(),
+   Driver       = "Amazon Redshift",
+   servername   = Sys.getenv("RS_URLS"),
+   database     = "dev",
+   UID          = Sys.getenv("RS_USER"),
+   PWD          = Sys.getenv("RS_PASS"),
+   Port         = 5439
+ )
Error: nanodbc/nanodbc.cpp:1138: 00000
[Amazon][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function: [Amazon][Support] (50483) Could not load shared library, all attempted paths ("", "/usr/local/Cellar/unixodbc/2.3.12/lib/libodbcinst.dylib") failed 
> 

I spent a lot of time fiddling around with .ini files and where they point, etc., and can't get anything beyond that error.

My lingering worry here is that the Amazon download page says:

On Linux and macOS X operating systems, you use an ODBC driver manager to configure the ODBC connection settings. ODBC driver managers use configuration files to define and configure ODBC data sources and drivers. The ODBC driver manager that you use depends on the operating system that you use:

unixODBC driver manager (for Linux operating systems)

iODBC driver manager (for macOS X operating system)

However, the setup vignette says:

For Unix and MacOS, ODBC drivers should be compiled against unixODBC. Drivers compiled against iODBC may also work, but are not fully supported.

I'm not totally sure—it very well could be another problem with misspecified paths, despite my best efforts.

Attempt 4: RJDBC

{RJDBC} is last, because I dislike depending on Java. However, the only "official" guide from Amazon on hooking up to R uses this package. Unfortunately, it was published in 2015 and has me download what is probably an out-of-date driver.

I run this code, from the example:

install.packages("RJDBC")
library(RJDBC)

# download Amazon Redshift JDBC driver
download.file(
  'http://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC41-1.1.9.1009.jar',
  'RedshiftJDBC41-1.1.9.1009.jar'
)

# connect to Amazon Redshift
driver <- JDBC(
  "com.amazon.redshift.jdbc41.Driver",
  "RedshiftJDBC41-1.1.9.1009.jar",
  identifier.quote = "`"
)

And R immediately aborts the session. I found another guide for this package from 2017, and I run the suggested code therein... and R once again aborts the session when I get to the JDBC() call reading in the driver.

What if I don't use an older version of the driver? I download it directly from this Amazon link. I then set up the code like:

library(RJDBC)

driver <- JDBC(
  "com.amazon.redshift.jdbc42.Driver",
  "redshift-jdbc42-2.1.0.29.jar"
)

url <- sprintf(
  "jdbc:redshift://%s:%s/%s?tcpKeepAlive=true&ssl=true&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory",
  Sys.getenv("RS_URLS"),
  5439,
  "dev"
)

jconn <- dbConnect(driver, url, Sys.getenv("RS_USER"), Sys.getenv("RS_PASS"))

Where I ticked the driver class up to 42 from 41. This time, I don't get RStudio aborting the R session and exploding until I run dbConnect(). I also tried getting a new name for the driver class using findDrivers() as is suggested in the documentation—but encountered the same aborted session upon running dbConnect(). I also tried including the identifier.quote parameter here—same outcome.

Summary

How do I connect R to Redshift?

(And why the hell is it this hard, when packages like {bigrquery} make it trivially simple for other databases?)


Solution

  • It was RStudio. For anyone reading this in the future, here's what you need to do to connect and have it work in RStudio. I'm assuming you have R and RStudio installed, but nothing else. I'm also assuming you have your Redshift username, password, host URL, and database name (usually something like "dev" or "prod" or "teamname").

    Downloads

    1. Download Java JDK. If you Google those words, it should take you to their download site. As of right now, the most recent is JDK 22. You click the macOS tab, and you click to download the ARM64 DMG Installer. ARM means it is not an Intel chip, but M1, M2, M3, etc. When that downloads, double-click it, open up the installer, and go through the windows it prompts you with to install it.

    2. Download Amazon JDBC Redshift driver. If you Google this, it should take you right to the download page. Click the highlighted text: JDBC 4.2–compatible driver version 2.1 and AWS SDK driver–dependent libraries to download. Unzip this download, and you should find a file that looks something like redshift-jdbc42-2.1.0.29.jar. That is the most up-to-date driver they have right now. It might be different for you in version name when you download, but it should be clear that it is a Redshift JDBC driver and end with .jar.

    File Paths

    1. Move your driver file to your home directory (or really, any other directory you like; I'm doing home for ease). To do this, open up the terminal. Click cd to make sure you're at your user's home directory. Move it using the code below. Your specific file name may be different depending on if you downloaded it to the Downloads folder, what version you're on looks like, etc. But mv will move it, the file path says what to move, and the . means to drop it where you're at, which is at your home directory since you typed cd.
    mv Downloads/redshift-jdbc42-2.1.0.29/redshift-jdbc42-2.1.0.29.jar .
    
    1. Figure out where your Java JDK lives. In the terminal, type /usr/libexec/java_home -V. It will return a list of matching Java Virtual Machines and where there paths are. The one you want should match the JDK version you just installed. For me, it is /Library/Java/JavaVirtualMachines/jdk-22.jdk/Contents/Home. Copy this path for the next step.

    Setting Up R Environment

    1. If you don't have an .Renviron file, create one in the terminal using touch .Renviron.

    2. Type open .Renviron to open it in your default text editor. (You could also use nano or whatever else file editor you want, but I find this to be easiest.)

    3. In there, paste JAVA_HOME= and then the path you copied from Step 4. For me, this looks like: JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-22.jdk/Contents/Home. This is so RStudio knows where to find Java. Without it, I was getting crashes upon trying to connect to Redshift.

    4. Also set up your username, password, and host URL in here. Inside of the <...> you should type your own credentials:

    RS_USER="<username>"
    RS_PASS="<password>"
    RS_URL="<host_url>"
    
    1. Save and close out of your .Renviron file.

    Setting Up in R

    1. Open up RStudio (if you already had it opened, exit and re-open to re-load the .Renviron file).

    2. Install the RJDBC package with install.packages("RJDBC"). This should install the dependencies of rJava, but if it doesn't, install that, too.

    3. Load RJDBC using library(RJDBC).

    4. Define the driver by specifying it's class and the path you stored it in; for me, that is:

    drv <- JDBC("com.amazon.redshift.Driver", "~/redshift-jdbc42-2.1.0.29.jar")
    

    (See the documentation for JDBC() and findDrivers() for determining the class of driver you should use as the first argument, but you can also try what I wrote if you're installing anytime near when I am.)

    1. Define your host URL using the code below. The only thing you should need to change is the database name. You can see ?sprintf if you're curious how it works. Note that this is for an SSL connection, hence the extra UTM parameters. Sys.getenv() and 5439 should be the same for you if you added your to your .Renviron file like I suggested and you are hooking up to Redshift, but you may need a different port number depending on system. The database name is what it's called in your system (e.g., prod, dev, teamname, companyname_dev, etc.)
    url <- sprintf(
      "jdbc:redshift://%s:%s/%s?tcpKeepAlive=true&ssl=true&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory",
      Sys.getenv("RS_URL"),
      5439,
      "<database_name>"
    )
    
    1. Finally, connect:
    conn <- dbConnect(drv, url, Sys.getenv("RS_USER"), Sys.getenv("RS_PASS"))
    
    1. Hopefully, RStudio did not abort the session. You can now test to see if you have access by doing something like head(dbListTables(conn)), which will list the first few table names.

    Notes

    I found this out by a colleague suggesting I save some of the code above in a script and running it in the terminal using Rscript script_name.R. I put cat("If you're seeing this, it didn't explode\n") at the end to see if it made it past the dbConnect() part. Once I found out it did, I realized RStudio was the issue.

    That's where this post helped. Apparently, you just need to tell RStudio where to find the JDK via your .Renviron file.

    Unfortunately, I do not know how to get it to work with any of the other three packages, but I wanted to post the solution here.