I am trying to build a Spark Job which can connect to a SFTP Server and drop a csv file there. The option that I found was to use the SpringML package. However, I keep getting a noSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
. This was once described in an issue related to the package.
It is not exactly clear to me how the issue was resolved and if I am just making a user error. I use the newest version of the SpringML package (1.0.3) on Databricks and installed it through the Databricks Maven in "Create Library".
The code I use looks as follows:
// Read sample dataframe from table
val df = sqlContext.sql("SELECT * FROM default.some_test_data")
// Write sample data to a SFTP server
df.write.format("com.springml.spark.sftp").
option("host", "SFTP_HOST").
option("username", "SFTP_USER").
option("password", "password").
option("fileType", "fileType")
.save(/some_test_data.csv)
I am happy with any working example, even if it is using another open source package. Also feel free to point out what I miss?
Note: My Spark Version is 3.1.2, Scala 2.12
A little late, but here the script on how to do it without SpringMl
.
First create a init script which sets up and configures redsocks
to redirect network traffic through a proxy. Here is the breakdown step-by-step:
Shebang and Error Handling:
#!/bin/bash
set -euo pipefail
Install redsocks:
# Install redsocks
apt-get update && apt-get install -y redsocks
Configure iptables Chains:
IPTABLES_CHAINS=('custom-socks' 'custom-http')
for chain in "${IPTABLES_CHAINS[@]}"
do
iptables -t nat -N $chain
iptables -t nat -A $chain -d 0.0.0.0/8 -j RETURN
iptables -t nat -A $chain -d 10.0.0.0/8 -j RETURN
iptables -t nat -A $chain -d 100.64.0.0/10 -j RETURN
iptables -t nat -A $chain -d 127.0.0.0/8 -j RETURN
iptables -t nat -A $chain -d 169.254.0.0/16 -j RETURN
iptables -t nat -A $chain -d 172.16.0.0/12 -j RETURN
iptables -t nat -A $chain -d 192.168.0.0/16 -j RETURN
iptables -t nat -A $chain -d 198.18.0.0/15 -j RETURN
iptables -t nat -A $chain -d 224.0.0.0/4 -j RETURN
iptables -t nat -A $chain -d 240.0.0.0/4 -j RETURN
done
This is done according to wikipedia reference here
Redirect Traffic to redsocks:
iptables -t nat -A your-company-socks -p tcp -j REDIRECT --to-ports port-number-here
iptables -t nat -A your-company-http -p tcp -j REDIRECT --to-ports port-number-here
Configure redsocks
cat <<EOT >> /etc/redsocks.conf
redsocks {
local_ip = your-company-socks-ip;
local_port = your-company-socks-port;
ip = socks-proxy.example.com;
port = your-company-remote-socks-port;
type = socks5;
}
redsocks {
local_ip = your-company-http-ip;
local_port = your-company-http-port;
ip = http-proxy.example.com;
port = your-company-remote-proxy-port;
type = http-connect;
}
EOT
Start redsocks:
systemctl start redsocks
Then to configure use:
iptables -t nat -A OUTPUT -p tcp -d #{SftpHost}# -j your-company-socks