javascalaapache-sparksftpdatabricks

How to make this SpringML (or other) Spark SFTP Server connector work?


I am trying to build a Spark Job which can connect to a SFTP Server and drop a csv file there. The option that I found was to use the SpringML package. However, I keep getting a noSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;. This was once described in an issue related to the package.

It is not exactly clear to me how the issue was resolved and if I am just making a user error. I use the newest version of the SpringML package (1.0.3) on Databricks and installed it through the Databricks Maven in "Create Library".

The code I use looks as follows:

// Read sample dataframe from table
val df = sqlContext.sql("SELECT * FROM default.some_test_data")

// Write sample data to a SFTP server
df.write.format("com.springml.spark.sftp").
           option("host", "SFTP_HOST").
           option("username", "SFTP_USER").
           option("password", "password").
           option("fileType", "fileType")
          .save(/some_test_data.csv) 

I am happy with any working example, even if it is using another open source package. Also feel free to point out what I miss?

Note: My Spark Version is 3.1.2, Scala 2.12


Solution

  • A little late, but here the script on how to do it without SpringMl.

    First create a init script which sets up and configures redsocks to redirect network traffic through a proxy. Here is the breakdown step-by-step:

    Shebang and Error Handling:

    #!/bin/bash
    set -euo pipefail
    

    Install redsocks:

    # Install redsocks
    apt-get update && apt-get install -y redsocks
    

    Configure iptables Chains:

    IPTABLES_CHAINS=('custom-socks' 'custom-http')
    for chain in "${IPTABLES_CHAINS[@]}"
    do
        iptables -t nat -N $chain
        iptables -t nat -A $chain -d 0.0.0.0/8 -j RETURN
        iptables -t nat -A $chain -d 10.0.0.0/8 -j RETURN
        iptables -t nat -A $chain -d 100.64.0.0/10 -j RETURN
        iptables -t nat -A $chain -d 127.0.0.0/8 -j RETURN
        iptables -t nat -A $chain -d 169.254.0.0/16 -j RETURN
        iptables -t nat -A $chain -d 172.16.0.0/12 -j RETURN
        iptables -t nat -A $chain -d 192.168.0.0/16 -j RETURN
        iptables -t nat -A $chain -d 198.18.0.0/15 -j RETURN
        iptables -t nat -A $chain -d 224.0.0.0/4 -j RETURN
        iptables -t nat -A $chain -d 240.0.0.0/4 -j RETURN
    done
    

    This is done according to wikipedia reference here

    Redirect Traffic to redsocks:

    iptables -t nat -A your-company-socks -p tcp -j REDIRECT --to-ports port-number-here
    iptables -t nat -A your-company-http -p tcp -j REDIRECT --to-ports port-number-here
    

    Configure redsocks

    cat <<EOT >> /etc/redsocks.conf
    redsocks {
        local_ip = your-company-socks-ip;
        local_port = your-company-socks-port;
        
        ip = socks-proxy.example.com;
        port = your-company-remote-socks-port;
        type = socks5;
    }
    redsocks {
        local_ip = your-company-http-ip;
        local_port = your-company-http-port;
        
        ip = http-proxy.example.com;
        port = your-company-remote-proxy-port;
        type = http-connect;
    }
    EOT
    

    Start redsocks:

    systemctl start redsocks
    

    Then to configure use:

    iptables -t nat -A OUTPUT -p tcp -d #{SftpHost}# -j your-company-socks