apache-sparkamazon-s3hdfsjava-ioedgedb

Read edge DB files from HDFS or S3 in Spark


I have list Db files store into local folder, when I am running spark job on local mode I can provide local path to read those local files. but while running on client or cluster mode path is not accessible. seems they need to be kept at HDFS or access directly from S3. I am doing following :

java.io.File directory = new File(dbPath)

at dbPath all the list of db files are present. is there any simple way to access those files folder from HDFS or from S3, as I am running this Spark job on AWS.


Solution

  • To my knowledge, there isn't a standard way to do this currently. But it seems you could reverse-engineer a dump-reading protocol through a close examination of how the dump is generated.

    According to edgedb-cli/dump.rs, it looks like you can open the file with a binary stream reader and ignore the first 15 bytes of a given dump file.

        output.write_all(
            b"\xFF\xD8\x00\x00\xD8EDGEDB\x00DUMP\x00\
              \x00\x00\x00\x00\x00\x00\x00\x01"
            ).await?;
    

    But then it appears the remaining dump get written to a mutable async future result via:

                header_buf.truncate(0);
                header_buf.push(b'H');
                header_buf.extend(
                    &sha1::Sha1::from(&packet.data).digest().bytes()[..]);
                header_buf.extend(
                    &(packet.data.len() as u32).to_be_bytes()[..]);
                output.write_all(&header_buf).await?;
                output.write_all(&packet.data).await?;
    

    with a SHA1 encoding. Unfortunately, we're in the dark at this point because we don't know what the byte sequences of the header_buf actually say. You'll need to investigate how the undigested contents look in comparison to any protocols used by asyncpg and Postgres to verify what your dump resembles.

    Alternatively, you could prepare a shim to the restore.rs with some pre-existing data loaders.