Im new to spark, and as far as I can understand spark executes the code written in Structured API(dataframes, datasets, and SQL) in parallel in workers nodes.
Does the same thing also apply to the code written in NON Structered API, for example I have this code, which renames files in the S3 bucket:
public static boolean renameAwsFolder(String sourceBucketName, String destinationBucketName) {
System.out.println("Invoked");
boolean result = false;
try {
AmazonS3 s3client = getAmazonS3ClientObject();
List<S3ObjectSummary> fileList = s3client.listObjects(sourceBucketName).getObjectSummaries();
//some meta data to create empty folders start
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentLength(0);
for (S3ObjectSummary file : fileList) {
String key = file.getKey();
if (key.contains("sitemap/part-")) {
System.out.println("Key : " + key);
String oldFileName = key.substring(key.indexOf("-") + 1); //part-xxxx
String newFileName = "sitemap_" + oldFileName.replaceFirst("^0+(?!$)", "") + ".xml";
String destinationKeyName = "output/" + newFileName;
CopyObjectRequest copyObjRequest =
new CopyObjectRequest(sourceBucketName, file.getKey(), destinationBucketName,
destinationKeyName);
s3client.copyObject(copyObjRequest);
}
}
} catch (Exception e) {
e.printStackTrace();
}
return result;
}
I don't understand is this function going to be executed in parallel in multiple workers?
Thanks in advance
The code you provided doesn't use Spark API at all (neither structured nor non-structured) - it's direct AWS S3 operations via the AWS SDK, so it's gonna be exectuted as standard Java code on one-thread on JVM, not in parallel on multiple workers.