javascalaapache-sparkscala-java-interop

Interoperability : sharing Datasets of objects or Row between Java and Scala, two ways. I put a Scala dataset operation in the middle of Java ones


Currently, my main application is built with Java Spring-boot and this won't change because it's convenient.
@Autowired service beans implements, for example :

Many user case functions are calls of this kind :

What are associations(year=2020) ?

And my applications forward to datasetAssociation(2020) that operates with enterprises and establishments datasets and with cities and local authorities ones to provide an useful result.

Many recommended me to benefit from Scala abilities

For this, I'm considering an operation involving other ones between datasets :

I have this operation to do, in term of datasets reached/involved :
associations.enterprises.establishments.cities.localautorities

Will I be able to write the bold part in Scala ? This means that :

  1. A Dataset<Row> built with Java code is sent to a Scala function to be completed.

  2. Scala creates a new dataset with Enterprise and Establishment objects.
    a) If the source of an object is written in Scala I don't have to recreate a new source for it in Java.
    b) conversely if the source of an object is written in Java, I don't have to recreate a new source in Scala.
    c) I can use a Scala object returned by this dataset on Java side directly.

  3. Scala will have to call functions kept implemented in Java and send them the underlying dataset it is creating (for example to complete them with cities information).

Java calls Scala methods at anytime
and Scala calls Java methods at anytime too :

an operation could follow a
Java -> Scala -> Scala -> Java -> Scala -> Java -> Java
path if wished, in term of native language of method called.
Because I don't know in advance what parts I will find useful to port in Scala or not.

Completing these three points, I will consider that Java and Scala are able interoperable the two way and benefit one from the other.

But may I achieve this goal (in Spark 2.4.x or more probably in Spark 3.0.0) ?

Summarizing, are Java and Scala interoperable the two ways, a manner that :


Solution

  • As Jasper-M wrote, scala and java code are perfectly inter-operable:

    Now, as many have recommended, spark being a scala library first, and the scala language being more powerful than java (*), using scala to write spark code will be much easier. Also, you will find much more code-example in scala. It is often difficult to find java code example for complex Dataset manipulation.

    So, I think the two main issues you should be taking care of are:

    1. (not spark related, but necessary) have a project that compiles both language and allows two-way inter-operability. I think sbt provides it out-of-the-box, and with maven you need to use the scala plugin and (from my experience) put both java and scala files in the java folder. Otherwise one can call the other, but not the opposite (scala call java but java cannot call scala, or the other way around)
    2. You should be careful of the encoder that are used each time you create a typed Dataset (i.e. Dataset[YourClass] and not Dataset<Row>). In Java, and for java model classes, you need to use Encoders.bean(YourClass.class) explicitely. But in scala, by default spark find the encoder implicitly, and the encoders are build for scala case classes ("Product types") and scala standard collections. So just be mindful of which encoders are used. For example, if you create a Dataset of YourJavaClass in scala, I think you will probably have to give explicitly the Encoders.bean(YourJavaClass.class) for it to work and not have serialization issues.

    One last note: you wrote that you use java Spring-boot. So


    (*) About "scala being more powerful than java": I don't mean that scala is better than java (well I do think so, but it is a matter of taste :). What I mean is that the scala language provides much more expressiveness than java. Basically it does more with less code. The main differences are: