Currently, my main application is built with Java Spring-boot and this won't change because it's convenient.
@Autowired
service beans implements, for example :
Map
of their establishments.Dataset<Enterprise>
, Dataset<Establishment>
, Dataset<Row>
Dataset<Row>
Dataset<Commune>
or Dataset<Row>
,Datatset<Row>
.Many user case functions are calls of this kind :
What are associations(year=2020) ?
And my applications forward to datasetAssociation(2020)
that operates with enterprises and establishments datasets and with cities and local authorities ones to provide an useful result.
For this, I'm considering an operation involving other ones between datasets :
I have this operation to do, in term of datasets reached/involved :
associations.enterprises.establishments.cities.localautorities
A Dataset<Row>
built with Java code is sent to a Scala function to be completed.
Scala creates a new dataset with Enterprise
and Establishment
objects.
a) If the source of an object is written in Scala I don't have to recreate a new source for it in Java.
b) conversely if the source of an object is written in Java, I don't have to recreate a new source in Scala.
c) I can use a Scala object returned by this dataset on Java side directly.
Scala will have to call functions kept implemented in Java and send them the underlying dataset it is creating (for example to complete them with cities information).
Java calls Scala methods at anytime
and Scala calls Java methods at anytime too :
an operation could follow a
Java -> Scala -> Scala -> Java -> Scala -> Java -> Java
path if wished, in term of native language of method called.
Because I don't know in advance what parts I will find useful to port in Scala or not.
Completing these three points, I will consider that Java and Scala are able interoperable the two way and benefit one from the other.
But may I achieve this goal (in Spark 2.4.x
or more probably in Spark 3.0.0
) ?
As Jasper-M wrote, scala and java code are perfectly inter-operable:
Now, as many have recommended, spark being a scala library first, and the scala language being more powerful than java (*), using scala to write spark code will be much easier. Also, you will find much more code-example in scala. It is often difficult to find java code example for complex Dataset manipulation.
So, I think the two main issues you should be taking care of are:
Dataset[YourClass]
and not Dataset<Row>
). In Java, and for java model classes, you need to use Encoders.bean(YourClass.class)
explicitely. But in scala, by default spark find the encoder implicitly, and the encoders are build for scala case classes ("Product types") and scala standard collections. So just be mindful of which encoders are used. For example, if you create a Dataset of YourJavaClass in scala, I think you will probably have to give explicitly the Encoders.bean(YourJavaClass.class)
for it to work and not have serialization issues.One last note: you wrote that you use java Spring-boot. So
rdd.map
. This will attempt to create Spring context in each worker which is very slow and can easily fail.(*) About "scala being more powerful than java": I don't mean that scala is better than java (well I do think so, but it is a matter of taste :). What I mean is that the scala language provides much more expressiveness than java. Basically it does more with less code. The main differences are: