Have been using aws glue python shell jobs to build simple data etl jobs, for spark job, only have used once or twice for converting to orc format or executing spark sql on JDBC data. So wondering which are the best/typical use cases for each of them? Some document says python shell job is suitable for simple jobs whereas spark for more complicated jobs, is that correct? Could you please share more experience on this?
Many thanks
which are the best/typical use cases for each of them? Some document says python shell job is suitable for simple jobs whereas spark for more complicated jobs, is that correct?
AWS Glue is quick development facility/service for ETL jobs, given by AWS. IMHO it is very quick development if you know what needs to be done in your etl pipeline.
Glue has components like Discover, Develop, Deploy. In Discover... automatic crawling (run or schedule a crawler multiple times) is the important feature which differentiates with other tools I observed.
Glue has seems like integration feature to connect to AWS eco system services (where as spark you need to do it)
Typical use case of AWS Glue could be...
1) Load data from Dataware houses.
2) Build a data lake on amazon s3 .
See this presentation of AWS for more insight.
Custom Spark Job also can do the same thing, but it needs to be developed from the scratch. and it doesnt have in built automatic crawling kind of feature.
But if you develop a spark job for etl you have fine grained control to implement complicated jobs.
Both glue, spark has same goal for ETL. AFAIK, Glue is for simple jobs such as loading from source to destination. Where as Spark job can do wide variety of transformations in a controlled way.
Conclusion : For simple use cases of ETL (which can be done with out much development experience ) go with Glue. For customized ETL which has many dependencies/transformations go with spark job.