airflow workflow directed-acyclic-graphs flyte

How is Flyte tailored to "Data and Machine Learning"?

The Workflow Automation Platform for Complex, Mission-Critical Data and Machine Learning Processes at Scale

I went through quite a bit of documentation and I fail to see why it is "Data and Machine Learning". It seem to me that it is a workflow manager on top of a container orchastration (here Kubernetes), where workflow manager means, that I can define a Directed Acyclic Graphs (DAG) and then the DAG nodes are deployed as containers and the DAG is run.

Of course this is usefull and important for "Data and Machine Learning", but I might as well use it for any other microservice DAG with this. Except for features/details, how is this different than https://airflow.apache.org or other workflow managers (of which there are many). There are even more specialized workflow managers for "Data and Machine Learning", e.g., https://spark.apache.org.

What should I keep in mind as a Software Achitect?

Solution

That is a great question. You are right in one thing, at the core it is a Serverless Workflow Orchestrator (serverless, because it does bring up the infrastructure to run the code). And yes it can be used for multiple other situations. It may not be the best tool for some other systems like Micro-service orchestration.

But, what really makes it good for ML & Data Orchestration is a combination of

Features (list below) &
Integrations (list below)
Community of folks using it
Roadmap

Features

Long running tasks: It is designed for extremely long running tasks. Tasks that can run for days and weeks, even if the control plane goes down, you will not lose the work. You can keep deploying without impacting existing work.
Versioning - allow multiple users to work independently on the same workflow, use different libraries, models, inputs etc
Memoization. Lets take an example of a pipeline with 10 steps, you can memoize all 9 steps and if 10th fails or you can modify 10th and then it will reuse results from previous 9. This leads to drastically faster iteration
Strong typing and ML specific type supports Flyte understands dataframes and is able to translate dataframes from spark.dataFrame -> pandas.DataFrame -> Modin -> polars etc without the user having to think about how to do it efficiently. Also supports things like tensors (correctly serialized), numpy arrays, etc. Also models can be saved and retrieved from past executions so is infact the model truth store
Native support for Intra task checkpointing. This can help is recovering model training between node failures and across executions even. With new support being added for Checkpointing callbacks.
Flyte decks: A way to visualize metrics like ROC curve, etc or auto visualization of the distribution of data input to a task.
Extendable Programming interface, that can orchestrate distributed jobs or run locally - e.g spark, MPI, sagemaker
Reference task for library isolation
Scheduler independent of user code
Understanding of resources like GPU's etc - Automatically schedule on gpus and or spot machines. With Smart handling of spot machines - n-1 retries last one automatically is moved to an on-demand machine to better guarantees
Map tasks and dynamic tasks. (map over a list of regions), dynamic -> create new static graphs based on inputs dyanmically
Multiple launchplans. Schedule 2 runs for a workflow with slightly different hyper parameters or model values etc

For Admins

For really long running tasks, admin can deploy the management layer without killing the tasks
Support for spot/arm/gpu (with different versions etc)
Quotas and throttles for per project/domain
Upgrade infra without upgrading user libraries

Integrations

pandas dataframe native support
Spark
mpi jobs (gang scheduled)
pandera / Great expectations for data quality
Sagemaker
Easy deployment of model for serving
Polars / Modin / Spark dataframe
tensors / checkpointing etc etc and many others in the roadmap

Community

Focused on ML specific features

Roadmap

CD4ML, with human in the loop and external signal based workflows. This will allow for users to automate deployment of models or perform human in the loop labeling etc
Support for Ray/Spark/Dask cluster re-use across tasks
Integration with WhyLogs and other tools for monitoring
Integration with MLFlow etc
More native Flytedecks renderers

Hopefully this answers your questions. Also please join the slack community and help spread this information. Also ask more questions