airflowworkflowdirected-acyclic-graphsflyte

How is Flyte tailored to "Data and Machine Learning"?


https://flyte.org/ says that it is

The Workflow Automation Platform for Complex, Mission-Critical Data and Machine Learning Processes at Scale

I went through quite a bit of documentation and I fail to see why it is "Data and Machine Learning". It seem to me that it is a workflow manager on top of a container orchastration (here Kubernetes), where workflow manager means, that I can define a Directed Acyclic Graphs (DAG) and then the DAG nodes are deployed as containers and the DAG is run.

Of course this is usefull and important for "Data and Machine Learning", but I might as well use it for any other microservice DAG with this. Except for features/details, how is this different than https://airflow.apache.org or other workflow managers (of which there are many). There are even more specialized workflow managers for "Data and Machine Learning", e.g., https://spark.apache.org.

What should I keep in mind as a Software Achitect?


Solution

  • That is a great question. You are right in one thing, at the core it is a Serverless Workflow Orchestrator (serverless, because it does bring up the infrastructure to run the code). And yes it can be used for multiple other situations. It may not be the best tool for some other systems like Micro-service orchestration.

    But, what really makes it good for ML & Data Orchestration is a combination of

    1. Features (list below) &
    2. Integrations (list below)
    3. Community of folks using it
    4. Roadmap

    Features

    1. Long running tasks: It is designed for extremely long running tasks. Tasks that can run for days and weeks, even if the control plane goes down, you will not lose the work. You can keep deploying without impacting existing work.
    2. Versioning - allow multiple users to work independently on the same workflow, use different libraries, models, inputs etc
    3. Memoization. Lets take an example of a pipeline with 10 steps, you can memoize all 9 steps and if 10th fails or you can modify 10th and then it will reuse results from previous 9. This leads to drastically faster iteration
    4. Strong typing and ML specific type supports Flyte understands dataframes and is able to translate dataframes from spark.dataFrame -> pandas.DataFrame -> Modin -> polars etc without the user having to think about how to do it efficiently. Also supports things like tensors (correctly serialized), numpy arrays, etc. Also models can be saved and retrieved from past executions so is infact the model truth store
    5. Native support for Intra task checkpointing. This can help is recovering model training between node failures and across executions even. With new support being added for Checkpointing callbacks.
    6. Flyte decks: A way to visualize metrics like ROC curve, etc or auto visualization of the distribution of data input to a task.
    7. Extendable Programming interface, that can orchestrate distributed jobs or run locally - e.g spark, MPI, sagemaker
    8. Reference task for library isolation
    9. Scheduler independent of user code
    10. Understanding of resources like GPU's etc - Automatically schedule on gpus and or spot machines. With Smart handling of spot machines - n-1 retries last one automatically is moved to an on-demand machine to better guarantees
    11. Map tasks and dynamic tasks. (map over a list of regions), dynamic -> create new static graphs based on inputs dyanmically
    12. Multiple launchplans. Schedule 2 runs for a workflow with slightly different hyper parameters or model values etc

    For Admins

    1. For really long running tasks, admin can deploy the management layer without killing the tasks
    2. Support for spot/arm/gpu (with different versions etc)
    3. Quotas and throttles for per project/domain
    4. Upgrade infra without upgrading user libraries

    Integrations

    1. pandas dataframe native support
    2. Spark
    3. mpi jobs (gang scheduled)
    4. pandera / Great expectations for data quality
    5. Sagemaker
    6. Easy deployment of model for serving
    7. Polars / Modin / Spark dataframe
    8. tensors / checkpointing etc etc and many others in the roadmap

    Community

    Focused on ML specific features

    Roadmap

    1. CD4ML, with human in the loop and external signal based workflows. This will allow for users to automate deployment of models or perform human in the loop labeling etc
    2. Support for Ray/Spark/Dask cluster re-use across tasks
    3. Integration with WhyLogs and other tools for monitoring
    4. Integration with MLFlow etc
    5. More native Flytedecks renderers

    Hopefully this answers your questions. Also please join the slack community and help spread this information. Also ask more questions