machine-learninggoogle-cloud-platformolapoltp

Google Cloud Architecture: Can a data lake be used for OLTP?


I want to design a large scale web application in the Google cloud and I need a OLAP system that creates ML models which I plan to design by sending all data through Pub/Sub into a BigTable data lake. The models are created by dataproc processes.

The models are deployed to micro services that execute them on data from user sessions. My question is: Where do I store the "normal business data" for this micro services? Do I have to separate the data for the micro services that provide the web application from the data in the data lake, e.g. by using MariaDB instances (db per uS)? Or can I connect them with BigTable?

Regarding the data lake: Are there alternatives to BigTable? Another developer told me that an option is to store data on Google Cloud Storage (Buckets) and access this data with DataProc to save cross-region costs from BigTable.


Solution

  • Wow, lot of questions, lot of hypothesis and lot of possibilities. The best answer is "all depends of your needs"!

    Where do I store the "normal business data" for this micro services?

    Want do you want to do in these microservices?

    Or can I connect them with BigTable?

    Yes you can, but do you need this? If you need the raw data before processing, yes connect to BigTable and query it!

    If not, it's better to have a batch process which pre-process the raw data and store only the summary in a relational or document database (better latency for user, but less details)

    Are there alternatives to BigTable?

    Depends of your needs. BigTable is great for high throughput. If you have less than 1 million of stream write per second, you can consider BigQuery. You can also query BigTable table with BigQuery engine thanks to federated table

    BigTable, BigQuery and Cloud Storage are reachable by dataproc, so as you need!

    Another developer told me that an option is to store data on Google Cloud Storage (Buckets)

    Yes, you can stream to Cloud Storage, but be careful, you don't have checksum validation and thus you can be sure that your data haven't been corrupted.


    Note

    You can think your application in other way. If you publish event into PubSub, one of common pattern is to process them with Dataflow, at least for the pre-processing -> your dataproc job for training your model will be easier like this!

    If you train a Tensorflow model, you can also consider BigQuery ML, not for the training (except if a standard model fit your needs but I doubt), but for the serving part.

    1. Load your tensorflow model into BigQueryML
    2. Simply query your data with BigQuery as input of your model, submit them to your model and get immediately the prediction. That you can store directly into BigQuery with an Insert Select query. The processing for the prediction is free, you pay only the data scanned into BigQuery!

    As I said, lot of possibility. Narrow your question to have a sharper answer! Anyway, hope this help