google-analyticsamazon-redshiftetlgoogle-analytics-firebasedata-ingestion

Ingesting Google Analytics data into S3 or Redshift


I am looking for options to ingest Google Analytics data(historical data as well) into Redshift. Any suggestions regarding tools, API's are welcomed. I searched online and found out Stitch as one of the ETL tools, help me know better about this option and other options if you have.


Solution

  • Google Analytics has an API (Core Reporting API). This is good for getting the occasional KPIs, but due to API limits it's not great for exporting great amounts of historical data.

    For big data dumps it's better to use the Link to BigQuery ("Link" because I want to avoid the word "integration" which implies a larger level of control than you actually have).

    Setting up the link to BigQuery is fairly easy - you create a project in the Google Cloud Console, enable billing (BigQuery comes with a fee, it's not part of the GA360 contract), add your email address as BigQuery Owner in the "IAM&Admin" section, go to your GA account and enter the BigQuery Project ID in the GA Admin section, "Property Settings/Product Linking/All Products/BigQuery Link". The process is described here: https://support.google.com/analytics/answer/3416092

    You can select between standard updates and streaming updated - the latter comes with an extra fee, but gives you near realtime data. The former updates data in BigQuery three times a day every eight hours.

    The exported data is not raw data, this is already sessionized (i.e. while you will get one row per hit things like the traffic attribution for that hit will be session based).

    You will pay three different kinds of fees - one for the export to BigQuery, one for storage, and one for the actual querying. Pricing is documented here: https://cloud.google.com/bigquery/pricing.

    Pricing depends on region, among other things. The region where the data is stored might also important be important when it comes to legal matters - e.g. if you have to comply with the GDPR your data should be stored in the EU. Make sure you get the region right, because moving data between regions is cumbersome (you need to export the tables to Google Cloud storage and re-import them in the proper region) and kind of expensive.

    You cannot just delete data and do a new export - on your first export BigQuery will backfill the data for the last 13 months, however it will do this only once per view. So if you need historical data better get this right, because if you delete data in BQ you won't get it back.

    I don't actually know much about Redshift, but as per your comment you want to display data in Tableau, and Tableau directly connects to BigQuery.

    We use custom SQL queries to get the data into Tableau (Google Analytics data is stored in daily tables, and custom SQL seems the easiest way to query data over many tables). BigQuery has a user-based cache that lasts 24 hours as long as the query does not change, so you won't pay for the query every time the report is opened. It still is a good idea to keep an eye on the cost - cost is not based on the result size, but on the amount of data that has to be searched to produce the wanted result, so if you query over a long timeframe and maybe do a few joins a single query can run into the dozens of euros (multiplied by the number of users who use the query).