google-bigquery

Finding the most frequent value of string using BigQuery


I am aiming of finding the most frequent value and group them by user id using BigQuery. It should be able to count the number of language used for each user id and result should return the language that is highest. However, I found error saying that

No matching signature for aggregate function AVG for argument types: STRING. Supported signatures: AVG(INT64); AVG(FLOAT64); AVG(NUMERIC) at [3:5]

This is my code:

SELECT * FROM( 
  SELECT COUNT(*) AS cnt,
    AVG(Language) AS mean,
    APPROX_TOP_COUNT(Language, 1)[OFFSET(0)].value AS most_frequent_value
  FROM `language`
  WHERE Language IS NOT NULL
  GROUP BY User_ID);

What should I change so the result will return the value of language preferred by the each user ID?

enter image description here

Stored producer:

 CASE 
    WHEN Preferred_Language in ('EN', 'English') THEN 'EN' 
    ELSE Preferred_Language 
END AS Preferred_Language,

Solution

  • Below is for BigQuery Standard SQL

    #standardSQL
    SELECT
      User_ID,
      ARRAY_AGG(Language ORDER BY cnt DESC LIMIT 1)[OFFSET(0)] most_frequent_language
    FROM (
      SELECT 
        User_ID,
        Language,
        COUNT(*) AS cnt
      FROM `project.dataset.language`
      WHERE Language IS NOT NULL
      GROUP BY User_ID, Language
    )
    GROUP BY User_ID