sqlsql-serversql-server-2017dataexplorer

How to get ALL users' posts' tags (include answer's tags) in the Stack Exchange Data Explorer?


I use a query in the Stack Exchange Data Explorer (SEDE).

This is my query:

SELECT A.Id
       , A.PostTypeId
       , A.Title
       , A.Body
       , A.ParentId
       , A.Tags
       , A.CreationDate
FROM   posts A
       LEFT JOIN users U
              ON A.OwnerUserId = U.id
WHERE  U.Id = ##UserId##
       AND A.PostTypeId = 1
UNION
SELECT A.Id
       , A.PostTypeId
       , A.Title
       , A.Body
       , A.ParentId
       , B.Tags
       , A.CreationDate
FROM   posts A
       LEFT JOIN users U
              ON A.OwnerUserId = U.id
       RIGHT JOIN posts B
               ON A.ParentId = B.Id
WHERE  U.Id = ##UserId##
       AND A.PostTypeId = 2

In the code above, posts in Stack Overflow have 2 types: question and answer. Questions(PostTypeId is 1 in database schema) have the tags, but the answers(PostTypeId is 2 in database schema) do not have the tags. Answers belong to questions through the ParentId.

But the efficiency of the my query above is too low, I only can get some (using user id) posts' tags.

How can I get all users' posts' tags within the SEDE time out?


Solution

  • Several things:

    1. You cannot get all users' posts' tags. Currently, there are over 41 million posts and SEDE limits results to 50,000 rows.
    2. So, you need to: (A) restrict the results somehow or (B) page through 41 M posts, 50K at a time (NOT RECOMMENDED), or (C) use the Data Dump or Google's BigQuery($) instead.
    3. If you are not going to pull anything from the Users table but ID, then don't include that table. It chews up cycles and Posts.OwnerUserId is the same thing.
    4. Avoid UNION statements if possible (it is in this case).
    5. If using UNION statements, use UNION ALL if possible (it is in this case). This saves the engine from having to do duplicate checks.

    So, here is the execution plan for the original query:

    before

    Here is a streamlined plan:

    enter image description here

    And the query that corresponds to it:

    SELECT      TOP 50000
                p.OwnerUserId                                       AS [User]
                , p.Id                                              AS [Post Id]
                , CASE WHEN p.PostTypeId = 1 THEN 'Q' ELSE 'A' END  AS [Type]
                , COALESCE (p.Title, ownr.Title)                    AS [Title]
                , p.Body
                , COALESCE (p.Tags, ownr.Tags)                      AS [Tags]
                , p.CreationDate
    FROM        Posts p
    LEFT JOIN   Posts ownr      ON ownr.ID = p.ParentId
    WHERE       p.OwnerUserId   = ##UserId##
    AND         p.PostTypeId    IN (1, 2)  -- Question, answer
    ORDER BY    p.OwnerUserId DESC,  p.CreationDate
    

    -- which also gives more readable results -- especially when the WHERE clause is removed.

    But, if you can limit by, say, user before hand; you get an even more efficient query:

    WITH usersOfInterest AS (
        SELECT      TOP 10
                    u.Id                                                AS [UserId]
                    , u.DisplayName
        FROM        Users u
        ORDER BY    u.Reputation DESC
    )
    SELECT      
                [User] = 'site://u/' + CAST(u.UserId AS NVARCHAR) + '|' + CAST(u.UserId AS NVARCHAR)
                , p.Id                                              AS [Post Id]
                , CASE WHEN p.PostTypeId = 1 THEN 'Q' ELSE 'A' END  AS [Type]
                , COALESCE (p.Title, ownr.Title)                    AS [Title]
                , p.Body
                , COALESCE (p.Tags, ownr.Tags)                      AS [Tags]
                , p.CreationDate
    FROM        usersOfInterest u
    INNER JOIN  Posts p         ON u.UserId = p.OwnerUserId
    LEFT JOIN   Posts ownr      ON ownr.Id  = p.ParentId
    WHERE       p.PostTypeId    IN (1, 2)   -- Question, answer
    ORDER BY    u.UserId DESC,  p.CreationDate
    

    (This query adds a convenient hyperlink to the user id.)

    Note that just the top 10 users have more than 50K posts.