djangoaggregationdjango-filter

Using StringAgg after filter & distinct


I'm using StringAgg and order as follows:

    # Get order column & annotate with list of credits
    if request.POST.get('order[0][name]'):
        order = request.POST['order[0][name]']
        if order == 'credits_primary':
            releases = releases.annotate(credits_primary=StringAgg(
                'credits__entity__name',
                delimiter=', ',
                filter=Q(credits__type='primary'),
                ordering='credits__id'
            ))
        elif order == 'credits_secondary':
            releases = releases.annotate(credits_secondary=StringAgg(
                'credits__entity__name',
                delimiter=', ',
                filter=Q(credits__type='secondary'),
                ordering='credits__id'
            ))
    else:
        order = 'title'
    
    # Order releases
    if request.POST.get('order[0][dir]') == 'desc':
        releases = releases.order_by(F(order).desc(nulls_last=True), 'title')
    else:
        releases = releases.order_by(F(order).asc(nulls_last=True), 'title')
    
    for release in releases:
        try: print(release.credits_primary)
        except: pass
        try: print(release.credits_secondary)
        except: pass

This in itself works exactly as expected: the ordering is what I expect, and print returns the values I expect.

However, when I apply a filter before this, where the result includes multiple instances of the same release, the credits__entity__names will be repeated or omitted for that release, depending on how many results there are for the field(s) being filtered by, despite using distinct.

Below is the filter I'm applying that affects the aggregation (applied before StringAgg):

    # Other filters...

    # Filter by Search
    if request.POST.get('search[value]'):
        query = Q()
        search = request.POST['search[value]']
        query.add(Q(title__icontains=search), Q.OR)
        query.add(Q(tags__tag__name__icontains=search), Q.OR)
        query.add(Q(credits__entity__name__icontains=search), Q.OR)
        releases = releases.filter(query)
        
    # Make sure items aren't repeated
    releases = releases.distinct()

Solution

  • Below is the filter I'm applying that affects the aggregation (applied before StringAgg):

    The filter looks at (other) one-to-many or many-to-many relations. This means that this will result in LEFT OUTER JOIN that will start to act as a multiplier. The same works with other aggregates such as SUM and COUNT.

    Indeed, if you just aggregate, the query looks like:

    SELECT string_agg(CASE
                          WHEN c.type = 'primary' THEN e.name
                          ELSE NULL
                      END
                      ORDER BY e.id)
    FROM releases AS r
    LEFT OUTER JOIN credits AS c ON c.release_id = r.id
    LEFT OUTER JOIN entity AS e ON e.credit_id = e.id

    and that is fine, because we want to aggregate over the entities, so per release we get all entities.

    But now if we for example filter on the tags, it looks like:

    SELECT string_agg(CASE
                          WHEN c.type = 'primary' THEN e.name
                          ELSE NULL
                      END
                      ORDER BY e.id)
    FROM releases AS r
    LEFT OUTER JOIN credits AS c ON c.release_id = r.id
    LEFT OUTER JOIN entity AS e ON e.credit_id = e.id
    LEFT OUTER JOIN release_tag rt1 ON rt1.release_id = r.id
    LEFT OUTER JOIN tag t1 ON rt1.tag_id = t1.id
    LEFT OUTER JOIN release_tag rt2 ON rt2.release_id = r.id
    LEFT OUTER JOIN tag t1 ON rt2.tag_id = t2.id
    WHERE t1.name = 'tag1'
      OR t2.name = 'tag2'

    This means that the entity names for a release where one of the two tags match will be repeated once, but where both match, it will be included twice.

    The Q(credits__entity__name__icontains) makes it even worse: since this will join on the table c, and thus entities of credits that have two or more matches will be included that many times, whereas the entities of a release that does not match with a given credit, will no longer be included.

    The least common divider of all these phenomena is probably that you should be very careful to make aggregates when you make JOINs especially if multiple tables on which the JOINs are caried our are involved.

    What you probably can do is make a Subquery expression [Django-doc] which is thus not impacted by the "outer" tables and the corresponding JOIN.