mysqlelasticsearchgroup-concatself-referencing-table

Creating Concatinated parent column with MySQL on self referencing query


I am trying to populate ElasticSearch with a collection of documents along with a field representing the path to the document based on its parents.

Here is my table layout:

+----+--------+-------+----------+
| Id | Parent | Alias | Contents |
+----+--------+-------+----------+
| 1  | null   | Doc1  | Admin    |
| 2  | 1      | Doc2  | Use      |
| 3  | 2      | Doc3  | Test     |
| 4  | 3      | Doc4  | Ask      |
| 5  | null   | PDF1  | Intro    |
| 6  | 5      | PDF2  | Managers |
+----+--------+-------+----------+

Here is the desired output

+----+--------+-------+----------+---------------------+
| Id | Parent | Alias | Contents | Path                |
+----+--------+-------+----------+---------------------+
| 1  | null   | Doc1  | Admin    | Doc1                |
| 2  | 1      | Doc2  | Use      | Doc1\Doc2           | 
| 3  | 2      | Doc3  | Test     | Doc1\Doc2\Doc3      |
| 4  | 3      | Doc4  | Ask      | Doc1\Doc2\Doc3\Doc4 |
| 5  | null   | PDF1  | Intro    | PDF1                |
| 6  | 5      | PDF2  | Managers | PDF1\PDF2           |
+----+--------+-------+----------+---------------------+

I have this query that gets the Path of one document specified by the parameter @child; (aka SET @child = 5; )

SELECT 
    T2.*
FROM
    (SELECT 
        @r AS _id,
            (SELECT 
                    @r:=Parent
                FROM
                    documents
                WHERE
                    id = _id) AS ParentId,
            @l:=@l + 1 AS lvl
    FROM
        (SELECT @r:=@child, @l:=@parent) vars, documents
    WHERE
        @r <> 0) T1
        JOIN
    documents T2 ON T1._id = T2.Id
ORDER BY T2.Parent

The problem being is how do I set @child if I put this into a subquery? I have tried GROUP_CONCAT() but it always ends up being the same path for every line. I have tried putting the Id of the current row in subquery but it throws an error: ErrorCode: 1109. Unknown table 'doc' in field list in the following query

SELECT doc.*, (
    SELECT GROUP_CONCAT(a.Alias) FROM (SELECT 
        T2.*
    FROM
        (SELECT 
            @r AS _id,
                (SELECT 
                        @r:=Parent
                    FROM
                        documents
                    WHERE
                        id = _id) AS ParentId,
                @l:=@l + 1 AS lvl
        FROM
            (SELECT @r:= doc.Id, @l:=@parent) vars, documents
        WHERE
            @r <> 0) T1
            JOIN
        documents T2 ON T1._id = T2.Id
    ORDER BY T1.lvl DESC) a
) as Path FROM documents doc

What am I doing wrong? Is there a better way to do this that I'm not seeing?

Though it is not entirely relevant, I will point out, I'm using a logstash script to load the documents into ElasticSearch from my database on a schedule. Also for multiplicities sake I have taken out the majority of the columns as well as the contents and replaced with faux contents.


Solution

  • You get your error because you cannot use an outer variable in a derived table. A derived table is basically every "subquery" for which you have to use an alias, like vars in your case. Try removing that alias, and MySQL will tell you that every derived table has to have an alias.

    One way to solve this is to move your whole query into a function, e.g. getpath(child_id int), where you can then freely use this variable whereever you want (assuming you have a working query that can get the path for one specific child, "something with GROUP_CONCAT()").

    But in your case, it is actually possible to reorganize your code so you do not need a derived table:

    select d.*, t3.path 
    from (
      SELECT t1.id, 
        group_concat(t2.alias order by t1.rownum desc separator '\\' ) as path
      from (
        SELECT 
          current_child.id, 
          lvls.rownum, 
          @r := if(lvls.rownum = 1, current_child.id, @r) AS _id,
          (SELECT @r:=Parent
           FROM documents
           WHERE id = _id) AS ParentId
        FROM (select @rownum:= @rownum+1 as rownum 
           from documents, -- maybe add limit 5
           (select @rownum := 0) vars
          ) as lvls 
          -- or use: 
          -- (select 1 as rownum union select 2 union select 3 
          -- union select 4 union select 5) as lvls
          straight_join documents as current_child 
      ) as t1
      join documents t2
      on t2.id = t1._id
      group by t1.id
    ) t3
    join documents d
    on d.id = t3.id;
    

    I used your inner documents the same way as you did, which is actually quite inefficient and is only used to support an unlimited tree depth. If you know your max dependency level, you could use the alternative code for lvls I added as a comment (which is just a list of numbers) or the limit.

    Make sure to set the group_concat_max_len-setting to an appropriate value (with e.g. set session group_concat_max_len = 20000;). By default, it supports a length of 1024, which will usually be enough, but for long aliases or really deep trees you might reach it - and since it will give you neither an error nor a warning, it is sometimes hard to diagnose, so be aware of it.

    There is a more straight forward way to solve your problem. It requires you to know the maximum depth of your tree though, but if you do, you can simply join your parents to every child.

    select child.*, 
      concat_ws('\\',p4.Alias,p3.Alias,p2.Alias,p1.Alias,child.Alias) as path
    from documents child
    left join documents p1 on child.parent = p1.id
    left join documents p2 on p1.parent = p2.id
    left join documents p3 on p2.parent = p3.id
    left join documents p4 on p3.parent = p4.id;
    

    Generally speaking, the tree you used for your hierarchy does not work very well in sql because of the recursive nature of the model (even if other databases actually support recursive queries in a very similar way you simulated with the variables).

    For other ways to model your hierarchy, see e.g. Bill Karwins presentation Models for hierarchical data. They make it a lot easier to query a path without recursion.