solrsolr6

Indexing joined records in Solr


I am new to Solr and stuck at something basic (I think), which is probably a lack of understanding/comprehension on my behalf. I've read the documentation on DIH and spent a lot of time searching this issue, without finding my solution.

My use case is a messaging/email system, where users can message each other and start a thread, to which they can reply (so it's more like email than direct messages on a user base).

The question is simple; I have one table, threads, that is the base for this and contains searchable data like user info and subject. Then joined from that is the emails table, with the html column searchable.

When I run below collection in Solr and do a search, it will only pick up a single email for a thread and search that, as opposed to what I'm hoping for; get all emails belonging to that thread. So say I have 10 threads, but 100 messages, it says Fetched: 100, but Processed: 10.

How do I get Solr to index all of this content properly and allow for a search on it? In this particular use case, I have also created a reversed example, getting messages first, then the threads it belongs to and then de-dupe the results (which works to some extent), but the next step is that there is also a left join for email attachments. So looking for a solution with this setup.

Using Solr 6.6

<dataConfig>
    <dataSource name="ds-db" type="JdbcDataSource"
                driver="com.mysql.jdbc.Driver"
                url="${dataimporter.request.url}"
                user="${dataimporter.request.user}"
                password="${dataimporter.request.password}"/>
    <document name="threads">
        <entity name="thread" dataSource="ds-db"
                query="
            SELECT threads.id
                 , threads.user_id
                 , threads.subject
                 , users.first_name
                 , users.last_name
                 , users.email
              FROM threads
         LEFT JOIN users ON users.user_id=threads.user_id
            ">
            <field column="id" name="thread_id"/>
            <field column="user_id" name="user_id"/>
            <field column="subject" name="subject"/>
            <field column="first_name" name="first_name"/>
            <field column="last_name" name="last_name"/>
            <field column="email" name="email"/>

            <entity name="message" dataSource="ds-db" transformer="HTMLStripTransformer"
                    query="
                SELECT id
                     , html
                  FROM emails
                 WHERE thread_id = ${thread.id}
                    ">
                <field column="id" name="id"/>
                <field column="html" name="html" stripHTML="true"/>


            </entity>

        </entity>
    </document>
</dataConfig>

managed-schema

<schema name="example-data-driven-schema" version="1.6">
    ...
    <field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
    <field name="thread_id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
    <field name="first_name" type="string_lowercase" indexed="true" stored="true"/>
    <field name="last_name" type="string_lowercase" indexed="true" stored="true"/>
    <field name="email" type="string_lowercase" indexed="true" stored="true"/>
    <field name="subject" type="string_lowercase" indexed="true" stored="true"/>
    <field name="html" type="string_lowercase" indexed="true" stored="true"/>
    ...
    <copyField source="first_name" dest="_text_"/>
    <copyField source="last_name" dest="_text_"/>
    <copyField source="email" dest="_text_"/>
    <copyField source="subject" dest="_text_"/>
    <copyField source="html" dest="_text_"/>
    ...
</schema>

Solution

  • If you want all the emails in a single field, that field has to be set as multiValued="true" - otherwise you'll only get one of the dependent entities indexed.