I am new to Solr and stuck at something basic (I think), which is probably a lack of understanding/comprehension on my behalf. I've read the documentation on DIH and spent a lot of time searching this issue, without finding my solution.
My use case is a messaging/email system, where users can message each other and start a thread, to which they can reply (so it's more like email than direct messages on a user base).
The question is simple; I have one table, threads
, that is the base for this and contains searchable data like user info and subject. Then joined from that is the emails
table, with the html
column searchable.
When I run below collection in Solr and do a search, it will only pick up a single email
for a thread and search that, as opposed to what I'm hoping for; get all emails belonging to that thread. So say I have 10 threads, but 100 messages, it says Fetched: 100
, but Processed: 10
.
How do I get Solr to index all of this content properly and allow for a search on it? In this particular use case, I have also created a reversed example, getting messages first, then the threads it belongs to and then de-dupe the results (which works to some extent), but the next step is that there is also a left join
for email attachments. So looking for a solution with this setup.
Using Solr 6.6
<dataConfig>
<dataSource name="ds-db" type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="${dataimporter.request.url}"
user="${dataimporter.request.user}"
password="${dataimporter.request.password}"/>
<document name="threads">
<entity name="thread" dataSource="ds-db"
query="
SELECT threads.id
, threads.user_id
, threads.subject
, users.first_name
, users.last_name
, users.email
FROM threads
LEFT JOIN users ON users.user_id=threads.user_id
">
<field column="id" name="thread_id"/>
<field column="user_id" name="user_id"/>
<field column="subject" name="subject"/>
<field column="first_name" name="first_name"/>
<field column="last_name" name="last_name"/>
<field column="email" name="email"/>
<entity name="message" dataSource="ds-db" transformer="HTMLStripTransformer"
query="
SELECT id
, html
FROM emails
WHERE thread_id = ${thread.id}
">
<field column="id" name="id"/>
<field column="html" name="html" stripHTML="true"/>
</entity>
</entity>
</document>
</dataConfig>
managed-schema
<schema name="example-data-driven-schema" version="1.6">
...
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="thread_id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="first_name" type="string_lowercase" indexed="true" stored="true"/>
<field name="last_name" type="string_lowercase" indexed="true" stored="true"/>
<field name="email" type="string_lowercase" indexed="true" stored="true"/>
<field name="subject" type="string_lowercase" indexed="true" stored="true"/>
<field name="html" type="string_lowercase" indexed="true" stored="true"/>
...
<copyField source="first_name" dest="_text_"/>
<copyField source="last_name" dest="_text_"/>
<copyField source="email" dest="_text_"/>
<copyField source="subject" dest="_text_"/>
<copyField source="html" dest="_text_"/>
...
</schema>
If you want all the emails in a single field, that field has to be set as multiValued="true"
- otherwise you'll only get one of the dependent entities indexed.