javasolrlucenesolrcloud

Full Outer Join in Solr


I am trying to do a full outer join on two collections. Given collection1 with a document that looks like this:

{
id: 234982032,
name: example,
listId: 123
}

and collection2 with a document that looks like this:

{
id: 123,
description: desc1
}

I expect a result like this:

{
id: 234982032,
name: example,
description: desc1
}

I have tried using this command:

fq={!join from=listId to=id fromIndex=collection2}description:desc1

but this results in only in inner join. Is there a way I can outer join two collections using a filter query? If this is not a possible Is there a plugin that can do this?


Solution

  • A join using the join query parser (i.e. {!join}) in Solr can't retrieve content from both sides of the join. These are purely inner joins, where one field is used for filtering content from the collection being queried.

    This is different from the concept of a join in a relational database because no information is being truly joined. An appropriate SQL analogy would be an "inner query".

    If you're on a recent-ish version of Solr you do have another option, which is using a Streaming Expression.

    This will allow you to set up two stream sources, then apply either a leftOuterJoin or an outerHashJoin to get a document with combined information from both sides.

    From the example in the reference guide:

    leftOuterJoin(
      search(people, q="*:*", qt="/export", fl="personId,name", sort="personId asc"),
      search(pets, q="type:cat", qt="/export", fl="personId,petName", sort="personId asc"),
      on="personId"
    )
    
    outerHashJoin(
      search(people, q="*:*", qt="/export", fl="personId,name", sort="personId asc"),
      hashed=search(pets, q="type:cat", qt="/export", fl="personId,petName", sort="personId asc"),
      on="personId"
    )
    

    Replace pets and people with collection1 and collection2 from your question. Be aware that you must have a sort criteria that maps the key used for joining when using the leftOuterJoin, but this usually makes the join more efficient for larger result sizes, since the outerHashJoin has to keep a lot more in memory.