I need a tag based search system by JCR like Modeshape. i want to search nodes by some tags. Question is that what is the best way to implement it?
There are several ways to implement tags in JCR. Which option you pick will depend upon the needs of your own application(s). Here are four options I know of.
Option 1: Use Mixins
Define for each tag a mixin node type definition that is a marker mixin (it has no property definitions or child node definitions), registering them dynamically using the NodeTypeManager. Then when you want to "tag" a node, simply add to that node the mixin that represents the tag. Any node could have multiple tags, and you could query for all the nodes that have a particular tag.
(In the rest of this response, "acme" is used as a generic namespace. You should replace this with a namespace suitable for your own application and organization.)
For example, given a tag "acme:tag1", you could find all nodes that have this tag with the simple query:
SELECT * FROM [acme:tag1]
The disadvantage of this approach is that maintaining tags is cumbersome. Creating new tags requires registering new node types. You cannot easily rename tags, but instead would have to create the mixin for the tag with the new name; find all nodes that have the mixin representing the old tag, remove the old mixin, and add the new one; and finally remove the node type definition for the old tag (after it is no longer used anywhere). Removing old tags is done in a similar manner. Another disadvantage is that it is not easy to associate additional metadata (e.g., display name) with a tag, since extra properties aren't allowed on node type definitions.
This approach should perform quite well.
Option 2: Use a taxonomy and strong references
In this approach, you would create a simple node structure in an area of the repository into which you can create a node for each tag (e.g., a taxonomy). On this node you could set properties that describe the tag (e.g., display name); these properties can be changed at any time (e.g., to rename the tag).
Then to "apply" the tag to a node, you simply have to create some sort of relationship to the tag. One way is to define a mixin node type that contains a "acme:tags" multivalued property of type REFERENCE. When you want to apply one or more tags to a node, simply add the mixin to the node and set the "acme:tags" property to the tag node(s).
To find all nodes of a particular tag, you can call "getReferences()" on a tag node to find all of the nodes that contain a reference to the tag node.
This approach has the benefit that all tags have to be controlled/managed within one or more taxonomies (including perhaps user-specific taxonomies). However, there are some disadvantages, too. First and foremost, the performance of REFERENCE properties might not be great. Some JCR implementations discourage the use of REFERENCES altogether. ModeShape does not, but ModeShape might start to degrade REFERENCE performance when there are lots of nodes that contain references to the same node (e.g., lots of nodes with a single tag).
Option 3: Use taxonomy and weak references
This option is a hybrid similar to Option 2 above except that the "acme:tags" properties would be WEAKREFERENCE instead of REFERENCE. You would still define and manage one or more taxonomies. To find nodes with a particular tag, you can't use the "getReferences()" method on the tag node (since they don't work with WEAKREFERENCE properties), but you can easily do this with a query:
SELECT * FROM [acme:taggable] AS taggable
JOIN [acme:tag] AS tag ON taggable.[acme:tags] = tag.[jcr:uuid]
AND LOCALNAME(tag) = 'tag1'
This approach does enforce using one or more taxonomies, makes it a bit easier to control the tags, since they must exist in a taxonomy before they can be used. Renaming and removing is also easier. Performance-wise, this is better than the REFERENCE approach, since WEAKREFERENCE properties will perform better with large numbers of references, regardless of whether they all point to one node or many.
The disadvantage is that you can remove a tag even if it is still used, but the nodes that contain a WEAKREFERENCE to that removed tag will not be valid anymore. This can be remedied with some conventions in your application, or by simply using metadata on the taxonomy to say that a particular tag is "deprecated" and shouldn't be used. (IMO, the latter is actually a benefit of this approach.)
This option will generally perform and scale much better than Option 2.
Option 4: Use string properties
Yet another approach is to simply use a STRING property to tag each node with the name of the tag(s) that are to be applied. For example, you could define a mixin (e.g., "acme:taggable") that defines a multi-valued STRING property, and when you want to tag a node simply add the mixin (if not already present) and add the name of the tag as a value on the "acme:tags" STRING property (again, if it's not already present as a value).
The primary advantage of this approach is that it is very simple: you're simply using string values on the node that is to be tagged. To find all nodes that are tagged with a particular tag (e.g., "tag1"), simply issue a query:
SELECT *
FROM [acme:taggable] AS taggable
WHERE taggable.[acme:tags] = 'tag1'
Management of the tags is easy: there is no management. If a tag is to be renamed, then you could rename the tag values. If a tag is to be deleted (and removed from the nodes that are tagged with it), then that can be done by removing the values from the "acme:tags" properties (perhaps in a background job).
Note that this allows any tag name to be used, and thus works best for cases where the tag names are not controlled at all. If you want to control the list of strings used as tag values, simply create a taxonomy in the repository (as described in Options 2 and 3 above) and have your application limit the values to those in the taxonomy. You can even have multiple taxonomies, some of which are perhaps user-specific. But this approach doesn't have quite the same control as Options 2 or 3.
This option will perform a bit better than Option 3 (since the queries are simpler), but will scale just as well.