I have a use case in which chat text is to be classified. I want to use DocumentCategorizer in Apache OpenNLP to categorize chat. But for that i must have Training Data that should have Chats already classified. Do i have to manually categorize hundreds of chats to prepare Training and Test Data? What else can i do? I intend the chat categories to be service related PROBLEMS. This list of Categories would then be domain specific. Should the provider of this data, provide me with the categorized chat data? Thanks, in advance.
By definition, you cannot have a classification problem without labelled data. Either someone labels (at least part of) the data, or you should try to address the problem in a different way.
-- Edited to add some examples of how to address the problem without classifying:
In general, depending on the specific task you can try to solve a "classification" problem via clustering or/and document or term matching. Clustering will group together documents related to the same topic, while term matching will observe documents that refer to specific terms. If no training data is available, but you have some knowledge about the problem, either method, or a combination between them might be enough for your information need.
For your specific problem, I would start trying to cluster the chats.