As I see less than 500 questions related on deeplearning4J here and most years old, first a different question: is DL4J dead? Do I really have to deal with horrible, horrible Python just to build my AI? I don't want to!
Now real question, I feel a bit stupid but really documentation and googling is a bit lacking (see question above): I have been reading up the past days on building a simple document classifier with DL4J which seems straight forward enough, although the follow-up material again is frighteningly sparse.
I build a ParagraphVector
, add some labels, pass in the training data and train. I also figured out, the data is passed in as a LabelAwareIterator
. Using a file structure I even found this documentation by DL4J how to structure the data. But what if I want to read the data from say an API or similar and not through file structuring? I am guessing I need a LabelAwareDocumentIterator, but how is data supposed to be structured and how to feed it in? I read about structuring as a table of text and label as columns but that seems rather sketchy and very imprecise.
Help would be much appreciated, as are better resources than what I have found so far. Thanks!
--UPDATE
Through reading of the source code (usually a good idea to just check the implementation) it looks like what I really want is the SimpleLabelAwareIterator
. That code is nicely readable. Dont really understand what the LabelAwareDocumentIterator
is for yet. Anyway the Simple one just needs a List of LabelledDocuments
. The LabelledDocuments
just have a string content and a list of labels. So far so good will try implementation this evening. If it works out, I will post this as an answer.
The approach in the update worked out. I am now using a SimpleLabelAwareIterator that I fill with a list of LabelledDocuments. Short code sample:
ArrayList<LabelledDocument> labelledDocumentList = new ArrayList<LabelledDocument>();
for(Document input : documents){
LabelledDocument doc = new LabelledDocument();
doc.setContent(input.content);
doc.addLabel(input.label);
labelledDocumentList.add(doc);
}
SimpleLabelAwareIterator iter = new simpleLabelAwareIterator(labelledDocumentList);
TokenizerFactory t = new UimaTokenizerFactory();
ParagraphVectors vec = new ParagraphVectors.Builder()
.minWordFrequency(1)
.labels(Arrays.asList("A", "B"))
.layerSize(100)
.stopWords(new ArrayList<String>())
.windowSize(5).iterate(iter).tokenizerFactory(t).build();
vec.fit();
tools.saveObject(vec, "models/modelName");