javadeep-learningdeeplearning4jdl4j

DeepLearning4J Doc2Vec input structure


As I see less than 500 questions related on deeplearning4J here and most years old, first a different question: is DL4J dead? Do I really have to deal with horrible, horrible Python just to build my AI? I don't want to!

Now real question, I feel a bit stupid but really documentation and googling is a bit lacking (see question above): I have been reading up the past days on building a simple document classifier with DL4J which seems straight forward enough, although the follow-up material again is frighteningly sparse.

I build a ParagraphVector, add some labels, pass in the training data and train. I also figured out, the data is passed in as a LabelAwareIterator. Using a file structure I even found this documentation by DL4J how to structure the data. But what if I want to read the data from say an API or similar and not through file structuring? I am guessing I need a LabelAwareDocumentIterator, but how is data supposed to be structured and how to feed it in? I read about structuring as a table of text and label as columns but that seems rather sketchy and very imprecise.

Help would be much appreciated, as are better resources than what I have found so far. Thanks!

--UPDATE

Through reading of the source code (usually a good idea to just check the implementation) it looks like what I really want is the SimpleLabelAwareIterator. That code is nicely readable. Dont really understand what the LabelAwareDocumentIterator is for yet. Anyway the Simple one just needs a List of LabelledDocuments. The LabelledDocuments just have a string content and a list of labels. So far so good will try implementation this evening. If it works out, I will post this as an answer.


Solution

  • The approach in the update worked out. I am now using a SimpleLabelAwareIterator that I fill with a list of LabelledDocuments. Short code sample:

        ArrayList<LabelledDocument> labelledDocumentList = new ArrayList<LabelledDocument>();
    
        for(Document input : documents){
          LabelledDocument doc = new LabelledDocument();
          doc.setContent(input.content);
          doc.addLabel(input.label);
          labelledDocumentList.add(doc);
        }
        
        SimpleLabelAwareIterator iter = new simpleLabelAwareIterator(labelledDocumentList);
        
        TokenizerFactory t = new UimaTokenizerFactory();
        ParagraphVectors vec = new ParagraphVectors.Builder()
                            .minWordFrequency(1)
                            .labels(Arrays.asList("A", "B"))
                            .layerSize(100)
                            .stopWords(new ArrayList<String>())
                            .windowSize(5).iterate(iter).tokenizerFactory(t).build();
        
        vec.fit();
        tools.saveObject(vec, "models/modelName");