[SOLVED] DeepLearning4J Doc2Vec input structure

DeepLearning4J Doc2Vec input structure

As I see less than 500 questions related on deeplearning4J here and most years old, first a different question: is DL4J dead? Do I really have to deal with horrible, horrible Python just to build my AI? I don't want to!

Now real question, I feel a bit stupid but really documentation and googling is a bit lacking (see question above): I have been reading up the past days on building a simple document classifier with DL4J which seems straight forward enough, although the follow-up material again is frighteningly sparse.

I build a ParagraphVector, add some labels, pass in the training data and train. I also figured out, the data is passed in as a LabelAwareIterator. Using a file structure I even found this documentation by DL4J how to structure the data. But what if I want to read the data from say an API or similar and not through file structuring? I am guessing I need a LabelAwareDocumentIterator, but how is data supposed to be structured and how to feed it in? I read about structuring as a table of text and label as columns but that seems rather sketchy and very imprecise.

Help would be much appreciated, as are better resources than what I have found so far. Thanks!

--UPDATE

Through reading of the source code (usually a good idea to just check the implementation) it looks like what I really want is the SimpleLabelAwareIterator. That code is nicely readable. Dont really understand what the LabelAwareDocumentIterator is for yet. Anyway the Simple one just needs a List of LabelledDocuments. The LabelledDocuments just have a string content and a list of labels. So far so good will try implementation this evening. If it works out, I will post this as an answer.

Solution

The approach in the update worked out. I am now using a SimpleLabelAwareIterator that I fill with a list of LabelledDocuments. Short code sample:

    ArrayList<LabelledDocument> labelledDocumentList = new ArrayList<LabelledDocument>();

    for(Document input : documents){
      LabelledDocument doc = new LabelledDocument();
      doc.setContent(input.content);
      doc.addLabel(input.label);
      labelledDocumentList.add(doc);
    }
    
    SimpleLabelAwareIterator iter = new simpleLabelAwareIterator(labelledDocumentList);
    
    TokenizerFactory t = new UimaTokenizerFactory();
    ParagraphVectors vec = new ParagraphVectors.Builder()
                        .minWordFrequency(1)
                        .labels(Arrays.asList("A", "B"))
                        .layerSize(100)
                        .stopWords(new ArrayList<String>())
                        .windowSize(5).iterate(iter).tokenizerFactory(t).build();
    
    vec.fit();
    tools.saveObject(vec, "models/modelName");