topic-modelingmallet

How to import excel file in mallet


I have excel file that contains posts title of stack overflow posts. My excel sheet have more than 10,000 lines. Therefore it is not possible to make separate txt for each row. If I copy my excel data into .txt file is it required to have labels or instance names for each line. I really don't find any documentation for that.


Solution

  • The website of Mallet describes topic modelling using a single file with one document per line on https://mimno.github.io/Mallet/topics-devel, emphasis mine:

    In this example, I import data from a file, train a topic model, and analyze the topic assignments of the first instance.
    [...]
    The input file contains one document per line. Each line has three fields, separated by commas. This is a standard Mallet format. For more information, see the importing data guide. The first field is a name for the document. The second field could contain a document label, as in a classification task, but for this example we won’t use that field. It is therefore set to a meaningless placeholder value. The third field contains the full text of the document, with no newline characters.

    Surprisingly, the quote above mentions commas as field separator, while everywhere else (for example the linked importing data guide) says that the file should be tab separated. An example of this format is given on the MALLET Github repository (https://github.com/mimno/Mallet/blob/master/sample-data/stackexchange/tsv/testing.tsv).

    You can create a similar file, with sequential indexes and a placeholder value for the label column. You can do this in Excel (depending on which version you have there is a Fill Series function available to create a sequential column by entering the desired number of rows in an input field) and then export as tab-separated csv). Alternatively, you could save the column with the data as a text file and add the other two columns programmatically with, e.g., Java, which I assume is available since you are running MALLET:

    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.PrintWriter;
    import java.io.FileWriter;
    import java.io.IOException;
    
    public class AddColumnsTM {
    
        public static void main(String[] args) {
            BufferedReader reader;
          PrintWriter writer;
            try {
                reader = new BufferedReader(new FileReader("titles.txt"));
                writer = new PrintWriter(new FileWriter("titles_columns.tsv"));
             
                String line = reader.readLine();
                int idx = 1;
                while (line != null) {
                    writer.printf("%d\tplaceholder\t%s\n", idx, line);
                    idx++;
                    line = reader.readLine();
                }
    
                reader.close();
                writer.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    
    }
    

    Example input file:

    first title
    second title
    third title
    fourth title
    fifth title
    

    Output file produced by the code above:

    1   placeholder first title
    2   placeholder second title
    3   placeholder third title
    4   placeholder fourth title
    5   placeholder fifth title
    

    Then you can transform this file input MALLET format using

    bin/mallet import-file --input titles_columns.tsv --output topic-input.mallet

    as described in the importing data guide and run the topic modelling afterwards using

    bin/mallet train-topics --input topic-input.mallet

    as described in the topic modelling guide.