javaarraylistheap-memoryn-gram

Java heap space error when trying to create an NGram model


In part of a larger project I need to create an NGram model using Java which is not optimal nor optional I am using JDK 20 and vs code to run the code. When I try to run the code on vs code I get:

`      Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.HashMap.resize(HashMap.java:710)
        at java.base/java.util.HashMap.putVal(HashMap.java:635)
        at java.base/java.util.HashMap.put(HashMap.java:618)
        at Ngram.NGramNode.addNGram(NGramNode.java:277)
        at Ngram.NGramNode.addNGram(NGramNode.java:280)
        at Ngram.NGram.addNGramSentence(NGram.java:157)
        at com.glmadu.editdistance.TRspellChecker.getCorpus(TRspellChecker.java:68)
        at com.glmadu.editdistance.TRspellChecker.checkFileSpell(TRspellChecker.java:22)
        at com.glmadu.App.main(App.java:21)`

Error I did increase the heap space from launch.JSON to 8GB and the corpus file is around 750 MB the code piece is here

private static void getCorpus(String output) {
        ArrayList<ArrayList<String>> corpus = new ArrayList<>();

        try (BufferedReader br = new BufferedReader(new FileReader("path/to/corpus"))) {
            String line;
            while ((line = br.readLine()) != null) {
                String[] tokens = line.split(" "); //line 63
                ArrayList<String> sentence = new ArrayList<>();
                for (String token : tokens) {
                    sentence.add(token); // line 68
                }
                corpus.add(sentence);
            }
            NGram<String> nGram = new NGram<>(corpus, 2);
            nGram.saveAsText(output);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

I do not understand how can I still get Heap space after push it to 8GB I tried with 12 and 10 but I get

> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
`        at java.base/java.lang.String.split(String.java:3138)
        at java.base/java.lang.String.split(String.java:3212)
        at com.glmadu.editdistance.TRspellChecker.getCorpus(TRspellChecker.java:63)
        at com.glmadu.editdistance.TRspellChecker.checkFileSpell(TRspellChecker.java:22)
        at com.glmadu.App.main(App.java:21)
`

error when I do that. I am using vs code to run this. Thanks in advance

I tried increasing the heap size, I tried reading less lines still got error even when I tried to read first 1000 lines. I tried not saving NGram model and from that I can derive it's not the NGram modeling but mode like arrays take too much space in memory, also when I checked the memory usage from task manager it sits at 4-5 GB and does not get close to 8 I allocated


Solution

  • Alright here is how I "solved" the problem I set an initial array size according with @Sascha 's response but it still got problems so I divided the problem and merged them later on

    public static void getCorpus(String output) {
        int count = 0;
        int countMain = 0;
        try (BufferedReader br = new BufferedReader(
                new FileReader("path\to\corpus"))) {
            String line;
            while ((line = br.readLine()) != null) {
                count++;
                if (count > ARRAY_SIZE) {
                    NGram nGram = new NGram(corpus, 2);
                    saveNgram(output, nGram, countMain);
                    countMain++;
                    System.out.println("Clearing ArrayList " + countMain);
                    corpus.clear();
                    count = 0;
                }
                String[] tokens = line.split(" ");
                ArrayList<String> sentence = new ArrayList<>();
                for (String token : tokens) {
                    sentence.add(token);
                }
                corpus.add(sentence);
            }
            countMain++;
            NGram nGram = new NGram(corpus, 2);
            corpus.clear();
            nGram.saveAsText(output + "final");
            String outFile = output + "final" + ".txt";
            FileWriter fWrite = new FileWriter(outFile, StandardCharsets.UTF_8);
            BufferedWriter bfWrite = new BufferedWriter(fWrite);
    
            for (int i = 0; i <= countMain; i++) {
                String path = output + "_part" + i + ".txt";
    
                mergeNGram(path, outFile, bfWrite);
            }
            fWrite.close();
            bfWrite.close();
    
        } catch (IOException e) {
            e.printStackTrace();
        }
    
    }
    

    It takes a String path to output file after that to save it runs saveNgram which is quite basic as it takes the output concatenate it to add partx to it and saves the NGram

     private static void saveNgram(String outputPath, NGram nGram, int countMain) {
            String finalPath = outputPath + "_part" + countMain + ".txt";
            File nFile = new File(finalPath);
            if (!nFile.exists()) {
                nGram.saveAsText(finalPath);
            }
        }
    

    At the and if there are any leftover lines saves it again and calls mergeNGram which is just a BufferedReader/Writer to write to the final file

       private static void mergeNGram(String path, String output, BufferedWriter bfWrite) {
            File inFile = new File(path);
            try (FileReader fRead = new FileReader(inFile, StandardCharsets.UTF_8)) {
                BufferedReader bfRead = new BufferedReader(fRead);
                String Line;
                while ((Line = bfRead.readLine()) != null) {
                    bfWrite.write(Line);
                    bfWrite.write(System.lineSeparator());
                }
                bfWrite.flush();
                bfRead.close();
                fRead.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    

    It is nowhere near perfect but it solves my current problem and that is all I can do at the moment, special thanks to @Sascha for the help I am leaving this here so anyone with similar problem can find and adopt