javasomlarge-data

How to process large data files iteratively?


I have a space separated data file with 4.5 million entries in the following format

CO_1 A 0 0 0 0 1

CO_2 A 0 0 0 0 1

This data file is used as an input to the Self-Organizing Map (SOM) algorithm that iterates through this file 100 (in my case) times.

I use the following readFile function to copy the file completely into the temp string and pass the string on to the SOM algorithm.

public String readFile()
{
    String temp = "";

    try
    {
        FileReader file = new FileReader(FILE_LOCATION);
        BR = new BufferedReader(file);
        String strLine = null;

        while((strLine = BR.readLine()) != null)
        {
            temp += strLine + "\n";
        }
    }
    catch(Exception e)
    {
        
    }
    
    return temp;
}

How ever I feel the above method puts a heavy burden on memory and slows down the iterations which could result in memory overruns. Currently I'm running this code in a cluster with 30GB memory allocation and the execution has not even completed a single iteration for about 36 hours.

I cannot partially read the file (as in blocks of lines) since the SOM will have to poll for data once the initial block is done which could result in even further complications.

Any ideas how I could improve this so I could successfully iterate 4.5 million entries 100 times.

EDIT

The whole files is read in to the string using the above method only once. Then the string variable is used throughout the 100 iterations. However, each time string tokenizers has been utilized to process each line in the file * number of iterations.


Solution

  • Don't ever use string concatenation for this kind of purpose.
    Instead of String, use StringBuffer class for this purpose.
    Consider Following example:

    public StringBuffer readFile()
    {
        StringBuffer tempSB = new StringBuffer();
    
        try
        {
            FileReader file = new FileReader(FILE_LOCATION);
            BR = new BufferedReader(file);
            String strLine = null;
    
            while((strLine = BR.readLine()) != null)
            {
                tempSB.append(strLine);
                tempSB.append("\n");
            }
        }
        catch(Exception e)
        {
    
        }
    
        return temp;
    }  
    

    This will save your heap memory.