I have written a program similar to external sort. I got a good idea from this blog. Here, they are trying to do external sort of only numbers. My requirement is little different. My input file can have over million records and it is difficult to sort them in-memory, so I have to make use of my disk. I divide my input into various slices, sort it and then store it in temporary files. And then merge sorted output into a single file. Below I am able to split it into temporary files and then merge only the keys.
I have an input file as below:
key1 abc
key2 world
key1 hello
key3 tom
key7 yankie
key3 apple
key5 action
key7 jack
key4 apple
key2 xon
key1 lemon
Let's say the size of the file on disk is 10 and maximum items memory buffer can hold is 4, so what I have done is taken 4 records at a time and stored it in HashMap, sorted my values along with the updated count. This input will be split into 3 sorted files as shown below. You can see, for each key I have a count and also the lexicographically highest value.
temp-file-0.txt
key1: 2, hello
key2: 1, world
key3: 1, tom
temp-file-1.txt
key5: 1, action
key3: 1, apple
key7: 2, yankie
temp-file-2.txt
key1: 1, lemon
key2: 1, xon
key4: 1, apple
then after merging all these 3 files, output should look like this:
key1: 3 lemon
key2: 2 xon
key3: 2 world
key5: 1 action
key7: 2 yankie
I am not sure about the logic of merging the entire line along with count and the lexicographically highest value for that key, my below code is able to give me all keys, something like this:
key1
key1
key2
key2
key3
key4
key5
key3
key7
In below code, I am opening each file and merging them, then writing back to disk into a new single file called external-sorted.txt
static int N = 10; // size of the file in disk
static int M = 4; // max items the memory buffer can hold
int slices = (int) Math.ceil((double) N/M);
String tfile = "temp-file-";
//Reading all the 3 temp files
BufferedReader[] brs = new BufferedReader[slices];
String[] topNums = new String[slices];
for(i = 0; i<slices; i++){
brs[i] = new BufferedReader(new FileReader(tfile + Integer.toString(i) + ".txt"));
String t = brs[i].readLine();
String[] kv = t.split(":");
if(t!=null){
topNums[i] = kv[0];
}
//topNums [key1, key5, key1]
}
FileWriter fw = new FileWriter("external-sorted.txt");
PrintWriter pw = new PrintWriter(fw);
for(i=0; i<N; i++){
String min = topNums[0];
System.out.println("min:"+min);
int minFile = 0;
for(j=0; j<slices; j++){
if(min.compareTo(topNums[j])>0)
{
min = topNums[j];
minFile = j;
}
}
pw.println(min);
String t = brs[minFile].readLine();
String[] kv = new String[2];
if (t != null)
kv = t.split(":");
topNums[minFile] = kv[0];
}
for (i = 0; i < slices; i++)
brs[i].close();
pw.close();
fw.close();
}
Any ideas are appreciated. Please ask, if you have any questions. TIA.
Well, something like this works, I'm sure there are better ways, but at the moment I'm not up to thinking really:
// Declare Scanner Object to read our file
Scanner in = new Scanner(new File(stringRepresentingLocationOfYourFileHere));
// create Map that will contain keys in sorted order (TreeMap)
// along with last value assigned to the key
Map<String, String> mapa = new TreeMap<>();
// another map to hold keys from first map and number of
// occurrences of those keys (repetitions), this could have been
// done using single Map as well, but whatever
Map<String, Integer> mapaDva = new HashMap<>();
// String array that will hold words of each line of our .txt file
String[] line;
// we loop until we reach end of our .txt file
while(in.hasNextLine()){
// check if map already contains given key, if it does
// increment value by 1 otherwise initialize the value with 1
if (mapa.put((line = in.nextLine().split(" "))[0], line[1]) != null)
mapaDva.put(line[0], mapaDva.get(line[0])+1);
else
mapaDva.put(line[0], 1);
}
// loop through our maps and print out keys, number of
//repetitions, last assigned value
for (Map.Entry<String, String> m : mapa.entrySet()){
System.out.println(m.getKey() + " " + mapaDva.get(m.getKey()) + " " + m.getValue());
}
If there is something specific that is unclear about this code, please ask.
Sample input file:
key1 abcd
key2 zzz
key1 tommy
key3 world
Output when done:
key1 2 tommy
key2 1 zzz
key3 1 world
EDIT 2 (solution for when dealing with multiple files):
// array of File objects that hold path to all your files to iterate through
File[] files = {new File("file1.txt"),
new File("file2.txt"),
new File("file3.txt")};
Scanner in;
Map<String, String> mapa = new TreeMap<>();
Map<String, Integer> mapaDva = new HashMap<>();
String[] line;
for (int i = 0; i < files.length; i++) {
// assign new File to Scanner on each iteration (go through our File array)
in = new Scanner(files[i]);
while(in.hasNextLine()){
if (mapa.put((line = in.nextLine().split(" "))[0], line[1]) != null)
mapaDva.put(line[0], mapaDva.get(line[0])+1);
else
mapaDva.put(line[0], 1);
}
}
for (Map.Entry<String, String> m : mapa.entrySet()){
System.out.println(m.getKey() + " " + mapaDva.get(m.getKey()) + " " + m.getValue());
}
So we store all the File objects in our File array and we go through each and every one of them, combine all the content and print out final result:
3 sample input files:
file1.txt
key1 abcd
key2 zzz
key1 tommy
key3 world
file2.txt
key1 abc
key3 xxx
key1 tommy
key6 denver
file3.txt
key5 lol
key8 head
key6 tommy
key6 denver
OUTPUT:
key1 4 tommy
key2 1 zzz
key3 2 xxx
key5 1 lol
key6 3 denver
key8 1 head