java algorithm sorting mergesort methodology

Comment on the Methodology of Primary Research investigating "The Effect of Hardware Variances on the Performance of the Merge Sort Algorithm"

I am a highschool student and am conducting primary research on "To what extent do variances in the performance of CPU, RAM, and storage affect the performance of the merge sort algorithm when it is run on large data sets?"

Methodology

The research question will be investigated by running the merge sort algorithm on various hardware components (CPU, RAM, and primary drive) and evaluating which hardware component is the most effective at increasing the efficiency of the algorithm. The performance of hardware components is going to be changed by underclocking and overclocking the CPU and RAM, whilst storing the program running the algorithm on SSD vs. HDD and recording the time it takes for the algorithm to sort 500 randomly generated integers in an array. A small data set was used in comparison to the Big Data used by major companies, to avoid taxing the limited available resources by constantly reaching the limits of the hardware through bottlenecks or by causing a burnout. To truly understand the efficacy of the merge sort algorthim on a big data set, the size of the data set should ideally be around billion data elements, but that would require superior CPU, RAM, storage, and cooling solutions to protect the hardware and keep up with processing large amounts of data. To reiterate the merge sort algorithm is very popular in large corporations to easily locate items in large lists and thus it is likely that in reality the merge sort algorithm has been modified to be more efficient and to better handle billions of data elements. Before conducting this experiment, it is important to control various extraneous variables which could skew the accuracy of the results presented in this essay. Firstly, it is important that the operating system on which the algorithm is run is the same during all trials so that the way the OS allocates and prioritizes memory is the same during all tests. Additionally, all the hardware components such as the CPU, RAM, graphics, motherboard, and the cooling solution must be the same for all tests to avoid any manufacturing differences in protocols or specifications such as the available cache, latency, number of cores, or multithreading performance. All of these differences can directly improve or deteriorate the performance of the merge sort algorithm and thus lead to distortions in the results. Lastly, no other programs must be open during the testing process to avoid other programs to use memory or processing power that was intended for sorting purposes.

Here is the algorithm I am going to be running:

import java.util.Random;
public class Merge_Sort {

    private int[] arr;

    public void main() {
        double startTime = System.nanoTime();
        createArray();
        mergeSort(0, arr.length - 1);
        double endTime = System.nanoTime();
        double totalTime = endTime - startTime;
        System.out.println("Total Time to run the algo: " +
                totalTime / 1000000000);
    }

    public void createArray() {
        arr = new int[500];
        for (int i = 0; i < arr.length; i++) {
            arr[i] = randomFill();
        }
    }

    public int randomFill() {
        Random rand = new Random();
        int randomNum = rand.nextInt();
        return randomNum;
    }

    public void merge(int l, int m, int r) {
        int i, j, k;
        int n1 = m - l + 1;
        int n2 = r - m;

        /* create temp arrays */
        int[] L = new int[n1];
        int[] R = new int[n2];

        /* Copy data to temp arrays L[] and R[] */
        for (i = 0; i < n1; i++) {
            L[i] = arr[l + i];
        }
        for (j = 0; j < n2; j++) {
            R[j] = arr[m + 1 + j];
        }
        /* Merge the temp arrays back into arr[l..r]*/
        i = 0; // Initial index of first subarray 
        j = 0; // Initial index of second subarray 
        k = l; // Initial index of merged subarray 
        while (i < n1 && j < n2) {
            if (L[i] <= R[j]) {
                arr[k] = L[i];
                i++;
            } else {
                arr[k] = R[j];
                j++;
            }
            k++;
        }

      /* Copy the remaining elements of L[], if there 
      are any */
        while (i < n1) {
            arr[k] = L[i];
            i++;
            k++;
        }

      /* Copy the remaining elements of R[], if there 
      are any */
        while (j < n2) {
            arr[k] = R[j];
            j++;
            k++;
        }
    }

    /* l is for left index and r is right index of the 
    sub-array of arr to be sorted */
    public void mergeSort(int l, int r) {
        if (l < r) {
            // large l and h 
            int m = (l + r) / 2;
            // Sort first and second halves 
            mergeSort(l, m);
            mergeSort(m + 1, r);

            merge(l, m, r);
        }
    }

    public void printArray(int A[]) {
        int i;
        for (i = 0; i < A.length; i++) {
            System.out.println(A[i]);
        }
        System.out.println("\n");
    }
}

Any comments on the methodology would be greatly appreciated, specifically whether my reasoning of using a small data set(500) is accurate or not.

Thanks a lot!

Solution

To benchmark Java code, you should use a proper benchmarking framework like JMH. JMH makes sure to warm up the JVM and to run your code enough times to get consistent results. Without it, you might be simply measuring the performance of JVM startup and compilation, not the sorting. This means that you'll be measuring something completely different from what you meant to measure - the results will be just noise, no signal.
500 integers is a ridiculously low number. If every integer is 4 bytes long, that's only 2000 bytes of data. This means a few things:
1. The entire "dataset" will be sorted in a very short time - we're talking microseconds here. This will be very difficult to measure accurately. On the whole, general purpose operating systems aren't great for accuracies below 10-20ms, which is probably x100 - x1000 the time it'll take to sort 500 ints. So you'll need to sort those numbers a whole bunch of times (say 1000), see how long that takes, and then divide by 1000 to see how long a single run took. This brings us to the next problem:
2. The entire "dataset" will probably fit into a single memory page. Moreover, it'll fit into the CPU cache in its entirety. Heck, it can all fit into L1, the smallest (and fastest) of the CPU caches. This means that during sorting, everything will be done within the CPU, so no memory accesses, no disk access. The size and clock speed of RAM will therefore impact only the initial loading of the 500 integers, and even then, the impact will be negligible. And since you'll need to run the test thousands of times for a single benchmark, you won't even see any loading times in your result, since the data will only be loaded once for all those runs.
In other words using 500 ins, is like comparing different types of engine oil by measuring the speed of a car over a distance of one meter (3.3 feet, if you're from that side of the ocean).

For any meaningful result, you need to vastly increase the amount of data you're sorting, and of course use JMH. I'd also use different data sizes and throw in some additional sorting algorithms to compare against. It would be interesting to show how the size of the input and the different hardware choices affect the results.