javaalgorithmsortingmergesortmethodology

Comment on the Methodology of Primary Research investigating "The Effect of Hardware Variances on the Performance of the Merge Sort Algorithm"


I am a highschool student and am conducting primary research on "To what extent do variances in the performance of CPU, RAM, and storage affect the performance of the merge sort algorithm when it is run on large data sets?"

Methodology

The research question will be investigated by running the merge sort algorithm on various hardware components (CPU, RAM, and primary drive) and evaluating which hardware component is the most effective at increasing the efficiency of the algorithm. The performance of hardware components is going to be changed by underclocking and overclocking the CPU and RAM, whilst storing the program running the algorithm on SSD vs. HDD and recording the time it takes for the algorithm to sort 500 randomly generated integers in an array. A small data set was used in comparison to the Big Data used by major companies, to avoid taxing the limited available resources by constantly reaching the limits of the hardware through bottlenecks or by causing a burnout. To truly understand the efficacy of the merge sort algorthim on a big data set, the size of the data set should ideally be around billion data elements, but that would require superior CPU, RAM, storage, and cooling solutions to protect the hardware and keep up with processing large amounts of data. To reiterate the merge sort algorithm is very popular in large corporations to easily locate items in large lists and thus it is likely that in reality the merge sort algorithm has been modified to be more efficient and to better handle billions of data elements. Before conducting this experiment, it is important to control various extraneous variables which could skew the accuracy of the results presented in this essay. Firstly, it is important that the operating system on which the algorithm is run is the same during all trials so that the way the OS allocates and prioritizes memory is the same during all tests. Additionally, all the hardware components such as the CPU, RAM, graphics, motherboard, and the cooling solution must be the same for all tests to avoid any manufacturing differences in protocols or specifications such as the available cache, latency, number of cores, or multithreading performance. All of these differences can directly improve or deteriorate the performance of the merge sort algorithm and thus lead to distortions in the results. Lastly, no other programs must be open during the testing process to avoid other programs to use memory or processing power that was intended for sorting purposes.

Here is the algorithm I am going to be running:

import java.util.Random;
public class Merge_Sort {

    private int[] arr;

    public void main() {
        double startTime = System.nanoTime();
        createArray();
        mergeSort(0, arr.length - 1);
        double endTime = System.nanoTime();
        double totalTime = endTime - startTime;
        System.out.println("Total Time to run the algo: " +
                totalTime / 1000000000);
    }

    public void createArray() {
        arr = new int[500];
        for (int i = 0; i < arr.length; i++) {
            arr[i] = randomFill();
        }
    }

    public int randomFill() {
        Random rand = new Random();
        int randomNum = rand.nextInt();
        return randomNum;
    }

    public void merge(int l, int m, int r) {
        int i, j, k;
        int n1 = m - l + 1;
        int n2 = r - m;

        /* create temp arrays */
        int[] L = new int[n1];
        int[] R = new int[n2];

        /* Copy data to temp arrays L[] and R[] */
        for (i = 0; i < n1; i++) {
            L[i] = arr[l + i];
        }
        for (j = 0; j < n2; j++) {
            R[j] = arr[m + 1 + j];
        }
        /* Merge the temp arrays back into arr[l..r]*/
        i = 0; // Initial index of first subarray 
        j = 0; // Initial index of second subarray 
        k = l; // Initial index of merged subarray 
        while (i < n1 && j < n2) {
            if (L[i] <= R[j]) {
                arr[k] = L[i];
                i++;
            } else {
                arr[k] = R[j];
                j++;
            }
            k++;
        }

      /* Copy the remaining elements of L[], if there 
      are any */
        while (i < n1) {
            arr[k] = L[i];
            i++;
            k++;
        }

      /* Copy the remaining elements of R[], if there 
      are any */
        while (j < n2) {
            arr[k] = R[j];
            j++;
            k++;
        }
    }

    /* l is for left index and r is right index of the 
    sub-array of arr to be sorted */
    public void mergeSort(int l, int r) {
        if (l < r) {
            // large l and h 
            int m = (l + r) / 2;
            // Sort first and second halves 
            mergeSort(l, m);
            mergeSort(m + 1, r);

            merge(l, m, r);
        }
    }

    public void printArray(int A[]) {
        int i;
        for (i = 0; i < A.length; i++) {
            System.out.println(A[i]);
        }
        System.out.println("\n");
    }
}

Any comments on the methodology would be greatly appreciated, specifically whether my reasoning of using a small data set(500) is accurate or not.

Thanks a lot!


Solution

  • For any meaningful result, you need to vastly increase the amount of data you're sorting, and of course use JMH. I'd also use different data sizes and throw in some additional sorting algorithms to compare against. It would be interesting to show how the size of the input and the different hardware choices affect the results.