How to improving array benchmark performance in PolyML?

I have the following benchmark which iterates over an array, setting the next entry to one plus the previous entry. If the number gets bigger than a certain cap, I set the entry to zero, and proceed. Then at the end I sum the entries in the array.

Question : how can I improve the benchmark results for PolyML?

The times are as following on Ubuntu x86-64 :

polyml (using CFLAGS=O3) = 
1250034994

real    0m54.207s
user    0m52.604s
sys 0m0.792s

g++ (O3) = 
1250034994

real    0m4.628s
user    0m4.578s
sys 0m0.028s

I can get mlton to run almost as fast as the c code (5.2s), but I am particularly interested in PolyML because it builds seamlessly in Windows 7 with the latest version of gcc. (For build instructions for polyML on Windows 7 with MSYS / MSYS2 and mingw gcc compiler see http://lists.inf.ed.ac.uk/pipermail/polyml/2015-August/001593.html)

On windows 7 I have had problems building the latest version of mlton with the latest version of gcc (similar issue to that in https://github.com/MLton/mlton/issues/61#issuecomment-50982499 )

The SML code is :

val size:int = 50000;
val loops:int = 30000;
val cap:int = 50000;

val data:int array = Array.array(size,0);


fun loop () = 
  let 
    fun loopI i = 
      if i = size then
        let val _ = () in
          Array.update(data,0,Array.sub(data,size-1));
          ()
        end
      else 
        let val previous = Array.sub(data,i-1) 
            val use = if previous > cap then 0 else previous in
          Array.update(data,i,use+1);
          loopI (i+1)
      end
  in loopI 1 end

fun benchmarkRun () = 
  let
    fun bench i = 
      if i = loops then ()
      else let val _ = () in 
             loop ();
             bench (i+1)
           end
  in bench 1 end

fun sum (i,value) =
  if i = size then value 
  else sum(i+1,value+Array.sub(data,i))

fun main () = let val _ = () in 
  benchmarkRun();
  print (Int.toString (sum (0,0)));
  print "\n"
  end

(*val _ = main ()*)

and the c++ code is :

#include <iostream>
#include <vector>
using namespace std;

int size = 50000;
int loops = 30000;
int cap = 50000;

vector<int> data(size);

void loop(){
  int previous, use;
  for(int i=1; i<size; i++){
    previous = data[i-1];
    if(previous > cap){
      use = 0;
    }else{
      use = previous;
    }
    data[i] = use + 1;
  }
  data[0] = data[size-1];
}

void benchmarkRun(){
  for(int i=1; i<loops; i++){
    loop();
  }
}

int sum(){
  int res = 0;
  for(int i=0; i<size; i++){
    res += data[i];
  }
  return res;
}

int main(){
  benchmarkRun();
  cout<<sum()<<endl;
}

Solution

I don't think there is anything wrong with your program. In my experience, mlton is the best performing SML compiler by a wide margin, especially for "C-like" code.

Here are some ways you could write it differently that might help the compiler do a better job:

It's possible that Poly/ML is boxing each element of the array. Boxing means allocating an object that contains the integer value, rather than just storing a flat array of integers. This is very expensive: You have many more allocations, indirections, worse cache locality, and more expensive GC. This is sort of fundamental to the compiler, but you might get better performance if you use a monomorphic array like IntArray.array or Word32Array.array. These are optional parts of the Basis: http://sml-family.org/Basis/mono-array.html

It may be slow because of bounds checks. Every iteration through the loop you do a "sub" and "update" call, each of which will (naively) check that the argument against the size of the array and then branch to throw an exception if it's outside bounds. You may be able to reduce the penalty from bounds checking by:

Using a function like Array.modifyi, which can know that the input and output indices are in bounds (you'd still pay for the "sub")
Using a function like ArraySlice.foldli, where you could also be passing the value from the previous cell to the next iteration
Using unsafe array accesses (if Poly/ML supports it; look for an 'Unsafe' structure).

It may be slow because of integer overflow checks. Here after each addition it checks to see if the result can't be represented, and branches to throw an exception. Using something like Word32.word instead of int may improve performance. There are also sometimes compiler flags for turning this off, though it's a pretty dangerous thing to do since other people's code may depend on this part of the language.

Most of these transformations are going to make the code look weirder. I do think it would improve both your program and its performance to pass the previous element's value to your loopI function instead of reading it out with Array.sub. You usually just had that value.

If you are concerned about performance, though, mlton is the way to go. I use x86_64 binaries with mingw64 and they work for me, including linking against C code.