pythontensorflowkerasneural-networkchess

Is there any way to speed up the prediction of a model?


We are currently building Neural Network using keras and tensorflow for evaluating chess positions. And we have encountered the problem with speed of prediction on a single sample, which is used in our search tree. The usage in the search tree is to check for legal moves in given position, evaluate each position and get the best move according to the best evaluation.

For higher depths the prediction speed is what makes it slow. It is worth to mention that our Neural network is kind of shallow - 3CNN layers and 2 dense layers. We've trained the model on CPU and we're using CPU in process of predicting aswell. We've assumed that, on this particular (shallow) network it won't affect the perfomance. While there is no parallelism in our use case, therefore there is no need for GPU computing.

Versions:

Python 3.12.4

Tensorflow 2.16.1

Keras 3.3.3

We evaluate a single sample at the time using either model.predict() or model(x). We've discovered that when used as predict_on_batch() is roughly as fast as predict on a single sample. Our goal is to get the prediction as fast as possible while keeping the prediction used on single sample at the time.

We've tried to convert the model to TFLite as it was suggested to get slighlty better performance. But we couldn't convert the model due to newest versions incompatibility, while downgrading didn't work out aswell.

We were observing speeds of each predict function on various batch sizes.

model = keras.models.load_model('firstModel.keras')

print("Durations using model __call__() on small batch")
for i in range(5):
    start = time.time()
    prediction = model(bitboard)
    end = time.time()
    print(end - start)

print("Durations using model.predict_on_batch() on small batch")
for i in range(5):
    start = time.time()
    prediction = model.predict_on_batch(bitboard)
    end = time.time()
    print(end - start)

print("Durations using model.predict() on small batch")
for i in range(5):
    start = time.time()
    prediction = model.predict(bitboard, batch_size=1, verbose=0)
    end = time.time()
    print(end - start)

print("Durations using model.__call__() on larger batch (100 samples)")
for i in range(5):
    start = time.time()
    prediction = model(bitboards)
    end = time.time()
    print(end - start)

print("Durations using model.predict_on_batch() on larger batch (100 samples)")
for i in range(5):
    start = time.time()
    prediction = model.predict_on_batch(bitboards)
    end = time.time()
    print(end - start)

print("Durations using model.predict() on larger batch (100 samples)")
for i in range(5):
    start = time.time()
    prediction = model.predict(bitboards, batch_size=1, verbose=0)
    end = time.time()
    print(end - start)

And the speeds were as follows:

Durations using model __call__() on small batch
0.055520057678222656
0.007033586502075195
0.006206035614013672
0.007121562957763672
0.005555391311645508
Durations using model.predict_on_batch() on small batch
0.06325101852416992
0.0020132064819335938
0.0010013580322265625
0.0009975433349609375
0.0025305747985839844
Durations using model.predict() on small batch
0.1571955680847168
0.05691671371459961
0.05576348304748535
0.05414080619812012
0.05917525291442871
Durations using model.__call__() on larger batch (100 samples)
0.01164698600769043
0.00638890266418457
0.007528543472290039
0.006807804107666016
0.00751185417175293
Durations using model.predict_on_batch() on larger batch (100 samples)
0.04664158821105957
0.0025255680084228516
0.0010013580322265625
0.0020008087158203125
0.0025064945220947266
Durations using model.predict() on larger batch (100 samples)
0.05106091499328613
0.04923701286315918
0.06421136856079102
0.0651085376739502
0.055069923400878906

What troubles us and yet we don't understand how is possible to get prediction on larger batch size in lower execution time, than predicting on a single samples. We've assumed it could be due to wrong keras/tensorflow usage.

Main questions:

Any suggestions how to speed the prediction up?

Is there any possibility, that running the code with GPU would increase the performance?

Would you recommend any other approach to the problem or different use case that would fit for our problem?

EDIT:


We tried converting model to .onnx format. That really improved the time performance of the evaluation function and that's great, but we encountered another performance problem. The problem is that our tree search (based on alpha beta pruning) is also very slow. We are now hardly reaching depth 6 without interference of the model(searching about 2 500 000 nodes in one minute, with interference even less), which is terrible.


So far we haven't added anything like transposition table or move ordering, simply it is just plain tree search. We are now considering translating it to C/C++. We have very little experience with performance tuning, so our question is, how much will it help and how many nodes per second can we get?


We are also wondering if it is a good idea to use our models trained with Python in our future C programs?


Solution

  • I will try to answer your questions in the broader context of chess programming.

    First of all, even though I might be stating the obvious, python isn't the best choice of programming language for writing a chess engine. If you wish to improve your chess engine's performance, translating it to c++ (or any other fast language) is definitely worth it. If you don't want to spend a lot of time doing this, it is also perfectly understandable.

    Concerning your question

    we don't understand how is possible to get prediction on larger batch size in lower execution time, than predicting on a single samples.

    It is not because of "wrong keras/tensorflow usage". When you run inference in batches you avoid various overheads and operations are vectorised.

    However, assuming you are using minimax as a search algorithm, batches will not help you because alpha beta pruning is very hard to parallelise and its speedup is so important that you can't do without it. This stands for GPU inference too (also, data transfer latency kills your performance if you try single inference on the GPU).

    This is why most chess engines run on the cpu and do inference on single positions. If you are curious about GPU engines, you can take a look at how leela chess zero works.


    Now that this is out of the way, here are the ways you can speed up your neural network inference:

    import tf2onnx
    
    # specify the input specs. For example:
    input_spec = (tf.TensorSpec((None, 768), tf.float32, name="input"),) 
    
    output = "onnx_model.onnx"
    
    model_proto, external_tensor_storage = tf2onnx.convert.from_keras(model, opset=15, input_signature=[input_spec], output_path=output)
    

    and to run the network:

    from onnxruntime import InferenceSession
    
    data = ... # your input data
    
    sess = InferenceSession("onnx_model.onnx")
    input_name = self.sess.get_inputs()[0].name
    output_name = self.sess.get_outputs()[0].name
    nn_eval = self.sess.run([self.output_name], {self.input_name: data})
    

    I talk in greater detail about these considerations in the readme of my chess engine if you want to have a look.

    I hope this helped!


    Edit concerning the new questions:

    I am happy to hear that your network's performance has improved. (When you say "interference" I guess you mean inference?)

    Also, I wouldn't say depth 6 is "terrible", it's a good starting point.

    how much will it help and how many nodes per second can we get?

    How much exactly I can't tell, but it will definitely improve the speed by quite a lot (but don't forget minimax has exponential growth, a 2x speedup doesn't at all mean 2x depth).

    You talk about performance tuning. But in a chess engine, speed is not as important as you seem to think. The speed of the network can't really be improved indefinitely. At some point, you reach a hardware limit on how fast you can evaluate a position. Further improvements do not really come from speed, but rather from the search algorithm. Quoting the usual search progression:

    A reasonable search feature progression assuming you have the fundamentals i.e. negamax and alpha/beta pruning (ideally in a fail-soft framework):

    • Quiescent Search
    • Transposition Table
    • Basic Move Ordering (sorting TT move first, captures by MVV-LVA)
    • Iterative Deepening
    • Principal Variation Search
    • Reverse Futility Pruning
    • Null Move Pruning
    • [...]

    There is also a lot to learn from reading the code of other engines, you can find some very interesting ideas there.

    To test if your engine has improved, rather than benchmarking speed, you make the newer version play a lot of games against the older version, and estimate the elo gain. See SPRT testing, which is supported by fast chess or cute chess.

    is a good idea to use our models trained with Python in our future C programs?

    Yes, training models in python to use them in C/C++ is very common. Most if not all chess engines do it. Python is much more convenient to train networks.

    Rather than editing your question I would advise you to join the stockfish discord server. A lot of helpful people can answer your questions there.