pythonnumpypygame-surfacepsychopypsychtoolbox

Fast direct pixel access in python, go or julia


I wrote a small program that creates random noise and displays it full screen (5K resolution). I used pygame for it. However the refresh rate is horribly slow. Both the surfarray.blit_array and random generation take a lot of time. Any way to speed this up? I am also flexible to use julia or golang instead. or also psychopy or octave with psychotoolbox (however those do not seem to be working under linux/wayland).

Here is what I wrote:


import pygame
import numpy as N
import pygame.surfarray as surfarray
from numpy import int32, uint8, uint

 
def main():
     
    pygame.init()
     
    #flags = pygame.OPENGL | pygame.FULLSCREEN   # OpenGL does not want to work with surfarray
    flags = pygame.FULLSCREEN
    screen = pygame.display.set_mode((0,0), flags=flags, vsync=1)
    w, h = screen.get_width(), screen.get_height()

    clock = pygame.time.Clock()
    font = pygame.font.SysFont("Arial" , 18 , bold = True)
     
    # define a variable to control the main loop
    running = True

    def fps_counter():
        fps = str(int(clock.get_fps()))
        fps_t = font.render(fps , 1, pygame.Color("RED"))
        screen.blit(fps_t,(0,0))

                    
     
    # main loop
    while running:
        # event handling, gets all event from the event queue
        for event in pygame.event.get():
            # only do something if the event is of type QUIT
            if event.type == pygame.QUIT:
                # change the value to False, to exit the main loop
                running = False
            elif event.type == pygame.KEYDOWN:
                if event.key == pygame.K_ESCAPE:
                    pygame.quit()
                    return
        array_img = N.random.randint(0, high=100, size=(w,h,3), dtype=uint)
        surfarray.blit_array(screen, array_img)
        fps_counter()
        pygame.display.flip()
        clock.tick()
        #print(clock.get_fps())
     
# run the main function only if this module is executed as the main script
# (if you import this as a module then nothing is executed)
if __name__=="__main__":
    # call the main function
    main()

I need a refresh rate of at least 30 fps for it to be useful


Solution

  • Faster random number generation

    Generating random numbers is expensive. This is especially true when the random number generator (RNG) needs to be statistically accurate (i.e. random numbers needs to look very random even after some transformation), and when number are generated sequentially.

    Indeed, for cryptographic usages or some mathematical (Monte Carlo) simulations, the target RNG needs to be sufficiently advanced so that there is no statistical correlation between several subsequent generated random numbers. In practice, software methods to do that are so expensive that modern mainstream processors provide a way to do that with specific instructions. Not all processors support this though, and AFAIK Numpy does not use that (certainly for sake of portability since a random sequence with the same seed on multiple machines is expected to give the same results).

    Fortunately, RNGs often do not need to be so accurate in most other use-case. They just need to look quite random. There are many methods to do that (e.g. Mersenne Twister, Xoshiro, Xorshift, PCG/LCG). There is often a trade-off between performance, accuracy and the specialization of the RNG. Since Numpy needs to provide a generic RNG that is relatively accurate (though AFAIK not meant to be used for cryptographic use-cases), it is not surprising its performance is sub-optimal.

    An interesting review of many different methods is available here (though results should be taken with a grain of salt, especially regarding performance since details like being SIMD-friendly is critical for performance in many use-cases).

    Implementing a very-fast random number generator in pure-Python (using CPython) is not possible but one can use Numba (or Cython) to do that. There is maybe fast existing modules written in native languages to do that though. On top of that we can use multiple threads to speed up the operation. I choose to implement a Xorshift64 RNG for sake of simplicity (and also because it is relatively fast).

    import numba as nb
    
    @nb.njit('uint64(uint64,)')
    def xorshift64_step(seed):
        seed ^= seed << np.uint64(13)
        seed ^= seed >> np.uint64(7)
        seed ^= seed << np.uint64(17)
        return seed
    
    @nb.njit('uint64()')
    def init_xorshift64():
        seed = np.uint64(np.random.randint(0x10000000, 0x7FFFFFFF)) # Bootstrap
        return xorshift64_step(seed)
    
    @nb.njit('(uint64, int_)')
    def random_pixel(seed, high):
        # Must be a constant for sake of performance and in the range [0;256]
        max_range = np.uint64(high)
        # Generate 3 group of 16 bits from the RNG
        bits1 = seed & np.uint64(0xFFFF)
        bits2 = (seed >> np.uint64(16)) & np.uint64(0xFFFF)
        bits3 = seed >> np.uint64(48)
        # Scale the numbers using a trick to avoid a modulo 
        # (since modulo are inefficient and statistically incorrect here)
        r = np.uint8(np.uint64(bits1 * max_range) >> np.uint64(16))
        g = np.uint8(np.uint64(bits2 * max_range) >> np.uint64(16))
        b = np.uint8(np.uint64(bits3 * max_range) >> np.uint64(16))
        new_seed = xorshift64_step(seed)
        return (r, g, b, new_seed)
    
    @nb.njit('(int_, int_, int_)', parallel=True)
    def pseudo_random_image(w, h, high):
        res = np.empty((w, h, 3), dtype=np.uint8)
        for i in nb.prange(w):
            # RNG seed initialization
            seed = init_xorshift64()
            for j in range(h):
                r, g, b, seed = random_pixel(seed, high)
                res[i, j, 0] = r
                res[i, j, 1] = g
                res[i, j, 2] = b
        return res
    

    The code is quite big but it is about 22 times faster than Numpy on my i5-9600KF CPU with 6 cores. Note that a similar code can be used in Julia so to get a fast implementation (since Julia use a JIT based on LLVM similar to Numba).

    On my machine this is sufficient to reach 75 FPS (maximum) while the initial code reached 16 FPS.


    Faster operations & rendering

    Generating new random array is limited by the speed of page faults on most platforms. This can significantly slow down the computation. The only way to mitigate this in Python is to create the brame buffer once and perform in-place operation. Moreover, PyGame certainly performs copies internally (and also probably many draw calls) so using a lower-level API can be significantly faster. Still, the operation will likely be memory-bound then and there is nothing to avoid that. It might be fast enough to you though at this point.

    On top of that, the frames are rendered on the GPU so the CPU needs to send/copy the buffer on the GPU, typically through the PCIe interconnect for discrete GPU. This operation is not very fast for wide screens.

    Actually, you can generate the random image directly on the GPU using shaders (or tools like OpenCL/CUDA). This avoid the above overhead and GPUs can do that faster than CPUs.