pythonlibmemcachedmemcachedb

Correct way to load bulk data


I am trying to load a huge amount of data into memcachedb. I am running some queries on MySQL database, and I want to store the results of these queries in memcachedb for later easy access.

Currently, I was just using simple set commands to store the results in memcachedb but since there are billions of these results, storing them one by one in a loop is very inefficient and time-consuming. So, I was wondering if there is a better way to load data into memcachedb? Like a data import wizard in traditional RDMS

I am using pylibmc to connect to memcachedb.


Solution

  • The pylibmc library has the set_multi function, which sends a bunch of commands in one go:

    mc.set_multi({
        'key': 'Hello',
        'another': True,
        #[..]
    })
    

    This should probably work well enough. If you have billions of keys, you probably want to split it into chunks of a few thousand.

    You can probably squeeze a bit more performance if you just send the commands over a socket. the memcache protocol is pretty simple. This has the advantage that you can add the noreply flag, so the server won't bother sending a reply. Of course, this means you can't do any error checking and that losing a few keys for whatever reason is fine.

    Here's a simple proof of concept:

    #!/usr/bin/env python
    
    import socket
    
    data = 'set key_1 0 86400 5\r\nabcde\r\n'
    data += 'set key_2 0 86400 5\r\nzxcvb\r\n'
    
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect(('localhost', 11211))
    s.sendall(data)
    print(s.recv(8192))
    s.close()
    
    # Verify if it worked!
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect(('localhost', 11211))
    s.sendall('get key_1\r\n')
    s.sendall('get key_2\r\n')
    print(s.recv(8192))
    s.close()
    

    Which should output:

    STORED
    STORED
    
    VALUE key_1 0 5
    abcde
    END
    VALUE key_2 0 5
    zxcvb
    END
    

    The format of the set command is:

    set <key> <flags> <exptime> <data_size> [noreply]\r\n
    <data>\r\n
    

    Of course, this is just a proof-of-concept; a slightly more advanced example might be something like:

    #!/usr/bin/env python
    
    import socket
    
    def make_set(n, data):
        return 'set key_{} 0 86400 {}\r\n{}\r\n'.format(n, len(data), data)
    
    data = open('/etc/aliases').readlines()
    commands = [ make_set(n, d.strip()) for n, d in enumerate(data) if d.strip() != '' ]
    
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect(('localhost', 11211))
    s.sendall(''.join(commands))
    print(s.recv(65000))
    
    # Verify if it worked!
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect(('localhost', 11211))
    for n in range(0, len(commands)):
        s.sendall('get key_{}\r\n'.format(n))
    print(s.recv(65000))
    s.close()
    

    If you're getting data from MySQL, then consider making a set command with an SQL query! For example:

    select
        concat('set key_', page_id, ' 0 86400 ', length(page_title), '\r\n', page_title, '\r\n')
        as cmd
    from page limit 2;
    

    Not sure this is actually faster, but I suspect it is.