pythonhadoophbasehappybase

Broken Pipe error when uploading data to Apache Hbase


I'm currently trying to load a large CSV into Apache hbase. The CSV is 50,000 columns wide and 15,000 rows. The values for the CSV are just integers.

The Hbase cluster is running on AWS EMR, with plenty of memory (244GB) and compute (32 cores each, 4 nodes).

I'm trying to load the data into the database with this python script:

import happybase
import pandas as pd

connection = happybase.Connection('localhost')

familes = {
    's': dict(in_memory=True)
}

#connection.delete_table('exon', disable=True)
connection.create_table('exon', familes)

table = connection.table('exon')
df = pd.read_csv('exon.csv', nrows=1000)

col = list(df)
col = col[1:]


for index, row in df.iterrows():
    to_put = {}
    for col_name in col:
        to_put[('s:'+ col_name).encode('utf-8')] = str(row[col_name]).encode('utf-8')
    print('putting: ' + str(row[0]))
    table.put(row[0].encode('utf-8'), to_put)

When this script runs, only reading the first few rows, there is no issue:

df = pd.read_csv('exon.csv', nrows=20)

However, reading more rows causes an error:

df = pd.read_csv('exon.csv', nrows=1000)
putting: F1S4_160106_001_B01
Traceback (most recent call last):
  File "load.py", line 25, in <module>
    table.put(row[0].encode('utf-8'), to_put)
  File "/usr/local/lib/python3.6/site-packages/happybase/table.py", line 464, in put
    batch.put(row, data)
  File "/usr/local/lib/python3.6/site-packages/happybase/batch.py", line 137, in __exit__
    self.send()
  File "/usr/local/lib/python3.6/site-packages/happybase/batch.py", line 60, in send
    self._table.connection.client.mutateRows(self._table.name, bms, {})
  File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 200, in _req
    self._send(_api, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 210, in _send
    args.write(self._oprot)
  File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 153, in write
    oprot.write_struct(self)
  File "thriftpy2/protocol/cybin/cybin.pyx", line 477, in cybin.TCyBinaryProtocol.write_struct
  File "thriftpy2/protocol/cybin/cybin.pyx", line 474, in cybin.TCyBinaryProtocol.write_struct
  File "thriftpy2/protocol/cybin/cybin.pyx", line 212, in cybin.write_struct
  File "thriftpy2/protocol/cybin/cybin.pyx", line 356, in cybin.c_write_val
  File "thriftpy2/protocol/cybin/cybin.pyx", line 115, in cybin.write_list
  File "thriftpy2/protocol/cybin/cybin.pyx", line 362, in cybin.c_write_val
  File "thriftpy2/protocol/cybin/cybin.pyx", line 212, in cybin.write_struct
  File "thriftpy2/protocol/cybin/cybin.pyx", line 356, in cybin.c_write_val
  File "thriftpy2/protocol/cybin/cybin.pyx", line 115, in cybin.write_list
  File "thriftpy2/protocol/cybin/cybin.pyx", line 362, in cybin.c_write_val
  File "thriftpy2/protocol/cybin/cybin.pyx", line 209, in cybin.write_struct
  File "thriftpy2/protocol/cybin/cybin.pyx", line 71, in cybin.write_i08
  File "thriftpy2/transport/buffered/cybuffered.pyx", line 55, in thriftpy2.transport.buffered.cybuffered.TCyBufferedTransport.c_write
  File "thriftpy2/transport/buffered/cybuffered.pyx", line 80, in thriftpy2.transport.buffered.cybuffered.TCyBufferedTransport.c_flush
  File "/usr/local/lib64/python3.6/site-packages/thriftpy2/transport/socket.py", line 136, in write
    self.sock.sendall(buff)
BrokenPipeError: [Errno 32] Broken pipe

Is it just too much data inserted at once? I've tried batch puts as well, the same issue comes up.


Solution

  • Found my error - because I'm calling pandas.read_csv after I open the HappyBase connection, the connection times out. Calling read_csv before I open the connection remedied the problem.