pythonpandasmemoryjupyter-notebookswapfile

How to get around Memory Error when using Pandas?


I know that Memors Error is a common error when using different functions of the Pandas library. I want to get help in several areas. My questions are formulated below, after describing the problem.

My OS is Ubuntu 18, workspace is jupyter notebook within the framework of Anaconda, RAM volume 8Gb.

The task that I solve.

I have over 100,000 dictionaries containing data on site visits by users, like this.

{'meduza.io': 2, 'google.com': 4, 'oracle.com': 2, 'mail.google.com': 1, 'yandex.ru': 1, 'user_id': 3}

It is necessary to form a DataFrame from this data. At first I used the append function to add dictionaries line by line in a DataFrame.

for i in tqdm_notebook(data):
   real_data = real_data.append([i], ignore_index=True)

But the toy dataset showed that this function takes a long time to complete. Then I directly tried to create a DataFrame by passing an array with dictionaries like this.

real_data = pd.DataFrame(data=data, dtype='int')

Converting a small amount of data is fast enough.But when I pass the complete data set to the function Memory Eror appears. I track the consumption of RAM. The function does not start execution and does not waste memory. I tried to expand the swap file. But this did not work, the function does not access it.

I understand that to solve my particular problem, I can break the data into parts, and then combine them. But I'm not sure that I know the most effective method of solving this problem.

  1. I want to understand how the calculation of the required amount of memory for the operation of the Pandas works. Judging by the number of questions on this topic, a memory error occurs when reading, merging, etc. Is it possible to include a swap file to solve this problem?

  2. How to more efficiently implement the solution to the problem with the addition of dictionaries in DataFrame? 'Append' is not working efficiently. Creating a DataFrame from a complete dataset is more efficient, but leads to an error. I do not understand the implementation of these processes, but I want to figure out what is the most efficient way to convert data like my task.


Solution

  • I'd suggest specifying the dtypes of the columns, it might be trying to read them as object types - e.g. if using DataFrame.from_dict then specify the dtype argument; dtype={'a': np.float64, 'b': np.int32, 'c': 'Int64'}. The best way to create the dataframe is from the dictionary object as you're doing - never use dataframe.append, because it's really inefficient.

    See if any other programs are taking up memory on your system as well, and kill those before trying to do the load.

    You could also try and see at what point the memory error occurs - 50k, 70k, 100k?

    Debug the dataframe and see what types are being loaded, and make sure those types are the smallest appropriate (e.g. bool rather than object for example).

    EDIT: What could be making your dataframe very large is if you have lots of sparse entries, especially if there are lots of different domains as headers. It might be better to change your columns to a more 'key:value' approach, e.g. {'domain': 'google.ru', 'user_id': 3, 'count': 10} for example. You might have 100k columns!