pythondictionaryindexer

Tool doesn't work on big data set - Single positional indexer is out-of-bounds


I am breaking my head for a day already about the following:

I have built a tool to iterate over two df to find duplicated values and sum points if they are duplicated:

It looks like this:

df1 = pd.DataFrame(dict1.items())
df2 = pd.DataFrame(dict2.items())



a = 0
while a != len(df2):

    value_to_compare = df2.iloc[a, 0]

    b = 0

    for row in range(len(df1)):
        if value_to_compare == df1.iloc[b, 0]:
            df1.iloc[b, 1] = df1.iloc[b, 1] + df2.iloc[b, 1]
            b = b + 1

        else:
            b = b + 1

    if b == len(df1):
        df1 = df1.append(df2.iloc[a, :], ignore_index=True)
        a = a + 1



df1 = df1.drop_duplicates(subset=[0], keep='first', ignore_index=True)
print('\n\n',df1)

It works perfect on a data set from 2 dicts like these:

dict1 = {'A': 1, 'B': 1, 'C': 1, 'D': 1}

dict2 = {'a': 1, 'b': 1, 'c': 1, 'd': 1}

But as soon as I apply it to the main program where there is 2 df with a couple of hundred rows (here an example):

              word  occurance
0            labor          4
1      predictions          2
2              nfl          2
3             kids          2
4           africa          2
5         pandemic          2
6             kara          2
7             days          2
8          swisher          2
9            event          2
10             day          2
11        football          2
12          office          2
13              us          2
14        politics          2

and make dicts of them using:

keys1 = words_total['word'].tolist()
values1 = words_total['occurance'].tolist()
dict1 = dict(zip(keys1, values1))
keys2 = words_date['word'].tolist()
values2 = words_date['occurance'].tolist()
dict2 = dict(zip(keys2, values2))

I get the following error:

Traceback (most recent call last):
  File "/Users/Programowanie/PycharmProjects/pythonProject3/main.py", line 120, in <module>
    df1.iloc[b, 1] = df1.iloc[b, 1] + df2.iloc[b, 1]
  File "/Users/Programowanie/PycharmProjects/pythonProject3/venv/lib/python3.9/site-packages/pandas/core/indexing.py", line 925, in __getitem__
    return self._getitem_tuple(key)
  File "/Users/Programowanie/PycharmProjects/pythonProject3/venv/lib/python3.9/site-packages/pandas/core/indexing.py", line 1506, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "/Users/Programowanie/PycharmProjects/pythonProject3/venv/lib/python3.9/site-packages/pandas/core/indexing.py", line 754, in _has_valid_tuple
    self._validate_key(k, i)
  File "/Users/Programowanie/PycharmProjects/pythonProject3/venv/lib/python3.9/site-packages/pandas/core/indexing.py", line 1409, in _validate_key
    self._validate_integer(key, axis)
  File "/Users/Programowanie/PycharmProjects/pythonProject3/venv/lib/python3.9/site-packages/pandas/core/indexing.py", line 1500, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

Do you have an idea why this is happening? Thank you in advance :)


Solution

  • Without seeing the 2 dataframes, my assumption is that the 2 dataframes don't contain the same number of rows, which means when you call and index/row with .iloc it's going to be a value that is out of bounds. For example, if I have a dataframe of 10 rows, I can't go and call the row at index value 15.

    Wouldn't it be easier to combine the 2 dataframes, then groupby the 'word' column and sum the occurance value for those?

    import pandas as pd
    
    data1 = {'word':['labor','predictions','nfl','kids','africa','pandemic','kara','days',
             'swisher','event','day','football','office','us','politics'],
     'occurance':[4,2,2,2,2,2,2,2,2,2,2,2,2,2,2]}
    
    data2 = {'word':['labor','predictions','nfl','kids','africa','pandemic','kara','days',
             'swisher','event','day','us','politics'],
     'occurance':[1,2,8,2,2,2,1,2,2,7,2,4,5]}
    
        
    df1 = pd.DataFrame(data1)
    df2 = pd.DataFrame(data2)
    
    # Combine the 2 dataframes
    combined_df = pd.concat([df1, df2])
    
    
    # Groupby the word column and sum the occurance column
    occurances = combined_df.groupby('word').agg({"occurance": "sum"}).reset_index()
    

    Output:

    print(occurances)
               word  occurance
    0        africa          4
    1           day          4
    2          days          4
    3         event          9
    4      football          2
    5          kara          3
    6          kids          4
    7         labor          5
    8           nfl         10
    9        office          2
    10     pandemic          4
    11     politics          7
    12  predictions          4
    13      swisher          4
    14           us          6