I am breaking my head for a day already about the following:
I have built a tool to iterate over two df to find duplicated values and sum points if they are duplicated:
It looks like this:
df1 = pd.DataFrame(dict1.items())
df2 = pd.DataFrame(dict2.items())
a = 0
while a != len(df2):
value_to_compare = df2.iloc[a, 0]
b = 0
for row in range(len(df1)):
if value_to_compare == df1.iloc[b, 0]:
df1.iloc[b, 1] = df1.iloc[b, 1] + df2.iloc[b, 1]
b = b + 1
else:
b = b + 1
if b == len(df1):
df1 = df1.append(df2.iloc[a, :], ignore_index=True)
a = a + 1
df1 = df1.drop_duplicates(subset=[0], keep='first', ignore_index=True)
print('\n\n',df1)
It works perfect on a data set from 2 dicts like these:
dict1 = {'A': 1, 'B': 1, 'C': 1, 'D': 1}
dict2 = {'a': 1, 'b': 1, 'c': 1, 'd': 1}
But as soon as I apply it to the main program where there is 2 df with a couple of hundred rows (here an example):
word occurance
0 labor 4
1 predictions 2
2 nfl 2
3 kids 2
4 africa 2
5 pandemic 2
6 kara 2
7 days 2
8 swisher 2
9 event 2
10 day 2
11 football 2
12 office 2
13 us 2
14 politics 2
and make dicts of them using:
keys1 = words_total['word'].tolist()
values1 = words_total['occurance'].tolist()
dict1 = dict(zip(keys1, values1))
keys2 = words_date['word'].tolist()
values2 = words_date['occurance'].tolist()
dict2 = dict(zip(keys2, values2))
I get the following error:
Traceback (most recent call last):
File "/Users/Programowanie/PycharmProjects/pythonProject3/main.py", line 120, in <module>
df1.iloc[b, 1] = df1.iloc[b, 1] + df2.iloc[b, 1]
File "/Users/Programowanie/PycharmProjects/pythonProject3/venv/lib/python3.9/site-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/Users/Programowanie/PycharmProjects/pythonProject3/venv/lib/python3.9/site-packages/pandas/core/indexing.py", line 1506, in _getitem_tuple
self._has_valid_tuple(tup)
File "/Users/Programowanie/PycharmProjects/pythonProject3/venv/lib/python3.9/site-packages/pandas/core/indexing.py", line 754, in _has_valid_tuple
self._validate_key(k, i)
File "/Users/Programowanie/PycharmProjects/pythonProject3/venv/lib/python3.9/site-packages/pandas/core/indexing.py", line 1409, in _validate_key
self._validate_integer(key, axis)
File "/Users/Programowanie/PycharmProjects/pythonProject3/venv/lib/python3.9/site-packages/pandas/core/indexing.py", line 1500, in _validate_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
Do you have an idea why this is happening? Thank you in advance :)
Without seeing the 2 dataframes, my assumption is that the 2 dataframes don't contain the same number of rows, which means when you call and index/row with .iloc
it's going to be a value that is out of bounds
. For example, if I have a dataframe of 10 rows, I can't go and call the row at index value 15
.
Wouldn't it be easier to combine the 2 dataframes, then groupby the 'word'
column and sum the occurance
value for those?
import pandas as pd
data1 = {'word':['labor','predictions','nfl','kids','africa','pandemic','kara','days',
'swisher','event','day','football','office','us','politics'],
'occurance':[4,2,2,2,2,2,2,2,2,2,2,2,2,2,2]}
data2 = {'word':['labor','predictions','nfl','kids','africa','pandemic','kara','days',
'swisher','event','day','us','politics'],
'occurance':[1,2,8,2,2,2,1,2,2,7,2,4,5]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Combine the 2 dataframes
combined_df = pd.concat([df1, df2])
# Groupby the word column and sum the occurance column
occurances = combined_df.groupby('word').agg({"occurance": "sum"}).reset_index()
Output:
print(occurances)
word occurance
0 africa 4
1 day 4
2 days 4
3 event 9
4 football 2
5 kara 3
6 kids 4
7 labor 5
8 nfl 10
9 office 2
10 pandemic 4
11 politics 7
12 predictions 4
13 swisher 4
14 us 6