I'm creating a lookup table / dictionary from a static spreadsheet for upcoming labelling on input data, adding a new field for flavor:
lookup_table_data = pd.read_csv(r'C:\Location\format1.csv', sep=',')
lookup_table_data['label'] = 'apache'
I create the dictionary
my_format = lookup_table_data.set_index('server').T.to_dict('list')
print(my_format)
{'ABC123': ['IBM', 1000, 'East Coast', 'apache'],
'ABC456': ['Dell', 800, 'West Coast', 'apache'],
'XYZ123': ['HP', 900, 'West Coast', 'apache']}
I read in the input data
my_data = pd.read_csv(r'C:\Location\my_data.csv')
print(my_data)
server busy datetime
0 ABC123 24% 6/1/2024 0:02
1 ABC456 45% 6/1/2024 4:01
2 GHI100 95% 6/1/2024 9:10
When I then try to assign some labels via slicing method I end up with 'KeyError' as a server from the input data is not present in the dictionary
my_data['type'] = my_data['server'].map(lambda x: my_format[x][0])
my_data['cost'] = my_data['server'].map(lambda x: my_format[x][1])
my_data['location'] = my_data['server'].map(lambda x: my_format[x][2])
KeyError: 'GHI100'
If I remove GHI100 data the code works fine as intended.
server %busy datetime type cost location
0 ABC123 24% 6/1/2024 0:02 IBM 1000 East Coast
1 ABC456 45% 6/1/2024 4:01 Dell 800 West Coast
Any advice on how I can handle missing keys? Several articles mention using .get with a default value. But doing that I get "Index out of list range" for the slices above 0 (i.e. map(lambda x: my_format.get([x][1], None)). Plugging in the [x][0] as a line by itself
my_data['type'] = my_data['server'].map(lambda x: my_format.get([x][0], None))
Gets the whole dictionary put into the type column though it does prevent a keyerror.
server %busy datetime type
0 ABC123 24% 6/1/2024 0:02 [IBM, 1000, East Coast]
1 ABC456 45% 6/1/2024 4:01 [Dell, 800, West Coast]
2 GHI100 95% 6/1/2024 9:10 None
My only other solution has been to do an inner merge on the format file and the data file before the lookup as that removes data with servers not in the dictionary.
In order for the .get()
method to work, you need to use it on the specific index operation that is causing the error. In your case, that is the myformat[x]
, not the following index of [0]
, [1]
, or [2]
. That means your replacement is myformat.get(x, DEFAULTVALUE)[INDEX]
, where INDEX
is 0
, 1
, or 2
as appropriate, and DEFAULTVALUE
is something that will work with the following index. Since your indexes are 0, 1, or 2, that's as easy as [None]*3
. My suggestion is actually to use ""
or "unknown"
instead of None
. Given that, any of these should work:
my_data['type'] = my_data['server'].map(lambda x: my_format.get(x, [""]*3)[0])
my_data['cost'] = my_data['server'].map(lambda x: my_format.get(x, [""]*3)[1])
my_data['location'] = my_data['server'].map(lambda x: my_format.get(x, [""]*3)[2])
for (col_index,col_name) in enumerate(['type', 'cost', 'location']):
my_data[col_name] = my_data['server'].map(
lambda x: (
my_format
.get(x, [""]*3)
[col_index]
)
)
for (col_index,col_name) in enumerate(['type', 'cost', 'location']):
my_data[col_name] = my_data['server'].map(
lambda x: (
my_format[x][col_index]
if x in my_format
else ""
)
)
my_data.update(
(col_name, my_data['server'].map(
lambda x: (
my_format[x][col_index]
if x in my_format
else ""
)
))
for (col_index, col_name) in enumerate(['type', 'cost', 'location'])
)
The reason your posted attempt failed (my_format.get([x][0], None)
) is because the expression [x][0]
means "make a list containing x
as its only element, then fetch the element at index 0", which just evaluates to x
, which was the original server name; the .get()
then fetches the my_format
entry for the server name if it exists, which is the entire list of server data.