pythondictionarykeyerror

Advice on how to handle KeyError occurring when using map Lambda and slicing for dictionary lookup


I'm creating a lookup table / dictionary from a static spreadsheet for upcoming labelling on input data, adding a new field for flavor:

lookup_table_data = pd.read_csv(r'C:\Location\format1.csv', sep=',')
lookup_table_data['label'] = 'apache'

I create the dictionary

my_format = lookup_table_data.set_index('server').T.to_dict('list')
print(my_format)

 {'ABC123': ['IBM', 1000, 'East Coast', 'apache'],
  'ABC456': ['Dell', 800, 'West Coast', 'apache'],
  'XYZ123': ['HP', 900, 'West Coast', 'apache']}

I read in the input data

my_data = pd.read_csv(r'C:\Location\my_data.csv')
print(my_data)

   server  busy       datetime
0  ABC123   24%  6/1/2024 0:02
1  ABC456   45%  6/1/2024 4:01
2  GHI100   95%  6/1/2024 9:10

When I then try to assign some labels via slicing method I end up with 'KeyError' as a server from the input data is not present in the dictionary

my_data['type'] = my_data['server'].map(lambda x: my_format[x][0])
my_data['cost'] = my_data['server'].map(lambda x: my_format[x][1])
my_data['location'] = my_data['server'].map(lambda x: my_format[x][2])

KeyError: 'GHI100'

If I remove GHI100 data the code works fine as intended.

      server %busy       datetime  type  cost    location
   0  ABC123   24%  6/1/2024 0:02   IBM  1000  East Coast
   1  ABC456   45%  6/1/2024 4:01  Dell   800  West Coast

Any advice on how I can handle missing keys? Several articles mention using .get with a default value. But doing that I get "Index out of list range" for the slices above 0 (i.e. map(lambda x: my_format.get([x][1], None)). Plugging in the [x][0] as a line by itself

 my_data['type'] = my_data['server'].map(lambda x: my_format.get([x][0], None))

Gets the whole dictionary put into the type column though it does prevent a keyerror.

   server %busy       datetime                    type
0  ABC123   24%  6/1/2024 0:02   [IBM, 1000, East Coast]
1  ABC456   45%  6/1/2024 4:01   [Dell, 800, West Coast]
2  GHI100   95%  6/1/2024 9:10                      None

My only other solution has been to do an inner merge on the format file and the data file before the lookup as that removes data with servers not in the dictionary.


Solution

  • In order for the .get() method to work, you need to use it on the specific index operation that is causing the error. In your case, that is the myformat[x], not the following index of [0], [1], or [2]. That means your replacement is myformat.get(x, DEFAULTVALUE)[INDEX], where INDEX is 0, 1, or 2 as appropriate, and DEFAULTVALUE is something that will work with the following index. Since your indexes are 0, 1, or 2, that's as easy as [None]*3. My suggestion is actually to use "" or "unknown" instead of None. Given that, any of these should work:

    my_data['type'] = my_data['server'].map(lambda x: my_format.get(x, [""]*3)[0])
    my_data['cost'] = my_data['server'].map(lambda x: my_format.get(x, [""]*3)[1])
    my_data['location'] = my_data['server'].map(lambda x: my_format.get(x, [""]*3)[2])
    
    for (col_index,col_name) in enumerate(['type', 'cost', 'location']):
        my_data[col_name] = my_data['server'].map(
            lambda x: (
                my_format
                .get(x, [""]*3)
                [col_index]
            )
        )
    
    for (col_index,col_name) in enumerate(['type', 'cost', 'location']):
        my_data[col_name] = my_data['server'].map(
            lambda x: (
                my_format[x][col_index]
                if x in my_format
                else ""
            )
        )
    
    my_data.update(
        (col_name, my_data['server'].map(
            lambda x: (
                my_format[x][col_index]
                if x in my_format
                else ""
            )
        ))
        for (col_index, col_name) in enumerate(['type', 'cost', 'location'])
    )
    

    The reason your posted attempt failed (my_format.get([x][0], None)) is because the expression [x][0] means "make a list containing x as its only element, then fetch the element at index 0", which just evaluates to x, which was the original server name; the .get() then fetches the my_format entry for the server name if it exists, which is the entire list of server data.