pythonmongodbdaskdask-distributeddask-delayed

How to access nested data in Dask Bag while using dask mongo


Below is the sample data -

({'age': 61,
  'name': ['Emiko', 'Oliver'],
  'occupation': 'Medical Student',
  'telephone': '166.814.5565',
  'address': {'address': '645 Drumm Line', 'city': 'Kennewick'},
  'credit-card': {'number': '3792 459318 98518', 'expiration-date': '12/23'}},
 {'age': 54,
  'name': ['Wendolyn', 'Ortega'],
  'occupation': 'Tractor Driver',
  'telephone': '1-975-090-1672',
  'address': {'address': '1274 Harbor Court', 'city': 'Mustang'},
  'credit-card': {'number': '4600 5899 6829 6887',
   'expiration-date': '11/25'}})

We can apply filter on the dask bag root elemnets as below. b.filter(lambda record: record['age'] > 30).take(2) # Select only people over 30

However I need to access the nested element i.e credit-card.expiration-date Any help will be appriciated.


Solution

  • You can simply do this:

    import dask.bag as db
    
    data = ({'age': 61,
             'name': ['Emiko', 'Oliver'],
             'occupation': 'Medical Student',
             'telephone': '166.814.5565',
             'address': {'address': '645 Drumm Line', 'city': 'Kennewick'},
             'credit-card': {'number': '3792 459318 98518', 'expiration-date': '12/23'}},
            {'age': 54,
             'name': ['Wendolyn', 'Ortega'],
             'occupation': 'Tractor Driver',
             'telephone': '1-975-090-1672',
             'address': {'address': '1274 Harbor Court', 'city': 'Mustang'},
             'credit-card': {'number': '4600 5899 6829 6887',
                             'expiration-date': '11/25'}})
    
    bag = db.from_sequence(data)
    
    result = bag.map(lambda record: record['credit-card']['expiration-date']).compute()
    
    print(result)
    

    which returns

    ['12/23', '11/25']
    

    In those cases where you have several cards per individual, do this:

    import dask.bag as db
    
    data = ({
                'age': 61,
                'name': ['Emiko', 'Oliver'],
                'occupation': 'Medical Student',
                'telephone': '166.814.5565',
                'address': {'address': '645 Drumm Line', 'city': 'Kennewick'},
                'credit-card': {'number': '3792 459318 98518', 'expiration-date': '12/23'}
            },
            {
                'age': 54,
                'name': ['Wendolyn', 'Ortega'],
                'occupation': 'Tractor Driver',
                'telephone': '1-975-090-1672',
                'address': {'address': '1274 Harbor Court', 'city': 'Mustang'},
                'credit-card': [
                    {'number': '4600 5899 6829 6887', 'expiration-date': '11/25'},
                    {'number': '4610 5899 6829 6887', 'expiration-date': '11/26'},
                ]
            })
    
    bag = db.from_sequence(data)
    
    result = bag.map(lambda record: record['credit-card']['expiration-date'] 
                      if isinstance(record['credit-card'], dict) 
                      else [card['expiration-date'] for card in record['credit-card']]).compute()
    
    print(result)
    

    which will return

    ['12/23', ['11/25', '11/26']]