I am using an osm.pbf Open Street Map dataset to count the number of individual shops per shop_type in a certain country. My problem is, that identifying the shops via nodes creates duplicates in my dataset. Because some shops are represented by multiple nodes of different lon and lat coordinates (e.g. big department stores). How can I avoid the duplicates?
This is my current code:
import osmium
import pandas as pd
class ShopHandler(osmium.SimpleHandler):
def __init__(self):
super(ShopHandler, self).__init__()
self.nodes = {}
def node(self, n):
if 'shop' in n.tags:
node_id = n.id
shop_name = n.tags.get('name', 'Unnamed Shop')
if shop_name not in self.nodes:
self.nodes[shop_name] = {
'node_ids': [node_id],
'lat': n.location.lat,
'lon': n.location.lon,
'shop_type': n.tags.get('shop', 'Unknown')
}
else:
self.nodes[shop_name]['node_ids'].append(node_id)
handler = ShopHandler()
handler.apply_file(file_path)
shops_df = pd.DataFrame(handler.nodes)
I have tried to round up the coordinate numbers. However, if some stores were located too close to each other then they were not identified as separate stores.
To avoid duplicates, only consider elements with a shop=*
tag. If a shop is mapped as a building, then it will consist of multiple nodes. None of the nodes have a shop=*
tag, however. Only the way that represents the building will have a shop=*
tag.