pythonrandomdatasetdata-generation

Generation of random coordinates in the USA (with the use of the biggest cities dataset)


Good morning,

For my project, I need to generate random coordinates in the form of latitude and longitude pairs, where the generated coordinates need to be inside the borders of the USA, and ideally, there would be a higher probability for a coordinate to generate near the biggest cities in the USA. I have a CSV dataset, where I have the coordinates of such cities and their population.

So to summarize, I want my coordinates to generate with a higher probability near big cities, but sometimes I also want them to generate in the rural areas.

I'm using Python, and all Python's libraries are ok to use. One option I could do is to generate random coordinates in a rectangle, which surrounds the USA, but then somehow I also need to take into account the cities dataset.


Solution

  • Given: a CSV dataset, containing the coordinates of US cities and their population.

    Find: random coordinates in the form of latitude and longitude pairs, where the generated coordinates need to be inside the borders of the USA, and ideally, there would be a higher probability for a coordinate to generate near the biggest cities in the USA

    Here is an approach.

    1. read the csv file of city data into a pandas dataframe
    2. sample the dataframe to get a random set of the cities
    3. randomly select a list of cities from this sample to use for urban lat, lngs.
    4. randomly select a second list of cities, select these cities 2 at a time and compute a point equidistant from each city as the rural point.

    The following code illustrates this approach:

    import pandas as pd
    import random as rnd
    from collections import namedtuple
    
    cities = r'DataFiles/uscities.csv'
    Point = namedtuple('Point','lat, lng')
    def pickRandomPoints(city_file: str, city_count: int, rural_count: int) -> list():
        """Return a list of lat, lng tuples that is city_count + rural_count long"""
        qty = 5*(city_count+ rural_count)
        rslt = []
        city_df = pd.read_csv(cities)[['city', 'lat', 'lng', 'population']].sample(qty)
        city_list = rnd.choices(list(city_df['city']), k= city_count)
        # Add city lat & lngs
        for city in city_list:
            pt = Point(city_df[ city_df['city'] == city]['lat'].values[0],
                      city_df[ city_df['city'] == city]['lng'].values[0])
            rslt.append((city_df[ city_df['city'] == city]['lat'].values[0],city_df[ city_df['city'] == city]['lng'].values[0]))
        rural_list = rnd.choices(list(city_df['city']), k=2*rural_count)
        for rp in range(1, len(rural_list)):
            pta = Point(city_df[ city_df['city'] == rural_list[rp-1]]['lat'].values[0],
                      city_df[ city_df['city'] == rural_list[rp-1]]['lng'].values[0])
            ptb = Point(city_df[ city_df['city'] == rural_list[rp]]['lat'].values[0],
                      city_df[ city_df['city'] == rural_list[rp]]['lng'].values[0])
            newpt = Point(round(pta.lat + (ptb.lat - pta.lat)/2, 4), round(pta.lng  + (pta.lng - ptb.lng)/2,4))
            rslt.append((newpt.lat, newpt.lng))
        return rslt   
    

    For example running this code as follows:

    pickRandomPoints(cities, 10, 5)
    

    Yields:

    [(40.4968, -77.7283),
     (35.1639, -78.7371),
     (37.7941, -95.15),
     (33.1481, -88.1757),
     (41.9555, -88.5292),
     (48.5707, -110.0862),
     (37.7668, -108.9071),
     (37.3042, -113.6658),
     (40.8358, -81.0649),
     (35.4671, -78.1613),
     (35.744, -122.0233),
     (39.924, -103.4567),
     (43.4704, -102.7324),
     (44.4596, -71.6226),
     (43.488, -131.5072),
     (40.0874, -95.6023),
     (37.5124, -59.5604),
     (40.6303, -91.1167),
     (36.0781, -142.8842)]