python-3.xpandasimportexportgeonames

Import text file from geonames using pandas python


I downloaded one of the country datasets from geonames and I used this line to parse the dataset into columns:

data = pd.read_csv("C:/Users/Documents/TR.txt", sep="\t", header = None)

But for some reason this doesn't parse all of the rows correctly. Most of the rows are correctly parsed and about 2K are not. I used this line to be able to see that it's not parsed correctly:

data.to_csv("C:/Users/Documents/output.csv")

I then opened the output.csv in excel and saw that some of the rows are not parsed. But when I open the original TR.txt dataset on excel and use tab delimiter all the rows are correctly shown as parsed. So I am doing something wrong in my python code but I can't figure out what. Am I outputting the dataset wrong?? Thank you


Solution

  • Always read the readme.txt file.

    In this particular case, there are two noteworthy issues going on.

    1. The elevation (column 15) is expected as an int, but contains blanks. If you specify an int as the datatype, this will generate an error because there is no NaN values for ints. The work around is to cast it as a float. If you really want an int, then create a sentinal value for missing fields (e.g. -99999), fillna() with this value, and then cast as an int.

    2. Some column contain comma separated lists (e.g. column 3, alternate names). When you used data.to_csv("C:/Users/Documents/output.csv") you destroyed the tab delimited parsing. You need to specify sep='\t'.

      dtypes_dict = {
          0: int, # geonameid
          1: unicode,  # name
          2: str,  # asciiname
          3: str,  # alternatenames
          4: float, # latitude
          5: float, # longitude
          6: str, # feature class
          7: str, # feature code
          8: str, # country code
          9: str, # cc2
          10: str, # admin1 code
          11: str, # admin2 code
          12: str, # admin3 code
          13: str, # admin4 code
          14: int, # population
          15: int, # elevation
          16: int, # dem (digital elevation model)
          17: str, # timezone
          18: str # modification date yyyy-MM-dd
      }
      
      data = pd.read_csv("TR.txt", sep="\t", header = None, dtype=dtypes_dict)
      data.to_csv('output.txt', sep='\t')
      

    I didn't parse the final date column because it is probably not relevant.