I am trying to add a new column to my data with a FIPS code in it (5 digit number). Basically when County from maindata.csv matches County from fipsdata.tsv, I want the FIPS code (fipsCountyFIPS) to land in a new column i.e. data[fips] (so if County in maindata matches County in fipsdata THEN write the corresponding fips code to a new column in dataframe).
data = pd.read_csv ("maindata.csv")
fips = pd.read_csv ("fips2county.tsv",sep='\t')
data[fips] = np.where(data.County == fips.CountyName, fipsCountyFIPS)
I also experimented with the following which sounds like it should be easier in theory, though I couldn't work it out :( https://github.com/fitnr/addfips -- I would prefer to just do it with the above if possible.
If anyone could share how to do this that would be amazing!
Update:
Is this what you're looking for?
# 1. Check if all the counties in the data are in the FIPS data.
# If so, add a new column to `data` with each county corresponding FIPS code.
if data.County.isin(fips.CountyName.unique()).all():
data = (
# 2. Merge the data with the FIPS data
data.merge(
fips[["FIPS", "CountyName"]],
left_on="County",
right_on="CountyName",
)
# 3. Rename the FIPS column
# This step is only needed if the column name you want to give
# to the FIPS codes is different from the original column name
# from `fips` dataframe.
.rename(columns={"FIPS": "fips"})
# 4. Drop the `"CountyName"` column from merged dataframe.
.drop(columns="CountyName", errors="ignore")
)
Here's an example of the above code in action:
Note: fips
data downloaded from https://public.opendatasoft.com/explore/dataset/georef-united-states-of-america-county
import pandas as pd
import numpy as np
# == Data to run the example ===================================================
# 1. Read in the FIPS data
fips = pd.read_csv(
"https://public.opendatasoft.com/explore/dataset/georef-united-states-of-america-county/download/?format=csv&timezone=America/Argentina/Buenos_Aires&lang=en&use_labels_for_header=true&csv_separator=%3B",
sep=";",
)
# 2. Rename the columns to match the data
fips = fips.rename(
columns={'Official Name County': 'CountyName', 'County FIPS Code': 'FIPS'}
)
# 3. Make sure the FIPS column is a string and has 5 digits
fips['FIPS'] = fips['FIPS'].astype(str).str.zfill(5)
# 4. Create a list of counties to sample from
counties = [
"DeKalb", "Johnson", "Linn", "Macon", "Chase", "Hall", "Hitchcock",
"Pierce", "Rock", "Wheeler", "St. Lawrence", "Wayne", "Buncombe",
"Martin", "Perquimans", "Scotland", "Vance", "Fairfield", "Lake"
]
# 5. Create a dataframe with a random sample of counties
data = pd.DataFrame(
{"County": np.random.choice(counties, size=20, replace=True)}
)
# == Actual Solution ===========================================================
# 6. Check if all the counties in the data are in the FIPS data.
# If so, add a new column to `data` with each county corresponding FIPS code.
if data.County.isin(fips.CountyName.unique()).all():
data = (
# 7. Merge the data with the FIPS data
data.merge(
fips[["FIPS", "CountyName"]],
left_on="County",
right_on="CountyName",
)
# 8. Rename the FIPS column
# This step is only needed if the column name you want to give
# to the FIPS codes is different from the original column name
# from `fips` dataframe.
.rename(columns={"FIPS": "fips"})
# 9. Drop the CountyName column
.drop(columns="CountyName", errors="ignore")
)
print(data)
# Prints:
#
# County fips
# 0 Pierce 00139
# 1 Pierce 00069
# 2 Pierce 00229
# 3 Pierce 00053
# 4 Pierce 00093
# .. ... ...
# 93 DeKalb 00049
# 94 DeKalb 00089
# 95 DeKalb 00041
# 96 DeKalb 00037
# 97 St. Lawrence 00089
Output screenshot:
If one or more "Counties"
not found inside fips
: