pythonpandasfips

Python add column with new value when 2 conditions match


I am trying to add a new column to my data with a FIPS code in it (5 digit number). Basically when County from maindata.csv matches County from fipsdata.tsv, I want the FIPS code (fipsCountyFIPS) to land in a new column i.e. data[fips] (so if County in maindata matches County in fipsdata THEN write the corresponding fips code to a new column in dataframe).

data = pd.read_csv ("maindata.csv")
fips = pd.read_csv ("fips2county.tsv",sep='\t')

data[fips] = np.where(data.County == fips.CountyName, fipsCountyFIPS)

I also experimented with the following which sounds like it should be easier in theory, though I couldn't work it out :( https://github.com/fitnr/addfips -- I would prefer to just do it with the above if possible.

If anyone could share how to do this that would be amazing!

Update:

enter image description here


Solution

  • Is this what you're looking for?

    # 1. Check if all the counties in the data are in the FIPS data.
    #    If so, add a new column to `data` with each county corresponding FIPS code.
    if data.County.isin(fips.CountyName.unique()).all():
        data = (
            # 2. Merge the data with the FIPS data
            data.merge(
                fips[["FIPS", "CountyName"]],
                left_on="County",
                right_on="CountyName",
            )
            # 3. Rename the FIPS column
            #    This step is only needed if the column name you want to give
            #    to the FIPS codes is different from the original column name
            #    from `fips` dataframe.
            .rename(columns={"FIPS": "fips"})
            # 4. Drop the `"CountyName"` column from merged dataframe.
            .drop(columns="CountyName", errors="ignore")
        )
    
    

    Full Example Code

    Here's an example of the above code in action:

    Note: fips data downloaded from https://public.opendatasoft.com/explore/dataset/georef-united-states-of-america-county

    
    import pandas as pd
    import numpy as np
    
    
    # == Data to run the example ===================================================
    # 1. Read in the FIPS data
    fips = pd.read_csv(
        "https://public.opendatasoft.com/explore/dataset/georef-united-states-of-america-county/download/?format=csv&timezone=America/Argentina/Buenos_Aires&lang=en&use_labels_for_header=true&csv_separator=%3B",
        sep=";",
    )
    
    # 2. Rename the columns to match the data
    fips = fips.rename(
        columns={'Official Name County': 'CountyName', 'County FIPS Code': 'FIPS'}
    )
    
    # 3. Make sure the FIPS column is a string and has 5 digits
    fips['FIPS'] = fips['FIPS'].astype(str).str.zfill(5)
    
    # 4. Create a list of counties to sample from
    counties = [
        "DeKalb", "Johnson", "Linn", "Macon", "Chase", "Hall", "Hitchcock",
        "Pierce", "Rock", "Wheeler", "St. Lawrence", "Wayne", "Buncombe",
        "Martin", "Perquimans", "Scotland", "Vance", "Fairfield", "Lake"
    ]
    
    # 5. Create a dataframe with a random sample of counties
    data = pd.DataFrame(
        {"County": np.random.choice(counties, size=20, replace=True)}
    )
    
    # == Actual Solution ===========================================================
    # 6. Check if all the counties in the data are in the FIPS data.
    #    If so, add a new column to `data` with each county corresponding FIPS code.
    if data.County.isin(fips.CountyName.unique()).all():
        data = (
            # 7. Merge the data with the FIPS data
            data.merge(
                fips[["FIPS", "CountyName"]],
                left_on="County",
                right_on="CountyName",
            )
            # 8. Rename the FIPS column
            #    This step is only needed if the column name you want to give
            #    to the FIPS codes is different from the original column name
            #    from `fips` dataframe.
            .rename(columns={"FIPS": "fips"})
            # 9. Drop the CountyName column
            .drop(columns="CountyName", errors="ignore")
        )
    print(data)
    # Prints:
    #
    #           County   fips
    # 0         Pierce  00139
    # 1         Pierce  00069
    # 2         Pierce  00229
    # 3         Pierce  00053
    # 4         Pierce  00093
    # ..           ...    ...
    # 93        DeKalb  00049
    # 94        DeKalb  00089
    # 95        DeKalb  00041
    # 96        DeKalb  00037
    # 97  St. Lawrence  00089
    
    

    Output screenshot:

    enter image description here

    If one or more "Counties" not found inside fips:

    enter image description here