pandasdata-cleaning

Data cleaning for duplications within cells of a dataframe


I've just scraped a dataset of names from a website but the names are coming into the dataframe duplicated. Example:

    [MarkMark, SarahSarah, BenBen]

The website I'm scraping from has images in the table and it seems like when I've pulled the table into a dataframe format it duplicates the name. How would I go about cleaning this data so I've only got one name?


Solution

  • Try splitting the name string in the middle

    df["name"] = df["name"].apply(lambda name: name[:len(name)/2])