pythonpandasnumpy

How to find the index of character and delete the characters after it


I'm trying to clear the csv data for my project which contains news and unnecessary things(such as javascript code). It's the dataset for our project and my job it to filter it and delete unnecessary characters.

The thing I want to do is to find the index of the character inside the row/columns and if it's there delete the characters after it(including the character itself).

I have wrote the code to check the index and can replace the exact character, but the problem is that I want to delete all the characters after that character.

I have tried implementing Pandas library to get the data and replace the exact row. But, as seen from the code, it just replaces the exact char with empty. I want to find the index of char(let say "window") and delete the characters that come after "window" char inside the row.

import pandas as pd
import numpy as np
import csv


pathtofile = "t1.csv"
data = pd.read_csv(pathtofile, encoding='utf-8' ,index_col=0)

print(type(data)) #which gives dataframe as output
print(data.head()) #prints out [id, contetn, date]

sub = 'window._ttzi' #its char array that i'm searching using find()
data["Indexes"]= data["contetn"].str.find(sub)
print(data) #prints the csv file with additional index

data = data.replace('window._ttzi', '')

#data.to_csv("t1edited.csv", encoding = 'utf-8')
print(data)   

Solution

  • I searched a lot more in the internet and actually find the answer myself.

    The rstip() function of pandas solve what I needed.

    Firstly: we open the file with pathtofile = "t1.csv" data = pd.read_csv(pathtofile, encoding='utf-8' ,index_col=0), and then for the data file we divide it into columns and then rstripping with specific character like sub = 'window._ttzi'. So the code will be like data['contetn'].str.rstrip(sub).

    I will still search for other ways of deleting the unnecessary data. Have a nice day.