I have a DataFrame having around 50K to 100K rows
This DataFrame has around 4 columns.
We need to filter the DataFrame to discard any special character rows
The DataFrame should only filter the last row. All other rows should be part of the DataFrame
How do we filter the DataFrame to not include only this last row
You can define a function has_special_chars
. This function uses a refined regular expression to check for non-ASCII characters using the code below:
def has_special_chars(s):
return bool(re.search(r'[^\x00-\x7F]', s))
Apply the function to the DataFrame as mentioned below:
filtered_df = df[~df.apply(lambda row: any(has_special_chars(str(cell)) for cell in row), axis=1)]
It filters out rows where any cell contains non-ASCII characters. Here is a sample DataFrame with the defined function applied:
import pandas as pd
import re
data = {
'column1': [
'199 Central Avenue',
'1664 O\'block Road',
'1630 Hahn\'s Dairy Road',
'N 40 Degrees 36\' 15" W -75 Degrees -27\' -52"',
'4061605 North Lat-7538\'39* West Long.',
'40�3\'13" North',
'40@$%½3\'13" North'
],
'column2': [
'Some other data',
'More data',
'Even more data',
'Data',
'Random text',
'Special char text',
'cvs'
],
'column3': [
'123',
'456',
'789',
'012',
'345',
'678',
'243'
],
'column4': [
'Another column',
'With some text',
'Without special char',
'Text',
'More text',
'More special char text',
'd1'
]
}
df = pd.DataFrame(data)
# Define a function to check for special characters in a string
def has_special_chars(s):
return bool(re.search(r'[^\x00-\x7F]', s))
# Apply the function to filter the DataFrame
filtered_df = df[~df.apply(lambda row: any(has_special_chars(str(cell)) for cell in row), axis=1)]
print(filtered_df)
You will get the output as shown below:
column1 | column2 | column3 | column4 |
---|---|---|---|
199 Central Avenue | Some other data | 123 | Another column |
1664 O'block Road | More data | 456 | With some text |
1630 Hahn's Dairy Road | Even more data | 789 | Without special char |
N 40 Degrees 36' 15" W -75 Degrees -27' -52" | Data | 012 | Text |
4061605 North Lat-7538'39* West Long. | Random text | 345 | More text |
If it is spark data frame you can use below code:
import re
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
# Initialize Spark session
spark = SparkSession.builder.appName("CreateDataFrame").getOrCreate()
# Given DataFrame
data = [
('199 Central Avenue', 'Some other data', '123', 'Another column'),
('1664 O\'block Road', 'More data', '456', 'With some text'),
('1630 Hahn\'s Dairy Road', 'Even more data', '789', 'Without special char'),
('N 40 Degrees 36\' 15" W -75 Degrees -27\' -52"', 'Data', '012', 'Text'),
('4061605 North Lat-7538\'39* West Long.', 'Random text', '345', 'More text'),
('40�3\'13" North', 'Special char text', '678', 'More special char text')
]
schema = ["column1", "column2", "column3", "column4"]
# Create DataFrame
df = spark.createDataFrame(data, schema)
# Define the function to check for special characters
def has_special_chars(s):
return bool(re.search(r'[^\x00-\x7F]', s))
# Register the UDF
has_special_chars_udf = udf(has_special_chars, BooleanType())
# Apply the UDF to filter the DataFrame
filtered_df = df.filter(~has_special_chars_udf(df.column1) &
~has_special_chars_udf(df.column2) &
~has_special_chars_udf(df.column3) &
~has_special_chars_udf(df.column4))
# Show the filtered DataFrame
filtered_df.show()