pythonpandasjupyter-notebook

I converted my .ipynb file into a .py script but when I run it, it does not work. (VS Code)


What i'm trying to do in this notebook is reading the file "updated_coffee_sales_data.xls" with Pandas, cleaning the file, and then exporting it into another excel file. This work with Notebooks but I want to slowly learn how to do this in Python scripts.

I tried just normal export and removed all the #%% markdowns and all, when I try to run, it does now work

# # **DATA CLEANING, WRANGLING, AND MANIPULATION**
# 
# **(Before Data Visualization)**

# ## Importing Pandas and creating DataFrames
# 
# First I import necessary packages and create a DataFrame for each sheet.

import pandas as pd
import os

pwd = os.getcwd()

df1 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='transactions')
df2 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='employees')
df3 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='coffee')

# Then I check to see if there are any columns with mismatched Dtypes

df1.info()

df2.info()

df3.info()

# ## Merging tables
# 
# Everything looks fine so I can proceed.
# 
# In this case, I will join all of the tables together to create one DataFrame.
# 
# The first step to achieving this is to rename the columns and make the match across all DataFrames.

df1 = df1.rename(columns=
          {'Transaction ID' : 'trn_id', 'Emp_id' : 'emp_id',
          'Employee Name' : 'emp_name', 'Date and Time' : 'date_time',
          'Purchased Coffee Type' : 'coffee_type', 'Purchase Quantity' : 'purchase_qty',
          'Payment Method' : 'pay_method', 'Customer Name' : 'customer_name'}
     ) 
df2 = df2.rename(columns=
          {'Emp_id' : 'emp_id', 'Emp_Name' : 'emp_name',
          'Job Title' : 'job_title', 'Hourly Salary' : 'hr_salary'}
     )
df3 = df3.rename(columns=
          {'Coffee type' : 'coffee_type'}
     )

# Now that everything is named properly, I can merge the three DataFrames with no problems.

df_merged = df1.merge(df2, on='emp_id', how='inner').merge(df3, on='coffee_type', how='inner')
df_merged.head()

# ## Data cleaning.
# 
# Now it is finally time to clean the data.
# 
# From what I can see, there are a few changes that I can make:
# 1. Shorten the transaction IDs to the first 8 string.
# 2. Drop 1 employee name column 'emp_name_y" and renaming "emp_name_x".
# 3. Separate the date and time then drop the column 'date_time'
# 4. Rearrange the columns.
# 5. Finishing touches.

df_cleaning = df_merged.copy() # Always best practice to create new copies of your dataframes.

df_cleaning['trn_id'] = df_cleaning['trn_id'].str[:8] # Shorten the transaction IDs to the first 8 string.
df_cleaning = df_cleaning.rename(columns={'emp_name_x' : 'emp_name'}) # Drop 1 employee name column 'emp_name_y" and renaming "emp_name_x".
df_cleaning = df_cleaning.drop(columns=['emp_name_y'])
df_cleaning['date'] = df_cleaning['date_time'].dt.date # Separate the date and time then drop the column 'date_time'
df_cleaning['time'] = df_cleaning['date_time'].dt.time
df_cleaning = df_cleaning.drop(columns=['date_time'])
df_cleaning.head()

# Rearrange the columns.
reorder = [
     'trn_id', 'emp_id', 'hr_salary', 'job_title', 'emp_name',
     'date', 'time', 'coffee_type', 'purchase_qty', 'price',
     'total_price', 'pay_method', 'customer_name'
]

df_reorder = df_cleaning[reorder]
df_reorder.head()

df_reorder.info()

# Upon inspection, I noticed that date is Dtype Objects which will be a problem if not fixed.
# 
# Let me fix that.

df_dt = df_reorder.copy() # Always best practice to create new copies of your dataframes.


df_dt['date'] = pd.to_datetime(df_dt['date'], format='%Y-%m-%d')
df_dt.info()

df_dt.head()

# ## Exporting to CSV
# 
# Now that the data is clean and ready for visualization.
# I am going to export it into a csv file.

df_final = df_dt.copy()
df_final.to_csv('updated_coffee_sales_cleanedPY.csv', index=False)

I try running the import and the pwd = os.getcwd() in an interactive notebook first then try to run it again still did not work.

I can't copy the error in the terminal but what it's saying is:

[Errno 2]No such file or directory: 'D:\\Admin Files\\Desktop\DataS Proj\\Pproj_Coffee'

Solution

  • I found the issue... Python had a problem with locating file because of who I told it to locate the path:

    import pandas as pd
    import os
    
    pwd = os.getcwd()
    
    df1 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='transactions')
    df2 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='employees')
    df3 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='coffee')
    

    What I should've done is to use "os.path.join" and the best solution is to also make that entire file path into a variable.

    import os
    import pandas as pd
    
    # Get the current working directory
    pwd = os.getcwd()
    
    # Construct file paths
    file_path = os.path.join(pwd, 'updated_coffee_sales_data.xlsx')
    
    # Read Excel files
    df1 = pd.read_excel(file_path, sheet_name='transactions')
    df2 = pd.read_excel(file_path, sheet_name='employees')
    df3 = pd.read_excel(file_path, sheet_name='coffee')
    

    This solution works for me. (I found another problem when I exported the file but fixed it using the same concept of actually defining the file path)