Pandas: Cleaning Empty Cells
- Remove Cells
If there are some empty rosws, you would want to remove them.
Example
Return a new Data Frame with no empty cells.
import pandas as pd
df = pd.read_csv('hello.csv')
new_df = df.dropna()
print(new_df.to_string())
** dropna() method returns a new Dataframe, not changing the original one. **
When you want to keep the orignial dataframe, use the inplace = True argument.
Example
Remove all rows with NULL values.
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
print(df.to_string())
** The dropna(inplace = True) will not return a new dataframe, but it will remove all rows containing NULL values from the original dataframe.
- Replace Empty Values
Another way to remove empty cells is inserting new value instead.
The fillna() method allows us to replace empty cells with a value.
Example
Replace NULL values with the number 20.
import pandas as pd
df = pd.read_csv('may.csv')
df.fillna(20, inplace = True)
- Replace Only For Specified Columns
If you want to replace empty values for one column, specify the column name for the dataframe.
Example
Replace NULL values in the "Address" columns with the number 20.
import pandas as pd
df = pd.read_csv('may.csv')
df ["Address"].fillna(20, inplace = True)
- Replace Using Mean, Median, or Mode
There is another way that you can remove empty cells using mean(), median(), mode().
** mean is the average value (the sum of all values divided by number of values), median is the value in the middle, after you have sorted all values ascending, mode is the value that appears most frequently. **
Example
Calculate the MEAN and replace any empty values with it.
import pandas as pd
df = pd.read_csv("berry.csv')
a = df["Address"].mean()
df ["Address"].fillna(a, inplace = True)
Calculate the MODE and replace any empty values with it.
import pandas as pd
df = pd.read_csv('hello.csv')
x = df["Address"].mode()[0]
df["Address"].fillna(x, inplace = True)
Exercise
1. You are working with a dataset that contains information about the sales of products. The dataset includes a column called "ProductCategory." Your task is to calculate the mode (most frequent category) in the "ProductCategory" column and replace any empty (missing) values with this mode.
import pandas as pd
df = pd.read_csv('productcategory.csv')
a = df["Productcategory"].mode()[0]
df["Productcategory"].fillna(a, inplace = True)
2. You are working with a dataset that contains information about students' exam scores. The dataset has columns for "StudentID," "Name," "MathScore," and "ScienceScore." Some of the students have missing scores. Your task is to clean the data by removing rows with missing scores.
import pandas as pd
df = pd.read_csv('studentid.csv')
studentid_df = df.dropna()
print(studentid_df)