Pandas: Cleaning Empty Cells

Python

Pandas: Cleaning Empty Cells

grace21110 2023. 10. 29. 10:03

Remove Cells

If there are some empty rosws, you would want to remove them.

Example

Return a new Data Frame with no empty cells.

import pandas as pd

df = pd.read_csv('hello.csv')

new_df = df.dropna()

print(new_df.to_string())

** dropna() method returns a new Dataframe, not changing the original one. **

When you want to keep the orignial dataframe, use the inplace = True argument.

Example

Remove all rows with NULL values.

import pandas as pd

df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df.to_string())

** The dropna(inplace = True) will not return a new dataframe, but it will remove all rows containing NULL values from the original dataframe.

Replace Empty Values

Another way to remove empty cells is inserting new value instead.

The fillna() method allows us to replace empty cells with a value.

Example

Replace NULL values with the number 20.

import pandas as pd 
df = pd.read_csv('may.csv')

df.fillna(20, inplace = True)

Replace Only For Specified Columns

If you want to replace empty values for one column, specify the column name for the dataframe.

Example

Replace NULL values in the "Address" columns with the number 20.

import pandas as pd 
df = pd.read_csv('may.csv')

df ["Address"].fillna(20, inplace = True)

Replace Using Mean, Median, or Mode

There is another way that you can remove empty cells using mean(), median(), mode().

** mean is the average value (the sum of all values divided by number of values), median is the value in the middle, after you have sorted all values ascending, mode is the value that appears most frequently. **

Example

Calculate the MEAN and replace any empty values with it.

import pandas as pd 
df = pd.read_csv("berry.csv')

a = df["Address"].mean()

df ["Address"].fillna(a, inplace = True)

Calculate the MODE and replace any empty values with it.

import pandas as pd

df = pd.read_csv('hello.csv')

x = df["Address"].mode()[0]

df["Address"].fillna(x, inplace = True)

Exercise

1. You are working with a dataset that contains information about the sales of products. The dataset includes a column called "ProductCategory." Your task is to calculate the mode (most frequent category) in the "ProductCategory" column and replace any empty (missing) values with this mode.

import pandas as pd 

df = pd.read_csv('productcategory.csv')

a = df["Productcategory"].mode()[0]

df["Productcategory"].fillna(a, inplace = True)

2. You are working with a dataset that contains information about students' exam scores. The dataset has columns for "StudentID," "Name," "MathScore," and "ScienceScore." Some of the students have missing scores. Your task is to clean the data by removing rows with missing scores.

import pandas as pd

df = pd.read_csv('studentid.csv')

studentid_df = df.dropna()

print(studentid_df)