Data Science Explorer

How to duplicate each row by Python 본문

Projects

How to duplicate each row by Python

grace21110 2023. 11. 16. 21:03
반응형

I was working on data cleaning and faced a struggle today. I had to duplicate each row and put it together and did not know how to deal with this. It took me a few hours to think and look up the ways to figure it out. 

 

Luckily I found the way out so I am going to show you how I did it!

 

Step #1: You have to import the data. 

import pandas as pd 

ss = pd.read_csv('/content/총물량데이터.csv')
print (ss)

 

Step #2: Check what's inside of the file.

ss.info()

 

Step #3: Since the data that I imported included all the regions, I had to sort it out with only data of Seoul.

seoul_df = ss[(ss['SEND_CITY'] == '서울특별시') & (ss['REC_CITY'] == '서울특별시')]
print(seoul_df)

 

Step #4: Set the ID for each column 

seoul_df = ss[(ss['SEND_CITY'] == '서울특별시') & (ss['REC_CITY'] == '서울특별시')].reset_index(drop=True)

# Add a new 'ID' column
seoul_df['ID'] = range(1, len(seoul_df) + 1)

# Display the resulting DataFrame
print(seoul_df.head(2))

 

Step #5: Place the location of the ID column 

seoul_df = seoul_df[['ID'] + [col for col in ss.columns if col != 'ID']]
seoul_df.info()

 

Step #6: Make the same ID column double 

# Make the same ID column double
seoul_df = pd.concat([seoul_df, seoul_df], ignore_index=True)

# Set ID column as a standard 
seoul_df.sort_values(by=['ID'], inplace=True)

# Reset the index 
seoul_df.reset_index(drop=True, inplace=True)

seoul_df.head(10)

 

Step #7: Insert the new column name and fill it with 0

seoul_df_copy = seoul_df.copy()
seoul_df_copy['일하는_구'] = 0
seoul_df_copy.head()

Step #8: Use for loops to sort the Sending region for even numbers, and odd numbers for Receiving region. 

for idx, row in seoul_df_copy.iterrows():
  if idx % 2 == 0:
    seoul_df_copy.at[idx, '일하는_구'] = seoul_df_copy.at[idx, 'SEND_GU']
  else:
    seoul_df_copy.at[idx, '일하는_구'] = seoul_df_copy.at[idx, 'REC_GU']
seoul_df_copy = seoul_df_copy.drop(columns=['SEND_GU', 'REC_GU'])
seoul_df_copy.head()

Step #9: Rename 일하는_구 into English 

seoul_df_copy.rename(columns={'일하는_구': 'WORK_GU'}, inplace=True)
print(seoul_df_copy)

Step #10: Get the sum in total sorted by the categories

items = ['FURNITURE/INTERIOR', 'ETC', 'BOOK/ALBUM',
       'DIGITAL/HOME APP', 'LIFE/HEALTH', 'SPORTS/LEISURE', 'FOOD',
       'CHILDBIRTH/PARENTING', 'FASHION/CLOTHES', 'FASHION/ACC',
       'COSMETICS/BEAUTY']
      
result = seoul_df_copy.groupby(['DATE', 'YEAR', 'MONTH', 'DAY', 'DAY_OF', 'holiday', 'SEND_CITY', 'REC_CITY','WORK_GU'])[items].sum().reset_index()
result.loc[result['DAY_OF'] == 'DAY', 'holiday'] = 1
result

 

Step #11: Check if I have all columns I need 

seoul_df_copy.columns

Step #12: Export it into excel file 

from google.colab import files
result.to_csv("총물량데이터.csv", index=False)
files.download("총물량데이터.csv")

 

All set!