Cleaning & Modifying A Dataframe – Python

People usually use excel or R to clean and modify data. After the data is clean, then they will import the data into Python. But, let’s clean and modify data in Python only. I used a dataset from datahub and used Credit Card information in order to see who is a good risk and who is a bad risk based on Credit usage. Find the file example on my github.

Import data

df = pd.read_csv('credit.csv')

View DataFrame

df.head(5)

Check Nulls (If False; no nulls. If True; nulls)

df.isnull().all()

To see how many Nulls

df.isnull().sum()

To see unique values in a column

df['column'].unique()

Create new column based on another column using a case statement

newcolumn = []
for row in df.basecolumn:
    if row in ['value from basecolumn']:
        newcolumn.append('newvalue')
    elif row in ['value from basecolumn']:
        newcolumn.append('newvalue')
    elif row in ['value from basecolumn']:
        newcolumn.append('newvalue')        
    else:
        newcolumn.append('newvalue')
df['newcolumn'] = newcolumn

To create bins and new column in pandas

df['xxxxbins'] = pd.cut(df['existingcolumn'], the number of bins you want)  #to create bins
datause['age_bins'] = pd.cut(datause['age'], 6)  #to create bins example
df['xxxxbins'].unique() # To check bins 
datause['age_bins'] = pd.cut(datause['age'], 6) #example 
datause['age_bins'].unique() #example
[(65.667, 75.0], (18.944, 28.333], (47.0, 56.333], (37.667, 47.0], (28.333, 37.667], (56.333, 65.667]]
Categories (6, interval[float64]): [(18.944, 28.333] < (28.333, 37.667] < (37.667, 47.0] < (47.0, 56.333] < (56.333, 65.667] < (65.667, 75.0]]

df['xxxxbins'] = pd.cut(x=df['exisitingcolumn'], bins=[bin, bin, bin, bin, bin, bin, bin], labels=['label', 'label', 'label','label', 'label','label']) #if you have 6 bins, you need 7 labels 
datause['agerange'] = pd.cut(x=datause['age'], bins=[20, 30, 40, 50, 60, 70, 80], labels=['20', '30', '40','50', '60','70']) #example

Create bins by hand

df.describe() # Let the Mean, Min, 25%, 50%, 75% and Max as your guide
df['xxxxbins'] = pd.cut(x=df['existingcolumns'], bins=[bin, bin, bin, bin, bin])
datanew['credit_amount_bins'] = pd.cut(x=datanew['credit_amount'], bins=[250, 3972, 8000, 10424, 18424 ]) # example
df['xxxxbins'].unique()  #To check bins
datanew['credit_amount_bins'].unique() #example
[(250.0, 3972.0], (3972.0, 8000.0], (8000.0, 10424.0], (10424.0, 18424.0], NaN]
Categories (4, interval[int64]): [(250, 3972] < (3972, 8000] < (8000, 10424] < (10424, 18424]]
#See the NaN - You will need to fix this later by using the insull().sum 
#In this dataset, there was only one, so I used 
df = df.dropna(axis=0)  # to just delete, but this deletes all NaN
#so use with caution

To fix Nulls by using values from another column

#fix the nulls by copy one column to another
def fx(x):
    if pd.isnull(x['xxxxbins']):
        return x['exisitingcolumn']
    else:
        return x['xxxxbins']
    print(df) 
df['xxxxbins'] = dfapply(lambda x : fx(x), axis = 1)
print(df)

If you are using numbers, then use:

if np.isnull

If you are using non-numbers, then use:

if pd.isnull

Example

def fx(x):
    if pd.isnull(x['agerange']):
        return x['age']
    else:
        return x['agerange']
    print(datanew) 
datanew['agerange'] = datanew.apply(lambda x : fx(x), axis = 1)
print(datanew)

Convert from float to in64

df['xxxxbins'] = np.int64(df['xxxxbins'])  #change float to int64

Example

datanew['agerange'] = np.int64(datanew['agerange'])  #change float to int64

Add a counter

df['count'] = 1  #add a counter to the dataframe for future groupings

Example

dataf['count'] = 1  #add a counter to the dataframe for future groupings

Why use a counter?

When you groupby, you will be able to get a count of your findings.

df.groupby('class').sum()

Example

datanew.groupby('class').sum()  
# The using this dataset example, you will have 
bad = 300
good = 699

Data FramesData ManagementTips & Tricks