People usually use excel or R to clean and modify data. After the data is clean, then they will import the data into Python. But, let’s clean and modify data in Python only. I used a dataset from datahub and used Credit Card information in order to see who is a good risk and who is a bad risk based on Credit usage. Find the file example on my github.
Import data
df = pd.read_csv('credit.csv')
View DataFrame
df.head(5)
Check Nulls (If False; no nulls. If True; nulls)
df.isnull().all()
To see how many Nulls
df.isnull().sum()
To see unique values in a column
df['column'].unique()
Create new column based on another column using a case statement
newcolumn = [] for row in df.basecolumn: if row in ['value from basecolumn']: newcolumn.append('newvalue') elif row in ['value from basecolumn']: newcolumn.append('newvalue') elif row in ['value from basecolumn']: newcolumn.append('newvalue') else: newcolumn.append('newvalue') df['newcolumn'] = newcolumn
To create bins and new column in pandas
df['xxxxbins'] = pd.cut(df['existingcolumn'], the number of bins you want) #to create bins datause['age_bins'] = pd.cut(datause['age'], 6) #to create bins example df['xxxxbins'].unique() # To check bins datause['age_bins'] = pd.cut(datause['age'], 6) #example datause['age_bins'].unique() #example [(65.667, 75.0], (18.944, 28.333], (47.0, 56.333], (37.667, 47.0], (28.333, 37.667], (56.333, 65.667]] Categories (6, interval[float64]): [(18.944, 28.333] < (28.333, 37.667] < (37.667, 47.0] < (47.0, 56.333] < (56.333, 65.667] < (65.667, 75.0]]
df['xxxxbins'] = pd.cut(x=df['exisitingcolumn'], bins=[bin, bin, bin, bin, bin, bin, bin], labels=['label', 'label', 'label','label', 'label','label']) #if you have 6 bins, you need 7 labels datause['agerange'] = pd.cut(x=datause['age'], bins=[20, 30, 40, 50, 60, 70, 80], labels=['20', '30', '40','50', '60','70']) #example
Create bins by hand
df.describe() # Let the Mean, Min, 25%, 50%, 75% and Max as your guide df['xxxxbins'] = pd.cut(x=df['existingcolumns'], bins=[bin, bin, bin, bin, bin]) datanew['credit_amount_bins'] = pd.cut(x=datanew['credit_amount'], bins=[250, 3972, 8000, 10424, 18424 ]) # example df['xxxxbins'].unique() #To check bins datanew['credit_amount_bins'].unique() #example [(250.0, 3972.0], (3972.0, 8000.0], (8000.0, 10424.0], (10424.0, 18424.0], NaN] Categories (4, interval[int64]): [(250, 3972] < (3972, 8000] < (8000, 10424] < (10424, 18424]] #See the NaN - You will need to fix this later by using the insull().sum #In this dataset, there was only one, so I used df = df.dropna(axis=0) # to just delete, but this deletes all NaN #so use with caution
To fix Nulls by using values from another column
#fix the nulls by copy one column to another def fx(x): if pd.isnull(x['xxxxbins']): return x['exisitingcolumn'] else: return x['xxxxbins'] print(df) df['xxxxbins'] = dfapply(lambda x : fx(x), axis = 1) print(df)
If you are using numbers, then use:
if np.isnull
If you are using non-numbers, then use:
if pd.isnull
Example
def fx(x): if pd.isnull(x['agerange']): return x['age'] else: return x['agerange'] print(datanew) datanew['agerange'] = datanew.apply(lambda x : fx(x), axis = 1) print(datanew)
Convert from float to in64
df['xxxxbins'] = np.int64(df['xxxxbins']) #change float to int64
Example
datanew['agerange'] = np.int64(datanew['agerange']) #change float to int64
Add a counter
df['count'] = 1 #add a counter to the dataframe for future groupings
Example
dataf['count'] = 1 #add a counter to the dataframe for future groupings
Why use a counter?
When you groupby
, you will be able to get a count of your findings.
df.groupby('class').sum()
Example
datanew.groupby('class').sum() # The using this dataset example, you will have bad = 300 good = 699