Pandas is a Python language package, which is used for data processing in the part one. This is a very common basic programming library when we use Python language for machine learning programming. This article is the second tutorial in the series of pandas tutorial series. We recommend you to read the first pandas introductory tutorial here, before start exploring this.
File Operation
The pandas library provides a series of read_functions for reading files in various formats. They are as follows:
- Read_csv
- Read_table
- Read_fwf
- Read_clipboard
- Read_excel
- Read_hdf
- Read_html
- Read_json
- Read_msgpack
- Read_pickle
- Read_sas
- Read_sql
- Read_stata
- Read_feather
Reading Excel files
Note: To read the Excel file, you also need to install another library:xlrd
You can do this via pip:
sudo pip3 install xlrd $ pip3 show xlrd Name: xlrd Version: 1.1.0 Summary: Library for developers to extract data from Microsoft Excel (tm) spreadsheet files Home-page: http://www.python-excel.org/ Author: John Machin Author-email: sjmachin@lexicon.net License: BSD Location: /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages Requires:
Next, we look at a simple example of reading Excel:
# file_operation.py import pandas as pd import numpy as np df1 = pd.read_excel("data/test.xlsx") print("df1:\n{}\n".format(df1)) The content of this Excel is as follows: df1: C Mon 0 D Tue 1 E Wed 2 F Thu 3 G Fri 4 A Sat 5 B Sun
Note: The code and data files for this article are available through the Github repository mentioned at the beginning of the article.
Read CSV file
Next, let’s look at an example of reading a CSV file.
The contents of the first CSV file are as follows:
$ cat test1.csv C,Mon D,Tue E,Wed F,Thu G,Fri A,Sat
The way to read is also very simple:
# file_operation.py df2 = pd.read_csv("data/test1.csv") print("df2:\n{}\n".format(df2))
Let’s look at the second example. The contents of this file are as follows:
$ cat test2.csv C|Mon D|Tue E|Wed F|Thu G|Fri A|Sat
Strictly speaking, this is not a CSV file because its data is not separated by commas. In this case, we can read this file by specifying a delimiter, like so:
# file_operation.py df3 = pd.read_csv("data/test2.csv", sep="|") print("df3:\n{}\n".format(df3))
In fact, a read_csvvery large number of parameters are supported to adjust the read parameters, as shown in the following table:
parameter | Instructions |
Path | file path |
Sep or delimiter | Field separator |
Header | The number of column names, the default is 0 (first row) |
Index_col | Column number or name is used as the index of the row in the result |
Names | List of column names for the result |
Skiprows | Number of rows skipped from start position |
Na_values | Alternative NAsequence of values |
Comment | Separate the commented characters by the end of the line |
Parse_dates | Try to resolve the data to datetime. The default isFalse |
Keep_date_col | If the column is connected to the resolution date, the connected column is preserved. The default is False. |
Converters | Column converter |
Dayfirst | When parsing dates that can cause ambiguity, store it in an internal form. The default isFalse |
Data_parser | The function used to parse the date |
Nrows | Number of rows read from file |
Iterator | Returns a TextParser object for reading part of the content |
Chunksize | Specifies the size of the read block |
Skip_footer | The number of rows to ignore at the end of the file |
Verbose | Output various parsed output information |
Encoding | file encoding |
Squeeze | If the parsed data contains only one column, one is returnedSeries |
Thousands of | Thousands of separators |
See the detailed read_csv function description here: pandas.read_csv
Handle Invalid Value
The real world is not perfect. The data we read often has some invalid values. If you do not deal with these invalid values, it will cause great disruption to the program.
Treat invalid values, there are two main methods: directly ignore these invalid values; or replace invalid values with valid values.
Below I first create a data structure that contains invalid values. Then use pandas.isnafunctions to confirm which values are invalid:
# process_na.py import pandas as pd import numpy as np df = pd.DataFrame([[1.0, np.nan, 3.0, 4.0], [5.0, np.nan, np.nan, 8.0], [9.0, np.nan, np.nan, 12.0], [13.0, np.nan, 15.0, 16.0]]) print("df:\n{}\n".format(df)); print("df:\n{}\n".format(pd.isna(df)));**** df: 0 1 2 3 0 1.0 NaN 3.0 4.0 1 5.0 NaN NaN 8.0 2 9.0 NaN NaN 12.0 3 13.0 NaN 15.0 16.0 df: 0 1 2 3 0 False True False False 1 False True True False 2 False True True False 3 False True False False
Ignoring Invalid Values
We can pandas.DataFrame.dropnadiscard invalid values through functions:
# process_na.py print("df.dropna():\n{}\n".format(df.dropna()));
Note: dropnaBy
default, the original data structure will not be changed. Instead, a new data structure is returned. If you want to change the data directly, you can pass arguments when you call this function inplace = True
.
For the original structure, when the invalid value is all discarded, it will no longer be a valid DataFrame, so this line of code is output as follows:
df.dropna(): Empty DataFrame Columns: [0, 1, 2, 3] Index: []
We can also choose to discard the column where the entire column is invalid:
# process_na.py print("df.dropna(axis=1, how='all'):\n{}\n".format(df.dropna(axis=1, how='all'))); df.dropna(axis=1, how='all'): 0 2 3 0 1.0 3.0 4.0 1 5.0 NaN 8.0 2 9.0 NaN 12.0 3 13.0 15.0 16.0
Note: axis=1
Indicates the axis of the column. How can take ‘any’ or ‘all’, the default is the former.
Replace Invalid Value
We can also fill and replace invalid values with valid ones by functions. like this:
# process_na.py print("df.fillna(1):\n{}\n".format(df.fillna(1))); df.fillna(1): 0 1 2 3 0 1.0 1.0 3.0 4.0 1 5.0 1.0 1.0 8.0 2 9.0 1.0 1.0 12.0 3 13.0 1.0 15.0 16.0
It may not make sense to replace all invalid values with the same data, so we can specify different data to fill in. For ease of operation, before filling, we can rename, modify the names of rows and columns by methods:
# process_na.py df.rename(index={0: 'index1', 1: 'index2', 2: 'index3', 3: 'index4'}, columns={0: 'col1', 1: 'col2', 2: 'col3', 3: 'col4'}, inplace=True); df.fillna(value={'col2': 2}, inplace=True) df.fillna(value={'col3': 7}, inplace=True) print("df:\n{}\n".format(df)); df: col1 col2 col3 col4 index1 1.0 2.0 3.0 4.0 index2 5.0 2.0 7.0 8.0 index3 9.0 2.0 7.0 12.0 index4 13.0 2.0 15.0 16.0
Processing Strings
Data is often involved in the processing of strings, then we look at pandas for string manipulation.
The strfield
contains a series of functions to process the string. And, these functions automatically handle invalid values.
Here are some examples. In the first set of data, we deliberately set some strings containing spaces:
# process_string.py import pandas as pd s1 = pd.Series([' 1', '2 ', ' 3 ', '4', '5']); print("s1.str.rstrip():\n{}\n".format(s1.str.lstrip())) print("s1.str.strip():\n{}\n".format(s1.str.strip())) print("s1.str.isdigit():\n{}\n".format(s1.str.isdigit())) s1.str.rstrip(): 0 1 1 2 2 3 3 4 4 5 dtype: object s1.str.strip(): 0 1 1 2 2 3 3 4 4 5 dtype: object s1.str.isdigit(): 0 False 1 False 2 False 3 True 4 True dtype: bool
Here are some other examples showing the handling of uppercase, lowercase, and string lengths:
# process_string.py s2 = pd.Series(['Stairway to Heaven', 'Eruption', 'Freebird', 'Comfortably Numb', 'All Along the Watchtower']) print("s2.str.lower():\n{}\n".format(s2.str.lower())) print("s2.str.upper():\n{}\n".format(s2.str.upper())) print("s2.str.len():\n{}\n".format(s2.str.len())) s2.str.lower(): 0 stairway to heaven 1 eruption 2 freebird 3 comfortably numb 4 all along the watchtower dtype: object s2.str.upper(): 0 STAIRWAY TO HEAVEN 1 ERUPTION 2 FREEBIRD 3 COMFORTABLY NUMB 4 ALL ALONG THE WATCHTOWER dtype: object s2.str.len(): 0 18 1 8 2 8 3 16 4 24 dtype: int64
Conclusion
In this article, we covered the most basic operations in data processing using pandas. We hope that you understood the tutorial well and if you have any queries, please drop your comment in below comment box. We will get back to you as soon as possible.
Note: I learnt this pandas tutorial from this great resources Pandas – Powerful Python Data Analysis Toolkit and Python Data Analysis by J.Metz. It is a great tutorial and I highly recommend to read it if you are more interested in Pandas and Python data analysis!