Pandas is a Python language package, which is used for data processing. This is a very common basic programming library when we use Python language for machine learning programming. This article is an introductory tutorial to it. Pandas provide fast, flexible and expressive data structures with the goal of making the work of “relational” or “marking” data simple and intuitive. It is intended to be a high-level building block for actual data analysis in Python.
Pandas Introduction
Pandas is suitable for many different types of data, including:
- Table data with heterogeneous columns, such as SQL tables or Excel data.
- Ordered and unordered (not necessarily fixed frequency) time series data.
- Any matrix data with row and column labels (even type or different types)
- Any other form of observation/statistical data set.
Since this is a Python language package, you need to have a Python language environment on your machine. In this regard, please search for your own method on the Internet.
For instructions on how to get pandas, please refer to the official website: pandas Installation.
In general, we can pipperform the installation by:
sudo pip3 install pandas
Or install pandas through conda :
conda install pandas
Currently, the latest version of pandas is v0.22.0 .
You can find the source code and test data of this article on Github at: pandas tutorial , readers can go for it.
In addition, pandas and NumPy often used together, the source code in this article will be used NumPy. It is recommended that readers have some familiarity with NumPy before learning Python pandas.
Core data structure
Pandas is the core Seriesand DataFrametwo data structures.
The comparison of these two types of data structures is as follows:
Name | Dimensions | Instructions |
Series | 1dimensional | An array of isomorphic types with labels |
DataFrame | 2dimensional | Table structure, with tags, variable size, and can contain heterogeneous data columns |
A DataFrame can be thought of as a Container of Series, ie a DataFrame can contain several Series.
“Note: Before the 0.20.0 release, there was a three-dimensional data structure named Panel. This is also the reason for the pandas library name: pan (el)- da (ta)- s . However, this data structure has been abandoned because it is rarely used.”
Series
Since Series is a one-dimensional structure of data, we can create this data directly through an array, like this:
# data_structure.py import pandas as pd import numpy as np series1 = pd.Series([1, 2, 3, 4]) print("series1:\n{}\n".format(series1)) The output of this code is as follows: series1: 0 1 1 2 2 3 3 4 dtype: int64
This output is described as follows:
- The last line of output is the type of data in the Series, where the data is all int64types.
- The data is output in the second column. The first column is the index of the data, which is called in pandas Index.
We can print the data and indexes in the Series separately:
# data_structure.py print("series1.values: {}\n".format(series1.values)) print("series1.index: {}\n".format(series1.index)) The two lines of code output are as follows: series1.values: [1 2 3 4] series1.index: RangeIndex(start=0, stop=4, step=1)
If you do not specify (as above), the index is in the form of [1, N-1]. However, we can also specify the index when creating the Series. The index does not necessarily need to be an integer and can be any type of data, such as a string. For example, we map seven notes in seven letters. The purpose of the index is to use it to obtain the corresponding data, such as the following:
# data_structure.py series2 = pd.Series([1, 2, 3, 4, 5, 6, 7], index=["C", "D", "E", "F", "G", "A", "B"]) print("series2:\n{}\n".format(series2)) print("E is {}\n".format(series2["E"])) The output of this code is as follows: series2: C 1 D 2 E 3 F 4 G 5 A 6 B 7 dtype: int64 E is 3
DataFrame
Let’s take a look at the creation of the DataFrame. We can create a 4×4 matrix through the NumPy interface to create a DataFrame like this:
# data_structure.py df1 = pd.DataFrame(np.arange(16).reshape(4,4)) print("df1:\n{}\n".format(df1)) The output of this code is as follows: df1: 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15
From this output, we can see that the default index and column names are of the form [0, N-1]
.
We can specify the column name and index when creating the DataFrame, like this:
# data_structure.py df2 = pd.DataFrame(np.arange(16).reshape(4,4), columns=["column1", "column2", "column3", "column4"], index=["a", "b", "c", "d"]) print("df2:\n{}\n".format(df2)) The output of this code is as follows: df2: column1 column2 column3 column4 a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 d 12 13 14 15
We can also specify column data directly to create a DataFrame:
# data_structure.py df3 = pd.DataFrame({"note" : ["C", "D", "E", "F", "G", "A", "B"], "weekday": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]}) print("df3:\n{}\n".format(df3)) The output of this code is as follows: df3: note weekday 0 C Mon 1 D Tue 2 E Wed 3 F Thu 4 G Fri 5 A Sat 6 B Sun
Please note:
- Different columns of DataFrame can be different data types
- If you create a DataFrame with a Series array, each Series becomes a row, not a column
E.g:
# data_structure.py noteSeries = pd.Series(["C", "D", "E", "F", "G", "A", "B"], index=[1, 2, 3, 4, 5, 6, 7]) weekdaySeries = pd.Series(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"], index=[1, 2, 3, 4, 5, 6, 7]) df4 = pd.DataFrame([noteSeries, weekdaySeries]) print("df4:\n{}\n".format(df4)) The output of df4 is as follows: df4: 1 2 3 4 5 6 7 0 C D E F G A B 1 Mon Tue Wed Thu Fri Sat Sun
We can add or remove column data to DataFrame in the following form:
# data_structure.py df3["No."] = pd.Series([1, 2, 3, 4, 5, 6, 7]) print("df3:\n{}\n".format(df3)) del df3["weekday"] print("df3:\n{}\n".format(df3)) The output of this code is as follows: df3: note weekday No. 0 C Mon 1 1 D Tue 2 2 E Wed 3 3 F Thu 4 4 G Fri 5 5 A Sat 6 6 B Sun 7 df3: note No. 0 C 1 1 D 2 2 E 3 3 F 4 4 G 5 5 A 6 6 B 7
Index object and data access
The Pandas Index object contains metadata describing the axis. When creating a Series or DataFrame, the array or sequence of tags is converted to Index. You can get the Index object of the DataFrame column and row in the following way:
# data_structure.py print("df3.columns\n{}\n".format(df3.columns)) print("df3.index\n{}\n".format(df3.index)) The two lines of code output are as follows: df3.columns Index(['note', 'No.'], dtype='object') df3.index RangeIndex(start=0, stop=7, step=1)
Please note:
- Index is not a collection, so it can contain duplicate data
- The value of the Index object cannot be changed, so it can access data securely
DataFrame provides the following two operators to access the data:
- loc: Accessing Data Through Row and Column Indexes
- iloc: Accessing data through row and column subscripts
For example:
# data_structure.py print("Note C, D is:\n{}\n".format(df3.loc[[0, 1], "note"])) print("Note C, D is:\n{}\n".format(df3.iloc[[0, 1], 0])) Note C, D is: 0 C 1 D Name: note, dtype: object Note C, D is: 0 C 1 D Name: note, dtype: object
The first line of code accesses elements whose row indexes are 0 and 1 and whose column index is “note”. The second line of code accesses the row indices 0 and 1 (for df3, the row index and row subscript are exactly the same, so here are 0 and 1, but they are different meanings), the column subscript Is an element of 0.
Conclusion
This article is a first introductory tutorial for pandas, so here we covered the most basic operations. You can read next article in the series here. We hope that you understood the tutorial well and if you have any queries, please drop your comment in below comment box. We will get back to you as soon as possible.
Note: I learnt this pandas tutorial from this great resources Pandas – Powerful Python Data Analysis Toolkit and Python Data Analysis by J.Metz. It is a great tutorial and I highly recommend to read it if you are more interested in Pandas and Python data analysis!