Using pandas for Data Analysis
Introduction to Pandas
Pandas is a powerful data analysis tool built on top of the Python programming language. It provides data scientists and analysts with high-performance, easy-to-use data structures, and data analysis tools. The name 'Pandas' is derived from the term 'Panel Data', an econometrics term for datasets that include observations over multiple time periods for the same individuals.
What is Pandas?
Pandas is a software library for Python. It provides you with high-performance, easy-to-use data structures and data analysis tools. With Pandas, you can manipulate tables, time series, and perform operations on these data structures.
Installing Pandas
To install pandas, you can use pip which is a package management system used to install and manage software packages written in Python.
pip install pandas
Importing Pandas
Once you have installed pandas, you need to import the library. This is done with the import command.
import pandas as pd
We import pandas as pd. This means that we can use the keyword pd instead of pandas when we want to use functions from the pandas library.
Pandas Data Structures
Pandas primarily uses two types of data structures:
- Series: It is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.
s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
- DataFrame: It is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.
data = {'Country': ['Belgium', 'India', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'Brasília'],
'Population': [11190846, 1303171035, 207847528]}
df = pd.DataFrame(data, columns=['Country', 'Capital', 'Population'])
Reading Data
Pandas can read data from various file formats like CSV, Excel, SQL databases, and more. Here is an example of reading a CSV file.
df = pd.read_csv('filename.csv')
Writing Data
Pandas can also write data to various file formats like CSV, Excel, SQL databases, and more. Here is an example of writing a DataFrame to a CSV file.
df.to_csv('filename.csv')
Data Exploration
Pandas provides various methods to have a quick overview of the data.
- To check the first few records of the DataFrame, use the
head()
function. - To check the last few records of the DataFrame, use the
tail()
function. - To get the summary of the numerical columns in a DataFrame, use the
describe()
function.
df.head()
df.tail()
df.describe()
Data Cleaning
Pandas provides several methods to clean the data.
- To check the null values in the DataFrame, use the
isnull()
function. - To fill the null values in the DataFrame, use the
fillna()
function. - To drop the rows where at least one element is missing, use the
dropna()
function.
df.isnull()
df.fillna(value)
df.dropna()
Data Manipulation
Pandas provides various methods to manipulate the data.
- To select a single column, use the DataFrame with the column name inside the brackets.
- To select multiple columns, use the DataFrame with the column names inside the brackets.
- To filter records, use the DataFrame with the condition inside the brackets.
- To apply a function to a column, use the
apply()
function.
df['column_name']
df[['column1', 'column2']]
df[df['column_name'] > 50]
df['column_name'].apply(lambda x: x*2)
This is a basic introduction to data analysis with pandas. There are many more functionalities provided by pandas like merging and joining dataframes, reshaping data, pivot tables, etc. which can be explored as you progress further. Happy learning!