python

Introduction to Pandas


Pandas is a powerful and popular open-source data manipulation and analysis library for Python. It provides data structures and functions needed to manipulate structured data seamlessly. Pandas is built on top of NumPy and is well-suited for various tasks such as data cleaning, data transformation, data analysis, and more.

 

Key Features of Pandas

  1. Data Structures: Pandas introduces two primary data structures - Series and DataFrame. These structures handle one-dimensional and two-dimensional labeled data, respectively.
  2. Data Alignment: Automatic and explicit data alignment is built-in. Objects can be explicitly aligned to a set of labels or can be automatically aligned during operations.
  3. Handling Missing Data: Functions are available to handle missing data (e.g., filling or dropping).
  4. Flexible Indexing: Allows for advanced indexing and subsetting of data.
  5. Data Wrangling: Provides functions for merging, reshaping, selecting, as well as data cleaning and filtering.
  6. Input/Output: Supports reading from and writing to various file formats such as CSV, Excel, SQL databases, HDF5, and more.

 

 

Pandas Data Structures

 

1. Series

A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

import pandas as pd
# Creating a Series
s = pd.Series([1, 3, 5, 7, 9])
print(s)

 

Output:

0    1
1    3
2    5
3    7
4    9
dtype: int64
 

 

 

2. DataFrame

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table, or a dict of Series objects.

# Creating a DataFrame
data = {
   'A': [1, 2, 3, 4],
   'B': [5, 6, 7, 8],
   'C': [9, 10, 11, 12]
}
df = pd.DataFrame(data)
print(df)

 

Output:

   A  B   C
0  1  5   9
1  2  6  10
2  3  7  11
3  4  8  12
 

 

 

Basic Operations in Pandas

 

1. Loading Data

Pandas can read data from various file formats:

# Reading a CSV file
df = pd.read_csv('file.csv')
# Reading an Excel file
df = pd.read_excel('file.xlsx')
# Reading from a SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table_name', conn)
 

 

2. Viewing Data

You can quickly view the top and bottom entries of a DataFrame using head() and tail():

print(df.head())  # First 5 rows
print(df.tail())  # Last 5 rows

 

 

3. Data Selection and Filtering

Selecting columns and rows:

# Selecting a column
print(df['A'])
# Selecting multiple columns
print(df[['A', 'B']])
# Selecting rows by index
print(df.iloc[0])  # First row
# Selecting rows by label
print(df.loc[0])  # Row with label 0
# Filtering rows
print(df[df['A'] > 2])
 

 

 

4. Handling Missing Data

Pandas provides several functions to handle missing data:

# Checking for missing values
print(df.isnull())
# Dropping missing values
df_cleaned = df.dropna()
# Filling missing values
df_filled = df.fillna(value=0)
 

 

5. Merging and Joining

Combining data from multiple DataFrames:

# Merging DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [4, 5, 6]})
merged = pd.merge(df1, df2, on='key')
print(merged)
# Joining DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['a', 'b'])
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]}, index=['a', 'b'])
joined = df1.join(df2)
print(joined)
 

 


python