Pandas is a powerful and popular open-source data manipulation and analysis library for Python. It provides data structures and functions needed to manipulate structured data seamlessly. Pandas is built on top of NumPy and is well-suited for various tasks such as data cleaning, data transformation, data analysis, and more.
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.
import pandas as pd
# Creating a Series
s = pd.Series([1, 3, 5, 7, 9])
print(s)
Output:
0 1
1 3
2 5
3 7
4 9
dtype: int64
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table, or a dict of Series objects.
# Creating a DataFrame
data = {
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]
}
df = pd.DataFrame(data)
print(df)
Output:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Pandas can read data from various file formats:
# Reading a CSV file
df = pd.read_csv('file.csv')
# Reading an Excel file
df = pd.read_excel('file.xlsx')
# Reading from a SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table_name', conn)
You can quickly view the top and bottom entries of a DataFrame using head() and tail():
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
Selecting columns and rows:
# Selecting a column
print(df['A'])
# Selecting multiple columns
print(df[['A', 'B']])
# Selecting rows by index
print(df.iloc[0]) # First row
# Selecting rows by label
print(df.loc[0]) # Row with label 0
# Filtering rows
print(df[df['A'] > 2])
Pandas provides several functions to handle missing data:
# Checking for missing values
print(df.isnull())
# Dropping missing values
df_cleaned = df.dropna()
# Filling missing values
df_filled = df.fillna(value=0)
Combining data from multiple DataFrames:
# Merging DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [4, 5, 6]})
merged = pd.merge(df1, df2, on='key')
print(merged)
# Joining DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['a', 'b'])
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]}, index=['a', 'b'])
joined = df1.join(df2)
print(joined)