PYTHON

Data Manipulation and Analysis


Pandas is designed to simplify data manipulation and analysis in Python. It provides a comprehensive set of functions and methods for cleaning, transforming, analyzing, and visualizing data.

 

Data Manipulation

 

1. Selecting and Filtering Data

  • Selecting Columns
import pandas as pd
# Creating a sample DataFrame
data = {
   'A': [1, 2, 3, 4],
   'B': [5, 6, 7, 8],
   'C': [9, 10, 11, 12]
}
df = pd.DataFrame(data)
# Selecting a single column
df['A']
# Selecting multiple columns
df[['A', 'B']]
 
  • Selecting Rows
# Selecting rows by index
df.iloc[0]  # First row
df.iloc[0:2]  # First two rows
# Selecting rows by label
df.loc[0]  # Row with index 0
df.loc[0:2]  # Rows with indices 0 to 2
 
  • Filtering Rows
# Filtering rows based on a condition
df[df['A'] > 2]
 

 

2. Data Cleaning

 

  • Handling Missing Data
# Checking for missing values
df.isnull()
# Dropping rows with missing values
df.dropna()
# Filling missing values
df.fillna(value=0)

 

  • Renaming Columns
# Renaming columns
df.rename(columns={'A': 'Alpha', 'B': 'Beta'})

 

 

3. Data Transformation

 

  • Adding and Modifying Columns
# Adding a new column
df['D'] = df['A'] + df['B']
# Modifying an existing column
df['A'] = df['A'] * 2

 

  • Applying Functions to Columns
# Applying a function to a column
df['A'] = df['A'].apply(lambda x: x * 2)

 

  • Sorting Data
# Sorting by a single column
df.sort_values(by='A')
# Sorting by multiple columns
df.sort_values(by=['A', 'B'])

 

  • Merging and Joining DataFrames
# Merging DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [4, 5, 6]})
merged_df = pd.merge(df1, df2, on='key')
# Joining DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['a', 'b'])
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]}, index=['a', 'b'])
joined_df = df1.join(df2)

 

 

Data Analysis

 

1. Descriptive Statistics

Pandas provides a range of methods to calculate descriptive statistics:

# Calculating summary statistics
df.describe()
# Calculating individual statistics
df.mean()
df.median()
df.std()
df.var()
df.min()
df.max()
df.sum()
df.cumsum()  # Cumulative sum
 

 

2. Correlation and Covariance

# Calculating pairwise correlation
df.corr()
# Calculating pairwise covariance
df.cov()
 

 

3. Pivot Tables

Pivot tables are used to summarize data:

# Creating a pivot table
pivot_table = df.pivot_table(values='D', index='A', columns='B', aggfunc='mean')
 

 


PYTHON