Module 4: Exploratory Data Analysis for Finance

Learning Objectives

By the end of this module, you will:

Calculate and interpret financial returns (simple vs logarithmic)
Measure and analyze volatility and risk metrics
Understand and calculate correlation between assets
Perform time series analysis on financial data
Identify trends, patterns, and anomalies in market data
Calculate rolling statistics and moving averages
Build comprehensive exploratory analysis workflows
Create insightful summary statistics for portfolios

4.1 Understanding Returns: The Foundation of Financial Analysis

Why Returns Matter More Than Prices

In finance, we rarely analyze absolute prices. Instead, we focus on returns—the percentage change in value over time. Here's why:

1. Comparability: A $10 stock rising to $11 (10% return) is more impressive than a $100 stock rising to $105 (5% return). Returns let us compare apples to apples.

2. Stationarity: Prices trend upward or downward over time, making statistical analysis difficult. Returns tend to fluctuate around a stable mean, making them more suitable for analysis.

3. Portfolio Construction: Returns can be aggregated across assets; prices cannot.

Simple Returns

Simple returns (also called arithmetic returns) measure the percentage change from one period to the next.

Formula: R = (P_t - P_{t-1}) / P_{t-1}

Or equivalently: R = (P_t / P_{t-1}) - 1

import pandas as pd
import numpy as np
import yfinance as yf

# Download stock data
data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)
prices = data['Adj Close']

# Calculate simple returns
simple_returns = prices.pct_change()

# Alternative manual calculation
simple_returns_manual = (prices / prices.shift(1)) - 1

print("First 10 daily returns:")
print(simple_returns.head(10))

# Convert to percentage
simple_returns_pct = simple_returns * 100
print(f"\nAverage daily return: {simple_returns_pct.mean():.4f}%")

Logarithmic Returns

Logarithmic (log) returns use the natural logarithm of the price ratio.

Formula: R = ln(P_t / P_{t-1})

Why Use Log Returns?

Time Additivity: Log returns over multiple periods simply sum up
Symmetry: A 50% gain followed by a 50% loss doesn't return you to breakeven with simple returns, but log returns handle this correctly
Statistical Properties: Log returns are often more normally distributed, which is important for many financial models

import numpy as np

# Calculate log returns
log_returns = np.log(prices / prices.shift(1))

# Alternative using NumPy's diff and log
log_returns_alt = np.diff(np.log(prices))

print("First 10 log returns:")
print(log_returns.head(10))

# Convert to percentage
log_returns_pct = log_returns * 100
print(f"\nAverage daily log return: {log_returns_pct.mean():.4f}%")

Comparing Simple and Log Returns

# For small returns, simple and log returns are very similar
comparison = pd.DataFrame({
    'Simple Returns': simple_returns,
    'Log Returns': log_returns,
    'Difference': simple_returns - log_returns
})

print("Comparison of returns:")
print(comparison.head(10))
print(f"\nAverage absolute difference: {abs(comparison['Difference']).mean():.6f}")

# For large returns, the difference becomes noticeable
large_change_example = pd.Series([100, 150, 100])  # 50% up, then 33.33% down
simple_ret = large_change_example.pct_change()
log_ret = np.log(large_change_example / large_change_example.shift(1))

print("\nExample with large price change:")
print(f"Simple returns: {simple_ret.values}")
print(f"Log returns: {log_ret.values}")
print(f"Sum of simple returns: {simple_ret.sum():.4f}")
print(f"Sum of log returns: {log_ret.sum():.4f}")  # Should be ~0 (back to starting price)

Multi-Period Returns

Cumulative Simple Returns

To calculate the total return over multiple periods with simple returns, you must compound them:

# Calculate cumulative returns (investment growth)
cumulative_return = (1 + simple_returns).cumprod() - 1

# Alternative: cumulative product of (1 + return)
cumulative_growth = (1 + simple_returns).cumprod()

print("Cumulative returns:")
print(cumulative_return.tail())

# Total return over the entire period
total_return = cumulative_return.iloc[-1]
print(f"\nTotal return for the period: {total_return * 100:.2f}%")

# What $1000 would have become
initial_investment = 1000
final_value = initial_investment * (1 + total_return)
print(f"$1000 would have become: ${final_value:.2f}")

Cumulative Log Returns

With log returns, you simply sum them up:

# Cumulative log returns (just add them up)
cumulative_log_return = log_returns.cumsum()

print("Cumulative log returns:")
print(cumulative_log_return.tail())

# Convert to simple return
total_simple_return = np.exp(cumulative_log_return.iloc[-1]) - 1
print(f"\nTotal simple return: {total_simple_return * 100:.2f}%")

Annualizing Returns

To compare returns across different time periods, we annualize them.

# Calculate average daily return
avg_daily_return = simple_returns.mean()

# Annualize (assuming 252 trading days per year)
annual_return = (1 + avg_daily_return) ** 252 - 1

print(f"Average daily return: {avg_daily_return * 100:.4f}%")
print(f"Annualized return: {annual_return * 100:.2f}%")

# Alternative with log returns
avg_daily_log_return = log_returns.mean()
annual_log_return = avg_daily_log_return * 252
annual_simple_from_log = np.exp(annual_log_return) - 1

print(f"\nAnnualized return (from log): {annual_simple_from_log * 100:.2f}%")

Practical Example: Complete Returns Analysis

import yfinance as yf
import pandas as pd
import numpy as np

def analyze_returns(ticker, start_date, end_date):
    """
    Comprehensive returns analysis for a stock
    """
    # Download data
    print(f"Analyzing {ticker}...")
    data = yf.download(ticker, start=start_date, end=end_date, progress=False)
    prices = data['Adj Close']
    
    # Calculate returns
    simple_returns = prices.pct_change()
    log_returns = np.log(prices / prices.shift(1))
    
    # Cumulative returns
    cumulative_returns = (1 + simple_returns).cumprod() - 1
    
    # Statistics
    stats = {
        'Ticker': ticker,
        'Start Date': prices.index[0].strftime('%Y-%m-%d'),
        'End Date': prices.index[-1].strftime('%Y-%m-%d'),
        'Starting Price': prices.iloc[0],
        'Ending Price': prices.iloc[-1],
        'Total Return': cumulative_returns.iloc[-1] * 100,
        'Avg Daily Return': simple_returns.mean() * 100,
        'Annualized Return': ((1 + simple_returns.mean()) ** 252 - 1) * 100,
        'Best Day': simple_returns.max() * 100,
        'Worst Day': simple_returns.min() * 100,
        'Positive Days': (simple_returns > 0).sum(),
        'Negative Days': (simple_returns < 0).sum(),
        'Win Rate': (simple_returns > 0).sum() / len(simple_returns) * 100
    }
    
    # Print results
    print("\n" + "="*60)
    print(f"RETURNS ANALYSIS: {ticker}")
    print("="*60)
    for key, value in stats.items():
        if isinstance(value, float):
            if 'Return' in key or 'Day' in key or 'Rate' in key:
                print(f"{key:.<30} {value:>8.2f}%")
            else:
                print(f"{key:.<30} {value:>8.2f}")
        else:
            print(f"{key:.<30} {value}")
    print("="*60)
    
    return simple_returns, log_returns, cumulative_returns

# Analyze a stock
returns, log_returns, cum_returns = analyze_returns('AAPL', '2023-01-01', '2024-01-01')

4.2 Measuring Volatility and Risk

What is Volatility?

Volatility measures how much an asset's returns vary over time. High volatility means larger price swings; low volatility means more stable prices. In finance, volatility is often used as a proxy for risk.

Standard Deviation: The Primary Volatility Measure

Standard deviation measures the dispersion of returns around their mean.

import yfinance as yf
import numpy as np
import pandas as pd

# Download data
data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)
returns = data['Adj Close'].pct_change().dropna()

# Calculate daily volatility
daily_volatility = returns.std()
print(f"Daily volatility: {daily_volatility * 100:.4f}%")

# Annualize volatility (multiply by square root of trading days)
annual_volatility = daily_volatility * np.sqrt(252)
print(f"Annualized volatility: {annual_volatility * 100:.2f}%")

Why Square Root of Time?

Variance scales linearly with time, so standard deviation (the square root of variance) scales with the square root of time. This is a fundamental principle in financial mathematics.

Rolling Volatility

Volatility changes over time. Rolling volatility shows how risk evolves.

# Calculate 30-day rolling volatility
rolling_vol = returns.rolling(window=30).std() * np.sqrt(252) * 100

print("Rolling volatility:")
print(rolling_vol.tail())

# Find periods of high and low volatility
high_vol_threshold = rolling_vol.quantile(0.75)
low_vol_threshold = rolling_vol.quantile(0.25)

print(f"\nHigh volatility threshold: {high_vol_threshold:.2f}%")
print(f"Low volatility threshold: {low_vol_threshold:.2f}%")

# Identify high volatility periods
high_vol_periods = rolling_vol[rolling_vol > high_vol_threshold]
print(f"\nNumber of high volatility days: {len(high_vol_periods)}")

Downside Deviation

Standard deviation treats upside and downside volatility equally. Downside deviation only measures volatility of negative returns.

# Calculate downside deviation (only negative returns)
negative_returns = returns[returns < 0]
downside_deviation = negative_returns.std() * np.sqrt(252)

print(f"Downside deviation (annual): {downside_deviation * 100:.2f}%")
print(f"Standard deviation (annual): {annual_volatility * 100:.2f}%")

# Downside deviation is typically lower than standard deviation

Maximum Drawdown

Maximum drawdown measures the largest peak-to-trough decline.

# Calculate cumulative returns
cumulative = (1 + returns).cumprod()

# Calculate running maximum
running_max = cumulative.expanding().max()

# Calculate drawdown
drawdown = (cumulative - running_max) / running_max

# Find maximum drawdown
max_drawdown = drawdown.min()

print(f"Maximum drawdown: {max_drawdown * 100:.2f}%")

# Find when it occurred
max_dd_date = drawdown.idxmin()
print(f"Max drawdown date: {max_dd_date.strftime('%Y-%m-%d')}")

# Find the peak before the drawdown
peak_date = running_max[:max_dd_date].idxmax()
print(f"Peak before drawdown: {peak_date.strftime('%Y-%m-%d')}")

Value at Risk (VaR)

VaR estimates the maximum loss expected over a specific time period at a given confidence level.

# Calculate VaR at 95% confidence level
confidence_level = 0.05
var_95 = returns.quantile(confidence_level)

print(f"Daily VaR (95%): {var_95 * 100:.2f}%")
print(f"This means: 95% of days, losses won't exceed {abs(var_95) * 100:.2f}%")

# Annualized VaR
annual_var_95 = var_95 * np.sqrt(252)
print(f"Annual VaR (95%): {annual_var_95 * 100:.2f}%")

# For a $100,000 portfolio
portfolio_value = 100000
var_dollar = portfolio_value * abs(var_95)
print(f"\nFor a ${portfolio_value:,} portfolio:")
print(f"95% confident daily loss won't exceed: ${var_dollar:,.2f}")

Conditional Value at Risk (CVaR)

CVaR, also called Expected Shortfall, measures the average loss when VaR is exceeded.

# Calculate CVaR (average of returns worse than VaR)
cvar_95 = returns[returns <= var_95].mean()

print(f"CVaR (95%): {cvar_95 * 100:.2f}%")
print(f"When bad days happen, average loss is {abs(cvar_95) * 100:.2f}%")

# CVaR in dollars
cvar_dollar = portfolio_value * abs(cvar_95)
print(f"Expected loss on worst 5% of days: ${cvar_dollar:,.2f}")

Practical Example: Comprehensive Risk Analysis

import yfinance as yf
import pandas as pd
import numpy as np

def risk_analysis(ticker, start_date, end_date):
    """
    Comprehensive risk analysis for a stock
    """
    # Download data
    data = yf.download(ticker, start=start_date, end=end_date, progress=False)
    prices = data['Adj Close']
    returns = prices.pct_change().dropna()
    
    # Calculate cumulative returns for drawdown
    cumulative = (1 + returns).cumprod()
    running_max = cumulative.expanding().max()
    drawdown = (cumulative - running_max) / running_max
    
    # Volatility metrics
    daily_vol = returns.std()
    annual_vol = daily_vol * np.sqrt(252)
    downside_returns = returns[returns < 0]
    downside_vol = downside_returns.std() * np.sqrt(252) if len(downside_returns) > 0 else 0
    
    # VaR and CVaR
    var_95 = returns.quantile(0.05)
    cvar_95 = returns[returns <= var_95].mean()
    
    # Risk metrics
    risk_metrics = {
        'Ticker': ticker,
        'Daily Volatility (%)': daily_vol * 100,
        'Annual Volatility (%)': annual_vol * 100,
        'Downside Deviation (%)': downside_vol * 100,
        'Maximum Drawdown (%)': drawdown.min() * 100,
        'VaR 95% (%)': var_95 * 100,
        'CVaR 95% (%)': cvar_95 * 100,
        'Best Day (%)': returns.max() * 100,
        'Worst Day (%)': returns.min() * 100,
        'Positive Days': (returns > 0).sum(),
        'Negative Days': (returns < 0).sum()
    }
    
    # Print results
    print("\n" + "="*60)
    print(f"RISK ANALYSIS: {ticker}")
    print("="*60)
    for key, value in risk_metrics.items():
        if isinstance(value, float):
            print(f"{key:.<45} {value:>10.2f}")
        else:
            print(f"{key:.<45} {value:>10}")
    print("="*60)
    
    return risk_metrics

# Analyze risk
risk = risk_analysis('AAPL', '2023-01-01', '2024-01-01')

4.3 Correlation Analysis

Understanding Correlation

Correlation measures how two assets move together. It ranges from -1 to +1:

+1: Perfect positive correlation (move together)
0: No correlation (independent movement)
-1: Perfect negative correlation (move opposite)

Why Correlation Matters

Diversification: Low correlation between assets reduces portfolio risk
Hedging: Negative correlation can protect against losses
Trading Strategies: Pairs trading exploits correlations

Calculating Correlation

import yfinance as yf
import pandas as pd

# Download data for multiple stocks
tickers = ['AAPL', 'MSFT', 'GOOGL', 'JPM', 'XOM']
data = yf.download(tickers, start='2023-01-01', end='2024-01-01', progress=False)
prices = data['Adj Close']
returns = prices.pct_change().dropna()

# Calculate correlation matrix
correlation_matrix = returns.corr()

print("Correlation Matrix:")
print(correlation_matrix.round(2))

# Get correlation between two specific stocks
aapl_msft_corr = returns['AAPL'].corr(returns['MSFT'])
print(f"\nAAPL-MSFT Correlation: {aapl_msft_corr:.3f}")

Interpreting Correlation

# High correlation pairs (>0.7)
high_corr = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if correlation_matrix.iloc[i, j] > 0.7:
            high_corr.append((
                correlation_matrix.columns[i],
                correlation_matrix.columns[j],
                correlation_matrix.iloc[i, j]
            ))

print("\nHighly Correlated Pairs (>0.7):")
for stock1, stock2, corr in high_corr:
    print(f"{stock1} - {stock2}: {corr:.3f}")

# Low correlation pairs (<0.3)
low_corr = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) < 0.3:
            low_corr.append((
                correlation_matrix.columns[i],
                correlation_matrix.columns[j],
                correlation_matrix.iloc[i, j]
            ))

print("\nLow Correlation Pairs (<0.3):")
for stock1, stock2, corr in low_corr:
    print(f"{stock1} - {stock2}: {corr:.3f}")

Rolling Correlation

Correlation changes over time. Rolling correlation shows how relationships evolve.

# Calculate 30-day rolling correlation between AAPL and MSFT
rolling_corr = returns['AAPL'].rolling(window=30).corr(returns['MSFT'])

print("Rolling correlation (last 10 days):")
print(rolling_corr.tail(10))

# Summary statistics
print(f"\nMean correlation: {rolling_corr.mean():.3f}")
print(f"Min correlation: {rolling_corr.min():.3f}")
print(f"Max correlation: {rolling_corr.max():.3f}")

Covariance

Covariance is related to correlation but not standardized (it depends on the volatility of each asset).

# Calculate covariance matrix
covariance_matrix = returns.cov()

print("Covariance Matrix:")
print(covariance_matrix)

# Annualize covariance
annual_cov_matrix = covariance_matrix * 252

print("\nAnnualized Covariance Matrix:")
print(annual_cov_matrix)

# Relationship: correlation = covariance / (std1 * std2)
aapl_std = returns['AAPL'].std()
msft_std = returns['MSFT'].std()
aapl_msft_cov = returns['AAPL'].cov(returns['MSFT'])
calculated_corr = aapl_msft_cov / (aapl_std * msft_std)

print(f"\nCorrelation from covariance: {calculated_corr:.3f}")
print(f"Direct correlation: {aapl_msft_corr:.3f}")

Practical Example: Diversification Analysis

import yfinance as yf
import pandas as pd
import numpy as np

def diversification_analysis(tickers, start_date, end_date):
    """
    Analyze diversification potential across assets
    """
    # Download data
    data = yf.download(tickers, start=start_date, end=end_date, progress=False)
    prices = data['Adj Close']
    returns = prices.pct_change().dropna()
    
    # Calculate correlation matrix
    corr_matrix = returns.corr()
    
    # Calculate average correlation for each stock
    avg_correlations = {}
    for ticker in tickers:
        # Exclude correlation with itself
        other_tickers = [t for t in tickers if t != ticker]
        avg_corr = corr_matrix.loc[ticker, other_tickers].mean()
        avg_correlations[ticker] = avg_corr
    
    # Find best diversifiers (lowest average correlation)
    diversifiers = sorted(avg_correlations.items(), key=lambda x: x[1])
    
    print("\n" + "="*60)
    print("DIVERSIFICATION ANALYSIS")
    print("="*60)
    print("\nAverage Correlation with Other Assets:")
    print("-" * 40)
    for ticker, avg_corr in diversifiers:
        print(f"{ticker:.<20} {avg_corr:>10.3f}")
    
    print(f"\nBest diversifier: {diversifiers[0][0]} (avg corr: {diversifiers[0][1]:.3f})")
    print(f"Least diversifier: {diversifiers[-1][0]} (avg corr: {diversifiers[-1][1]:.3f})")
    
    # Overall portfolio correlation
    # Average of all pairwise correlations (excluding diagonal)
    upper_triangle = np.triu(corr_matrix, k=1)
    avg_portfolio_corr = upper_triangle[upper_triangle != 0].mean()
    
    print(f"\nOverall portfolio average correlation: {avg_portfolio_corr:.3f}")
    
    if avg_portfolio_corr < 0.5:
        print("✓ Good diversification (low correlation)")
    elif avg_portfolio_corr < 0.7:
        print("⚠ Moderate diversification")
    else:
        print("✗ Poor diversification (high correlation)")
    
    return corr_matrix, avg_correlations

# Analyze diversification
tickers = ['AAPL', 'MSFT', 'JPM', 'XOM', 'JNJ', 'WMT', 'GLD']
corr_matrix, avg_corrs = diversification_analysis(tickers, '2023-01-01', '2024-01-01')

4.4 Time Series Analysis

Identifying Trends

Trends are sustained directional movements in prices.

import yfinance as yf
import pandas as pd
import numpy as np

# Download data
data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)
prices = data['Adj Close']

# Calculate moving averages
ma_20 = prices.rolling(window=20).mean()
ma_50 = prices.rolling(window=50).mean()
ma_200 = prices.rolling(window=200).mean()

# Current trend identification
current_price = prices.iloc[-1]
current_ma20 = ma_20.iloc[-1]
current_ma50 = ma_50.iloc[-1]

print(f"Current Price: ${current_price:.2f}")
print(f"20-day MA: ${current_ma20:.2f}")
print(f"50-day MA: ${current_ma50:.2f}")

# Determine trend
if current_price > current_ma20 > current_ma50:
    print("\nTrend: Strong Uptrend ↑")
elif current_price > current_ma50:
    print("\nTrend: Uptrend ↑")
elif current_price < current_ma20 < current_ma50:
    print("\nTrend: Strong Downtrend ↓")
elif current_price < current_ma50:
    print("\nTrend: Downtrend ↓")
else:
    print("\nTrend: Sideways/Neutral →")

Seasonality Analysis

Some stocks exhibit seasonal patterns.

# Extract month from date
returns = prices.pct_change()
returns_with_month = pd.DataFrame({
    'returns': returns,
    'month': returns.index.month
})

# Calculate average return by month
monthly_avg = returns_with_month.groupby('month')['returns'].mean() * 100

print("Average Returns by Month:")
print("-" * 30)
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
for month, avg_return in monthly_avg.items():
    print(f"{month_names[month-1]:.<10} {avg_return:>8.3f}%")

# Best and worst months
best_month = monthly_avg.idxmax()
worst_month = monthly_avg.idxmin()
print(f"\nBest month: {month_names[best_month-1]} ({monthly_avg[best_month]:.3f}%)")
print(f"Worst month: {month_names[worst_month-1]} ({monthly_avg[worst_month]:.3f}%)")

Day of Week Effect

# Extract day of week
returns_with_dow = pd.DataFrame({
    'returns': returns,
    'day_of_week': returns.index.dayofweek  # Monday=0, Sunday=6
})

# Calculate average return by day
daily_avg = returns_with_dow.groupby('day_of_week')['returns'].mean() * 100

print("\nAverage Returns by Day of Week:")
print("-" * 30)
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
for day, avg_return in daily_avg.items():
    if day < 5:  # Exclude weekends
        print(f"{day_names[day]:.<15} {avg_return:>8.3f}%")

Autocorrelation

Autocorrelation measures whether past returns predict future returns.

from pandas.plotting import autocorrelation_plot

# Calculate autocorrelation
autocorr_1 = returns.autocorr(lag=1)
autocorr_5 = returns.autocorr(lag=5)
autocorr_20 = returns.autocorr(lag=20)

print("Autocorrelation:")
print(f"1-day lag: {autocorr_1:.4f}")
print(f"5-day lag: {autocorr_5:.4f}")
print(f"20-day lag: {autocorr_20:.4f}")

# Interpretation
if abs(autocorr_1) < 0.05:
    print("\nLow autocorrelation - returns appear random (efficient market)")
else:
    print("\nSome autocorrelation present - potential predictability")

Moving Average Crossovers

A classic technical analysis technique.

# Calculate crossovers
crossovers = pd.DataFrame({
    'Price': prices,
    'MA_20': ma_20,
    'MA_50': ma_50
})

# Find golden crosses (MA_20 crosses above MA_50 - bullish)
crossovers['Signal'] = 0
crossovers.loc[crossovers['MA_20'] > crossovers['MA_50'], 'Signal'] = 1
crossovers.loc[crossovers['MA_20'] <= crossovers['MA_50'], 'Signal'] = -1

# Find where signal changes (actual crossover points)
crossovers['Position_Change'] = crossovers['Signal'].diff()

golden_crosses = crossovers[crossovers['Position_Change'] == 2]
death_crosses = crossovers[crossovers['Position_Change'] == -2]

print(f"\nGolden Crosses (bullish): {len(golden_crosses)}")
print(f"Death Crosses (bearish): {len(death_crosses)}")

if len(golden_crosses) > 0:
    print("\nMost recent Golden Cross:")
    print(golden_crosses.iloc[-1][['Price', 'MA_20', 'MA_50']])

4.5 Descriptive Statistics for Portfolios

Building Summary Statistics

import yfinance as yf
import pandas as pd
import numpy as np

def comprehensive_stats(ticker, start_date, end_date):
    """
    Generate comprehensive descriptive statistics
    """
    # Download data
    data = yf.download(ticker, start=start_date, end=end_date, progress=False)
    prices = data['Adj Close']
    returns = prices.pct_change().dropna()
    
    # Calculate cumulative returns
    cumulative = (1 + returns).cumprod()
    
    # Risk metrics
    daily_vol = returns.std()
    annual_vol = daily_vol * np.sqrt(252)
    
    # Return metrics
    total_return = (prices.iloc[-1] / prices.iloc[0]) - 1
    avg_return = returns.mean()
    annual_return = (1 + avg_return) ** 252 - 1
    
    # Drawdown
    running_max = cumulative.expanding().max()
    drawdown = (cumulative - running_max) / running_max
    max_dd = drawdown.min()
    
    # Sharpe ratio (assuming 0% risk-free rate)
    sharpe = (avg_return / daily_vol) * np.sqrt(252) if daily_vol > 0 else 0
    
    # Sortino ratio (using downside deviation)
    downside_returns = returns[returns < 0]
    downside_std = downside_returns.std() * np.sqrt(252) if len(downside_returns) > 0 else 0
    sortino = (annual_return / downside_std) if downside_std > 0 else 0
    
    # Skewness and kurtosis
    skew = returns.skew()
    kurt = returns.kurtosis()
    
    # Compile statistics
    stats = {
        'Total Return (%)': total_return * 100,
        'Annual Return (%)': annual_return * 100,
        'Annual Volatility (%)': annual_vol * 100,
        'Sharpe Ratio': sharpe,
        'Sortino Ratio': sortino,
        'Maximum Drawdown (%)': max_dd * 100,
        'Skewness': skew,
        'Kurtosis': kurt,
        'Best Day (%)': returns.max() * 100,
        'Worst Day (%)': returns.min() * 100,
        'Win Rate (%)': (returns > 0).sum() / len(returns) * 100
    }
    
    return pd.Series(stats)

# Get stats for multiple stocks
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN']
all_stats = {}

for ticker in tickers:
    all_stats[ticker] = comprehensive_stats(ticker, '2023-01-01', '2024-01-01')

# Create comparison DataFrame
comparison = pd.DataFrame(all_stats).T

print("COMPREHENSIVE STATISTICS COMPARISON")
print("="*80)
print(comparison.round(2))

Percentile Analysis

# Download data
data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)
returns = data['Adj Close'].pct_change().dropna()

# Calculate percentiles
percentiles = [1, 5, 10, 25, 50, 75, 90, 95, 99]
percentile_values = [returns.quantile(p/100) for p in percentiles]

print("Return Distribution Percentiles:")
print("-" * 40)
for p, value in zip(percentiles, percentile_values):
    print(f"{p:>2}th percentile: {value*100:>8.3f}%")

# Current vs historical
recent_return = returns.iloc[-1]
percentile_rank = (returns < recent_return).sum() / len(returns) * 100

print(f"\nMost recent return: {recent_return*100:.3f}%")
print(f"This is at the {percentile_rank:.1f}th percentile historically")

Performance Attribution

def performance_attribution(ticker, start_date, end_date):
    """
    Break down performance by time period
    """
    data = yf.download(ticker, start=start_date, end=end_date, progress=False)
    returns = data['Adj Close'].pct_change()
    
    # Add time periods
    df = pd.DataFrame({'returns': returns})
    df['year'] = df.index.year
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    
    # Calculate returns by period
    yearly = df.groupby('year')['returns'].apply(lambda x: (1 + x).prod() - 1) * 100
    quarterly = df.groupby(['year', 'quarter'])['returns'].apply(lambda x: (1 + x).prod() - 1) * 100
    monthly = df.groupby(['year', 'month'])['returns'].apply(lambda x: (1 + x).prod() - 1) * 100
    
    print(f"\nAnnual Returns for {ticker}:")
    print("-" * 30)
    for year, ret in yearly.items():
        print(f"{year}: {ret:>8.2f}%")
    
    print(f"\nQuarterly Returns:")
    print("-" * 30)
    for (year, quarter), ret in quarterly.tail(8).items():
        print(f"{year} Q{quarter}: {ret:>8.2f}%")
    
    return yearly, quarterly, monthly

# Analyze performance
yearly, quarterly, monthly = performance_attribution('AAPL', '2021-01-01', '2024-01-01')

4.6 Practice Exercises

Exercise 1: Returns Analysis Challenge

# Your task: Complete returns analysis for a stock of your choice
# 1. Download 3 years of data
# 2. Calculate both simple and log returns
# 3. Find the best and worst months
# 4. Calculate the Sharpe ratio
# 5. Determine the probability of positive returns
# 6. Compare 1-year returns vs 2-year returns vs 3-year returns

Exercise 2: Risk Comparison

# Your task: Compare risk profiles of different assets
# 1. Download data for: AAPL (stock), GLD (gold), TLT (bonds), SPY (S&P 500)
# 2. Calculate annualized volatility for each
# 3. Calculate maximum drawdown for each
# 4. Calculate VaR and CVaR for each
# 5. Rank them from safest to riskiest
# 6. Create a summary table comparing all metrics

Exercise 3: Correlation Study

# Your task: Explore correlations across sectors
# 1. Choose 2 stocks from each sector: Tech, Finance, Healthcare, Energy
# 2. Calculate correlation matrix
# 3. Find the most correlated pair
# 4. Find the least correlated pair
# 5. Calculate rolling 60-day correlation for most correlated pair
# 6. Identify when correlation was strongest and weakest

Exercise 4: Build an EDA Dashboard

# Your task: Create a comprehensive exploratory analysis function
# The function should take a ticker and date range and return:
# 1. Price summary (start, end, min, max, average)
# 2. Return statistics (mean, std, skew, kurtosis)
# 3. Risk metrics (volatility, max drawdown, VaR)
# 4. Performance metrics (total return, CAGR, Sharpe)
# 5. Calendar analysis (best/worst month, day of week effect)
# 6. Technical signals (MA crossovers, trend direction)
# Make the output clear and professionally formatted

Module 4 Summary

Congratulations! You've mastered exploratory data analysis for financial data.

What You've Accomplished

Returns Analysis

Understanding simple vs logarithmic returns
Calculating multi-period and cumulative returns
Annualizing returns for comparison
Analyzing return distributions

Risk Measurement

Measuring volatility with standard deviation
Calculating downside deviation
Finding maximum drawdown
Computing Value at Risk (VaR) and CVaR
Understanding different risk metrics

Correlation Analysis

Calculating correlation between assets
Using correlation for diversification
Analyzing rolling correlations
Understanding covariance

Time Series Analysis

Identifying trends with moving averages
Detecting seasonality patterns
Finding autocorrelation
Recognizing technical signals

Statistical Proficiency

Generating comprehensive descriptive statistics
Performing percentile analysis
Breaking down performance by time period
Building complete analytical frameworks

Real-World Skills

You can now:

Analyze any stock's risk and return profile
Compare assets objectively using multiple metrics
Identify patterns and trends in market data
Build portfolios with proper diversification
Perform professional-grade exploratory analysis
Create comprehensive financial reports

What's Next

In Module 5, we'll take these analytical skills and create stunning visualizations. You'll learn to build professional charts, graphs, and dashboards that communicate your insights effectively. Data analysis is powerful; data visualization makes it persuasive.

Before Moving Forward

Ensure you're comfortable with:

Calculating and interpreting returns
Understanding various risk metrics
Working with correlation matrices
Identifying trends in time series
Generating summary statistics

Practice Recommendations

Daily Practice: Analyze a different stock each day
Compare Sectors: Look at how different industries behave
Economic Events: Study how stocks react to news
Portfolio Thinking: Start considering combinations of assets
Documentation: Keep notes on interesting patterns you discover

The Foundation is Set

You now have the analytical toolkit used by professional quantitative analysts. These metrics—returns, volatility, correlation, Sharpe ratios—form the language of modern finance. You're not just crunching numbers; you're extracting insights that drive investment decisions.

The next step is making these insights visible and compelling through visualization.

Continue to Module 5: Data Visualization →