Module 4: Exploratory Data Analysis for Finance
Learning Objectives
By the end of this module, you will:
- Calculate and interpret financial returns (simple vs logarithmic)
- Measure and analyze volatility and risk metrics
- Understand and calculate correlation between assets
- Perform time series analysis on financial data
- Identify trends, patterns, and anomalies in market data
- Calculate rolling statistics and moving averages
- Build comprehensive exploratory analysis workflows
- Create insightful summary statistics for portfolios
4.1 Understanding Returns: The Foundation of Financial Analysis
Why Returns Matter More Than Prices
In finance, we rarely analyze absolute prices. Instead, we focus on returns—the percentage change in value over time. Here's why:
1. Comparability: A $10 stock rising to $11 (10% return) is more impressive than a $100 stock rising to $105 (5% return). Returns let us compare apples to apples.
2. Stationarity: Prices trend upward or downward over time, making statistical analysis difficult. Returns tend to fluctuate around a stable mean, making them more suitable for analysis.
3. Portfolio Construction: Returns can be aggregated across assets; prices cannot.
Simple Returns
Simple returns (also called arithmetic returns) measure the percentage change from one period to the next.
Formula: R = (P_t - P_{t-1}) / P_{t-1}
Or equivalently: R = (P_t / P_{t-1}) - 1
import pandas as pd
import numpy as np
import yfinance as yf
# Download stock data
data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)
prices = data['Adj Close']
# Calculate simple returns
simple_returns = prices.pct_change()
# Alternative manual calculation
simple_returns_manual = (prices / prices.shift(1)) - 1
print("First 10 daily returns:")
print(simple_returns.head(10))
# Convert to percentage
simple_returns_pct = simple_returns * 100
print(f"\nAverage daily return: {simple_returns_pct.mean():.4f}%")
Logarithmic Returns
Logarithmic (log) returns use the natural logarithm of the price ratio.
Formula: R = ln(P_t / P_{t-1})
Why Use Log Returns?
- Time Additivity: Log returns over multiple periods simply sum up
- Symmetry: A 50% gain followed by a 50% loss doesn't return you to breakeven with simple returns, but log returns handle this correctly
- Statistical Properties: Log returns are often more normally distributed, which is important for many financial models
import numpy as np
# Calculate log returns
log_returns = np.log(prices / prices.shift(1))
# Alternative using NumPy's diff and log
log_returns_alt = np.diff(np.log(prices))
print("First 10 log returns:")
print(log_returns.head(10))
# Convert to percentage
log_returns_pct = log_returns * 100
print(f"\nAverage daily log return: {log_returns_pct.mean():.4f}%")
Comparing Simple and Log Returns
# For small returns, simple and log returns are very similar
comparison = pd.DataFrame({
'Simple Returns': simple_returns,
'Log Returns': log_returns,
'Difference': simple_returns - log_returns
})
print("Comparison of returns:")
print(comparison.head(10))
print(f"\nAverage absolute difference: {abs(comparison['Difference']).mean():.6f}")
# For large returns, the difference becomes noticeable
large_change_example = pd.Series([100, 150, 100]) # 50% up, then 33.33% down
simple_ret = large_change_example.pct_change()
log_ret = np.log(large_change_example / large_change_example.shift(1))
print("\nExample with large price change:")
print(f"Simple returns: {simple_ret.values}")
print(f"Log returns: {log_ret.values}")
print(f"Sum of simple returns: {simple_ret.sum():.4f}")
print(f"Sum of log returns: {log_ret.sum():.4f}") # Should be ~0 (back to starting price)
Multi-Period Returns
Cumulative Simple Returns
To calculate the total return over multiple periods with simple returns, you must compound them:
# Calculate cumulative returns (investment growth)
cumulative_return = (1 + simple_returns).cumprod() - 1
# Alternative: cumulative product of (1 + return)
cumulative_growth = (1 + simple_returns).cumprod()
print("Cumulative returns:")
print(cumulative_return.tail())
# Total return over the entire period
total_return = cumulative_return.iloc[-1]
print(f"\nTotal return for the period: {total_return * 100:.2f}%")
# What $1000 would have become
initial_investment = 1000
final_value = initial_investment * (1 + total_return)
print(f"$1000 would have become: ${final_value:.2f}")
Cumulative Log Returns
With log returns, you simply sum them up:
# Cumulative log returns (just add them up)
cumulative_log_return = log_returns.cumsum()
print("Cumulative log returns:")
print(cumulative_log_return.tail())
# Convert to simple return
total_simple_return = np.exp(cumulative_log_return.iloc[-1]) - 1
print(f"\nTotal simple return: {total_simple_return * 100:.2f}%")
Annualizing Returns
To compare returns across different time periods, we annualize them.
# Calculate average daily return
avg_daily_return = simple_returns.mean()
# Annualize (assuming 252 trading days per year)
annual_return = (1 + avg_daily_return) ** 252 - 1
print(f"Average daily return: {avg_daily_return * 100:.4f}%")
print(f"Annualized return: {annual_return * 100:.2f}%")
# Alternative with log returns
avg_daily_log_return = log_returns.mean()
annual_log_return = avg_daily_log_return * 252
annual_simple_from_log = np.exp(annual_log_return) - 1
print(f"\nAnnualized return (from log): {annual_simple_from_log * 100:.2f}%")
Practical Example: Complete Returns Analysis
import yfinance as yf
import pandas as pd
import numpy as np
def analyze_returns(ticker, start_date, end_date):
"""
Comprehensive returns analysis for a stock
"""
# Download data
print(f"Analyzing {ticker}...")
data = yf.download(ticker, start=start_date, end=end_date, progress=False)
prices = data['Adj Close']
# Calculate returns
simple_returns = prices.pct_change()
log_returns = np.log(prices / prices.shift(1))
# Cumulative returns
cumulative_returns = (1 + simple_returns).cumprod() - 1
# Statistics
stats = {
'Ticker': ticker,
'Start Date': prices.index[0].strftime('%Y-%m-%d'),
'End Date': prices.index[-1].strftime('%Y-%m-%d'),
'Starting Price': prices.iloc[0],
'Ending Price': prices.iloc[-1],
'Total Return': cumulative_returns.iloc[-1] * 100,
'Avg Daily Return': simple_returns.mean() * 100,
'Annualized Return': ((1 + simple_returns.mean()) ** 252 - 1) * 100,
'Best Day': simple_returns.max() * 100,
'Worst Day': simple_returns.min() * 100,
'Positive Days': (simple_returns > 0).sum(),
'Negative Days': (simple_returns < 0).sum(),
'Win Rate': (simple_returns > 0).sum() / len(simple_returns) * 100
}
# Print results
print("\n" + "="*60)
print(f"RETURNS ANALYSIS: {ticker}")
print("="*60)
for key, value in stats.items():
if isinstance(value, float):
if 'Return' in key or 'Day' in key or 'Rate' in key:
print(f"{key:.<30} {value:>8.2f}%")
else:
print(f"{key:.<30} {value:>8.2f}")
else:
print(f"{key:.<30} {value}")
print("="*60)
return simple_returns, log_returns, cumulative_returns
# Analyze a stock
returns, log_returns, cum_returns = analyze_returns('AAPL', '2023-01-01', '2024-01-01')
4.2 Measuring Volatility and Risk
What is Volatility?
Volatility measures how much an asset's returns vary over time. High volatility means larger price swings; low volatility means more stable prices. In finance, volatility is often used as a proxy for risk.
Standard Deviation: The Primary Volatility Measure
Standard deviation measures the dispersion of returns around their mean.
import yfinance as yf
import numpy as np
import pandas as pd
# Download data
data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)
returns = data['Adj Close'].pct_change().dropna()
# Calculate daily volatility
daily_volatility = returns.std()
print(f"Daily volatility: {daily_volatility * 100:.4f}%")
# Annualize volatility (multiply by square root of trading days)
annual_volatility = daily_volatility * np.sqrt(252)
print(f"Annualized volatility: {annual_volatility * 100:.2f}%")
Why Square Root of Time?
Variance scales linearly with time, so standard deviation (the square root of variance) scales with the square root of time. This is a fundamental principle in financial mathematics.
Rolling Volatility
Volatility changes over time. Rolling volatility shows how risk evolves.
# Calculate 30-day rolling volatility
rolling_vol = returns.rolling(window=30).std() * np.sqrt(252) * 100
print("Rolling volatility:")
print(rolling_vol.tail())
# Find periods of high and low volatility
high_vol_threshold = rolling_vol.quantile(0.75)
low_vol_threshold = rolling_vol.quantile(0.25)
print(f"\nHigh volatility threshold: {high_vol_threshold:.2f}%")
print(f"Low volatility threshold: {low_vol_threshold:.2f}%")
# Identify high volatility periods
high_vol_periods = rolling_vol[rolling_vol > high_vol_threshold]
print(f"\nNumber of high volatility days: {len(high_vol_periods)}")
Downside Deviation
Standard deviation treats upside and downside volatility equally. Downside deviation only measures volatility of negative returns.
# Calculate downside deviation (only negative returns)
negative_returns = returns[returns < 0]
downside_deviation = negative_returns.std() * np.sqrt(252)
print(f"Downside deviation (annual): {downside_deviation * 100:.2f}%")
print(f"Standard deviation (annual): {annual_volatility * 100:.2f}%")
# Downside deviation is typically lower than standard deviation
Maximum Drawdown
Maximum drawdown measures the largest peak-to-trough decline.
# Calculate cumulative returns
cumulative = (1 + returns).cumprod()
# Calculate running maximum
running_max = cumulative.expanding().max()
# Calculate drawdown
drawdown = (cumulative - running_max) / running_max
# Find maximum drawdown
max_drawdown = drawdown.min()
print(f"Maximum drawdown: {max_drawdown * 100:.2f}%")
# Find when it occurred
max_dd_date = drawdown.idxmin()
print(f"Max drawdown date: {max_dd_date.strftime('%Y-%m-%d')}")
# Find the peak before the drawdown
peak_date = running_max[:max_dd_date].idxmax()
print(f"Peak before drawdown: {peak_date.strftime('%Y-%m-%d')}")
Value at Risk (VaR)
VaR estimates the maximum loss expected over a specific time period at a given confidence level.
# Calculate VaR at 95% confidence level
confidence_level = 0.05
var_95 = returns.quantile(confidence_level)
print(f"Daily VaR (95%): {var_95 * 100:.2f}%")
print(f"This means: 95% of days, losses won't exceed {abs(var_95) * 100:.2f}%")
# Annualized VaR
annual_var_95 = var_95 * np.sqrt(252)
print(f"Annual VaR (95%): {annual_var_95 * 100:.2f}%")
# For a $100,000 portfolio
portfolio_value = 100000
var_dollar = portfolio_value * abs(var_95)
print(f"\nFor a ${portfolio_value:,} portfolio:")
print(f"95% confident daily loss won't exceed: ${var_dollar:,.2f}")
Conditional Value at Risk (CVaR)
CVaR, also called Expected Shortfall, measures the average loss when VaR is exceeded.
# Calculate CVaR (average of returns worse than VaR)
cvar_95 = returns[returns <= var_95].mean()
print(f"CVaR (95%): {cvar_95 * 100:.2f}%")
print(f"When bad days happen, average loss is {abs(cvar_95) * 100:.2f}%")
# CVaR in dollars
cvar_dollar = portfolio_value * abs(cvar_95)
print(f"Expected loss on worst 5% of days: ${cvar_dollar:,.2f}")
Practical Example: Comprehensive Risk Analysis
import yfinance as yf
import pandas as pd
import numpy as np
def risk_analysis(ticker, start_date, end_date):
"""
Comprehensive risk analysis for a stock
"""
# Download data
data = yf.download(ticker, start=start_date, end=end_date, progress=False)
prices = data['Adj Close']
returns = prices.pct_change().dropna()
# Calculate cumulative returns for drawdown
cumulative = (1 + returns).cumprod()
running_max = cumulative.expanding().max()
drawdown = (cumulative - running_max) / running_max
# Volatility metrics
daily_vol = returns.std()
annual_vol = daily_vol * np.sqrt(252)
downside_returns = returns[returns < 0]
downside_vol = downside_returns.std() * np.sqrt(252) if len(downside_returns) > 0 else 0
# VaR and CVaR
var_95 = returns.quantile(0.05)
cvar_95 = returns[returns <= var_95].mean()
# Risk metrics
risk_metrics = {
'Ticker': ticker,
'Daily Volatility (%)': daily_vol * 100,
'Annual Volatility (%)': annual_vol * 100,
'Downside Deviation (%)': downside_vol * 100,
'Maximum Drawdown (%)': drawdown.min() * 100,
'VaR 95% (%)': var_95 * 100,
'CVaR 95% (%)': cvar_95 * 100,
'Best Day (%)': returns.max() * 100,
'Worst Day (%)': returns.min() * 100,
'Positive Days': (returns > 0).sum(),
'Negative Days': (returns < 0).sum()
}
# Print results
print("\n" + "="*60)
print(f"RISK ANALYSIS: {ticker}")
print("="*60)
for key, value in risk_metrics.items():
if isinstance(value, float):
print(f"{key:.<45} {value:>10.2f}")
else:
print(f"{key:.<45} {value:>10}")
print("="*60)
return risk_metrics
# Analyze risk
risk = risk_analysis('AAPL', '2023-01-01', '2024-01-01')
4.3 Correlation Analysis
Understanding Correlation
Correlation measures how two assets move together. It ranges from -1 to +1:
- +1: Perfect positive correlation (move together)
- 0: No correlation (independent movement)
- -1: Perfect negative correlation (move opposite)
Why Correlation Matters
- Diversification: Low correlation between assets reduces portfolio risk
- Hedging: Negative correlation can protect against losses
- Trading Strategies: Pairs trading exploits correlations
Calculating Correlation
import yfinance as yf
import pandas as pd
# Download data for multiple stocks
tickers = ['AAPL', 'MSFT', 'GOOGL', 'JPM', 'XOM']
data = yf.download(tickers, start='2023-01-01', end='2024-01-01', progress=False)
prices = data['Adj Close']
returns = prices.pct_change().dropna()
# Calculate correlation matrix
correlation_matrix = returns.corr()
print("Correlation Matrix:")
print(correlation_matrix.round(2))
# Get correlation between two specific stocks
aapl_msft_corr = returns['AAPL'].corr(returns['MSFT'])
print(f"\nAAPL-MSFT Correlation: {aapl_msft_corr:.3f}")
Interpreting Correlation
# High correlation pairs (>0.7)
high_corr = []
for i in range(len(correlation_matrix.columns)):
for j in range(i+1, len(correlation_matrix.columns)):
if correlation_matrix.iloc[i, j] > 0.7:
high_corr.append((
correlation_matrix.columns[i],
correlation_matrix.columns[j],
correlation_matrix.iloc[i, j]
))
print("\nHighly Correlated Pairs (>0.7):")
for stock1, stock2, corr in high_corr:
print(f"{stock1} - {stock2}: {corr:.3f}")
# Low correlation pairs (<0.3)
low_corr = []
for i in range(len(correlation_matrix.columns)):
for j in range(i+1, len(correlation_matrix.columns)):
if abs(correlation_matrix.iloc[i, j]) < 0.3:
low_corr.append((
correlation_matrix.columns[i],
correlation_matrix.columns[j],
correlation_matrix.iloc[i, j]
))
print("\nLow Correlation Pairs (<0.3):")
for stock1, stock2, corr in low_corr:
print(f"{stock1} - {stock2}: {corr:.3f}")
Rolling Correlation
Correlation changes over time. Rolling correlation shows how relationships evolve.
# Calculate 30-day rolling correlation between AAPL and MSFT
rolling_corr = returns['AAPL'].rolling(window=30).corr(returns['MSFT'])
print("Rolling correlation (last 10 days):")
print(rolling_corr.tail(10))
# Summary statistics
print(f"\nMean correlation: {rolling_corr.mean():.3f}")
print(f"Min correlation: {rolling_corr.min():.3f}")
print(f"Max correlation: {rolling_corr.max():.3f}")
Covariance
Covariance is related to correlation but not standardized (it depends on the volatility of each asset).
# Calculate covariance matrix
covariance_matrix = returns.cov()
print("Covariance Matrix:")
print(covariance_matrix)
# Annualize covariance
annual_cov_matrix = covariance_matrix * 252
print("\nAnnualized Covariance Matrix:")
print(annual_cov_matrix)
# Relationship: correlation = covariance / (std1 * std2)
aapl_std = returns['AAPL'].std()
msft_std = returns['MSFT'].std()
aapl_msft_cov = returns['AAPL'].cov(returns['MSFT'])
calculated_corr = aapl_msft_cov / (aapl_std * msft_std)
print(f"\nCorrelation from covariance: {calculated_corr:.3f}")
print(f"Direct correlation: {aapl_msft_corr:.3f}")
Practical Example: Diversification Analysis
import yfinance as yf
import pandas as pd
import numpy as np
def diversification_analysis(tickers, start_date, end_date):
"""
Analyze diversification potential across assets
"""
# Download data
data = yf.download(tickers, start=start_date, end=end_date, progress=False)
prices = data['Adj Close']
returns = prices.pct_change().dropna()
# Calculate correlation matrix
corr_matrix = returns.corr()
# Calculate average correlation for each stock
avg_correlations = {}
for ticker in tickers:
# Exclude correlation with itself
other_tickers = [t for t in tickers if t != ticker]
avg_corr = corr_matrix.loc[ticker, other_tickers].mean()
avg_correlations[ticker] = avg_corr
# Find best diversifiers (lowest average correlation)
diversifiers = sorted(avg_correlations.items(), key=lambda x: x[1])
print("\n" + "="*60)
print("DIVERSIFICATION ANALYSIS")
print("="*60)
print("\nAverage Correlation with Other Assets:")
print("-" * 40)
for ticker, avg_corr in diversifiers:
print(f"{ticker:.<20} {avg_corr:>10.3f}")
print(f"\nBest diversifier: {diversifiers[0][0]} (avg corr: {diversifiers[0][1]:.3f})")
print(f"Least diversifier: {diversifiers[-1][0]} (avg corr: {diversifiers[-1][1]:.3f})")
# Overall portfolio correlation
# Average of all pairwise correlations (excluding diagonal)
upper_triangle = np.triu(corr_matrix, k=1)
avg_portfolio_corr = upper_triangle[upper_triangle != 0].mean()
print(f"\nOverall portfolio average correlation: {avg_portfolio_corr:.3f}")
if avg_portfolio_corr < 0.5:
print("✓ Good diversification (low correlation)")
elif avg_portfolio_corr < 0.7:
print("⚠ Moderate diversification")
else:
print("✗ Poor diversification (high correlation)")
return corr_matrix, avg_correlations
# Analyze diversification
tickers = ['AAPL', 'MSFT', 'JPM', 'XOM', 'JNJ', 'WMT', 'GLD']
corr_matrix, avg_corrs = diversification_analysis(tickers, '2023-01-01', '2024-01-01')
4.4 Time Series Analysis
Identifying Trends
Trends are sustained directional movements in prices.
import yfinance as yf
import pandas as pd
import numpy as np
# Download data
data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)
prices = data['Adj Close']
# Calculate moving averages
ma_20 = prices.rolling(window=20).mean()
ma_50 = prices.rolling(window=50).mean()
ma_200 = prices.rolling(window=200).mean()
# Current trend identification
current_price = prices.iloc[-1]
current_ma20 = ma_20.iloc[-1]
current_ma50 = ma_50.iloc[-1]
print(f"Current Price: ${current_price:.2f}")
print(f"20-day MA: ${current_ma20:.2f}")
print(f"50-day MA: ${current_ma50:.2f}")
# Determine trend
if current_price > current_ma20 > current_ma50:
print("\nTrend: Strong Uptrend ↑")
elif current_price > current_ma50:
print("\nTrend: Uptrend ↑")
elif current_price < current_ma20 < current_ma50:
print("\nTrend: Strong Downtrend ↓")
elif current_price < current_ma50:
print("\nTrend: Downtrend ↓")
else:
print("\nTrend: Sideways/Neutral →")
Seasonality Analysis
Some stocks exhibit seasonal patterns.
# Extract month from date
returns = prices.pct_change()
returns_with_month = pd.DataFrame({
'returns': returns,
'month': returns.index.month
})
# Calculate average return by month
monthly_avg = returns_with_month.groupby('month')['returns'].mean() * 100
print("Average Returns by Month:")
print("-" * 30)
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
for month, avg_return in monthly_avg.items():
print(f"{month_names[month-1]:.<10} {avg_return:>8.3f}%")
# Best and worst months
best_month = monthly_avg.idxmax()
worst_month = monthly_avg.idxmin()
print(f"\nBest month: {month_names[best_month-1]} ({monthly_avg[best_month]:.3f}%)")
print(f"Worst month: {month_names[worst_month-1]} ({monthly_avg[worst_month]:.3f}%)")
Day of Week Effect
# Extract day of week
returns_with_dow = pd.DataFrame({
'returns': returns,
'day_of_week': returns.index.dayofweek # Monday=0, Sunday=6
})
# Calculate average return by day
daily_avg = returns_with_dow.groupby('day_of_week')['returns'].mean() * 100
print("\nAverage Returns by Day of Week:")
print("-" * 30)
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
for day, avg_return in daily_avg.items():
if day < 5: # Exclude weekends
print(f"{day_names[day]:.<15} {avg_return:>8.3f}%")
Autocorrelation
Autocorrelation measures whether past returns predict future returns.
from pandas.plotting import autocorrelation_plot
# Calculate autocorrelation
autocorr_1 = returns.autocorr(lag=1)
autocorr_5 = returns.autocorr(lag=5)
autocorr_20 = returns.autocorr(lag=20)
print("Autocorrelation:")
print(f"1-day lag: {autocorr_1:.4f}")
print(f"5-day lag: {autocorr_5:.4f}")
print(f"20-day lag: {autocorr_20:.4f}")
# Interpretation
if abs(autocorr_1) < 0.05:
print("\nLow autocorrelation - returns appear random (efficient market)")
else:
print("\nSome autocorrelation present - potential predictability")
Moving Average Crossovers
A classic technical analysis technique.
# Calculate crossovers
crossovers = pd.DataFrame({
'Price': prices,
'MA_20': ma_20,
'MA_50': ma_50
})
# Find golden crosses (MA_20 crosses above MA_50 - bullish)
crossovers['Signal'] = 0
crossovers.loc[crossovers['MA_20'] > crossovers['MA_50'], 'Signal'] = 1
crossovers.loc[crossovers['MA_20'] <= crossovers['MA_50'], 'Signal'] = -1
# Find where signal changes (actual crossover points)
crossovers['Position_Change'] = crossovers['Signal'].diff()
golden_crosses = crossovers[crossovers['Position_Change'] == 2]
death_crosses = crossovers[crossovers['Position_Change'] == -2]
print(f"\nGolden Crosses (bullish): {len(golden_crosses)}")
print(f"Death Crosses (bearish): {len(death_crosses)}")
if len(golden_crosses) > 0:
print("\nMost recent Golden Cross:")
print(golden_crosses.iloc[-1][['Price', 'MA_20', 'MA_50']])
4.5 Descriptive Statistics for Portfolios
Building Summary Statistics
import yfinance as yf
import pandas as pd
import numpy as np
def comprehensive_stats(ticker, start_date, end_date):
"""
Generate comprehensive descriptive statistics
"""
# Download data
data = yf.download(ticker, start=start_date, end=end_date, progress=False)
prices = data['Adj Close']
returns = prices.pct_change().dropna()
# Calculate cumulative returns
cumulative = (1 + returns).cumprod()
# Risk metrics
daily_vol = returns.std()
annual_vol = daily_vol * np.sqrt(252)
# Return metrics
total_return = (prices.iloc[-1] / prices.iloc[0]) - 1
avg_return = returns.mean()
annual_return = (1 + avg_return) ** 252 - 1
# Drawdown
running_max = cumulative.expanding().max()
drawdown = (cumulative - running_max) / running_max
max_dd = drawdown.min()
# Sharpe ratio (assuming 0% risk-free rate)
sharpe = (avg_return / daily_vol) * np.sqrt(252) if daily_vol > 0 else 0
# Sortino ratio (using downside deviation)
downside_returns = returns[returns < 0]
downside_std = downside_returns.std() * np.sqrt(252) if len(downside_returns) > 0 else 0
sortino = (annual_return / downside_std) if downside_std > 0 else 0
# Skewness and kurtosis
skew = returns.skew()
kurt = returns.kurtosis()
# Compile statistics
stats = {
'Total Return (%)': total_return * 100,
'Annual Return (%)': annual_return * 100,
'Annual Volatility (%)': annual_vol * 100,
'Sharpe Ratio': sharpe,
'Sortino Ratio': sortino,
'Maximum Drawdown (%)': max_dd * 100,
'Skewness': skew,
'Kurtosis': kurt,
'Best Day (%)': returns.max() * 100,
'Worst Day (%)': returns.min() * 100,
'Win Rate (%)': (returns > 0).sum() / len(returns) * 100
}
return pd.Series(stats)
# Get stats for multiple stocks
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN']
all_stats = {}
for ticker in tickers:
all_stats[ticker] = comprehensive_stats(ticker, '2023-01-01', '2024-01-01')
# Create comparison DataFrame
comparison = pd.DataFrame(all_stats).T
print("COMPREHENSIVE STATISTICS COMPARISON")
print("="*80)
print(comparison.round(2))
Percentile Analysis
# Download data
data = yf.download('AAPL', start='2023-01-01', end='2024-01-01', progress=False)
returns = data['Adj Close'].pct_change().dropna()
# Calculate percentiles
percentiles = [1, 5, 10, 25, 50, 75, 90, 95, 99]
percentile_values = [returns.quantile(p/100) for p in percentiles]
print("Return Distribution Percentiles:")
print("-" * 40)
for p, value in zip(percentiles, percentile_values):
print(f"{p:>2}th percentile: {value*100:>8.3f}%")
# Current vs historical
recent_return = returns.iloc[-1]
percentile_rank = (returns < recent_return).sum() / len(returns) * 100
print(f"\nMost recent return: {recent_return*100:.3f}%")
print(f"This is at the {percentile_rank:.1f}th percentile historically")
Performance Attribution
def performance_attribution(ticker, start_date, end_date):
"""
Break down performance by time period
"""
data = yf.download(ticker, start=start_date, end=end_date, progress=False)
returns = data['Adj Close'].pct_change()
# Add time periods
df = pd.DataFrame({'returns': returns})
df['year'] = df.index.year
df['quarter'] = df.index.quarter
df['month'] = df.index.month
# Calculate returns by period
yearly = df.groupby('year')['returns'].apply(lambda x: (1 + x).prod() - 1) * 100
quarterly = df.groupby(['year', 'quarter'])['returns'].apply(lambda x: (1 + x).prod() - 1) * 100
monthly = df.groupby(['year', 'month'])['returns'].apply(lambda x: (1 + x).prod() - 1) * 100
print(f"\nAnnual Returns for {ticker}:")
print("-" * 30)
for year, ret in yearly.items():
print(f"{year}: {ret:>8.2f}%")
print(f"\nQuarterly Returns:")
print("-" * 30)
for (year, quarter), ret in quarterly.tail(8).items():
print(f"{year} Q{quarter}: {ret:>8.2f}%")
return yearly, quarterly, monthly
# Analyze performance
yearly, quarterly, monthly = performance_attribution('AAPL', '2021-01-01', '2024-01-01')
4.6 Practice Exercises
Exercise 1: Returns Analysis Challenge
# Your task: Complete returns analysis for a stock of your choice
# 1. Download 3 years of data
# 2. Calculate both simple and log returns
# 3. Find the best and worst months
# 4. Calculate the Sharpe ratio
# 5. Determine the probability of positive returns
# 6. Compare 1-year returns vs 2-year returns vs 3-year returns
Exercise 2: Risk Comparison
# Your task: Compare risk profiles of different assets
# 1. Download data for: AAPL (stock), GLD (gold), TLT (bonds), SPY (S&P 500)
# 2. Calculate annualized volatility for each
# 3. Calculate maximum drawdown for each
# 4. Calculate VaR and CVaR for each
# 5. Rank them from safest to riskiest
# 6. Create a summary table comparing all metrics
Exercise 3: Correlation Study
# Your task: Explore correlations across sectors
# 1. Choose 2 stocks from each sector: Tech, Finance, Healthcare, Energy
# 2. Calculate correlation matrix
# 3. Find the most correlated pair
# 4. Find the least correlated pair
# 5. Calculate rolling 60-day correlation for most correlated pair
# 6. Identify when correlation was strongest and weakest
Exercise 4: Build an EDA Dashboard
# Your task: Create a comprehensive exploratory analysis function
# The function should take a ticker and date range and return:
# 1. Price summary (start, end, min, max, average)
# 2. Return statistics (mean, std, skew, kurtosis)
# 3. Risk metrics (volatility, max drawdown, VaR)
# 4. Performance metrics (total return, CAGR, Sharpe)
# 5. Calendar analysis (best/worst month, day of week effect)
# 6. Technical signals (MA crossovers, trend direction)
# Make the output clear and professionally formatted
Module 4 Summary
Congratulations! You've mastered exploratory data analysis for financial data.
What You've Accomplished
Returns Analysis
- Understanding simple vs logarithmic returns
- Calculating multi-period and cumulative returns
- Annualizing returns for comparison
- Analyzing return distributions
Risk Measurement
- Measuring volatility with standard deviation
- Calculating downside deviation
- Finding maximum drawdown
- Computing Value at Risk (VaR) and CVaR
- Understanding different risk metrics
Correlation Analysis
- Calculating correlation between assets
- Using correlation for diversification
- Analyzing rolling correlations
- Understanding covariance
Time Series Analysis
- Identifying trends with moving averages
- Detecting seasonality patterns
- Finding autocorrelation
- Recognizing technical signals
Statistical Proficiency
- Generating comprehensive descriptive statistics
- Performing percentile analysis
- Breaking down performance by time period
- Building complete analytical frameworks
Real-World Skills
You can now:
- Analyze any stock's risk and return profile
- Compare assets objectively using multiple metrics
- Identify patterns and trends in market data
- Build portfolios with proper diversification
- Perform professional-grade exploratory analysis
- Create comprehensive financial reports
What's Next
In Module 5, we'll take these analytical skills and create stunning visualizations. You'll learn to build professional charts, graphs, and dashboards that communicate your insights effectively. Data analysis is powerful; data visualization makes it persuasive.
Before Moving Forward
Ensure you're comfortable with:
- Calculating and interpreting returns
- Understanding various risk metrics
- Working with correlation matrices
- Identifying trends in time series
- Generating summary statistics
Practice Recommendations
- Daily Practice: Analyze a different stock each day
- Compare Sectors: Look at how different industries behave
- Economic Events: Study how stocks react to news
- Portfolio Thinking: Start considering combinations of assets
- Documentation: Keep notes on interesting patterns you discover
The Foundation is Set
You now have the analytical toolkit used by professional quantitative analysts. These metrics—returns, volatility, correlation, Sharpe ratios—form the language of modern finance. You're not just crunching numbers; you're extracting insights that drive investment decisions.
The next step is making these insights visible and compelling through visualization.
Continue to Module 5: Data Visualization →

