Module 9: Statistical Analysis for Finance
Learning Objectives
By the end of this module, you will:
- Perform hypothesis testing on financial data
- Build and interpret regression models
- Understand and test for statistical significance
- Conduct time series analysis and forecasting
- Apply ARIMA and GARCH models
- Test for market efficiency and anomalies
- Calculate and interpret confidence intervals
- Perform factor analysis and attribution
- Build predictive models for returns
9.1 Hypothesis Testing in Finance
Understanding Hypothesis Testing
Hypothesis testing helps us determine if observed patterns in financial data are statistically significant or just random noise.
The Process:
- Formulate null hypothesis (H₀) and alternative hypothesis (H₁)
- Choose significance level (typically α = 0.05)
- Calculate test statistic
- Determine p-value
- Make decision: reject or fail to reject H₀
Testing Average Returns
import numpy as np
import pandas as pd
import yfinance as yf
from scipy import stats
import matplotlib.pyplot as plt
# Download stock data
data = yf.download('AAPL', start='2020-01-01', end='2024-01-01', progress=False)
returns = data['Adj Close'].pct_change().dropna()
# Test if average return is significantly different from zero
# H₀: μ = 0 (no average return)
# H₁: μ ≠ 0 (significant average return)
sample_mean = returns.mean()
sample_std = returns.std()
sample_size = len(returns)
# Calculate t-statistic
t_statistic = (sample_mean - 0) / (sample_std / np.sqrt(sample_size))
# Calculate p-value (two-tailed test)
p_value = 2 * (1 - stats.t.cdf(abs(t_statistic), df=sample_size - 1))
# Using scipy's t-test function
t_stat, p_val = stats.ttest_1samp(returns, 0)
print("Hypothesis Test: Are Average Returns Significant?")
print("="*60)
print(f"Null Hypothesis: Mean return = 0")
print(f"Alternative: Mean return ≠ 0")
print(f"\nSample Size: {sample_size}")
print(f"Sample Mean: {sample_mean*100:.4f}%")
print(f"Sample Std Dev: {sample_std*100:.4f}%")
print(f"\nT-statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.4f}")
print(f"Significance Level: 0.05")
if p_val < 0.05:
print(f"\nConclusion: REJECT null hypothesis (p < 0.05)")
print(f"The average return is statistically significant.")
else:
print(f"\nConclusion: FAIL TO REJECT null hypothesis (p >= 0.05)")
print(f"The average return is not statistically significant.")
# Calculate confidence interval
confidence_level = 0.95
degrees_freedom = sample_size - 1
confidence_interval = stats.t.interval(
confidence_level,
degrees_freedom,
loc=sample_mean,
scale=sample_std / np.sqrt(sample_size)
)
print(f"\n95% Confidence Interval: [{confidence_interval[0]*100:.4f}%, {confidence_interval[1]*100:.4f}%]")
Comparing Two Stocks
# Download data for two stocks
aapl_data = yf.download('AAPL', start='2020-01-01', end='2024-01-01', progress=False)
msft_data = yf.download('MSFT', start='2020-01-01', end='2024-01-01', progress=False)
aapl_returns = aapl_data['Adj Close'].pct_change().dropna()
msft_returns = msft_data['Adj Close'].pct_change().dropna()
# Align dates
aligned_returns = pd.DataFrame({
'AAPL': aapl_returns,
'MSFT': msft_returns
}).dropna()
# Test if returns are different
# H₀: μ_AAPL = μ_MSFT
# H₁: μ_AAPL ≠ μ_MSFT
t_stat, p_val = stats.ttest_ind(
aligned_returns['AAPL'],
aligned_returns['MSFT']
)
print("\n" + "="*60)
print("Comparing Two Stocks: AAPL vs MSFT")
print("="*60)
print(f"AAPL Mean Return: {aligned_returns['AAPL'].mean()*100:.4f}%")
print(f"MSFT Mean Return: {aligned_returns['MSFT'].mean()*100:.4f}%")
print(f"\nT-statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.4f}")
if p_val < 0.05:
print(f"\nConclusion: Returns are SIGNIFICANTLY DIFFERENT")
else:
print(f"\nConclusion: No significant difference in returns")
Testing for Normality
Financial returns often deviate from normal distribution.
from scipy.stats import shapiro, jarque_bera, kstest
# Download data
data = yf.download('AAPL', start='2020-01-01', end='2024-01-01', progress=False)
returns = data['Adj Close'].pct_change().dropna()
# Shapiro-Wilk test
shapiro_stat, shapiro_p = shapiro(returns)
# Jarque-Bera test
jb_stat, jb_p = jarque_bera(returns)
# Kolmogorov-Smirnov test
ks_stat, ks_p = kstest(returns, 'norm',
args=(returns.mean(), returns.std()))
print("\n" + "="*60)
print("Testing for Normality")
print("="*60)
print("H₀: Returns are normally distributed")
print("H₁: Returns are not normally distributed")
print(f"\nShapiro-Wilk Test:")
print(f" Statistic: {shapiro_stat:.4f}")
print(f" P-value: {shapiro_p:.4f}")
print(f"\nJarque-Bera Test:")
print(f" Statistic: {jb_stat:.4f}")
print(f" P-value: {jb_p:.4f}")
print(f"\nKolmogorov-Smirnov Test:")
print(f" Statistic: {ks_stat:.4f}")
print(f" P-value: {ks_p:.4f}")
if jb_p < 0.05:
print(f"\nConclusion: Returns are NOT normally distributed (p < 0.05)")
print("This is common in financial data (fat tails, skewness)")
else:
print(f"\nConclusion: Cannot reject normality assumption")
# Calculate skewness and kurtosis
skewness = returns.skew()
kurtosis = returns.kurtosis()
print(f"\nSkewness: {skewness:.4f}")
if abs(skewness) > 0.5:
print(" (Distribution is skewed)")
else:
print(" (Distribution is approximately symmetric)")
print(f"Excess Kurtosis: {kurtosis:.4f}")
if kurtosis > 0:
print(" (Fat tails - more extreme values than normal distribution)")
else:
print(" (Thin tails)")
9.2 Linear Regression
Single-Factor Model (Market Model)
import numpy as np
import pandas as pd
import yfinance as yf
from scipy import stats
import matplotlib.pyplot as plt
# Download stock and market data
stock = yf.download('AAPL', start='2020-01-01', end='2024-01-01', progress=False)
market = yf.download('^GSPC', start='2020-01-01', end='2024-01-01', progress=False)
stock_returns = stock['Adj Close'].pct_change().dropna()
market_returns = market['Adj Close'].pct_change().dropna()
# Align data
returns_df = pd.DataFrame({
'Stock': stock_returns,
'Market': market_returns
}).dropna()
# Perform regression
slope, intercept, r_value, p_value, std_err = stats.linregress(
returns_df['Market'],
returns_df['Stock']
)
# Calculate fitted values
fitted_values = intercept + slope * returns_df['Market']
residuals = returns_df['Stock'] - fitted_values
print("Linear Regression: Market Model")
print("="*60)
print("Model: R_stock = α + β × R_market + ε")
print(f"\nAlpha (Intercept): {intercept*100:.4f}%")
print(f"Beta (Slope): {slope:.4f}")
print(f"R-squared: {r_value**2:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Standard Error: {std_err:.4f}")
print(f"\nInterpretation:")
print(f"- For every 1% market return, stock returns {slope:.4f}%")
if slope > 1:
print(f"- Stock is MORE volatile than market (β > 1)")
elif slope < 1:
print(f"- Stock is LESS volatile than market (β < 1)")
else:
print(f"- Stock moves with market (β ≈ 1)")
if p_value < 0.05:
print(f"- Relationship is statistically significant")
else:
print(f"- Relationship is NOT statistically significant")
# Plot regression
plt.figure(figsize=(12, 8))
# Scatter plot
plt.scatter(returns_df['Market']*100, returns_df['Stock']*100,
alpha=0.5, s=30, edgecolors='black', linewidth=0.5)
# Regression line
x_line = np.array([returns_df['Market'].min(), returns_df['Market'].max()])
y_line = (intercept + slope * x_line) * 100
plt.plot(x_line*100, y_line, 'r-', linewidth=2,
label=f'y = {intercept*100:.4f} + {slope:.4f}x\nR² = {r_value**2:.4f}')
plt.xlabel('Market Return (%)', fontsize=12)
plt.ylabel('Stock Return (%)', fontsize=12)
plt.title('Market Model Regression: AAPL vs S&P 500',
fontsize=16, fontweight='bold', pad=20)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='black', linewidth=0.5)
plt.axvline(x=0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()
# Residual analysis
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Residual plot
ax1.scatter(fitted_values*100, residuals*100, alpha=0.5, s=30)
ax1.axhline(y=0, color='red', linestyle='--', linewidth=1)
ax1.set_xlabel('Fitted Values (%)', fontsize=11)
ax1.set_ylabel('Residuals (%)', fontsize=11)
ax1.set_title('Residual Plot', fontsize=13, fontweight='bold')
ax1.grid(True, alpha=0.3)
# Q-Q plot
stats.probplot(residuals, dist="norm", plot=ax2)
ax2.set_title('Q-Q Plot', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Multi-Factor Model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Download multiple factors
spy = yf.download('^GSPC', start='2020-01-01', end='2024-01-01', progress=False) # Market
tlt = yf.download('TLT', start='2020-01-01', end='2024-01-01', progress=False) # Bonds
gld = yf.download('GLD', start='2020-01-01', end='2024-01-01', progress=False) # Gold
dxy = yf.download('DX-Y.NYB', start='2020-01-01', end='2024-01-01', progress=False) # Dollar
# Calculate returns
factors_df = pd.DataFrame({
'Stock': stock['Adj Close'].pct_change(),
'Market': spy['Adj Close'].pct_change(),
'Bonds': tlt['Adj Close'].pct_change(),
'Gold': gld['Adj Close'].pct_change(),
}).dropna()
# Prepare data for regression
X = factors_df[['Market', 'Bonds', 'Gold']]
y = factors_df['Stock']
# Fit model
model = LinearRegression()
model.fit(X, y)
# Get results
r_squared = r2_score(y, model.predict(X))
print("\n" + "="*60)
print("Multi-Factor Regression Model")
print("="*60)
print(f"Model: R_stock = α + β₁×Market + β₂×Bonds + β₃×Gold + ε")
print(f"\nIntercept (α): {model.intercept_*100:.4f}%")
print(f"\nCoefficients:")
for factor, coef in zip(X.columns, model.coef_):
print(f" {factor}: {coef:.4f}")
print(f"\nR-squared: {r_squared:.4f}")
# Compare to single-factor model
print(f"\nModel Comparison:")
print(f" Single-factor R²: {r_value**2:.4f}")
print(f" Multi-factor R²: {r_squared:.4f}")
print(f" Improvement: {(r_squared - r_value**2):.4f}")
# Statistical significance of coefficients
import statsmodels.api as sm
X_with_const = sm.add_constant(X)
ols_model = sm.OLS(y, X_with_const).fit()
print(f"\nCoefficient Significance:")
print(ols_model.summary().tables[1])
9.3 Time Series Analysis
Autocorrelation and Stationarity
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Download data
data = yf.download('AAPL', start='2020-01-01', end='2024-01-01', progress=False)
prices = data['Adj Close']
returns = prices.pct_change().dropna()
# Augmented Dickey-Fuller test for stationarity
# H₀: Series has unit root (non-stationary)
# H₁: Series is stationary
print("Testing for Stationarity")
print("="*60)
# Test prices
adf_prices = adfuller(prices.dropna())
print("\nPrices:")
print(f" ADF Statistic: {adf_prices[0]:.4f}")
print(f" P-value: {adf_prices[1]:.4f}")
print(f" Critical Values: {adf_prices[4]}")
if adf_prices[1] < 0.05:
print(" Conclusion: Stationary (p < 0.05)")
else:
print(" Conclusion: Non-stationary (p >= 0.05)")
# Test returns
adf_returns = adfuller(returns.dropna())
print("\nReturns:")
print(f" ADF Statistic: {adf_returns[0]:.4f}")
print(f" P-value: {adf_returns[1]:.4f}")
print(f" Critical Values: {adf_returns[4]}")
if adf_returns[1] < 0.05:
print(" Conclusion: Stationary (p < 0.05)")
else:
print(" Conclusion: Non-stationary (p >= 0.05)")
# Plot ACF and PACF
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Prices - ACF
plot_acf(prices.dropna(), lags=40, ax=axes[0, 0])
axes[0, 0].set_title('Prices: Autocorrelation Function', fontsize=12, fontweight='bold')
# Prices - PACF
plot_pacf(prices.dropna(), lags=40, ax=axes[0, 1])
axes[0, 1].set_title('Prices: Partial Autocorrelation Function', fontsize=12, fontweight='bold')
# Returns - ACF
plot_acf(returns.dropna(), lags=40, ax=axes[1, 0])
axes[1, 0].set_title('Returns: Autocorrelation Function', fontsize=12, fontweight='bold')
# Returns - PACF
plot_pacf(returns.dropna(), lags=40, ax=axes[1, 1])
axes[1, 1].set_title('Returns: Partial Autocorrelation Function', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()
ARIMA Models
from statsmodels.tsa.arima.model import ARIMA
import warnings
warnings.filterwarnings('ignore')
# Fit ARIMA model to returns
# ARIMA(p, d, q) where:
# p = autoregressive order
# d = differencing order
# q = moving average order
# For returns (already stationary), d=0
model = ARIMA(returns.dropna(), order=(1, 0, 1))
results = model.fit()
print("\n" + "="*60)
print("ARIMA Model Results")
print("="*60)
print(results.summary())
# Forecast next 30 days
forecast_steps = 30
forecast = results.forecast(steps=forecast_steps)
forecast_index = pd.date_range(
start=returns.index[-1] + pd.Timedelta(days=1),
periods=forecast_steps,
freq='D'
)
# Plot forecast
plt.figure(figsize=(14, 7))
# Historical returns
plt.plot(returns.index[-100:], returns.values[-100:],
label='Historical Returns', linewidth=1.5, color='blue')
# Forecast
plt.plot(forecast_index, forecast.values,
label='Forecast', linewidth=2, color='red', linestyle='--')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Returns', fontsize=12)
plt.title('ARIMA Forecast: Next 30 Days', fontsize=16, fontweight='bold', pad=20)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nForecast Statistics:")
print(f" Mean: {forecast.mean()*100:.4f}%")
print(f" Std Dev: {forecast.std()*100:.4f}%")
print(f" Min: {forecast.min()*100:.4f}%")
print(f" Max: {forecast.max()*100:.4f}%")
GARCH Models (Volatility Modeling)
from arch import arch_model
# Fit GARCH(1,1) model
# Models conditional volatility (heteroskedasticity)
returns_pct = returns * 100 # Convert to percentage for better scaling
garch_model = arch_model(
returns_pct.dropna(),
vol='Garch',
p=1, # GARCH order
q=1 # ARCH order
)
garch_results = garch_model.fit(disp='off')
print("\n" + "="*60)
print("GARCH(1,1) Model Results")
print("="*60)
print(garch_results.summary())
# Plot conditional volatility
conditional_vol = garch_results.conditional_volatility
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10), sharex=True)
# Returns
ax1.plot(returns.index, returns*100, linewidth=1, alpha=0.7)
ax1.set_ylabel('Returns (%)', fontsize=11)
ax1.set_title('Daily Returns', fontsize=13, fontweight='bold')
ax1.grid(True, alpha=0.3)
# Conditional volatility
ax2.plot(conditional_vol.index, conditional_vol.values,
linewidth=1.5, color='red')
ax2.set_ylabel('Volatility (%)', fontsize=11)
ax2.set_xlabel('Date', fontsize=11)
ax2.set_title('Conditional Volatility (GARCH)', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Forecast volatility
volatility_forecast = garch_results.forecast(horizon=30)
forecast_variance = volatility_forecast.variance.values[-1, :]
forecast_vol = np.sqrt(forecast_variance)
print(f"\nVolatility Forecast (Next 30 Days):")
print(f" Mean: {forecast_vol.mean():.4f}%")
print(f" Min: {forecast_vol.min():.4f}%")
print(f" Max: {forecast_vol.max():.4f}%")
9.4 Testing Market Efficiency
Random Walk Test
# Test if price changes are random (efficient market hypothesis)
data = yf.download('AAPL', start='2020-01-01', end='2024-01-01', progress=False)
returns = data['Adj Close'].pct_change().dropna()
# Runs test for randomness
from statsmodels.sandbox.stats.runs import runstest_1samp
# Convert returns to binary (positive/negative)
binary_returns = (returns > 0).astype(int)
# Runs test
runs_stat, runs_p = runstest_1samp(binary_returns, cutoff='mean')
print("Testing for Random Walk (Market Efficiency)")
print("="*60)
print("H₀: Returns are random (efficient market)")
print("H₁: Returns are not random (market inefficiency)")
print(f"\nRuns Test:")
print(f" Statistic: {runs_stat:.4f}")
print(f" P-value: {runs_p:.4f}")
if runs_p < 0.05:
print(f"\nConclusion: Returns are NOT random (p < 0.05)")
print("Evidence against market efficiency")
else:
print(f"\nConclusion: Cannot reject randomness (p >= 0.05)")
print("Consistent with market efficiency")
# Autocorrelation test
lag_1_autocorr = returns.autocorr(lag=1)
n = len(returns)
se = 1 / np.sqrt(n)
z_score = lag_1_autocorr / se
p_val_autocorr = 2 * (1 - stats.norm.cdf(abs(z_score)))
print(f"\nLag-1 Autocorrelation Test:")
print(f" Correlation: {lag_1_autocorr:.4f}")
print(f" Z-score: {z_score:.4f}")
print(f" P-value: {p_val_autocorr:.4f}")
if abs(lag_1_autocorr) > 2 * se:
print(f" Significant autocorrelation detected")
else:
print(f" No significant autocorrelation")
Variance Ratio Test
def variance_ratio_test(returns, q):
"""
Variance Ratio Test for random walk
Under random walk, VR(q) should equal 1
"""
n = len(returns)
# Variance of 1-period returns
var_1 = np.var(returns, ddof=1)
# Variance of q-period returns
returns_q = returns.rolling(window=q).sum().dropna()
var_q = np.var(returns_q, ddof=1)
# Variance ratio
vr = var_q / (q * var_1)
# Test statistic (under null hypothesis of random walk)
m = (n - q + 1) * (1 - q/n)
phi = 2 * (2*q - 1) * (q - 1) / (3 * q * n)
test_stat = (vr - 1) / np.sqrt(phi)
# P-value (two-tailed)
p_value = 2 * (1 - stats.norm.cdf(abs(test_stat)))
return vr, test_stat, p_value
print("\n" + "="*60)
print("Variance Ratio Test")
print("="*60)
for q in [2, 5, 10, 20]:
vr, test_stat, p_val = variance_ratio_test(returns, q)
print(f"\nHolding Period: {q} days")
print(f" Variance Ratio: {vr:.4f}")
print(f" Test Statistic: {test_stat:.4f}")
print(f" P-value: {p_val:.4f}")
if p_val < 0.05:
print(f" Reject random walk (p < 0.05)")
else:
print(f" Consistent with random walk")
9.5 Factor Models and Attribution
Fama-French Three-Factor Model
# Download Fama-French factors from Kenneth French's data library
# For this example, we'll create synthetic factors
# Download stock data
stock_data = yf.download('AAPL', start='2020-01-01', end='2024-01-01', progress=False)
market_data = yf.download('^GSPC', start='2020-01-01', end='2024-01-01', progress=False)
stock_returns = stock_data['Adj Close'].pct_change().dropna()
market_returns = market_data['Adj Close'].pct_change().dropna()
# Create simplified factor data
# In practice, you'd download from Kenneth French's website
factors = pd.DataFrame({
'Stock': stock_returns,
'Mkt_RF': market_returns, # Market minus risk-free
'SMB': np.random.randn(len(stock_returns)) * 0.01, # Small minus Big (simplified)
'HML': np.random.randn(len(stock_returns)) * 0.01 # High minus Low (simplified)
}).dropna()
# Fit three-factor model
X = factors[['Mkt_RF', 'SMB', 'HML']]
y = factors['Stock']
import statsmodels.api as sm
X_with_const = sm.add_constant(X)
ff_model = sm.OLS(y, X_with_const).fit()
print("Fama-French Three-Factor Model")
print("="*60)
print("Model: R = α + β_MKT×Mkt_RF + β_SMB×SMB + β_HML×HML + ε")
print("\n" + str(ff_model.summary()))
print(f"\nInterpretation:")
print(f"Alpha: {ff_model.params['const']*100:.4f}% (excess return after controlling for factors)")
if ff_model.pvalues['const'] < 0.05:
print(" Alpha is statistically significant - skill or mispricing")
else:
print(" Alpha is not statistically significant")
Rolling Factor Analysis
def rolling_factor_analysis(stock_returns, market_returns, window=252):
"""
Calculate rolling beta and alpha
"""
results = pd.DataFrame(index=stock_returns.index)
betas = []
alphas = []
r_squareds = []
for i in range(window, len(stock_returns)):
stock_window = stock_returns.iloc[i-window:i]
market_window = market_returns.iloc[i-window:i]
# Regression
slope, intercept, r_value, _, _ = stats.linregress(
market_window,
stock_window
)
betas.append(slope)
alphas.append(intercept)
r_squareds.append(r_value**2)
results['Beta'] = np.nan
results['Alpha'] = np.nan
results['R_Squared'] = np.nan
results.iloc[window:, results.columns.get_loc('Beta')] = betas
results.iloc[window:, results.columns.get_loc('Alpha')] = alphas
results.iloc[window:, results.columns.get_loc('R_Squared')] = r_squareds
return results
# Calculate rolling metrics
rolling_results = rolling_factor_analysis(
factors['Stock'],
factors['Mkt_RF'],
window=252
)
# Plot rolling beta
fig, axes = plt.subplots(3, 1, figsize=(14, 12), sharex=True)
# Beta
axes[0].plot(rolling_results.index, rolling_results['Beta'],
linewidth=1.5, color='#2E86AB')
axes[0].axhline(y=1, color='red', linestyle='--', linewidth=1, alpha=0.5)
axes[0].set_ylabel('Beta', fontsize=11)
axes[0].set_title('Rolling Beta (252-day window)', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)
# Alpha
axes[1].plot(rolling_results.index, rolling_results['Alpha']*100,
linewidth=1.5, color='#A23B72')
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1, alpha=0.5)
axes[1].set_ylabel('Alpha (%)', fontsize=11)
axes[1].set_title('Rolling Alpha (252-day window)', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)
# R-squared
axes[2].plot(rolling_results.index, rolling_results['R_Squared'],
linewidth=1.5, color='#06A77D')
axes[2].set_ylabel('R²', fontsize=11)
axes[2].set_xlabel('Date', fontsize=11)
axes[2].set_title('Rolling R² (252-day window)', fontsize=13, fontweight='bold')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Summary statistics
print("\n" + "="*60)
print("Rolling Factor Analysis Summary")
print("="*60)
print(f"\nBeta:")
print(f" Mean: {rolling_results['Beta'].mean():.4f}")
print(f" Std Dev: {rolling_results['Beta'].std():.4f}")
print(f" Min: {rolling_results['Beta'].min():.4f}")
print(f" Max: {rolling_results['Beta'].max():.4f}")
print(f"\nAlpha:")
print(f" Mean: {rolling_results['Alpha'].mean()*100:.4f}%")
print(f" Std Dev: {rolling_results['Alpha'].std()*100:.4f}%")
9.6 Predictive Models
Building a Return Prediction Model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Download data
data = yf.download('AAPL', start='2018-01-01', end='2024-01-01', progress=False)
# Create features
df = pd.DataFrame()
df['Returns'] = data['Adj Close'].pct_change()
df['Volume'] = data['Volume']
df['High_Low'] = (data['High'] - data['Low']) / data['Close']
df['Close_Open'] = (data['Close'] - data['Open']) / data['Open']
# Lagged features
for lag in [1, 2, 3, 5, 10]:
df[f'Return_Lag_{lag}'] = df['Returns'].shift(lag)
df[f'Volume_Lag_{lag}'] = df['Volume'].shift(lag)
# Technical indicators
df['SMA_20'] = data['Close'].rolling(window=20).mean()
df['SMA_50'] = data['Close'].rolling(window=50).mean()
df['Price_SMA20'] = (data['Close'] - df['SMA_20']) / df['SMA_20']
# RSI
delta = data['Close'].diff()
gain = delta.where(delta > 0, 0).rolling(window=14).mean()
loss = -delta.where(delta < 0, 0).rolling(window=14).mean()
rs = gain / loss
df['RSI'] = 100 - (100 / (1 + rs))
# Target: Next day's return
df['Target'] = df['Returns'].shift(-1)
# Drop NaN values
df = df.dropna()
# Prepare features and target
feature_cols = [col for col in df.columns if col not in ['Returns', 'Target', 'SMA_20', 'SMA_50']]
X = df[feature_cols]
y = df['Target']
# Split data (80% train, 20% test)
split_idx = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]
print("Building Predictive Model for Returns")
print("="*60)
print(f"Total observations: {len(df)}")
print(f"Training set: {len(X_train)}")
print(f"Test set: {len(X_test)}")
print(f"Number of features: {len(feature_cols)}")
# Train model
rf_model = RandomForestRegressor(
n_estimators=100,
max_depth=10,
min_samples_split=20,
random_state=42
)
rf_model.fit(X_train, y_train)
# Make predictions
y_pred_train = rf_model.predict(X_train)
y_pred_test = rf_model.predict(X_test)
# Evaluate
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
print(f"\nModel Performance:")
print(f"Training R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")
print(f"Training RMSE: {train_rmse*100:.4f}%")
print(f"Test RMSE: {test_rmse*100:.4f}%")
# Feature importance
feature_importance = pd.DataFrame({
'Feature': feature_cols,
'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)
print(f"\nTop 10 Most Important Features:")
print(feature_importance.head(10).to_string(index=False))
# Plot predictions vs actual
plt.figure(figsize=(14, 7))
plt.scatter(y_test*100, y_pred_test*100, alpha=0.5, s=30)
plt.plot([y_test.min()*100, y_test.max()*100],
[y_test.min()*100, y_test.max()*100],
'r--', linewidth=2, label='Perfect Prediction')
plt.xlabel('Actual Returns (%)', fontsize=12)
plt.ylabel('Predicted Returns (%)', fontsize=12)
plt.title(f'Return Predictions: Test Set (R² = {test_r2:.4f})',
fontsize=16, fontweight='bold', pad=20)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Trading strategy based on predictions
strategy_returns = []
buy_hold_returns = []
for i in range(len(y_test)):
actual_return = y_test.iloc[i]
predicted_return = y_pred_test[i]
# If predicted positive, go long; if predicted negative, stay out
if predicted_return > 0:
strategy_returns.append(actual_return)
else:
strategy_returns.append(0)
buy_hold_returns.append(actual_return)
strategy_cumulative = (1 + pd.Series(strategy_returns)).cumprod()
buy_hold_cumulative = (1 + pd.Series(buy_hold_returns)).cumprod()
print(f"\nStrategy Backtest (Test Period):")
print(f"Strategy Return: {(strategy_cumulative.iloc[-1] - 1)*100:.2f}%")
print(f"Buy & Hold Return: {(buy_hold_cumulative.iloc[-1] - 1)*100:.2f}%")
print(f"Outperformance: {((strategy_cumulative.iloc[-1] / buy_hold_cumulative.iloc[-1]) - 1)*100:.2f}%")
9.7 Practice Exercises
Exercise 1: Event Study
# Your task: Conduct an event study
# 1. Choose a major corporate event (earnings announcement, merger, etc.)
# 2. Download stock data around the event date (-30 to +30 days)
# 3. Calculate abnormal returns (actual - expected based on market model)
# 4. Calculate cumulative abnormal returns (CAR)
# 5. Test if CAR is significantly different from zero
# 6. Visualize the event impact
# 7. Determine event window significance
Exercise 2: Pairs Trading Statistical Test
# Your task: Test for cointegration between two stocks
# 1. Choose two potentially related stocks
# 2. Download 2+ years of price data
# 3. Test for cointegration using Engle-Granger test
# 4. If cointegrated, calculate the spread
# 5. Test spread for mean reversion
# 6. Identify trading signals
# 7. Backtest the pairs trading strategy
Exercise 3: Volatility Forecasting Competition
# Your task: Compare volatility forecasting methods
# 1. Download 3+ years of stock data
# 2. Build models: Historical volatility, EWMA, GARCH(1,1)
# 3. Use rolling window to forecast next day's volatility
# 4. Compare forecasts to realized volatility
# 5. Calculate forecast errors for each method
# 6. Determine which method is most accurate
# 7. Visualize forecast vs actual volatility
Exercise 4: Market Anomaly Detection
# Your task: Test for the "Monday Effect"
# 1. Download 5+ years of S&P 500 data
# 2. Calculate returns by day of week
# 3. Test if Monday returns are different from other days
# 4. Use ANOVA to test across all days
# 5. Account for multiple testing (Bonferroni correction)
# 6. Check if pattern holds in different time periods
# 7. Discuss economic significance vs statistical significance
Module 9 Summary
Congratulations! You've mastered statistical analysis for finance.
What You've Accomplished
Hypothesis Testing
- Testing for significant returns and differences
- Comparing stocks and strategies statistically
- Understanding p-values and significance levels
- Testing distributional assumptions
Regression Analysis
- Building single and multi-factor models
- Interpreting coefficients and R-squared
- Calculating beta and alpha
- Performing residual analysis
Time Series
- Testing for stationarity
- Building ARIMA models for forecasting
- Modeling volatility with GARCH
- Understanding autocorrelation
Market Efficiency
- Testing the random walk hypothesis
- Variance ratio tests
- Runs tests for randomness
- Understanding market anomalies
Factor Models
- Fama-French three-factor model
- Rolling factor analysis
- Performance attribution
- Risk decomposition
Predictive Modeling
- Building machine learning models for returns
- Feature engineering from market data
- Model validation and testing
- Backtesting prediction strategies
Real-World Capabilities
You can now:
- Rigorously test financial hypotheses
- Build and validate regression models
- Forecast returns and volatility
- Test for market efficiency
- Decompose returns into factors
- Build predictive models with statistical rigor
- Make evidence-based investment decisions
Critical Statistical Thinking
Remember these key principles:
-
Statistical vs Economic Significance: A result can be statistically significant but economically meaningless (or vice versa)
-
In-Sample vs Out-of-Sample: Models that fit historical data well may not predict the future
-
Data Snooping: Testing many hypotheses increases the chance of false positives
-
Correlation ≠ Causation: Relationships don't imply causal mechanisms
-
Assumptions Matter: All statistical tests have assumptions—verify them
What's Next
In Module 10, we'll bring everything together in a comprehensive real-world project. You'll apply all the skills you've learned—data acquisition, analysis, visualization, portfolio construction, modeling, and statistical testing—to build a complete financial analysis and trading system.
Before Moving Forward
Ensure you're comfortable with:
- Conducting hypothesis tests
- Building regression models
- Time series analysis basics
- Interpreting statistical results
- Understanding model limitations
Practice Recommendations
- Question Everything: Always test your assumptions statistically
- Replicate Studies: Try to reproduce published financial research
- Use Multiple Tests: Don't rely on a single statistical test
- Check Robustness: Test if results hold across different periods
- Stay Skeptical: Be wary of results that seem too good to be true
The Complete Toolkit
You now possess a comprehensive analytical framework:
- Data science fundamentals (Python, Pandas, NumPy)
- Financial domain knowledge
- Quantitative modeling skills
- Statistical rigor
- Visualization capabilities
- Portfolio theory
- Machine learning techniques
You're equipped to perform institutional-quality quantitative financial analysis. One final module remains: putting it all together.
Continue to Module 10: Real-World Capstone Project →

