Intro to Machine Learning with scikit-learn
This is the lesson where Python turns from a data tool into an AI tool. You will train your first machine learning model, in scikit-learn, in fewer than 20 lines of code. By the end you will know what a model is, what training and prediction mean, and how to tell if your model is any good.
This is not a deep machine learning course. You are not going to derive gradient descent or build a neural network from scratch. You are going to learn the workflow that 90 percent of practical machine learning uses, then point at where to go deeper.
What You'll Learn
- The four steps of every supervised ML project: load, split, train, evaluate
- How to train a classifier with scikit-learn in 5 lines
- The difference between classification and regression
- Why train/test split matters and what overfitting looks like
What Is Machine Learning?
Machine learning is the practice of letting an algorithm learn patterns from data, instead of you writing rules by hand. Two main flavors for beginners:
- Classification — predict a label. Will this email be spam or not? Did this passenger survive or die? What kind of flower is this?
- Regression — predict a number. What will this house sell for? How many calories does this meal contain? What is the predicted temperature tomorrow?
Both share the same workflow.
The Four-Step Workflow
Every supervised ML project follows these four steps. Memorize them.
- Load the data into a DataFrame.
- Split the data into features (
X) and labels (y), then split again into train and test sets. - Train a model on the training data.
- Evaluate the model on the test data.
That is it. The complexity comes from choosing the right model and tuning it, but the skeleton is always the same.
Your First Model: Predict Iris Species
The iris dataset is the "Hello World" of machine learning. Three species of flowers, four measurements each. Your job: predict the species from the measurements.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
# 1. Load
iris = load_iris(as_frame=True)
df = iris.frame
X = df.drop(columns="target")
y = df["target"]
# 2. Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 3. Train
model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, y_train)
# 4. Evaluate
predictions = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions, target_names=iris.target_names))
Run it. You should see accuracy somewhere around 0.95 to 1.00. You just trained a decision tree that classifies iris flowers.
What Just Happened
Step 1 — load. load_iris returns a built-in scikit-learn dataset. Real projects use pd.read_csv() instead.
Step 2 — split. train_test_split randomly divides the data: 80 percent for training, 20 percent for testing. The random_state=42 is for reproducibility — same split every time you run.
Why split at all? Because you want to know how the model performs on data it has not seen. If you trained and evaluated on the same data, you would not learn anything about real-world performance — like a student grading their own exam.
Step 3 — train. DecisionTreeClassifier is one of many classifiers. fit(X_train, y_train) is the line that does the actual learning: the tree finds rules in the features that predict the species.
Step 4 — evaluate. Accuracy is the fraction of test predictions that were correct. The classification report breaks accuracy down by class, plus precision and recall.
Try a Different Model
Replace DecisionTreeClassifier with RandomForestClassifier:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
Random forests are an ensemble of decision trees. They almost always perform better than a single tree. Try it; you should see accuracy stay around 0.95 to 1.00 — iris is easy.
For a slightly harder problem, try a logistic regression on the breast cancer dataset:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
data = load_breast_cancer(as_frame=True)
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
model.fit(X_train, y_train)
print("Accuracy:", model.score(X_test, y_test))
make_pipeline chains preprocessing (scaling features to mean 0, std 1) with the model. Most real ML code uses pipelines because they prevent data leakage — you fit the scaler on training data only, then apply it consistently to test data.
Regression: Predicting a Number
When the target is a number, use a regression model. Try predicting California house prices:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print("Mean absolute error:", mean_absolute_error(y_test, preds))
The metric is different. For classification you use accuracy or F1. For regression you use mean absolute error or root mean squared error. Bigger error means worse predictions.
Overfitting in One Sentence
A model that gets 100 percent on training data and 60 percent on test data is overfit — it memorized the training set instead of learning a general pattern. The fix: simpler model (smaller max_depth), more data, or regularization. AI tools will explain this in detail when asked:
Explain overfitting in plain English with a real example. What does it look like in scikit-learn output, and how do I fix it for a beginner like me?
You will get a clear, concrete answer with code.
A Practical Prompt for ML Help
Use this template when you want AI to help with a model:
I am a beginner. My data has columns:
[list]. The target I want to predict is[name]. The target is[binary / multi-class / continuous]. I have[N]rows. Please:
- Recommend a starting model from scikit-learn and explain why
- Write the full pipeline: load, split, fit, predict, evaluate
- Suggest one alternative model I could try next, and how to compare results
The AI will pick something sensible (logistic regression for small binary problems, random forest as a strong default, gradient boosting when you want top performance) and explain the trade-offs.
What This Lesson Did Not Cover
- Cross-validation. Doing the train/test split many times to get a more reliable performance estimate.
- Hyperparameter tuning. Searching over model settings (
max_depth,n_estimators, etc.) to find the best. - Feature engineering. Creating new columns from existing ones to help the model.
- Deep learning. Neural networks via PyTorch or TensorFlow.
These are the next steps after you have the four-step workflow under your fingers.
Key Takeaways
- Every supervised ML project follows the same four steps: load, split, train, evaluate
- Always split before training; use
random_statefor reproducibility - Classification predicts labels; regression predicts numbers — different metrics
- Random forest and logistic regression are excellent first defaults
- Overfitting means train accuracy is high but test accuracy is low — fix with simpler models or more data
- AI is great at picking a starting model and writing the pipeline; verify by reading the code

