AI Black Belt - Yellow

# AI Black Belt - Yellow

Day 2/4: Learn to identify and solve supervised learning problems

---

## Outline

[Day 2] Learn to identify and solve supervised learning problems
- Learn to recognize classification and regression problems in the wild.
- Practice classification algorithms with Scikit-Learn.

---

# Supervised learning

---

.center.width-100[![](figures/day2/map3.jpg)]

.footnote[Credits: vas3k, [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/), 2018.]

---

# Representing data

.center.width-100[![](figures/day2/matrix-representation.png)]

.footnote[Credits: Andreas Mueller, [Introduction to Machine Learning with Scikit-Learn](https://github.com/amueller/ml-workshop-1-of-4/), 2019.]

---

# Supervised learning

.center.width-70[![](figures/day2/classification.jpg)]

.footnote[Credits: vas3k, [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/), 2018.]

---

.center.width-70[![](figures/day2/regression.jpg)]

.footnote[Credits: vas3k, [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/), 2018.]

---

Formally, given inputs $\mathbf{X}$ and outputs $\mathbf{y}$, we want to find a function $f$ such that $$\mathbf{y} \approx f(\mathbf{X}).$$
- in classification, the set of possible output values is finite and *symbolic*.
- in regression, the set of possible output values is infinite and **numerical**.

---

# In the wild

- Visual recognition
- Spam filtering
- Sentiment analysis
- Medical diagnosis
- Weather forecast
- Customer segmentation
- Recommendations

]
.kol-1-2[
## Regression

- Weather forecast
- Stock price forecast
- Demand and sales volume analysis
- Estimating the price of an Uber ride
- Market value prediction
- Playing games

]
]

---

---

Would he survive the sinking of the Titanic?

---

Sorting cucumbers?

---

How long will it take to arrive at work if you leave home at 6:00 AM?

---

Are you sick?

---

What's the market value of my house?

---

.exercice[Identify supervised learning problems for each of the following situations. What are the inputs and outputs? Brainstorm in small groups.]

---

.center.width-80[![](figures/day2/fb1.png)]

.footnote[Credits: Andreas Mueller, [Introduction to Machine Learning with Scikit-Learn](https://github.com/amueller/ml-workshop-1-of-4/), 2019.]

---

.center.width-80[![](figures/day2/facebook_gael.png)]

.footnote[Credits: Andreas Mueller, [Introduction to Machine Learning with Scikit-Learn](https://github.com/amueller/ml-workshop-1-of-4/), 2019.]

---

---

---

---

---

# Classification

---

# Recap

## Estimator API

```python
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
clf.score(X_test, y_test)
```
]
.kol-1-2[
.center.width-100[![](figures/day2/supervised-ml-flow.svg)]
]

]

---

Jump to `day2-01-scikit-learn.ipynb`.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/AI-BlackBelt/yellow/master)

---

# Choosing the right estimator

.center.width-100[![](figures/day2/ml_map.png)]

---

## Linear model

Model the decision boundary as a hyperplane.

.center.width-60[![](figures/day2/linear-model.jpg)]

.footnote[Credits: vas3k, [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/), 2018.]

---

.center.width-100[![](figures/day2/illus-lr.png)]

```python
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X, y)
```

---

## K-Nearest neighbors

Predict the average output value among the $K$ closest neighbors to the input $\mathbf{x}$. Closeness is typically defined using the Euclidean distance.

.center.width-50[![](figures/day2/knn.png)]

.footnote[Credits: Andreas Mueller, [Introduction to Machine Learning with Scikit-Learn](https://github.com/amueller/ml-workshop-1-of-4/), 2019.]

---

---

.center.width-100[![](figures/day2/illus-knn.png)]

```python
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X, y)
```

---

## Decision trees

Idea: build a partition of the input space using cuts orthogonal to feature axes.

.center.width-100[![](figures/day2/tree.jpg)]

.footnote[Credits: vas3k, [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/), 2018.]

---

.center.width-100[![](figures/day2/illus-tree.png)]

```python
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X, y)
```

---

## Naive Bayes (optional)

Idea: apply the Bayes rule to make predictions for $y$ given $\mathbf{x}$.

.center.width-100[![](figures/day2/naive-bayes.jpg)]

.footnote[Credits: vas3k, [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/), 2018.]

---

.center.width-100[![](figures/day2/illus-nb.png)]

```python
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X, y)
```

---

# Preprocessing

Not only the performance on your system depends on the algorithm you pick, but also on how you **prepare** the data.

- How should you treat missing values?
- How do you encode categorical/symbolic features into numerical values?
- Does the scale of the features matter?

.center.width-50[![](figures/day2/gigo.jpg)]
.center.italic[Garbage in, garbage out.]

---

## Transformer API

```python
tf = StandardScaler()
tf.fit(X_train)
X_train_scaled = tf.transform(X_train)
X_test_scaled = tf.transform(X_test)
```

.center.width-50[![](figures/day2/transformer-flow.svg)]

---

Jump to
- `day2-02-census.ipynb`.
- `day2-03-challenge.ipynb`

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/AI-BlackBelt/yellow/master)