class: middle, center, title-slide .center[.width-60[]] # AI Black Belt - Yellow Day 3/4: Practice regression algorithms for sentiment analysis
--- class: middle ## Outline .inactive[[Day 1] Introduction to machine learning with Python] .inactive[[Day 2] Learn to identify and solve supervised learning problems] [Day 3] Practice regression algorithms for sentiment analysis - Learn to evaluate properly the performance of your models. - Practice regression algorithms with Scikit-Learn. - Implement a full ML pipeline: from raw to text to sentiment analysis. .inactive[[Day 4] Learn how to let the machine tune itself] --- class: middle ## Wrap-up of the day 2 challenge Jump to `day2-03-challenge.ipynb`. [](https://mybinder.org/v2/gh/AI-BlackBelt/yellow/master) --- class: middle # Model evaluation --- # Classification .center.width-70[] .footnote[Credits: vas3k, [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/), 2018.] --- class: middle ## Accuracy The accuracy is the proportion of correct predictions. ```python from sklearn.metrics import accuracy_score print(accuracy_score(y_test, clf.predict(X_test))) >>> 0.84 ``` .exercice[Is this an appropriate measure when classes are imbalanced?] --- class: middle ## Precision, recall and F-measure $$ \begin{aligned} \text{Precision} &= \frac{TP}{TP + FP} \\\\ \text{Recall} &= \frac{TP}{TP + FN} \\\\ F &= \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \end{aligned} $$ .center.width-90[] --- class: middle ```python from sklearn.metrics import precision_score from sklearn.metrics import recall_score from sklearn.metrics import fbeta_score print("Precision =", precision_score(y_test, clf.predict(X_test))) print("Recall =", recall_score(y_test, clf.predict(X_test))) print("F =", fbeta_score(y_test, clf.predict(X_test), beta=1)) >>> Precision = 0.8118811881188119 >>> Recall = 0.8631578947368421 >>> F = 0.8367346938775511 ``` --- class: middle ## ROC AUC Area under the curve of the false positive rate (FPR) against the true positive rate (TPR) as the decision threshold of the classifier is varied. .grid[ .kol-1-2[ .center.width-100[] ] .kol-1-2[ ```python from sklearn.metrics import roc_auc_score print("ROC AUC =", roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]) >>> ROC AUC = 0.9297744360902256 ``` ] ] --- # Training and test data .center.width-70[] Why splitting? - The error estimated on training data is typically optimistic. It **does not** accurately represent the performance the model would have on new data. - The actual performance should be estimated on (independent) test data. .footnote[Credits: Andreas Mueller, [Introduction to Machine Learning with Scikit-Learn](https://github.com/amueller/ml-workshop-1-of-4/), 2019.] ??? Make the analogy with the classroom. E.g., estimate region from how long it took to arrive. --- class: middle Jump to `day3-01-evaluation.ipynb`. [](https://mybinder.org/v2/gh/AI-BlackBelt/yellow/master) .footnote[Credits: vas3k, [Machine Learning for Everyone](https://vas3k.com/blog/machine_learning/), 2018.] --- class: middle # Regression --- class: middle .center.width-70[] --- # Algorithms Supervised learning algorithms can often be used both for classification and regression. This includes: - decision trees - K-nearest neighbors - linear models - neural networks. --- class: middle .center.width-60[] --- # Metrics ## Mean squared error Measures the average squared difference between the estimated values and what is estimated. ```python from sklearn.metrics import mean_squared_error >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> mean_squared_error(y_true, y_pred) 0.375 ``` --- class: middle ## R2 score The coefficient of determination represents the proportion of total variance explained by the model / total variance. - The best score is 1.0 and it can be negative. - A constant model would get a score of 0.0. ```python >>> from sklearn.metrics import r2_score >>> y_true = [3, -0.5, 2, 7] >>> y_pred = [2.5, 0.0, 2, 8] >>> r2_score(y_true, y_pred) 0.948... ``` --- class: middle Jump to - `day3-02-regression.ipynb` - `day3-03-boston.ipynb` - `day3-04-amazon.ipynb` [](https://mybinder.org/v2/gh/AI-BlackBelt/yellow/master)