Bloomberg ML EDU presents:

Foundations of Machine Learning

Instructor , Office of the CTO at Bloomberg

Understand the Concepts, Techniques and Mathematical Frameworks Used by Experts in Machine Learning

About This Course

Bloomberg presents "Foundations of Machine Learning," a training course that was initially delivered internally to the company's software engineers as part of its "Machine Learning EDU" initiative. This course covers a wide variety of topics in machine learning and statistical modeling. The primary goal of the class is to help participants gain a deep understanding of the concepts, techniques and mathematical frameworks used by experts in machine learning. It is designed to make valuable machine learning skills more accessible to individuals with a strong math background, including software developers, experimental scientists, engineers and financial professionals.

The 30 lectures in the course are embedded below, but may also be viewed in this YouTube playlist . The course includes a complete set of homework assignments, each containing a theoretical element and implementation challenge with support code in Python, which is rapidly becoming the prevailing programming language for data science and machine learning in both academia and industry. This course also serves as a foundation on which more specialized courses and further independent study can build.

Please fill out this short online form to register for access to our course's Piazza discussion board. Applications are processed manually, so please be patient. You should receive an email directly from Piazza when you are registered. Common questions from this and previous editions of the course are posted in our FAQ .

The first lecture, Black Box Machine Learning , gives a quick start introduction to practical machine learning and only requires familiarity with basic programming concepts.

Highlights and Distinctive Features of the Course Lectures, Notes, and Assignments

  • Geometric explanation for what happens with ridge, lasso, and elastic net regression in the case of correlated random variables.
  • Investigation of when the penalty (Tikhonov) and constraint (Ivanov) forms of regularization are equivalent.
  • Concise summary of what we really learn about SVMs from Lagrangian duality.
  • Proof of representer theorem with simple linear algebra, emphasizing it as a way to reparametrize certain objective functions.
  • Guided derivation of the math behind the classic diamond/circle/ellipsoids picture that "explains" why L1 regularization gives sparsity (Homework 2, Problem 5)
  • From scrach (in numpy) implementation of almost all major ML algorithms we discuss: ridge regression with SGD and GD (Homework 1, Problems 2.5, 2.6 page 4), lasso regression with the shooting algorithm (Homework 2, Problem 3, page 4), kernel ridge regression (Homework 4, Problem 3, page 2), kernelized SVM with Kernelized Pegasos (Homework 4, 6.4, page 9), L2-regularized logistic regression (Homework 5, Problem 3.3, page 4),Bayesian Linear Regession (Homework 5, problem 5, page 6), multiclass SVM (Homework 6, Problem 4.2, p. 3), classification and regression trees (without pruning) (Homework 6, Problem 6), gradient boosting with trees for classification and regression (Homework 6, Problem 8), multilayer perceptron for regression (Homework 7, Problem 4, page 3)
  • Repeated use of a simple 1-dimensional regression dataset, so it's easy to visualize the effect of various hypothesis spaces and regularizations that we investigate throughout the course.
  • Investigation of how to derive a conditional probability estimate from a predicted score for various loss functions, and why it's not so straightforward for the hinge loss (i.e. the SVM) (Homework 5, Problem 2, page 1)
  • Discussion of numerical overflow issues and the log-sum-exp trick (Homework 5, Problem 3.2)
  • Self-contained introduction to the expectation maximization (EM) algorithm for latent variable models.
  • Develop a general computation graph framework from scratch, using numpy, and implement your neural networks in it.

Prerequisites

The quickest way to see if the mathematics level of the course is for you is to take a look at this mathematics assessment , which is a preview of some of the math concepts that show up in the first part of the course.

  • Solid mathematical background , equivalent to a 1-semester undergraduate course in each of the following: linear algebra, multivariate differential calculus, probability theory, and statistics. The content of NYU's DS-GA-1002: Statistical and Mathematical Methods would be more than sufficient, for example.
  • Python programming required for most homework assignments.
  • Recommended: At least one advanced, proof-based mathematics course
  • Recommended: Computer science background up to a "data structures and algorithms" course
  • (HTF) refers to Hastie, Tibshirani, and Friedman's book The Elements of Statistical Learning
  • (SSBD) refers to Shalev-Shwartz and Ben-David's book Understanding Machine Learning: From Theory to Algorithms
  • (JWHT) refers to James, Witten, Hastie, and Tibshirani's book An Introduction to Statistical Learning

With the abundance of well-documented machine learning (ML) libraries, programmers can now "do" some ML, without any understanding of how things are working. And we'll encourage such "black box" machine learning... just so long as you follow the procedures described in this lecture. To make proper use of ML libraries, you need to be conversant in the basic vocabulary, concepts, and workflows that underlie ML. We'll introduce the standard ML problem types (classification and regression) and discuss prediction functions, feature extraction, learning algorithms, performance evaluation, cross-validation, sample bias, nonstationarity, overfitting, and hyperparameter tuning.

If you're already familiar with standard machine learning practice, you can skip this lecture.

(None) (None)

We have an interactive discussion about how to reformulate a real and subtly complicated business problem as a formal machine learning problem. The real goal isn't so much to solve the problem, as to convey the point that properly mapping your business problem to a machine learning problem is both extremely important and often quite challenging. This course doesn't dwell on how to do this mapping, though see Provost and Fawcett's book in the references.

(None) (None)
This is where our "deep study" of machine learning begins. We introduce some of the core building blocks and concepts that we will use throughout the remainder of this course: input space, action space, outcome space, prediction functions, loss functions, and hypothesis spaces. We present our first machine learning method: empirical risk minimization. We also highlight the issue of overfitting, which may occur when we find the empirical risk minimizer over too large a hypothesis space.
(None)
A recurring theme in machine learning is that we formulate learning problems as optimization problems. Empirical risk minimization was our first example of this. To do learning, we need to do optimization. In this lecture we cover stochastic gradient descent, which is today's standard optimization method for large-scale machine learning problems.
We introduce the notions of approximation error, estimation error, and optimization error. While these concepts usually show up in more advanced courses, they will help us frame our understanding of the tradeoffs between hypothesis space choice, data set size, and optimization run times. In particular, these concepts will help us understand why "better" optimization methods (such as quasi-Newton methods) may not find prediction functions that generalize better, despite finding better optima.
(None) (None)

We introduce "regularization", our main defense against overfitting. We discuss the equivalence of the penalization and constraint forms of regularization (see ), and we introduce L1 and L2 regularization, the two most important forms of regularization for linear models. When L1 and L2 regularization are applied to linear least squares, we get "lasso" and "ridge" regression, respectively. We compare the "regularization paths" for lasso and ridge regression, and give a geometric argument for why lasso often gives "sparse" solutions. Finally, we present "coordinate descent", our second major approach to optimization. When applied to the lasso objective function, coordinate descent takes a particularly clean form and is known as the "shooting algorithm".

We continue our discussion of ridge and lasso regression by focusing on the case of correlated features, which is a common occurrence in machine learning practice. We will see that ridge solutions tend to spread weight equally among highly correlated features, while lasso solutions may be unstable in the case of highly correlated features. Finally, we introduce the "elastic net", a combination of L1 and L2 regularization, which ameliorates the instability of L1 while still allowing for sparsity in the solution. (Credit to Brett Bernstein for the excellent graphics.)

We start by discussing absolute loss and Huber loss. We consider them as alternatives to the square loss that are more robust to outliers. Next, we introduce our approach to the classification setting, introducing the notions of score, margin, and margin-based loss functions. We discuss basic properties of the hinge loss (i.e SVM loss), logistic loss, and the square loss, considered as margin-based losses. The interplay between the loss function we use for training and the properties of the prediction function we end up with is a theme we will return to several times during the course.
(None) (None)

We introduce the basics of convex optimization and Lagrangian duality. We discuss weak and strong duality, Slater's constraint qualifications, and we derive the complementary slackness conditions. As far as this course is concerned, there are really only two reasons for discussing Lagrangian duality: 1) The complementary slackness conditions will imply that SVM solutions are "sparse in the data" ( ), which has important practical implications for the kernelized SVMs (see the ). 2) Strong duality is a sufficient condition for the equivalence between the penalty and constraint forms of regularization (see ).

This mathematically intense lecture may be safely skipped.

(None)

We define the soft-margin support vector machine (SVM) directly in terms of its objective function (L2-regularized, hinge loss minimization over a linear hypothesis space). Using our knowledge of Lagrangian duality, we find a dual form of the SVM problem, apply the complementary slackness conditions, and derive some interesting insights into the connection between "support vectors" and margin. Read the "SVM Insights from Duality" in the Notes below for a high-level view of this mathematically dense lecture.

Notably absent from the lecture is the hard-margin SVM and its standard geometric derivation. Although the derivation is fun, since we start from the simple and visually appealing idea of maximizing the "geometric margin", the hard-margin SVM is rarely useful in practice, as it requires separable data, which precludes any datasets with repeated inputs and label noise. One fixes this by introducing "slack" variables, which leads to a formulation equivalent to the soft-margin SVM we present. Once we introduce slack variables, I've personally found the interpretation in terms of maximizing the margin to be much hazier, and I find understanding the SVM in terms of "just" a particular loss function and a particular regularization to be much more useful for understanding its properties. That said, Brett Bernstein gives a very nice development of the geometric approach to the SVM, which is linked in the References below. At the very least, it's a great exercise in basic linear algebra.

Neither the lasso nor the SVM objective function is differentiable, and we had to do some work for each to optimize with gradient-based methods. It turns out, however, that gradient descent will essentially work in these situations, so long as you're careful about handling the non-differentiable points. To this end, we introduce "subgradient descent", and we show the surprising result that, even though the objective value may not decrease with each step, every step brings us closer to the minimizer.

This mathematically intense lecture may be safely skipped.

When using linear hypothesis spaces, one needs to encode explicitly any nonlinear dependencies on the input as features. In this lecture we discuss various strategies for creating features. Much of this material is taken, with permission, from Percy Liang's CS221 course at Stanford.

With linear methods, we may need a whole lot of features to get a hypothesis space that's expressive enough to fit our data -- there can be orders of magnitude more features than training examples. While regularization can control overfitting, having a huge number of features can make things computationally very difficult, if handled naively. For objective functions of a particular general form, which includes ridge regression and SVMs but not lasso regression, we can "kernelize", which can allow significant speedups in certain situations. In fact, with the "kernel trick", we can even use an infinite-dimensional feature space at a computational cost that depends primarily on the training set size.

In more detail, it turns out that even when the optimal parameter vector we're searching for lives in a very high-dimensional vector space (dimension being the number of features), a basic linear algebra argument shows that for certain objective functions, the optimal parameter vector lives in a subspace spanned by the training input vectors. Thus, when we have more features than training points, we may be better off restricting our search to the lower-dimensional subspace spanned by training inputs. We can do this by an easy reparameterization of the objective function. This result is referred to as the "representer theorem", and its proof can be given on one slide.

After reparameterization, we'll find that the objective function depends on the data only through the Gram matrix, or "kernel matrix", which contains the dot products between all pairs of training feature vectors. This is where things get interesting a second time: Suppose f is our featurization function. Sometimes the dot product between two feature vectors f(x) and f(x') can be computed much more efficiently than multiplying together corresponding features and summing. In such a situation, we write the dot products in terms of the "kernel function": k(x,x')=〈f(x),f(x')〉, which we hope to compute much more quickly than O(d), where d is the dimension of the feature space. The essence of a "kernel method" is to use this "kernel trick" together with the reparameterization described above. This allows one to use huge (even infinite-dimensional) feature spaces with a computational burden that depends primarily on the size of your training set. In practice, it's useful for small and medium-sized datasets for which computing the kernel matrix is tractable. Scaling kernel methods to large data sets is still an active area of research.

(None)

This is our second "black-box" machine learning lecture. We start by discussing various models that you should almost always build for your data, to use as baselines and performance sanity checks. From there we focus primarily on evaluating classifier performance. We define a whole slew of performance statistics used in practice (precision, recall, F1, etc.). We also discuss the fact that most classifiers provide a numeric score, and if you need to make a hard classification, you should tune your threshold to optimize the performance metric of importance to you, rather than just using the default (typically 0 or 0.5). We also discuss the various performance curves you'll see in practice: precision/recall, ROC, and (my personal favorite) lift curves.

(None) (None)

So far we have studied the regression setting, for which our predictions (i.e. "actions") are real-valued, as well as the classification setting, for which our score functions also produce real values. With this lecture, we begin our consideration of "conditional probability models", in which the predictions are probability distributions over possible outcomes. We motivate these models by discussion of the "CitySense" problem, in which we want to predict the probability distribution for the number of taxicab dropoffs at each street corner, at different times of the week. Given this model, we can then determine, in real-time, how "unusual" the amount of behavior is at various parts of the city, and thereby help you find the secret parties, which is of course the ultimate goal of machine learning.

(None) (None)
In empirical risk minimization, we minimize the average loss on a training set. If our prediction functions are producing probability distributions, what loss functions will give reasonable performance measures? In this lecture, we discuss "likelihood", one of the most popular performance measures for distributions. We temporarily leave aside the conditional probability modeling problem, and focus on the simpler problem of fitting an unconditional probability model to data. We can use "maximum likelihood" to fit both parametric and nonparametric models. Once we have developed a collection of candidate probability distributions on training data, we select the best one by choosing the model that has highest "hold-out likelihood", i.e. likelihood on validation data.
(None) (None)
In this lecture we consider prediction functions that produce distributions from a parametric family of distributions. We restrict to the case of linear models, though later in the course we will show how to make nonlinear versions using gradient boosting and neural networks. We develop the technique through four examples: Bernoulli regression (logistic regression being a special case), Poisson regression, Gaussian regression, and multinomial logistic regression (our first multiclass method). We conclude by connecting this maximum likelihood framework back to our empirical risk minimization framework.
(None)

We review some basics of classical and Bayesian statistics. For classical "frequentist" statistics, we define statistics and point estimators, and discuss various desirable properties of point estimators. For Bayesian statistics, we introduce the "prior distribution", which is a distribution on the parameter space that you declare before seeing any data. We compare the two approaches for the simple problem of learning about a coin's probability of heads. Along the way, we discuss conjugate priors, posterior distributions, and credible sets. Finally, we give the basic setup for Bayesian decision theory, which is how a Bayesian would go from a posterior distribution to choosing an action.

(None)

In our earlier discussion of conditional probability modeling, we started with a hypothesis space of conditional probability models, and we selected a single conditional probability model using maximum likelihood or regularized maximum likelihood. In the Bayesian approach, we start with a prior distribution on this hypothesis space, and after observing some training data, we end up with a posterior distribution on the hypothesis space. For making conditional probability predictions, we can derive a predictive distribution from the posterior distribution. We explore these concepts by working through the case of Bayesian Gaussian linear regression. We also make a precise connection between MAP estimation in this model and ridge regression.

We begin our discussion of nonlinear models with tree models. We first describe the hypothesis space of decision trees, and we discuss some complexity measures we can use for regularization, including tree depth and the number of leaf nodes. The challenge starts when we try to find the regularized empirical risk minimizer (ERM) over this space for some loss function. It turns out finding this ERM is computationally intractable. We discuss a standard greedy approach to tree building, both for classification and regression, in the case that features take values in any ordered set. We also describe an approach for handling categorical variables (in the binary classification case) and missing values.

In this lecture, we define bootstrap sampling and show how it is typically applied in statistics to do things such as estimating variances of statistics and making confidence intervals. It can be used in a machine learning context for assessing model performance.

(None) (None)

We motivate bagging as follows: Consider the regression case, and suppose we could create a bunch of prediction functions, say B of them, based on B independent training samples of size n. If we average together these prediction functions, the expected value of the average is the same as any one of the functions, but the variance would have decreased by a factor of 1/B -- a clear win! Of course, this would require an overall sample of size nB. The idea of bagging is to replace independent samples with bootstrap samples from a single data set of size n. Of course, the bootstrap samples are not independent, so much of our discussion is about when bagging does and does not lead to improved performance. Random forests were invented as a way to create conditions in which bagging works better.

Although it's hard to find crisp theoretical results describing when bagging helps, conventional wisdom says that it helps most for models that are "high variance", which in this context means the prediction function may change a lot when you train with a new random sample from the same distribution, and "low bias", which basically means fitting the training data well. Large decision trees have these characteristics and are usually the model of choice for bagging. Random forests are just bagged trees with one additional twist: only a random subset of features are considered when splitting a node of a tree. The hope, very roughly speaking, is that by injecting this randomness, the resulting prediction functions are less dependent, and thus we'll get a larger reduction in variance. In practice, random forests are one of the most effective machine learning models in many domains.

(None)

Gradient boosting is an approach to "adaptive basis function modeling", in which we learn a linear combination of M basis functions, which are themselves learned from a base hypothesis space H. Gradient boosting may be used with any subdifferentiable loss function and over any base hypothesis space on which we can do regression. Regression trees are the most commonly used base hypothesis space. It is important to note that the "regression" in "gradient boosted regression trees" (GBRTs) refers to how we fit the basis functions, not the overall loss function. GBRTs are routinely used for classification and conditional probability modeling. They are among the most dominant methods in competitive machine learning (e.g. Kaggle competitions).

If the base hypothesis space H has a nice parameterization (say differentiable, in a certain sense), then we may be able to use standard gradient-based optimization methods directly. In fact, neural networks may be considered in this category. However, if the base hypothesis space H consists of trees, then no such parameterization exists. This is where gradient boosting is really needed.

For practical applications, it would be worth checking out the GBRT implementations in and .

See the Notes below for fully worked examples of doing gradient boosting for classification, using the hinge loss, and for conditional probability modeling using both exponential and Poisson distributions. The code gbm.py illustrates L2-boosting and L1-boosting with decision stumps, for a one-dimensional regression dataset.

Here we consider how to generalize the score-producing binary classification methods we've discussed (e.g. SVM and logistic regression) to multiclass settings. We start by discussing "One-vs-All", a simple reduction of multiclass to binary classification. This usually works just fine in practice, despite the interesting failure case we illustrate. However, One-vs-All doesn't scale to a very large number of classes, since we have to train a separate model for each class. This is the real motivation for presenting the "compatibility function" approach described in this lecture. The approach presented here extends to structured prediction problems, where the output space may be exponentially large. We didn't have time to define structured prediction in the lecture, but please see the slides and the SSBD book in the references.
(None)
Here we start our short unit on unsupervised learning. k-means clustering is presented first as an algorithm and then as an approach to minimizing a particular objective function. One challenge with clustering algorithms is that it's not obvious how to measure success. (See Section 22.5 of the SSBD book for a nice discussion.) When possible, I prefer to take a probabilistic modeling approach, as discussed in the next two lectures.
(None) (None)
A Gaussian mixture model (GMM) is a family of multimodal probability distributions, which is a plausible generative model for clustered data. We can fit this model using maximum likelihood, and we can assess the quality of fit by evaluating the model likelihood on holdout data. While the "learning" phase of Gaussian mixture modeling is fitting the model to data, in the "inference" phase, we determine for any point drawn from the GMM the probability that it came from each of the k components. To use a GMM for clustering, we simply assign each point to the component that it is most likely to have come from. k-means clustering can be seen as a limiting case of a restricted form of Gaussian mixture modeling.
(None) (None)
It turns out, fitting a Gaussian mixture model (GMM) by maximum likelihood is easier said than done: there is no closed form solution, and our usual gradient methods do not work well. The standard approach to maximum likelihood estimation in a Gaussian mixture model is the expectation maximization (EM) algorithm. In this lecture, we present the EM algorithm for a general latent variable model, of which GMM is a special case. We present the EM algorithm as a very basic "variational method" and indicate a few generalizations.
(None) (None)
In the context of this course, we view neural networks as "just" another nonlinear hypothesis space. On the practical side, unlike trees and tree-based ensembles (our other major nonlinear hypothesis spaces), neural networks can be fit using gradient-based optimization methods. On the theoretical side, a large enough neural network can approximate any continuous function. We discuss the specific case of the multilayer perceptron for multiclass classification, which we view as a generalization of multinomial logistic regression from linear to nonlinear score functions.
(None) (None)

Neural network optimization is amenable to gradient-based methods, but if the actual computation of the gradient is done naively, the computational cost can be prohibitive. Backpropagation is the standard algorithm for computing the gradient efficiently. We present the backpropagation algorithm for a general computation graph. The algorithm we present applies, without change, to models with "parameter tying", which include convolutional networks and recurrent neural networks (RNN's), the workhorses of modern computer vision and natural language processing. We illustrate backpropagation with one of the simplest models with parameter tying: regularized linear regression. Backpropagation for the multilayer perceptron, the standard introductory example, is presented in detail in .

(None)
We point the direction to many other topics in machine learning that should be accessible to students of this course, but that we did not have time to cover.
(None) (None) (None)

Assignments

GD, SGD, and Ridge Regression

Lasso Regression

SVM and Sentiment Analysis

Kernel Methods

Probabilistic Modeling

Multiclass, Trees, and Gradient Boosting

Computation Graphs, Backpropagation, and Neural Networks

The cover of Hands-On Machine Learning with Scikit-Learn and TensorFlow

Other tutorials and references

  • Carlos Fernandez-Granda's lecture notes provide a comprehensive review of the prerequisite material in linear algebra, probability, statistics, and optimization.
  • Brian Dalessandro's iPython notebooks from DS-GA-1001: Intro to Data Science
  • The Matrix Cookbook has lots of facts and identities about matrices and certain probability distributions.
  • Stanford CS229: "Review of Probability Theory"
  • Stanford CS229: "Linear Algebra Review and Reference"
  • Math for Machine Learning by Hal Daumé III

A photo of David Rosenberg

David S. Rosenberg

Teaching Assistants

Prediction Assignment Writeup

Alaettin serhan mete.

This document summarizes the work done for the Prediction Assignment Writeup project for the Coursera Practical Machine Learning course. It's created using the functionalities of the knitr package in RStudio using the actual analysis code. The repository for this work can be found at https://github.com/amete/PracticalMachineLearningAssignment .

Using devices such as Jawbone Up , Nike FuelBand , and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it . In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

First, we begin by exporting the data. One can simply download the training and testing datasets using:

Then, we load some useful packages using:

Now let's load the data into memory:

Now that we have the data in the memory, let's get to the fun part. First thing we better do is to split the training data into two parts. We'll use 70% of this data to actually train our model and the remaining 30% to validate it:

Next, we need to prepare the data for modeling. If you look at the training data you'll see that there are a number of variables that have either no variance or a large fraction of missing values. These will not really help us in any meaningful way. Therefore, let's clean them up for a healthy modeling:

At this point, we have healthy clean data that we can use for building models. We'll build two models: a random forest and a generalized boosted model . We'll train these in the training portion of the original training dataset and then test them in the validation portion of the original training dataset:

Then let's see how well these two models perform predicting the values in the validation dataset. This can be easily accomplished by predicting the values in the validation set, and then comparing the predictions with the actual values.

We can investigate our generalized boosted model a bit further to see which variables have the highest relative influence:

plot of chunk unnamed-chunk-10

The above list shows the ranking of variables in our GBM. We see that num_window , roll_belt , and pitch_forearm are the most performant ones. We can checkout a few plots demonstrating their power:

plot of chunk unnamed-chunk-11

At this point we see the random forest has marginally better performance (Accuracy : 0.998) than the generalized boosted model (Accuracy : 0.9876). Actually we can go w/ either or ensemble them but that might be an overkill at this point. In any case they yield the same result. Let's test our model in the actual testing dataset:

On the expected out of sample error

Please note that since the method random forest is chosen, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. This is explained as: "In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree. Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests." The reader can find more information at: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr

Applied Machine Learning

Fall 2019 is here here's what you need to know., course description.

Learn and apply key concepts of modeling, analysis and validation from Machine Learning, Data Mining and Signal Processing to analyze and extract meaning from data. Implement algorithms and perform experiments on images, text, audio and mobile sensor measurements. Gain working knowledge of supervised and unsupervised techniques including classification, regression, clustering, feature selection, association rule mining and dimensionality reduction.

Prerequisites

CS 2800 or equivalent, Linear Algebra, and experience programming with Python or Matlab, or permission of the instructor.

Room & Time

Tuesdays and Thursdays, 12:30PM-1:45PM, Bloomberg Center 131, Cornell Tech

Class number: 12453

Textbooks (Available for free)

Required: T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition) , Springer-Verlag, 2008. Recommended: Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin: Learning from Data , AMLBook, 2012. P. Harrington, Machine Learning in Action , Manning, 2012. A. Rajaraman, J. Leskovec and J. Ullman, Mining of Massive Datasets , v1.1. H. Daumé III, A Course in Machine Learning , v0.8.

Course Requirements and Grading

Grade Breakdown: Your grade will be determined by the assignments (30%), one prelim (30%), a final exam (30%), and in-class quizzes (10%).

Homework: There will be four assignments and an “assignment 0” for environment setup. Each assignment will have a due date for completion. Half of the points of the lowest-scoring assignment will count as extra credit, meaning the points received for homeworks 1, 2, 3, and 4 is calculated as (sum of scores) / 3.5.

Late Policy: Each student has a total of one slip day that may be used without penalty.

External Code: Unless otherwise specified, you are allowed to use well known libraries such as scikit-learn, scikit-image, numpy, scipy, etc. in the assignments. Any reference or copy of public code repositories should be properly cited in your submission (examples include Github, Wikipedia, Blogs ). In some assignment cases, you are NOT allowed to use any of the libraries above, please refer to individual HW instructions for more details.

Collaboration: You are encouraged (but not required) to work in groups of no more than 2 students on each assignment. Please indicate the name of your collaborator at the top of each assignment and cite any references you used (including articles, books, code, websites, and personal communications). If you’re not sure whether to cite a source, err on the side of caution and cite it. You may submit just one writeup for the group. Remember not to plagiarize: all solutions must be written by members of the group.

Quizzes: There will be surprise in-class quizzes to make sure you attend and pay attention to the class.

  • Past Prelim 0
  • Past Prelim 1
  • Past Prelim 2

Final Exam: December 3-10. The final exam will be hosted on Kaggle. You will develop an algorithm, prepare a professional paper, submit an anonymized version to the EasyChair conference system, and peer-review the work from other groups. You are strongly encouraged to work in a group of three students.

Get the Reddit app

A subreddit dedicated to learning machine learning

Machine Learning Specialization (Andrew Ng) - Course Notes and Lab Assignments

logo

MGMT 4190/6560 Introduction to Machine Learning Applications @Rensselaer

Interact on Colab

Assignment 3 ¶

machine learning assignment github

Save your working file in Google drive so that all changes will be saved as you work. MAKE SURE that your final version is saved to GitHub.

Before you turn this in, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). You can speak with others regarding the assignment but all work must be your own.

This is a 30 point assignment. ¶

You may find it useful to go through the notebooks from the course materials when doing these exercises.

If you attempt to fake passing the tests you will receive a 0 on the assignment and it will be considered an ethical violation.

Exercises - For and If and Functions ¶

(1). Create a function list_step that accepts 3 variables ( start , stop , step ). The function returns a list starting at start , ending at stop , and with a step of step .

For example:

list_step(5, 19, 2)

Should return:

[5, 7, 9, 11, 13, 15, 17]

(2). Create a function list_divisible that accepts 3 variables ( start , stop , divisible ). Use a for loop to create a list of all numbers from start to stop which are divisible by divisible .

list_divisible(1, 50, 7)

[7, 14, 21, 28, 35, 42, 49]

(3). Create a function list_divisible_not that accepts 4 variables ( start , stop , divisible , not_divisible ). Use a for loop to create a list of all numbers from start to stop which are divisible by divisible but not divisible by not_divisible .

list_divisible_not(1, 100, 4, 3)

[4, 8, 16, 20, 28, 32, 40, 44, 52, 56, 64, 68, 76, 80, 88, 92]

Exercises-Titanic ¶

The following exercises will use the titanic data from Kaggle. I’ve included it in the input folder just like Kaggle.

(4) What is the key difference between the train and the test?

(5) Create a new column family that is equal to the SibSp * Parch for both the train and the test dataframes. DON’T use a for loop.

(6). While we can submit our answer to Kaggle to see how it will perform, we can also utilize our training data to assess accuracy. Accuracy is the percentage of predictions made correctly-i.e., the percentage of people in which our prediction regarding their survival is correct. In other words, accuracy = (#correct predictions)/(Total # of predictions). Create a function generate_accuracy which accepts two Pandas series objects ( predicted , actual ) and returns the accuracy.

For example, when a and b are two different Pandas Series: generate_accuracy(predicted, actual)

For the sample data below, the data should retun 50.0 (i.e., a percentage).

(7) Create a column PredEveryoneDies which is equal to 0 for everyone in both training and testing datasets.

(8) Find the accuracy of PredEveryoneDies in predicting Survived using the function generate_accuracy that you created earlier and assign it to the AccEveryoneDies variable.

(9) In both the training and testing datasets, create the column PredGender that is 1 – if the person is a woman and 0 – if the person is a man. (This is based on the “women and children first” law of shipwrecks). Then set AccGender to the accuracy of the PredGender in the Train dataset.

(10). Create a generate_submission function that accepts a DataFrame, a target column, and a filename ( df , target , filename ) and writes out the submission file with just the passengerID and the Survived columns, where the Survived column is equal to the target column.

For example: submitdie = generate_submission(train, 'PredEveryoneDies', 'submiteveryonedies.csv')

Should return a dataframe with just passengerID and the Survived column.

Make sure your submission file prediction for Survived is an integer and not at float. If you submit a float it may not work.

(11). To use the women and children first protocol, we will need to use the age field. This has some missing values. We are going to replace null values in the train and test set with the median value for each.

For this particular question:

Set the variables AgeMissingTrain and AgeMissingTest using the count of the number of missing values in the train and test sets, respectively.

Set the variable AgeMedianTrain and AgeMedianTest using the median age of the train and test sets, respectively.

(12) For rows in which the age value is missing, set the age to the appropriate median value for the train/test set.

(13). In our initial calculation of the PredGender column, we made our prediction based on whether the individual was male or female. In accordance to the women and children first protocol, we hypothesize that our model could be improved by including whether the individual was a child in addition to gender. We also have a question, what age to use to determine “child”? (People weren’t likely to check for IDs.) We will check 2 ages…<13 and <18 (somewhat arbitrary but have to start somewhere) and see which yields a better accuracy.

Specifically, create 2 predictions as follows:

train['PredGenderAge13'] should be the prediction incorporating both Gender (women survive) and Age (Children Age<13 survived while Age>=13 died) train['PredGenderAge18'] should be the prediction incorporating both Gender (women survive) and Age (Children Age<18 survived while Age>=18 died)

The analysis assumes that you have addressed missing values in the earlier step and you should do it for both the train and test dataframes

(14). Calculate the accuracy for your new predictions. Use PredGenderAge13 in the training set to calculate AccGenderAge13 (you can use your function again!) and PredGenderAge18 to calcuate AccGenderAge18 .

(15). You should find that the accuracy is higher when using 13 as a designation for a child than 18. What does this tell you about the role of age in surviving a shipwreck?

(16) Create a prediction file for the “women and children first” model in using the test dataset and upload it to Kaggle. Go through the process of uploading it to Kaggle. Put your Kaggle username so we can verify your prediction occued.

Make sure your submission file prediction is an integer and not at float. If you submit a float it may not work.

Instantly share code, notes, and snippets.

@gatoytoro

gatoytoro / Assignment: Machine Learning Prediction

  • Download ZIP
  • Star ( 0 ) 0 You must be signed in to star a gist
  • Fork ( 0 ) 0 You must be signed in to fork a gist
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.
  • Learn more about clone URLs
  • Save gatoytoro/f9598a7d65c37a4c0abb4e158181bc3d to your computer and use it in GitHub Desktop.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "<center>\n <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n</center>\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "# **Space X Falcon 9 First Stage Landing Prediction**\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Assignment: Machine Learning Prediction\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Estimated time needed: **60** minutes\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Space X advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because Space X can reuse the first stage. Therefore if we can determine if the first stage will land, we can determine the cost of a launch. This information can be used if an alternate company wants to bid against space X for a rocket launch. In this lab, you will create a machine learning pipeline to predict if the first stage will land given the data from the preceding labs.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing\\_1.gif)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Several examples of an unsuccessful landing are shown here:\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Most unsuccessful landings are planed. Space X; performs a controlled landing in the oceans.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Objectives\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Perform exploratory Data Analysis and determine Training Labels\n\n* create a column for the class\n* Standardize the data\n* Split into training data and test data\n\n\\-Find best Hyperparameter for SVM, Classification Trees and Logistic Regression\n\n* Find the method performs best using test data\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": ""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "***\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Import Libraries and Define Auxiliary Functions\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We will import the following libraries for the lab\n"
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": "# Pandas is a software library written for the Python programming language for data manipulation and analysis.\nimport pandas as pd\n# NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays\nimport numpy as np\n# Matplotlib is a plotting library for python and pyplot gives us a MatLab like plotting framework. We will use this in our plotter function to plot data.\nimport matplotlib.pyplot as plt\n#Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics\nimport seaborn as sns\n# Preprocessing allows us to standarsize our data\nfrom sklearn import preprocessing\n# Allows us to split our data into training and testing data\nfrom sklearn.model_selection import train_test_split\n# Allows us to test parameters of classification algorithms and find the best one\nfrom sklearn.model_selection import GridSearchCV\n# Logistic Regression classification algorithm\nfrom sklearn.linear_model import LogisticRegression\n# Support Vector Machine classification algorithm\nfrom sklearn.svm import SVC\n# Decision Tree classification algorithm\nfrom sklearn.tree import DecisionTreeClassifier\n# K Nearest Neighbors classification algorithm\nfrom sklearn.neighbors import KNeighborsClassifier"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "This function is to plot the confusion matrix.\n"
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": "def plot_confusion_matrix(y,y_predict):\n \"this function plots the confusion matrix\"\n from sklearn.metrics import confusion_matrix\n\n cm = confusion_matrix(y, y_predict)\n ax= plt.subplot()\n sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells\n ax.set_xlabel('Predicted labels')\n ax.set_ylabel('True labels')\n ax.set_title('Confusion Matrix'); \n ax.xaxis.set_ticklabels(['did not land', 'land']); ax.yaxis.set_ticklabels(['did not land', 'landed'])"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Load the dataframe\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Load the data\n"
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>FlightNumber</th>\n <th>Date</th>\n <th>BoosterVersion</th>\n <th>PayloadMass</th>\n <th>Orbit</th>\n <th>LaunchSite</th>\n <th>Outcome</th>\n <th>Flights</th>\n <th>GridFins</th>\n <th>Reused</th>\n <th>Legs</th>\n <th>LandingPad</th>\n <th>Block</th>\n <th>ReusedCount</th>\n <th>Serial</th>\n <th>Longitude</th>\n <th>Latitude</th>\n <th>Class</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>2010-06-04</td>\n <td>Falcon 9</td>\n <td>6104.959412</td>\n <td>LEO</td>\n <td>CCAFS SLC 40</td>\n <td>None None</td>\n <td>1</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>NaN</td>\n <td>1.0</td>\n <td>0</td>\n <td>B0003</td>\n <td>-80.577366</td>\n <td>28.561857</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>2012-05-22</td>\n <td>Falcon 9</td>\n <td>525.000000</td>\n <td>LEO</td>\n <td>CCAFS SLC 40</td>\n <td>None None</td>\n <td>1</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>NaN</td>\n <td>1.0</td>\n <td>0</td>\n <td>B0005</td>\n <td>-80.577366</td>\n <td>28.561857</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>2013-03-01</td>\n <td>Falcon 9</td>\n <td>677.000000</td>\n <td>ISS</td>\n <td>CCAFS SLC 40</td>\n <td>None None</td>\n <td>1</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>NaN</td>\n <td>1.0</td>\n <td>0</td>\n <td>B0007</td>\n <td>-80.577366</td>\n <td>28.561857</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4</td>\n <td>2013-09-29</td>\n <td>Falcon 9</td>\n <td>500.000000</td>\n <td>PO</td>\n <td>VAFB SLC 4E</td>\n <td>False Ocean</td>\n <td>1</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>NaN</td>\n <td>1.0</td>\n <td>0</td>\n <td>B1003</td>\n <td>-120.610829</td>\n <td>34.632093</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5</td>\n <td>2013-12-03</td>\n <td>Falcon 9</td>\n <td>3170.000000</td>\n <td>GTO</td>\n <td>CCAFS SLC 40</td>\n <td>None None</td>\n <td>1</td>\n <td>False</td>\n <td>False</td>\n <td>False</td>\n <td>NaN</td>\n <td>1.0</td>\n <td>0</td>\n <td>B1004</td>\n <td>-80.577366</td>\n <td>28.561857</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " FlightNumber Date BoosterVersion PayloadMass Orbit LaunchSite \\\n0 1 2010-06-04 Falcon 9 6104.959412 LEO CCAFS SLC 40 \n1 2 2012-05-22 Falcon 9 525.000000 LEO CCAFS SLC 40 \n2 3 2013-03-01 Falcon 9 677.000000 ISS CCAFS SLC 40 \n3 4 2013-09-29 Falcon 9 500.000000 PO VAFB SLC 4E \n4 5 2013-12-03 Falcon 9 3170.000000 GTO CCAFS SLC 40 \n\n Outcome Flights GridFins Reused Legs LandingPad Block \\\n0 None None 1 False False False NaN 1.0 \n1 None None 1 False False False NaN 1.0 \n2 None None 1 False False False NaN 1.0 \n3 False Ocean 1 False False False NaN 1.0 \n4 None None 1 False False False NaN 1.0 \n\n ReusedCount Serial Longitude Latitude Class \n0 0 B0003 -80.577366 28.561857 0 \n1 0 B0005 -80.577366 28.561857 0 \n2 0 B0007 -80.577366 28.561857 0 \n3 0 B1003 -120.610829 34.632093 0 \n4 0 B1004 -80.577366 28.561857 0 "
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "data = pd.read_csv(\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_2.csv\")\n\n# If you were unable to complete the previous lab correctly you can uncomment and load this csv\n\n# data = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/dataset_part_2.csv')\n\ndata.head()"
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>FlightNumber</th>\n <th>PayloadMass</th>\n <th>Flights</th>\n <th>Block</th>\n <th>ReusedCount</th>\n <th>Orbit_ES-L1</th>\n <th>Orbit_GEO</th>\n <th>Orbit_GTO</th>\n <th>Orbit_HEO</th>\n <th>Orbit_ISS</th>\n <th>...</th>\n <th>Serial_B1058</th>\n <th>Serial_B1059</th>\n <th>Serial_B1060</th>\n <th>Serial_B1062</th>\n <th>GridFins_False</th>\n <th>GridFins_True</th>\n <th>Reused_False</th>\n <th>Reused_True</th>\n <th>Legs_False</th>\n <th>Legs_True</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1.0</td>\n <td>6104.959412</td>\n <td>1.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>...</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2.0</td>\n <td>525.000000</td>\n <td>1.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>...</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3.0</td>\n <td>677.000000</td>\n <td>1.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>...</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>4.0</td>\n <td>500.000000</td>\n <td>1.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>...</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>5.0</td>\n <td>3170.000000</td>\n <td>1.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>...</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>5</th>\n <td>6.0</td>\n <td>3325.000000</td>\n <td>1.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>...</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7.0</td>\n <td>2296.000000</td>\n <td>1.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>...</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>7</th>\n <td>8.0</td>\n <td>1316.000000</td>\n <td>1.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>...</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>8</th>\n <td>9.0</td>\n <td>4535.000000</td>\n <td>1.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>...</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n </tr>\n <tr>\n <th>9</th>\n <td>10.0</td>\n <td>4428.000000</td>\n <td>1.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>...</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n </tr>\n </tbody>\n</table>\n<p>10 rows \u00d7 83 columns</p>\n</div>",
"text/plain": " FlightNumber PayloadMass Flights Block ReusedCount Orbit_ES-L1 \\\n0 1.0 6104.959412 1.0 1.0 0.0 0.0 \n1 2.0 525.000000 1.0 1.0 0.0 0.0 \n2 3.0 677.000000 1.0 1.0 0.0 0.0 \n3 4.0 500.000000 1.0 1.0 0.0 0.0 \n4 5.0 3170.000000 1.0 1.0 0.0 0.0 \n5 6.0 3325.000000 1.0 1.0 0.0 0.0 \n6 7.0 2296.000000 1.0 1.0 0.0 0.0 \n7 8.0 1316.000000 1.0 1.0 0.0 0.0 \n8 9.0 4535.000000 1.0 1.0 0.0 0.0 \n9 10.0 4428.000000 1.0 1.0 0.0 0.0 \n\n Orbit_GEO Orbit_GTO Orbit_HEO Orbit_ISS ... Serial_B1058 \\\n0 0.0 0.0 0.0 0.0 ... 0.0 \n1 0.0 0.0 0.0 0.0 ... 0.0 \n2 0.0 0.0 0.0 1.0 ... 0.0 \n3 0.0 0.0 0.0 0.0 ... 0.0 \n4 0.0 1.0 0.0 0.0 ... 0.0 \n5 0.0 1.0 0.0 0.0 ... 0.0 \n6 0.0 0.0 0.0 1.0 ... 0.0 \n7 0.0 0.0 0.0 0.0 ... 0.0 \n8 0.0 1.0 0.0 0.0 ... 0.0 \n9 0.0 1.0 0.0 0.0 ... 0.0 \n\n Serial_B1059 Serial_B1060 Serial_B1062 GridFins_False GridFins_True \\\n0 0.0 0.0 0.0 1.0 0.0 \n1 0.0 0.0 0.0 1.0 0.0 \n2 0.0 0.0 0.0 1.0 0.0 \n3 0.0 0.0 0.0 1.0 0.0 \n4 0.0 0.0 0.0 1.0 0.0 \n5 0.0 0.0 0.0 1.0 0.0 \n6 0.0 0.0 0.0 1.0 0.0 \n7 0.0 0.0 0.0 1.0 0.0 \n8 0.0 0.0 0.0 1.0 0.0 \n9 0.0 0.0 0.0 1.0 0.0 \n\n Reused_False Reused_True Legs_False Legs_True \n0 1.0 0.0 1.0 0.0 \n1 1.0 0.0 1.0 0.0 \n2 1.0 0.0 1.0 0.0 \n3 1.0 0.0 1.0 0.0 \n4 1.0 0.0 1.0 0.0 \n5 1.0 0.0 1.0 0.0 \n6 1.0 0.0 0.0 1.0 \n7 1.0 0.0 0.0 1.0 \n8 1.0 0.0 1.0 0.0 \n9 1.0 0.0 1.0 0.0 \n\n[10 rows x 83 columns]"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "X = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_3.csv')\n\n# If you were unable to complete the previous lab correctly you can uncomment and load this csv\n\n# X = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/dataset_part_3.csv')\n\nX.head(10)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 1\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Create a NumPy array from the column <code>Class</code> in <code>data</code>, by applying the method <code>to_numpy()</code> then\nassign it to the variable <code>Y</code>,make sure the output is a Pandas series (only one bracket df\\['name of column']).\n"
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "numpy.ndarray"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "Y = data['Class'].to_numpy()\ntype(Y)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 2\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Standardize the data in <code>X</code> then reassign it to the variable <code>X</code> using the transform provided below.\n"
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "StandardScaler()"
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "# students get this \nscaler = preprocessing.StandardScaler().fit(X)\nscaler"
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "array([[-1.71291154e+00, -1.94814463e-16, -6.53912840e-01, ...,\n -8.35531692e-01, 1.93309133e+00, -1.93309133e+00],\n [-1.67441914e+00, -1.19523159e+00, -6.53912840e-01, ...,\n -8.35531692e-01, 1.93309133e+00, -1.93309133e+00],\n [-1.63592675e+00, -1.16267307e+00, -6.53912840e-01, ...,\n -8.35531692e-01, 1.93309133e+00, -1.93309133e+00],\n ...,\n [ 1.63592675e+00, 1.99100483e+00, 3.49060516e+00, ...,\n 1.19684269e+00, -5.17306132e-01, 5.17306132e-01],\n [ 1.67441914e+00, 1.99100483e+00, 1.00389436e+00, ...,\n 1.19684269e+00, -5.17306132e-01, 5.17306132e-01],\n [ 1.71291154e+00, -5.19213966e-01, -6.53912840e-01, ...,\n -8.35531692e-01, -5.17306132e-01, 5.17306132e-01]])"
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "X = scaler.transform(X)\nX"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We split the data into training and testing data using the function <code>train_test_split</code>. The training data is divided into validation data, a second set used for training data; then the models are trained and hyperparameters are selected using the function <code>GridSearchCV</code>.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 3\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Use the function train_test_split to split the data X and Y into training and test data. Set the parameter test_size to 0.2 and random_state to 2. The training data and test data should be assigned to the following labels.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<code>X_train, X_test, Y_train, Y_test</code>\n"
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": "X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size=0.2, random_state=4)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "we can see we only have 18 test samples.\n"
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "(18,)"
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "Y_test.shape"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 4\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Create a logistic regression object then create a GridSearchCV object <code>logreg_cv</code> with cv = 10. Fit the object to find the best parameters from the dictionary <code>parameters</code>.\n"
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": "parameters ={'penalty':['l2'],'C':[0.01,0.1,1],'solver':['lbfgs']}"
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "GridSearchCV(cv=10, estimator=LogisticRegression(),\n param_grid={'C': [0.01, 0.1, 1], 'penalty': ['l2'],\n 'solver': ['lbfgs']})"
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "parameters ={'penalty':['l2'],'C':[0.01,0.1,1],'solver':['lbfgs']}# l1 lasso l2 ridge\n\nlogreg_cv= GridSearchCV(LogisticRegression(), parameters, cv=10)\nlogreg_cv.fit(X_train, Y_train)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We output the <code>GridSearchCV</code> object for logistic regression. We display the best parameters using the data attribute <code>best_params\\_</code> and the accuracy on the validation data using the data attribute <code>best_score\\_</code>.\n"
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "tuned hpyerparameters :(best parameters) {'C': 0.01, 'penalty': 'l2', 'solver': 'lbfgs'}\naccuracy : 0.8357142857142857\n"
}
],
"source": "print(\"tuned hpyerparameters :(best parameters) \",logreg_cv.best_params_)\nprint(\"accuracy :\",logreg_cv.best_score_)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 5\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Calculate the accuracy on the test data using the method <code>score</code>:\n"
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "0.7222222222222222"
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "logreg_cv.score(X_test, Y_test)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Lets look at the confusion matrix:\n"
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWgAAAEWCAYAAABLzQ1kAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAfrUlEQVR4nO3dd5xdVbn/8c93EiCUJBAIHUzQgAqYiAEpgkFQ6cVCaP4UwcCVdrEgIBcQxHaFC/4ulggqhoAUKdKREikihoRQQxEQCAkQAin0zMxz/9h74CTMzClz9px1Zr5vXvs15+x9ztpPZobnrHn2WmsrIjAzs/S0NDoAMzPrnBO0mVminKDNzBLlBG1mlignaDOzRDlBm5klygnaekzS8pKulrRA0qU9aOcASTfVM7ZGkHS9pK82Og5rfk7Q/Yik/SXdK+k1SXPyRPKpOjT9JWANYNWI+HKtjUTE5Ij4XB3iWYKkcZJC0uVL7R+d759SYTunSLqg3OsiYueIOL/GcM3e5QTdT0j6FnAW8COyZLo+8Etgzzo0/wHg8YhorUNbRZkLbC1p1ZJ9XwUer9cJlPH/U1Y3/mXqByQNBU4FDo+IyyPi9YhYHBFXR8R389csJ+ksSbPz7SxJy+XHxkmaJenbkl7Ke98H5cd+AJwEjM975gcv3dOUNCLvqQ7Mn39N0lOSFkl6WtIBJfvvLHnf1pKm5qWTqZK2Ljk2RdJpku7K27lJ0mrdfBveAa4E9s3fPwDYB5i81PfqbEnPSVooaZqkbfP9OwEnlPw77y+J43RJdwFvABvk+w7Jj/9K0mUl7f9U0i2SVOnPz/ovJ+j+YStgEHBFN6/5PrAlMAYYDWwBnFhyfE1gKLAOcDBwjqRVIuJksl75xRGxUkSc110gklYEfgHsHBGDga2BGZ28bhhwbf7aVYEzgWuX6gHvDxwErA4sC3ynu3MDfwT+X/7488DDwOylXjOV7HswDLgQuFTSoIi4Yal/5+iS93wFmAAMBp5Zqr1vAx/LP3y2JfvefTW8xoJVwAm6f1gVeLlMCeIA4NSIeCki5gI/IEs8HRbnxxdHxHXAa8BGNcbTDmwiafmImBMRD3fyml2BJyJiUkS0RsRFwKPA7iWv+X1EPB4RbwKXkCXWLkXE34FhkjYiS9R/7OQ1F0TEvPycZwDLUf7f+YeIeDh/z+Kl2nsDOJDsA+YC4MiImFWmPTPACbq/mAes1lFi6MLaLNn7eybf924bSyX4N4CVqg0kIl4HxgOHAXMkXSvpwxXE0xHTOiXPX6ghnknAEcD2dPIXRV7GmZmXVeaT/dXQXekE4LnuDkbEP4GnAJF9kJhVxAm6f7gbeAvYq5vXzCa72Ndhfd7/53+lXgdWKHm+ZunBiLgxIj4LrEXWK/5tBfF0xPR8jTF1mAR8E7gu792+Ky9BfI+sNr1KRKwMLCBLrABdlSW6LVdIOpysJz4bOLbmyK3fcYLuByJiAdmFvHMk7SVpBUnLSNpZ0s/yl10EnChpeH6x7SSyP8lrMQPYTtL6+QXK4zsOSFpD0h55LfptslJJWydtXAdsmA8NHChpPPBR4JoaYwIgIp4GPk1Wc1/aYKCVbMTHQEknAUNKjr8IjKhmpIakDYEfkpU5vgIcK2lMbdFbf+ME3U9ExJnAt8gu/M0l+7P8CLKRDZAlkXuBB4AHgen5vlrO9Vfg4rytaSyZVFvILpzNBl4hS5bf7KSNecBu+WvnkfU8d4uIl2uJaam274yIzv46uBG4nmzo3TNkf3WUli86JuHMkzS93HnyktIFwE8j4v6IeIJsJMikjhEyZt2RLyabmaXJPWgzs0Q5QZuZ1Zmk3+WTuh4q2TdM0l8lPZF/XaVcO07QZmb19wdgp6X2HQfcEhGjgFvy591yDdrMrACSRgDXRMQm+fPHgHERMUfSWsCUiOh2ElR3Exca6tgR+/mTw95nn9Y3Gx2CJWjsrCt7vLbJ4pefqjjnLDv8g4eSTe/vMDEiJpZ52xoRMQcgT9KrlztPsgnazCxVeTIul5B7zAnazAygvbP5UnX1oqS1SkocL5V7gy8SmpkBtLVWvtXmL2RrkJN/varcG9yDNjMDItrr1paki4BxZIuUzQJOBn4CXCLpYOBZoOzdh5ygzcwA2uuXoCNivy4O7VBNO07QZmYAdexB14sTtJkZ9MZFwqo5QZuZgXvQZmapitpHZxTGCdrMDOp6kbBenKDNzMAlDjOzZPkioZlZotyDNjNLlC8SmpklyhcJzczSFOEatJlZmlyDNjNLlEscZmaJcg/azCxRbYsbHcH7OEGbmYFLHGZmyXKJw8wsUe5Bm5klygnazCxN4YuEZmaJcg3azCxRLnGYmSXKPWgzs0S5B21mlij3oM3MEtXqBfvNzNLkHrSZWaJcgzYzS5R70GZmiXIP2swsUe5Bm5klyqM4zMwSFdHoCN7HCdrMDFyDNjNLVoIJuqXRAZiZJSHaK9/KkHSMpIclPSTpIkmDagnJCdrMDKCtrfKtG5LWAY4CxkbEJsAAYN9aQnKJw8wM6l3iGAgsL2kxsAIwu5ZG3IM2M4MsQVe4SZog6d6SbUJHMxHxPPBz4FlgDrAgIm6qJST3oM3MoKqJKhExEZjY2TFJqwB7AiOB+cClkg6MiAuqDck9aDMzINqj4q2MHYGnI2JuRCwGLge2riUm96DNzKCeNehngS0lrQC8CewA3FtLQ07QZmZQdnRGpSLiHkmXAdOBVuA+uiiHlOMEbWYGdR3FEREnAyf3tB0naDMzSHImoRN0wgYutwyHXXwSA5dbhpYBA3jw+nv46/9c1uiwLAGb3j2RttffhLZ2orWNmbt+p9EhNT8vlmTVaH17MRP3/yHvvPE2LQMH8M3LTuGxKTN49r5/NTo0S8DjXz6R1lcXNTqMvqM/9KAlLQK6/CiKiCH1Pmdf9s4bbwMwYOAABgwcQCT4KW/WJ5QfPtfr6p6gI2IwgKRTgReASYCAA4DB9T5fX6cWcfQ1P2LVD6zJ3yfdxHMznmx0SJaCCEZdeAoEzJ18Iy9PrmmimpWq0yiOeiqyxPH5iPhkyfNfSboH+FlXb8inS04A+NywsYwe/KECw2sO0R6ctcvxDBqyAl/9zbdYY8N1efHxWY0Oyxrs0b2PY/GLrzJw1aFseNEpvPWvWbx2zyONDqupRYIljiJnErZJOkDSAEktkg4Auv2IioiJETE2IsY6OS/prYVv8OQ/ZrLRp0c3OhRLwOIXXwWgdd4C5t9wDyuOGdXgiPqA9qh86yVFJuj9gX2AF/Pty/k+q9CKwwYzaMgKQDaiY9Q2mzD3yZoWxbI+pGX55WhZcdC7j4dsN4Y3H3u2wVH1AXVcD7peCitxRMS/yRYMsRoNXn0Vxp/xH7S0tKAW8cC1/2Dmrfc1OixrsIHDV+ZD5x4HgAYM4JUrb2fhFP9e9Fh/uEjYQdJw4BvAiNLzRMTXizpnX/PCo89y9q7HNzoMS8w7z77II587ptFh9D2t/esi4VXAHcDNlKk9m5k1XC+WLipVZIJeISK+V2D7Zmb1k2CJo8iLhNdI2qXA9s3M6iba2yveekuRPeijgRMkvQ0sJpusEp5JaGZJSrAHXeQoDs8aNLPm0Z8SNLx7b65RwKCOfRFxe5HnNDOrSX+a6i3pELIyx7rADGBL4G7gM0Wd08ysVhXca7DXFXmR8Ghgc+CZiNge+Dgwt8DzmZnVLsGp3kWWON6KiLckIWm5iHhU0kYFns/MrHYJLpZUZIKeJWll4Ergr5JeBbyQhJmlKcESR5GjOPbOH54i6TZgKHBDUeczM+uR/pCgJQ3rZPeD+deVgFfqfU4zs56Ktv5R4phGdssrlezreB7ABgWc08ysZ/pDDzoiRta7TTOzoqU4zM539TYzg/7RgzYza0rplaCdoM3MAKI1vQxd2ExCSZMq2WdmloT2KrZeUmQPeuPSJ5IGAJ8o8HxmZjVL8SJh3XvQko6XtAj4mKSFkhblz18iuw2WmVl6EuxB1z1BR8SP87Wg/zsihkTE4HxbNSJ8B1QzS1K0R8VbbylyqvfxkvYAtst3TYmIa4o6n5lZj6R3jbDQ9aB/DGwBTM53HS1pG/eizSxF0droCN6vyIuEuwJjIrJ7mUs6H7gPcII2s+REgj3oqmrQklaR9LEq3rJyyeOh1ZzLzKxX1fEioaSVJV0m6VFJMyVtVUtIZXvQkqYAe+SvnQHMlfS3iPhWmbf+GLgvX2pUZLVo957NLEl17kGfDdwQEV+StCywQi2NVFLiGBoRC/N7DP4+Ik6W9EC5N0XERXly35wsQX8vIl6oJUgzs6LVK0FLGkLWIf0aQES8A7xTS1uVlDgGSloL2AeodhRGC/Ay8CqwoaTtyrzezKwhok0Vb5ImSLq3ZJtQ0tQGZPdf/b2k+ySdK2nFWmKqpAd9KnAjcGdETJW0AfBEuTdJ+ikwHniY96o2AdxeS6BmZkWqpgcdEROBiV0cHghsBhwZEfdIOhs4DvivamMqm6Aj4lLg0pLnTwFfrKDtvYCNIuLtaoMyM+tt0a7yL6rMLGBWRNyTP7+MLEFXrcsELen/k/V4OxURR5Vp+ylgGcAJ2sySV68adES8IOk5SRtFxGPADsAjtbTVXQ/63pqie88bwAxJt1CSpCtI7GZmvS6ibj1ogCOByfkIjqeAg2pppMsEHRHnlz6XtGJEvF5F23/JNzOz5NVzmF1EzADG9rSdSsZBbwWcR3ZH7vUljQYOjYhvlgnw/O6Om5mlpL2trj3ouqhkmN1ZwOeBeQARcT/vLYBkZtYnRLsq3npLRWtxRMRz0hJBtRUTjplZY/Rm4q1UJQn6OUlbA5EXvI8CZhYblplZ74r0bqhSUYI+jGxe+TrA82STVg7v6sWSrqb74Xl7VBmjmVnhmrIHHREvAwdU0ebP869fANYELsif7wf8u5rgzMx6S52H2dVFJaM4NiDrQW9J1jO+Gzgmn1H4PhHxt/x9p0VE6cXEqyV5mreZJamtSUdxXAhcAqwFrE027fuiCt43PE/uAEgaCQyvJUgzs6JFqOKtt1RSg1ZETCp5foGkIyp43zHAFEkdPe0RwKFVxmdm1iuaqgYtaVj+8DZJxwF/IitxjAeuLddwRNwgaRTw4XzXo144ycxS1WyjOKaRJeSOj5XS3m8Ap3X2JkmfiYhbJX1hqUMflEREXF5ztGZmBWmqHnREjKyxzU8DtwK7d9Ys4ARtZslpa6/qFq29oqKZhJI2AT4KDOrYFxF/7Oy1EXFy/rWm1ZvMzBqh2UocAEg6GRhHlqCvA3YG7gQ6TdCSur2ZbEScWXWUZmYFa2/GcdDAl4DRwH0RcZCkNYBzu3n94PzrRmQ3jO1YcnR3fLsrM0tUU05UAd6MiHZJrfndal8iuylipyLiBwCSbgI2i4hF+fNTKLl1lplZSpqyxAHcK2ll4LdkIzteA/5ZwfvWZ8lbjb9DNha6ImfOdmfb3u/02Xc0OgTro5qyxFGyMP+vJd0ADImIBypoexLwT0lXkI3e2BvwIv5mlqSmGsUhabPujkXE9O4ajojTJV0PbJvvOigi7qstTDOzYiVY4ei2B31GN8cC+Ey5xvMk3m0iNzNLQVOVOCJi+94MxMyskZp1FIeZWZ9Xx5t6140TtJkZELgHbWaWpNYESxxlx5Uoc6Ckk/Ln60vaovjQzMx6T6CKt95SycC/XwJbkd1TEGARcE5hEZmZNUB7FVtvqaTE8cmI2EzSfQAR8aqkZQuOy8ysVzVrDXqxpAHk47glDSfNC55mZjVLMalVkqB/AVwBrC7pdLLV7U4sNCozs17W1ow96IiYLGkasAPZ7a/2ioiZhUdmZtaLErzjVUUL9q8PvAFcXbovIp4tMjAzs97U3ow9aLI7eHfcPHYQMBJ4DNi4wLjMzHpVsy2WBEBEbFr6PF/l7tAuXm5m1pSa9SLhEiJiuqTNiwjGzKxR2tWEJY6lbgLbAmwGzC0sIjOzBmhrdACdqKQHPbjkcStZTfrPxYRjZtYY9R7Fkc8fuRd4PiJ2q6WNbhN0foKVIuK7tTRuZtYsChjFcTQwExhSawNdrsUhaWBEtJGVNMzM+rSoYitH0rrArsC5PYmpux70P8mS8wxJfwEuBV7vOBgRl/fkxGZmKammxCFpAjChZNfEiJhY8vws4FiWLBFXrZIa9DBgHtk9CDvGQwfgBG1mfUY1w+zyZDyxs2OSdgNeiohpksb1JKbuEvTq+QiOh3gvMb8bX09OamaWmrb6laC3AfaQtAvZ5L4hki6IiAOrbai7BD0AWAk6rZw7QZtZn1KviSoRcTxwPEDeg/5OLckZuk/QcyLi1FoaNTNrNs02kzC9aTVmZgUp4paEETEFmFLr+7tL0DvU2qiZWbNpqh50RLzSm4GYmTVSs071NjPr85pywX4zs/6gqUocZmb9iRO0mVmiUpzc4QRtZoZr0GZmyfIoDjOzRLUnWORwgjYzwxcJzcySlV7/2QnazAxwD9rMLFmtSq8P7QRtZoZLHGZmyXKJw8wsUR5mZ2aWqPTSsxO0mRngEoeZWbLaEuxDO0GbmeEetJlZssI9aDOzNKXYg25pdADWvc9/bhwPP3Q7jz5yJ8d+9/BGh2MNcuKPzmS7XfdlrwMPe3ffgoWLOOToE9hl/MEccvQJLFi4qIERNr92ouKttzhBJ6ylpYVfnH06u+1+IJuO3p7x4/fiIx8Z1eiwrAH22uWz/PrMHy6x79xJl7Dl2DFcd/F5bDl2DOddcEmDousbooqttzhBJ2yLzT/Ok0/+m6effpbFixdzySVXscfun290WNYAY8dsytAhg5fYd9sdd7PnzjsCsOfOO3Lr7Xc3IrQ+o5WoeOstTtAJW3udNXlu1ux3n896fg5rr71mAyOylMx7dT7DVxsGwPDVhvHK/AUNjqi5RRX/9ZZCLhJK+kJ3xyPi8i7eNwGYAKABQ2lpWbGA6JqH9P6bpEWkd6XZrC9I8SJhUaM4ds+/rg5sDdyaP98emAJ0mqAjYiIwEWDgsuv0+0z0/Kw5rLfu2u8+X3edtZgz58UGRmQpWXWVlZn78isMX20Yc19+hWErD210SE0txWF2hZQ4IuKgiDiIrJ7+0Yj4YkR8Edi4iPP1VVPvncGHPjSSESPWY5lllmGfffbk6mtuanRYlohxn9qSq66/GYCrrr+Z7bfdqsERNbf2KrbeUvQ46BERMafk+YvAhgWfs89oa2vj6P88keuuvZABLS384fyLeeSRxxsdljXAd0/+CVPve4D58xeyw14H8s2Dv8IhX9mHb//Xj7j8mhtZa43hnPnD7zc6zKbWlmD5UEXWNCX9LzAKuIisN70v8K+IOLLce13isM68OfuORodgCVpmtQ3ef8GmSvt/YO+Kc86Fz1zR4/NVotAedEQcIWlvYLt818SIuKLIc5qZ1SLFGnRvTPWeDiyKiJslrSBpcER4ypOZJSXFURyFjoOW9A3gMuA3+a51gCuLPKeZWS3641Tvw4FtgIUAEfEE2dA7M7Ok1GuiiqT1JN0maaakhyUdXWtMRZc43o6IdzomXEgaSJp3ljGzfq6OozhagW9HxHRJg4Fpkv4aEY9U21DRPei/SToBWF7SZ4FLgasLPqeZWdXqVeKIiDkRMT1/vAiYSVberVrRCfo4YC7wIHAocB1wYsHnNDOrWjUTVSRNkHRvyTahszYljQA+DtxTS0xFD7NrB36bb2ZmyapmmF3pshRdkbQS8GfgPyNiYS0xFbVY0oN0U2uOiI8VcV4zs1rVc3SGpGXIkvPkrhaHq0RRPejd8q8dtwCZlH89AHijoHOamdWsXrOqlY2KOA+YGRFn9qStQhJ0RDwDIGmbiNim5NBxku4CTi3ivGZmtWqrXw96G+ArwIOSZuT7ToiI66ptqOhhditK+lRE3AkgaWugfy/ybGZJqleJI893dVmro+gEfTDwO0kdC9XOB75e8DnNzKqW4s0wih7FMQ0YLWkI2cp5viePmSWpN6dwV6rQBC1pOeCLwAhgYMeMwohwDdrMktIfV7O7ClgATAPeLvhcZmY1S3HB/qIT9LoRsVPB5zAz67EUSxxFT/X+u6RNCz6HmVmPpbjcaNE96E8BX5P0NFmJQ0B4JqGZpabfjeIAdi64fTOzukixxFH0MLuOGYWrA4OKPJeZWU/0u1EckvYAzgDWBl4CPkC2NurGRZ7XzKxabZHeXQmLvkh4GrAl8HhEjAR2AO4q+JxmZlWLiIq33lJ0gl4cEfOAFkktEXEbMKbgc5qZVa0/juKYny9afTswWdJLZPfrMjNLSoo16KJ70HsCbwLHADcATwK7F3xOM7OqtUdUvPWWokdxvF7y9Pwiz2Vm1hMp9qCLuuXVIjq/5VXHRJUhRZzXzKxWKY7iKOqOKoOLaNfMrCi9WbqoVNEXCc3MmkK/KXGYmTUb96DNzBLlHrSZWaLaoq3RIbyPE7SZGf1zuVEzs6bQ75YbNTNrFu5Bm5klyqM4zMwS5VEcZmaJ6jdTvc3Mmo1r0GZmiXIN2swsUe5Bm5klyuOgzcwS5R60mVmiPIrDzCxRvkhoZpaoFEscRd/V28ysKUQV/5UjaSdJj0n6l6Tjao3JPWgzM+rXg5Y0ADgH+CwwC5gq6S8R8Ui1bTlBm5lR1xr0FsC/IuIpAEl/AvYE+k6Cbn3neTU6hlRImhARExsdh6XFvxf1VU3OkTQBmFCya2LJz2Id4LmSY7OAT9YSk2vQzWFC+ZdYP+TfiwaJiIkRMbZkK/2g7CzR19Q9d4I2M6uvWcB6Jc/XBWbX0pATtJlZfU0FRkkaKWlZYF/gL7U0lGwN2pbgOqN1xr8XCYqIVklHADcCA4DfRcTDtbSlFAdnm5mZSxxmZslygjYzS5QTdA9IOkXSd/LHp0rasZPXjJN0TZ3Od0I3x/4tabU6nee1erRjtanX91/SCEkP1aMtawwn6DqJiJMi4uaCT9NlgjazvscJukqSvp8vgnIzsFHJ/j9I+lL+eCdJj0q6E/hCF+18TdLlkm6Q9ISkn5Uc20/Sg5IekvTTfN9PgOUlzZA0uUyMV0qaJunhfMZTx/7XJJ0u6X5J/5C0Rr5/pKS7JU2VdFoPvj1WR5JWknSLpOn578Oe+f4RkmZK+m3+M75J0vL5sU/kP9+7gcMb+g+wHnOCroKkT5CNafw4WeLdvJPXDAJ+C+wObAus2U2TY4DxwKbAeEnrSVob+Cnwmfz45pL2iojjgDcjYkxEHFAm1K9HxCeAscBRklbN968I/CMiRgO3A9/I958N/CoiNgdeKNO29Z63gL0jYjNge+AMSR2z1EYB50TExsB84Iv5/t8DR0XEVr0drNWfE3R1tgWuiIg3ImIhnQ8+/zDwdEQ8EdkYxgu6ae+WiFgQEW+RLaTyAbKkPyUi5kZEKzAZ2K7KOI+SdD/wD7IZTaPy/e8AHfXwacCI/PE2wEX540lVnsuKI+BHkh4AbiZb42GN/NjTETEjfzwNGCFpKLByRPwt3++fZZPzRJXqVTJwvNLB5W+XPG4j+3n0aJEoSeOAHYGtIuINSVOAQfnhxfHewPeO83XwgPj0HAAMBz4REYsl/Zv3fpZL/+4sT/a7459jH+IedHVuB/aWtLykwWRljKU9CoyU9MH8+X5VnuMe4NOSVsvXld0P6OgRLZa0TJn3DwVezZPzh4EtKzjnXWSlG8iSgqVhKPBSnpy3J/sLq0sRMR9YIOlT+S7/LJucE3QVImI6cDEwA/gzcEcnr3mLbJWxa/OLhM9UeY45wPHAbcD9wPSIuCo/PBF4oMxFwhuAgfmfxaeRlTnKORo4XNJUsqRgaZgMjJV0L1myfbSC9xwEnJNfJHyzyOCseJ7qbWaWKPegzcwS5QRtZpYoJ2gzs0Q5QZuZJcoJ2swsUU7Q9j6S2vI1Px6SdKmkFXrQVukaJedK+mg3rx0naesaztHpSn6VrPBX7cpxpSsYmhXNCdo607HmxyZk08MPKz2YT6CpWkQcEhGPdPOScUDVCdqsr3KCtnLuAD6U925vk3Qh8KCkAZL+O18B7wFJhwIo87+SHpF0LbB6R0OSpkgamz/eKV+l7f58xbYRZB8Ex+S9920lDZf05/wcUyVtk7931XwFt/sk/YYKpsd3tcJffuyMPJZbJA3P931Q2UqD0yTdkc/KXLrNo/J/5wOS/lTj99esS16Lw7okaSCwM9nsRIAtgE0i4uk8yS2IiM0lLQfcJekmspX+NiJboW8NskWgfrdUu8PJVvzbLm9rWES8IunXwGsR8fP8dRcC/xMRd0pan+wmnB8BTgbujIhTJe1KNnOznK/n51gemCrpzxExj2yFv+kR8W1JJ+VtH0E2a/OwiHhC0ieBX5KtMFjqOGBkRLwtaeVKvqdm1XCCts4sL2lG/vgO4Dyy0sM/I+LpfP/ngI911JfJpoiPIlt576KIaANmS7q1k/a3BG7vaCsiXukijh2Bj763wiZD8jVQtiNfZzsirpX0agX/pqMk7Z0/7ljhbx7QTjZ9H7KVBy+XtFL+77205NzLddLmA8BkSVcCV1YQg1lVnKCtM29GxJjSHXmier10F3BkRNy41Ot2ofyKapWuutZCtirfEmtK5LFUvEZBmRX+lhb5eecv/T3oxK5kHxZ7AP8laeN8iVizunAN2mp1I/AfHavrSdpQ0opkK/7tm9eo1yJbaH5pd5Ot2Dcyf++wfP8iYHDJ624iKzeQv25M/vB28pXaJO0MrFIm1u5W+GsBOv4K2J+sdLIQeFrSl/NzSNLo0gYltQDrRcRtwLHAysBKZeIwq4p70Farc8kW/J+urEs7F9gLuIKsVvsg8DjvLZX6roiYm9ewL88T3UvAZ4GrgcuU3drpSOAospXZHiD7Xb2d7ELiD4CLJE3P23+2TKw3AIfl7TzGkiv8vQ5sLGkasIDsDjeQfQD8StKJwDLAn8hWF+wwALhA2SL5IquVzy8Th1lVvJqdmVmiXOIwM0uUE7SZWaKcoM3MEuUEbWaWKCdoM7NEOUGbmSXKCdrMLFH/B0UGgIdxnOi0AAAAAElFTkSuQmCC\n",
"text/plain": "<Figure size 432x288 with 2 Axes>"
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": "yhat=logreg_cv.predict(X_test)\nplot_confusion_matrix(Y_test,yhat)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Examining the confusion matrix, we see that logistic regression can distinguish between the different classes. We see that the major problem is false positives.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 6\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Create a support vector machine object then create a <code>GridSearchCV</code> object <code>svm_cv</code> with cv - 10. Fit the object to find the best parameters from the dictionary <code>parameters</code>.\n"
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": "parameters = {'kernel':('linear', 'rbf','poly','rbf', 'sigmoid'),\n 'C': np.logspace(-3, 3, 5),\n 'gamma':np.logspace(-3, 3, 5)}\nsvm = SVC()"
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "GridSearchCV(cv=10, estimator=SVC(),\n param_grid={'C': array([1.00000000e-03, 3.16227766e-02, 1.00000000e+00, 3.16227766e+01,\n 1.00000000e+03]),\n 'gamma': array([1.00000000e-03, 3.16227766e-02, 1.00000000e+00, 3.16227766e+01,\n 1.00000000e+03]),\n 'kernel': ('linear', 'rbf', 'poly', 'rbf', 'sigmoid')})"
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "svm_cv= GridSearchCV(svm, parameters, cv=10)\nsvm_cv.fit(X_train, Y_train)"
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "tuned hpyerparameters :(best parameters) {'C': 1.0, 'gamma': 0.03162277660168379, 'kernel': 'sigmoid'}\naccuracy : 0.8625\n"
}
],
"source": "print(\"tuned hpyerparameters :(best parameters) \",svm_cv.best_params_)\nprint(\"accuracy :\",svm_cv.best_score_)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 7\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Calculate the accuracy on the test data using the method <code>score</code>:\n"
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "0.7777777777777778"
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "svm_cv.score(X_test, Y_test)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can plot the confusion matrix\n"
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWgAAAEWCAYAAABLzQ1kAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAfBElEQVR4nO3dd5xdVbnG8d+TAgRIAoHQApiggApKREC6oSi9iVdAsCAYuNJULIAICtjuFS54L+qNoGISooAUKQJSIuWiphBqkCihhEQIgRR6MvPeP/YeOBlm5pQ5+5x1Ms+Xz/7MObustTJzeGfNu9daWxGBmZmlp1+zG2BmZl1zgDYzS5QDtJlZohygzcwS5QBtZpYoB2gzs0Q5QFuvSRok6XpJiyRd2YtyjpR0az3b1gyS/ijpc81uh7U+B+g+RNKnJU2V9LKkeXkg2bkORX8SWBdYKyL+rdZCImJiRHy8Du1ZjqQxkkLS1Z32b5Xvn1xhOd+RNKHceRGxT0RcVmNzzd7iAN1HSPoqcCHwfbJgujHwU+CgOhT/LuDxiFhWh7KKMh/YUdJaJfs+BzxerwqU8f9TVjf+MPUBkoYC5wAnRMTVEfFKRCyNiOsj4uv5OStLulDS3Hy7UNLK+bExkuZIOlXS83nv++j82HeBs4DD8p75MZ17mpJG5j3VAfn7z0t6QtISSbMlHVmy/56S63aUNCVPnUyRtGPJscmSzpV0b17OrZLW7uHb8CZwLXB4fn1/4FPAxE7fq4skPSNpsaRpknbJ9+8NnFHy73ygpB3fk3Qv8CqwSb7v2Pz4zyRdVVL+jyTdLkmV/vys73KA7ht2AFYBrunhnG8B2wOjga2A7YAzS46vBwwFRgDHABdLWjMizibrlf8uIlaPiEt7aoik1YCfAPtExGBgR2BGF+cNA27Mz10LuAC4sVMP+NPA0cA6wErA13qqG/gN8Nn89V7AI8DcTudMIfseDAMuB66UtEpE3Nzp37lVyTWfAcYCg4GnOpV3KvDB/JfPLmTfu8+F11iwCjhA9w1rAS+USUEcCZwTEc9HxHzgu2SBp8PS/PjSiLgJeBnYvMb2tANbShoUEfMi4pEuztkPmBUR4yNiWURMAh4DDig551cR8XhEvAZcQRZYuxUR/wcMk7Q5WaD+TRfnTIiIBXmd5wMrU/7f+euIeCS/Zmmn8l4FjiL7BTMBOCki5pQpzwxwgO4rFgBrd6QYurEBy/f+nsr3vVVGpwD/KrB6tQ2JiFeAw4DjgXmSbpT03gra09GmESXv/1VDe8YDJwK70cVfFHkaZ2aeVllI9ldDT6kTgGd6OhgRfwOeAET2i8SsIg7QfcN9wOvAwT2cM5fsZl+HjXnnn/+VegVYteT9eqUHI+KWiPgYsD5Zr/gXFbSno03P1timDuOBLwE35b3bt+QpiG+S5abXjIg1gEVkgRWgu7REj+kKSSeQ9cTnAt+oueXW5zhA9wERsYjsRt7Fkg6WtKqkgZL2kfQf+WmTgDMlDc9vtp1F9id5LWYAu0raOL9BeXrHAUnrSjowz0W/QZYqaeuijJuAzfKhgQMkHQa8H7ihxjYBEBGzgY+S5dw7GwwsIxvxMUDSWcCQkuPPASOrGakhaTPgPLI0x2eAb0gaXVvrra9xgO4jIuIC4KtkN/7mk/1ZfiLZyAbIgshU4EHgIWB6vq+Wuv4E/C4vaxrLB9V+ZDfO5gIvkgXLL3VRxgJg//zcBWQ9z/0j4oVa2tSp7Hsioqu/Dm4B/kg29O4psr86StMXHZNwFkiaXq6ePKU0AfhRRDwQEbPIRoKM7xghY9YT+WaymVma3IM2M0uUA7SZWZ1J+mU+qevhkn3DJP1J0qz865rlynGANjOrv18De3fadxpwe0RsCtyev++Rc9BmZgWQNBK4ISK2zN//HRgTEfMkrQ9MjogeJ0H1NHGhqX4z4ij/5jCzinz22Qm9Xttk6QtPVBxzVhr+7uPIpvd3GBcR48pctm5EzAPIg/Q65epJNkCbmaUqD8blAnKvOUCbmQG0dzVfqq6ek7R+SYrj+XIX+CahmRlA27LKt9r8gWwNcvKv15W7wD1oMzMgor1uZUmaBIwhW6RsDnA28EPgCknHAE8DZZ8+5ABtZgbQXr8AHRFHdHNoj2rKcYA2MwOoYw+6XhygzcygETcJq+YAbWYG7kGbmaUqah+dURgHaDMzqOtNwnpxgDYzA6c4zMyS5ZuEZmaJcg/azCxRvkloZpYo3yQ0M0tThHPQZmZpcg7azCxRTnGYmSXKPWgzs0S1LW12C97BAdrMDJziMDNLllMcZmaJcg/azCxRDtBmZmkK3yQ0M0uUc9BmZolyisPMLFHuQZuZJco9aDOzRLkHbWaWqGVesN/MLE3uQZuZJco5aDOzRLkHbWaWKPegzcwS5R60mVmiPIrDzCxREc1uwTs4QJuZgXPQZmbJSjBA92t2A8zMkhDtlW9lSPqKpEckPSxpkqRVammSA7SZGUBbW+VbDySNAE4GtomILYH+wOG1NMkpDjMzqHeKYwAwSNJSYFVgbi2FuAdtZgZZgK5wkzRW0tSSbWxHMRHxLPBj4GlgHrAoIm6tpUnuQZuZQVUTVSJiHDCuq2OS1gQOAkYBC4ErJR0VEROqbZJ70GZmQLRHxVsZewKzI2J+RCwFrgZ2rKVN7kGbmUE9c9BPA9tLWhV4DdgDmFpLQQ7QZmZQdnRGpSLir5KuAqYDy4D76SYdUo4DtJkZ1HUUR0ScDZzd23IcoM3MwDMJrXrqJ/a/5Tx2v+zUZjfFEuLPRQEiKt8axAE6ce89dm8WzappjLutwPy5KEAV46Abpe4BWtISSYu72+pd34ps1fWHseEeo5k1aXKzm2IJ8eeiIO1R+dYgdc9BR8RgAEnnAP8CxgMCjgQG17u+Fdm23z2KaedNYuDqg5rdFEuIPxcFqdMojnoqMsWxV0T8NCKWRMTiiPgZcGhPF5ROn7zzlVkFNi19I/YczesvLObFh55sdlMsIf5cFCfa2yveGqXIURxtko4EfgsEcATQ46+o0umTvxlxVHqPN2igdbbZjA0/vjUjdt+K/isPZODgQez8k3/nnpN/1uymWRP5c1GgBqYuKqUo6I6kpJHARcBOZAH6XuDLEfFkJdf39QBdat0d3scWx+/LHZ87v9lNsYT4c/G2zz47Qb0t45XzKo85q53Z+/oqUVgPOg/EBxVVvplZXSXYgy4sQEsaDnwRGFlaT0R8oag6V1TP3TeT5+6b2exmWGL8uaizZendJCwyB30dcDdwG2Vyz2ZmTVfFcqONUmSAXjUivllg+WZm9ZNgiqPIYXY3SNq3wPLNzOqmrw2zOwU4Q9IbwFKyySoREUMKrNPMrDYJ9qCLHMXhWYNm1jr6UoCGt57NtSmwSse+iLiryDrNzGqS4FTvIofZHUuW5tgQmAFsD9wH7F5UnWZmtargWYMNV+RNwlOAbYGnImI34EPA/ALrMzOrXV9Yza7E6xHxuiQkrRwRj0navMD6zMxql+ATVYoM0HMkrQFcC/xJ0kuAVxg3szQlmOIochTHIfnL70i6ExgK3FxUfWZmvdIXArSkYV3sfij/ujrwYr3rNDPrrWjrGymOaWTLi5Yux9fxPoBNCqjTzKx3+kIPOiJG1btMM7OipTjMrtCJKmZmLcMB2swsUemloB2gzcwAYll6EbqwmYSSxleyz8wsCe1VbA1SZA96i9I3kvoDHy6wPjOzmqV4k7DuPWhJp0taAnxQ0mJJS/L3z5M9BsvMLD0J9qDrHqAj4gf5WtD/GRFDImJwvq0VEafXuz4zs3qI9qh4a5Qip3qfLulAYNd81+SIuKGo+szMeiW9e4SFrgf9A2A7YGK+6xRJO7kXbWYpimXNbsE7FXmTcD9gdET2LHNJlwH3Aw7QZpacSLAHXVUOWtKakj5YxSVrlLweWk1dZmYNVcebhJLWkHSVpMckzZS0Qy1NKtuDljQZODA/dwYwX9KfI+KrZS79AXB/vtSoyHLR7j2bWZLq3IO+CLg5Ij4paSVg1VoKqSTFMTQiFufPGPxVRJwt6cFyF0XEpDy4b0sWoL8ZEf+qpZFmZkWrV4CWNISsQ/p5gIh4E3izlrIqSXEMkLQ+8Cmg2lEY/YAXgJeAzSTtWuZ8M7OmiDZVvEkaK2lqyTa2pKhNyJ6/+itJ90u6RNJqtbSpkh70OcAtwD0RMUXSJsCschdJ+hFwGPAIb2dtArirloaamRWpmh50RIwDxnVzeACwNXBSRPxV0kXAacC3q21T2QAdEVcCV5a8fwI4tIKyDwY2j4g3qm2UmVmjRbvKn1SZOcCciPhr/v4qsgBdtW4DtKT/JuvxdikiTi5T9hPAQMAB2sySV68cdET8S9IzkjaPiL8DewCP1lJWTz3oqTW17m2vAjMk3U5JkK4gsJuZNVxE3XrQACcBE/MRHE8AR9dSSLcBOiIuK30vabWIeKWKsv+Qb2ZmyavnMLuImAFs09tyKhkHvQNwKdkTuTeWtBVwXER8qUwDL+vpuJlZStrb6tqDrotKhtldCOwFLACIiAd4ewEkM7MVQrSr4q1RKlqLIyKekZZrVFsxzTEza45GBt5KVRKgn5G0IxB5wvtkYGaxzTIza6xI74EqFQXo48nmlY8AniWbtHJCdydLup6eh+cdWGUbzcwK15I96Ih4ATiyijJ/nH/9BLAeMCF/fwTwZDWNMzNrlDoPs6uLSkZxbELWg96erGd8H/CVfEbhO0TEn/Przo2I0puJ10vyNG8zS1Jbi47iuBy4Algf2IBs2vekCq4bngd3ACSNAobX0kgzs6JFqOKtUSrJQSsixpe8nyDpxAqu+wowWVJHT3skcFyV7TMza4iWykFLGpa/vFPSacBvyVIchwE3lis4Im6WtCnw3nzXY144ycxS1WqjOKaRBeSOXyulvd8Azu3qIkm7R8Qdkj7R6dC7JRERV9fcWjOzgrRUDzoiRtVY5keBO4ADuioWcIA2s+S0tVf1iNaGqGgmoaQtgfcDq3Tsi4jfdHVuRJydf61p9SYzs2ZotRQHAJLOBsaQBeibgH2Ae4AuA7SkHh8mGxEXVN1KM7OCtbfiOGjgk8BWwP0RcbSkdYFLejh/cP51c7IHxnYsOXoAftyVmSWqJSeqAK9FRLukZfnTap8neyhilyLiuwCSbgW2jogl+fvvUPLoLDOzlLRkigOYKmkN4BdkIzteBv5WwXUbs/yjxt8kGwtdkS/Mv7PSU60PeW3u3c1ugq2gWjLFUbIw/88l3QwMiYgHKyh7PPA3SdeQjd44BPAi/maWpJYaxSFp656ORcT0ngqOiO9J+iOwS77r6Ii4v7ZmmpkVK8EMR4896PN7OBbA7uUKz4N4j4HczCwFLZXiiIjdGtkQM7NmatVRHGZmK7w6PtS7bhygzcyAwD1oM7MkLUswxVF2XIkyR0k6K3+/saTtim+amVnjBKp4a5RKBv79FNiB7JmCAEuAiwtrkZlZE7RXsTVKJSmOj0TE1pLuB4iIlyStVHC7zMwaqlVz0Esl9Scfxy1pOGne8DQzq1mKQa2SAP0T4BpgHUnfI1vd7sxCW2Vm1mBtrdiDjoiJkqYBe5A9/urgiJhZeMvMzBoowSdeVbRg/8bAq8D1pfsi4ukiG2Zm1kjtrdiDJnuCd8fDY1cBRgF/B7YosF1mZg3VaoslARARHyh9n69yd1w3p5uZtaRWvUm4nIiYLmnbIhpjZtYs7WrBFEenh8D2A7YG5hfWIjOzJmhrdgO6UEkPenDJ62VkOenfF9McM7PmqPcojnz+yFTg2YjYv5YyegzQeQWrR8TXaynczKxVFDCK4xRgJjCk1gK6XYtD0oCIaCNLaZiZrdCiiq0cSRsC+wGX9KZNPfWg/0YWnGdI+gNwJfBKx8GIuLo3FZuZpaSaFIekscDYkl3jImJcyfsLgW+wfIq4apXkoIcBC8ieQdgxHjoAB2gzW2FUM8wuD8bjujomaX/g+YiYJmlMb9rUU4BeJx/B8TBvB+a32tebSs3MUtNWvxT0TsCBkvYlm9w3RNKEiDiq2oJ6CtD9gdWhy8y5A7SZrVDqNVElIk4HTgfIe9BfqyU4Q88Bel5EnFNLoWZmrabVZhKmN63GzKwgRTySMCImA5Nrvb6nAL1HrYWambWalupBR8SLjWyImVkztepUbzOzFV5LLthvZtYXtFSKw8ysL3GANjNLVIqTOxygzcxwDtrMLFkexWFmlqj2BJMcDtBmZvgmoZlZstLrPztAm5kB7kGbmSVrmdLrQztAm5nhFIeZWbKc4jAzS5SH2ZmZJSq98OwAbWYGOMVhZpastgT70A7QZma4B21mlqxwD9rMLE0p9qD7NbsB1rO9Pj6GRx6+i8cevYdvfP2EZjfHmuTM71/ArvsdzsFHHf/WvkWLl3DsKWew72HHcOwpZ7Bo8ZImtrD1tRMVb43iAJ2wfv368ZOLvsf+BxzFB7bajcMOO5j3vW/TZjfLmuDgfT/Gzy84b7l9l4y/gu23Gc1Nv7uU7bcZzaUTrmhS61YMUcXWKA7QCdtu2w/xz38+yezZT7N06VKuuOI6Djxgr2Y3y5pgm9EfYOiQwcvtu/Pu+zhonz0BOGifPbnjrvua0bQVxjKi4q1RHKATtsGI9Xhmzty33s95dh4bbLBeE1tkKVnw0kKGrz0MgOFrD+PFhYua3KLWFlX81yiF3CSU9ImejkfE1d1cNxYYC6D+Q+nXb7UCWtc6pHc+JC0ivTvNZiuCFG8SFjWK44D86zrAjsAd+fvdgMlAlwE6IsYB4wAGrDSiz0eiZ+fMY6MNN3jr/YYj1mfevOea2CJLyVprrsH8F15k+NrDmP/CiwxbY2izm9TSUhxmV0iKIyKOjoijyfLp74+IQyPiUGCLIupbUU2ZOoP3vGcUI0duxMCBA/nUpw7i+htubXazLBFjdt6e6/54GwDX/fE2dttlhya3qLW1V7E1StHjoEdGxLyS988BmxVc5wqjra2NU758JjfdeDn9+/Xj15f9jkcffbzZzbIm+PrZP2TK/Q+ycOFi9jj4KL50zGc49jOf4tRvf5+rb7iF9dcdzgXnfavZzWxpbQmmD1VkTlPS/wCbApPIetOHA/+IiJPKXesUh3Xltbl3N7sJlqCBa2/yzhs2Vfr0uw6pOOZc/tQ1va6vEoX2oCPiREmHALvmu8ZFxDVF1mlmVosUc9CNmOo9HVgSEbdJWlXS4IjwlCczS0qKozgKHQct6YvAVcD/5rtGANcWWaeZWS364lTvE4CdgMUAETGLbOidmVlS6jVRRdJGku6UNFPSI5JOqbVNRac43oiINzsmXEgaQJpPljGzPq6OoziWAadGxHRJg4Fpkv4UEY9WW1DRPeg/SzoDGCTpY8CVwPUF12lmVrV6pTgiYl5ETM9fLwFmkqV3q1Z0gD4NmA88BBwH3AScWXCdZmZVq2aiiqSxkqaWbGO7KlPSSOBDwF9raVPRw+zagV/km5lZsqoZZle6LEV3JK0O/B74ckQsrqVNRS2W9BA95Joj4oNF1GtmVqt6js6QNJAsOE/sbnG4ShTVg94//9rxCJDx+dcjgVcLqtPMrGb1mlWtbFTEpcDMiLigN2UVEqAj4ikASTtFxE4lh06TdC9wThH1mpnVqq1+PeidgM8AD0make87IyJuqragoofZrSZp54i4B0DSjkDfXuTZzJJUrxRHHu/qslZH0QH6GOCXkjoWql0IfKHgOs3MqpbiwzCKHsUxDdhK0hCylfP8TB4zS1Ijp3BXqtAALWll4FBgJDCgY0ZhRDgHbWZJ6Yur2V0HLAKmAW8UXJeZWc1SXLC/6AC9YUTsXXAdZma9lmKKo+ip3v8n6QMF12Fm1mspLjdadA96Z+DzkmaTpTgEhGcSmllq+twoDmCfgss3M6uLFFMcRQ+z65hRuA6wSpF1mZn1Rp8bxSHpQOB8YAPgeeBdZGujblFkvWZm1WqL9J5KWPRNwnOB7YHHI2IUsAdwb8F1mplVLSIq3hql6AC9NCIWAP0k9YuIO4HRBddpZla1vjiKY2G+aPVdwERJz5M9r8vMLCkp5qCL7kEfBLwGfAW4GfgncEDBdZqZVa09ouKtUYoexfFKydvLiqzLzKw3UuxBF/XIqyV0/cirjokqQ4qo18ysVimO4ijqiSqDiyjXzKwojUxdVKrom4RmZi2hz6Q4zMxajXvQZmaJcg/azCxRbdHW7Ca8gwO0mRl9c7lRM7OW0OeWGzUzaxXuQZuZJcqjOMzMEuVRHGZmieozU73NzFqNc9BmZolyDtrMLFHuQZuZJcrjoM3MEuUetJlZojyKw8wsUb5JaGaWqBRTHEU/1dvMrCVEFf+VI2lvSX+X9A9Jp9XaJvegzcyoXw9aUn/gYuBjwBxgiqQ/RMSj1ZblAG1mRl1z0NsB/4iIJwAk/RY4CFhxAvSyN59Vs9uQCkljI2Jcs9thafHnor6qiTmSxgJjS3aNK/lZjACeKTk2B/hILW1yDro1jC1/ivVB/lw0SUSMi4htSrbSX5RdBfqauucO0GZm9TUH2Kjk/YbA3FoKcoA2M6uvKcCmkkZJWgk4HPhDLQUlm4O25TjPaF3x5yJBEbFM0onALUB/4JcR8UgtZSnFwdlmZuYUh5lZshygzcwS5QDdC5K+I+lr+etzJO3ZxTljJN1Qp/rO6OHYk5LWrlM9L9ejHKtNvb7/kkZKergeZVlzOEDXSUScFRG3FVxNtwHazFY8DtBVkvStfBGU24DNS/b/WtIn89d7S3pM0j3AJ7op5/OSrpZ0s6RZkv6j5NgRkh6S9LCkH+X7fggMkjRD0sQybbxW0jRJj+Qznjr2vyzpe5IekPQXSevm+0dJuk/SFEnn9uLbY3UkaXVJt0uann8eDsr3j5Q0U9Iv8p/xrZIG5cc+nP987wNOaOo/wHrNAboKkj5MNqbxQ2SBd9suzlkF+AVwALALsF4PRY4GDgM+ABwmaSNJGwA/AnbPj28r6eCIOA14LSJGR8SRZZr6hYj4MLANcLKktfL9qwF/iYitgLuAL+b7LwJ+FhHbAv8qU7Y1zuvAIRGxNbAbcL6kjllqmwIXR8QWwELg0Hz/r4CTI2KHRjfW6s8Bujq7ANdExKsRsZiuB5+/F5gdEbMiG8M4oYfybo+IRRHxOtlCKu8iC/qTI2J+RCwDJgK7VtnOkyU9APyFbEbTpvn+N4GOfPg0YGT+eidgUv56fJV1WXEEfF/Sg8BtZGs8rJsfmx0RM/LX04CRkoYCa0TEn/P9/lm2OE9UqV4lA8crHVz+RsnrNrKfR68WiZI0BtgT2CEiXpU0GVglP7w03h743lFfBw+IT8+RwHDgwxGxVNKTvP2z7PzZGUT22fHPcQXiHnR17gIOkTRI0mCyNEZnjwGjJL07f39ElXX8FfiopLXzdWWPADp6REslDSxz/VDgpTw4vxfYvoI67yVL3UAWFCwNQ4Hn8+C8G9lfWN2KiIXAIkk757v8s2xxDtBViIjpwO+AGcDvgbu7OOd1slXGbsxvEj5VZR3zgNOBO4EHgOkRcV1+eBzwYJmbhDcDA/I/i88lS3OUcwpwgqQpZEHB0jAR2EbSVLJg+1gF1xwNXJzfJHytyMZZ8TzV28wsUe5Bm5klygHazCxRDtBmZolygDYzS5QDtJlZohyg7R0kteVrfjws6UpJq/airNI1Si6R9P4ezh0jacca6uhyJb9KVvirduW40hUMzYrmAG1d6VjzY0uy6eHHlx7MJ9BULSKOjYhHezhlDFB1gDZbUTlAWzl3A+/Je7d3SroceEhSf0n/ma+A96Ck4wCU+R9Jj0q6EVinoyBJkyVtk7/eO1+l7YF8xbaRZL8IvpL33neRNFzS7/M6pkjaKb92rXwFt/sl/S8VTI/vboW//Nj5eVtulzQ83/duZSsNTpN0dz4rs3OZJ+f/zgcl/bbG769Zt7wWh3VL0gBgH7LZiQDbAVtGxOw8yC2KiG0lrQzcK+lWspX+NidboW9dskWgftmp3OFkK/7tmpc1LCJelPRz4OWI+HF+3uXAf0XEPZI2JnsI5/uAs4F7IuIcSfuRzdws5wt5HYOAKZJ+HxELyFb4mx4Rp0o6Ky/7RLJZm8dHxCxJHwF+SrbCYKnTgFER8YakNSr5nppVwwHaujJI0oz89d3ApWSph79FxOx8/8eBD3bkl8mmiG9KtvLepIhoA+ZKuqOL8rcH7uooKyJe7KYdewLvf3uFTYbka6DsSr7OdkTcKOmlCv5NJ0s6JH/dscLfAqCdbPo+ZCsPXi1p9fzfe2VJ3St3UeaDwERJ1wLXVtAGs6o4QFtXXouI0aU78kD1Suku4KSIuKXTeftSfkW1Sldd60e2Kt9ya0rkbal4jYIyK/x1Fnm9Czt/D7qwH9kviwOBb0vaIl8i1qwunIO2Wt0C/HvH6nqSNpO0GtmKf4fnOer1yRaa7+w+shX7RuXXDsv3LwEGl5x3K1m6gfy80fnLu8hXapO0D7Bmmbb2tMJfP6Djr4BPk6VOFgOzJf1bXockbVVaoKR+wEYRcSfwDWANYPUy7TCrinvQVqtLyBb8n66sSzsfOBi4hixX+xDwOG8vlfqWiJif57CvzgPd88DHgOuBq5Q92ukk4GSyldkeJPus3kV2I/G7wCRJ0/Pyny7T1puB4/Ny/s7yK/y9AmwhaRqwiOwJN5D9AviZpDOBgcBvyVYX7NAfmKBskXyR5coXlmmHWVW8mp2ZWaKc4jAzS5QDtJlZohygzcwS5QBtZpYoB2gzs0Q5QJuZJcoB2swsUf8PZZWSGMsti3cAAAAASUVORK5CYII=\n",
"text/plain": "<Figure size 432x288 with 2 Axes>"
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": "yhat=svm_cv.predict(X_test)\nplot_confusion_matrix(Y_test,yhat)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 8\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Create a decision tree classifier object then create a <code>GridSearchCV</code> object <code>tree_cv</code> with cv = 10. Fit the object to find the best parameters from the dictionary <code>parameters</code>.\n"
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": "parameters = {'criterion': ['gini', 'entropy'],\n 'splitter': ['best', 'random'],\n 'max_depth': [2*n for n in range(1,10)],\n 'max_features': ['auto', 'sqrt'],\n 'min_samples_leaf': [1, 2, 4],\n 'min_samples_split': [2, 5, 10]}\n\ntree = DecisionTreeClassifier()"
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "GridSearchCV(cv=10, estimator=DecisionTreeClassifier(),\n param_grid={'criterion': ['gini', 'entropy'],\n 'max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18],\n 'max_features': ['auto', 'sqrt'],\n 'min_samples_leaf': [1, 2, 4],\n 'min_samples_split': [2, 5, 10],\n 'splitter': ['best', 'random']})"
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "tree_cv= GridSearchCV(tree, parameters, cv=10)\ntree_cv.fit(X_train, Y_train)"
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "tuned hpyerparameters :(best parameters) {'criterion': 'gini', 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 4, 'min_samples_split': 10, 'splitter': 'random'}\naccuracy : 0.9071428571428573\n"
}
],
"source": "print(\"tuned hpyerparameters :(best parameters) \",tree_cv.best_params_)\nprint(\"accuracy :\",tree_cv.best_score_)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 9\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Calculate the accuracy of tree_cv on the test data using the method <code>score</code>:\n"
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "0.7222222222222222"
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "tree_cv.score(X_test, Y_test)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can plot the confusion matrix\n"
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWgAAAEWCAYAAABLzQ1kAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAfrUlEQVR4nO3dd5xdVbn/8c93EiCUJBAIHUzQgAqYiAEpgkFQ6cVCaP4UwcCVdrEgIBcQxHaFC/4ulggqhoAUKdKREikihoRQQxEQCAkQAin0zMxz/9h74CTMzClz9px1Zr5vXvs15+x9ztpPZobnrHn2WmsrIjAzs/S0NDoAMzPrnBO0mVminKDNzBLlBG1mlignaDOzRDlBm5klygnaekzS8pKulrRA0qU9aOcASTfVM7ZGkHS9pK82Og5rfk7Q/Yik/SXdK+k1SXPyRPKpOjT9JWANYNWI+HKtjUTE5Ij4XB3iWYKkcZJC0uVL7R+d759SYTunSLqg3OsiYueIOL/GcM3e5QTdT0j6FnAW8COyZLo+8Etgzzo0/wHg8YhorUNbRZkLbC1p1ZJ9XwUer9cJlPH/U1Y3/mXqByQNBU4FDo+IyyPi9YhYHBFXR8R389csJ+ksSbPz7SxJy+XHxkmaJenbkl7Ke98H5cd+AJwEjM975gcv3dOUNCLvqQ7Mn39N0lOSFkl6WtIBJfvvLHnf1pKm5qWTqZK2Ljk2RdJpku7K27lJ0mrdfBveAa4E9s3fPwDYB5i81PfqbEnPSVooaZqkbfP9OwEnlPw77y+J43RJdwFvABvk+w7Jj/9K0mUl7f9U0i2SVOnPz/ovJ+j+YStgEHBFN6/5PrAlMAYYDWwBnFhyfE1gKLAOcDBwjqRVIuJksl75xRGxUkSc110gklYEfgHsHBGDga2BGZ28bhhwbf7aVYEzgWuX6gHvDxwErA4sC3ynu3MDfwT+X/7488DDwOylXjOV7HswDLgQuFTSoIi4Yal/5+iS93wFmAAMBp5Zqr1vAx/LP3y2JfvefTW8xoJVwAm6f1gVeLlMCeIA4NSIeCki5gI/IEs8HRbnxxdHxHXAa8BGNcbTDmwiafmImBMRD3fyml2BJyJiUkS0RsRFwKPA7iWv+X1EPB4RbwKXkCXWLkXE34FhkjYiS9R/7OQ1F0TEvPycZwDLUf7f+YeIeDh/z+Kl2nsDOJDsA+YC4MiImFWmPTPACbq/mAes1lFi6MLaLNn7eybf924bSyX4N4CVqg0kIl4HxgOHAXMkXSvpwxXE0xHTOiXPX6ghnknAEcD2dPIXRV7GmZmXVeaT/dXQXekE4LnuDkbEP4GnAJF9kJhVxAm6f7gbeAvYq5vXzCa72Ndhfd7/53+lXgdWKHm+ZunBiLgxIj4LrEXWK/5tBfF0xPR8jTF1mAR8E7gu792+Ky9BfI+sNr1KRKwMLCBLrABdlSW6LVdIOpysJz4bOLbmyK3fcYLuByJiAdmFvHMk7SVpBUnLSNpZ0s/yl10EnChpeH6x7SSyP8lrMQPYTtL6+QXK4zsOSFpD0h55LfptslJJWydtXAdsmA8NHChpPPBR4JoaYwIgIp4GPk1Wc1/aYKCVbMTHQEknAUNKjr8IjKhmpIakDYEfkpU5vgIcK2lMbdFbf+ME3U9ExJnAt8gu/M0l+7P8CLKRDZAlkXuBB4AHgen5vlrO9Vfg4rytaSyZVFvILpzNBl4hS5bf7KSNecBu+WvnkfU8d4uIl2uJaam274yIzv46uBG4nmzo3TNkf3WUli86JuHMkzS93HnyktIFwE8j4v6IeIJsJMikjhEyZt2RLyabmaXJPWgzs0Q5QZuZ1Zmk3+WTuh4q2TdM0l8lPZF/XaVcO07QZmb19wdgp6X2HQfcEhGjgFvy591yDdrMrACSRgDXRMQm+fPHgHERMUfSWsCUiOh2ElR3Exca6tgR+/mTw95nn9Y3Gx2CJWjsrCt7vLbJ4pefqjjnLDv8g4eSTe/vMDEiJpZ52xoRMQcgT9KrlztPsgnazCxVeTIul5B7zAnazAygvbP5UnX1oqS1SkocL5V7gy8SmpkBtLVWvtXmL2RrkJN/varcG9yDNjMDItrr1paki4BxZIuUzQJOBn4CXCLpYOBZoOzdh5ygzcwA2uuXoCNivy4O7VBNO07QZmYAdexB14sTtJkZ9MZFwqo5QZuZgXvQZmapitpHZxTGCdrMDOp6kbBenKDNzMAlDjOzZPkioZlZotyDNjNLlC8SmpklyhcJzczSFOEatJlZmlyDNjNLlEscZmaJcg/azCxRbYsbHcH7OEGbmYFLHGZmyXKJw8wsUe5Bm5klygnazCxN4YuEZmaJcg3azCxRLnGYmSXKPWgzs0S5B21mlij3oM3MEtXqBfvNzNLkHrSZWaJcgzYzS5R70GZmiXIP2swsUe5Bm5klyqM4zMwSFdHoCN7HCdrMDFyDNjNLVoIJuqXRAZiZJSHaK9/KkHSMpIclPSTpIkmDagnJCdrMDKCtrfKtG5LWAY4CxkbEJsAAYN9aQnKJw8wM6l3iGAgsL2kxsAIwu5ZG3IM2M4MsQVe4SZog6d6SbUJHMxHxPPBz4FlgDrAgIm6qJST3oM3MoKqJKhExEZjY2TFJqwB7AiOB+cClkg6MiAuqDck9aDMzINqj4q2MHYGnI2JuRCwGLge2riUm96DNzKCeNehngS0lrQC8CewA3FtLQ07QZmZQdnRGpSLiHkmXAdOBVuA+uiiHlOMEbWYGdR3FEREnAyf3tB0naDMzSHImoRN0wgYutwyHXXwSA5dbhpYBA3jw+nv46/9c1uiwLAGb3j2RttffhLZ2orWNmbt+p9EhNT8vlmTVaH17MRP3/yHvvPE2LQMH8M3LTuGxKTN49r5/NTo0S8DjXz6R1lcXNTqMvqM/9KAlLQK6/CiKiCH1Pmdf9s4bbwMwYOAABgwcQCT4KW/WJ5QfPtfr6p6gI2IwgKRTgReASYCAA4DB9T5fX6cWcfQ1P2LVD6zJ3yfdxHMznmx0SJaCCEZdeAoEzJ18Iy9PrmmimpWq0yiOeiqyxPH5iPhkyfNfSboH+FlXb8inS04A+NywsYwe/KECw2sO0R6ctcvxDBqyAl/9zbdYY8N1efHxWY0Oyxrs0b2PY/GLrzJw1aFseNEpvPWvWbx2zyONDqupRYIljiJnErZJOkDSAEktkg4Auv2IioiJETE2IsY6OS/prYVv8OQ/ZrLRp0c3OhRLwOIXXwWgdd4C5t9wDyuOGdXgiPqA9qh86yVFJuj9gX2AF/Pty/k+q9CKwwYzaMgKQDaiY9Q2mzD3yZoWxbI+pGX55WhZcdC7j4dsN4Y3H3u2wVH1AXVcD7peCitxRMS/yRYMsRoNXn0Vxp/xH7S0tKAW8cC1/2Dmrfc1OixrsIHDV+ZD5x4HgAYM4JUrb2fhFP9e9Fh/uEjYQdJw4BvAiNLzRMTXizpnX/PCo89y9q7HNzoMS8w7z77II587ptFh9D2t/esi4VXAHcDNlKk9m5k1XC+WLipVZIJeISK+V2D7Zmb1k2CJo8iLhNdI2qXA9s3M6iba2yveekuRPeijgRMkvQ0sJpusEp5JaGZJSrAHXeQoDs8aNLPm0Z8SNLx7b65RwKCOfRFxe5HnNDOrSX+a6i3pELIyx7rADGBL4G7gM0Wd08ysVhXca7DXFXmR8Ghgc+CZiNge+Dgwt8DzmZnVLsGp3kWWON6KiLckIWm5iHhU0kYFns/MrHYJLpZUZIKeJWll4Ergr5JeBbyQhJmlKcESR5GjOPbOH54i6TZgKHBDUeczM+uR/pCgJQ3rZPeD+deVgFfqfU4zs56Ktv5R4phGdssrlezreB7ABgWc08ysZ/pDDzoiRta7TTOzoqU4zM539TYzg/7RgzYza0rplaCdoM3MAKI1vQxd2ExCSZMq2WdmloT2KrZeUmQPeuPSJ5IGAJ8o8HxmZjVL8SJh3XvQko6XtAj4mKSFkhblz18iuw2WmVl6EuxB1z1BR8SP87Wg/zsihkTE4HxbNSJ8B1QzS1K0R8VbbylyqvfxkvYAtst3TYmIa4o6n5lZj6R3jbDQ9aB/DGwBTM53HS1pG/eizSxF0droCN6vyIuEuwJjIrJ7mUs6H7gPcII2s+REgj3oqmrQklaR9LEq3rJyyeOh1ZzLzKxX1fEioaSVJV0m6VFJMyVtVUtIZXvQkqYAe+SvnQHMlfS3iPhWmbf+GLgvX2pUZLVo957NLEl17kGfDdwQEV+StCywQi2NVFLiGBoRC/N7DP4+Ik6W9EC5N0XERXly35wsQX8vIl6oJUgzs6LVK0FLGkLWIf0aQES8A7xTS1uVlDgGSloL2AeodhRGC/Ay8CqwoaTtyrzezKwhok0Vb5ImSLq3ZJtQ0tQGZPdf/b2k+ySdK2nFWmKqpAd9KnAjcGdETJW0AfBEuTdJ+ikwHniY96o2AdxeS6BmZkWqpgcdEROBiV0cHghsBhwZEfdIOhs4DvivamMqm6Aj4lLg0pLnTwFfrKDtvYCNIuLtaoMyM+tt0a7yL6rMLGBWRNyTP7+MLEFXrcsELen/k/V4OxURR5Vp+ylgGcAJ2sySV68adES8IOk5SRtFxGPADsAjtbTVXQ/63pqie88bwAxJt1CSpCtI7GZmvS6ibj1ogCOByfkIjqeAg2pppMsEHRHnlz6XtGJEvF5F23/JNzOz5NVzmF1EzADG9rSdSsZBbwWcR3ZH7vUljQYOjYhvlgnw/O6Om5mlpL2trj3ouqhkmN1ZwOeBeQARcT/vLYBkZtYnRLsq3npLRWtxRMRz0hJBtRUTjplZY/Rm4q1UJQn6OUlbA5EXvI8CZhYblplZ74r0bqhSUYI+jGxe+TrA82STVg7v6sWSrqb74Xl7VBmjmVnhmrIHHREvAwdU0ebP869fANYELsif7wf8u5rgzMx6S52H2dVFJaM4NiDrQW9J1jO+Gzgmn1H4PhHxt/x9p0VE6cXEqyV5mreZJamtSUdxXAhcAqwFrE027fuiCt43PE/uAEgaCQyvJUgzs6JFqOKtt1RSg1ZETCp5foGkIyp43zHAFEkdPe0RwKFVxmdm1iuaqgYtaVj+8DZJxwF/IitxjAeuLddwRNwgaRTw4XzXo144ycxS1WyjOKaRJeSOj5XS3m8Ap3X2JkmfiYhbJX1hqUMflEREXF5ztGZmBWmqHnREjKyxzU8DtwK7d9Ys4ARtZslpa6/qFq29oqKZhJI2AT4KDOrYFxF/7Oy1EXFy/rWm1ZvMzBqh2UocAEg6GRhHlqCvA3YG7gQ6TdCSur2ZbEScWXWUZmYFa2/GcdDAl4DRwH0RcZCkNYBzu3n94PzrRmQ3jO1YcnR3fLsrM0tUU05UAd6MiHZJrfndal8iuylipyLiBwCSbgI2i4hF+fNTKLl1lplZSpqyxAHcK2ll4LdkIzteA/5ZwfvWZ8lbjb9DNha6ImfOdmfb3u/02Xc0OgTro5qyxFGyMP+vJd0ADImIBypoexLwT0lXkI3e2BvwIv5mlqSmGsUhabPujkXE9O4ajojTJV0PbJvvOigi7qstTDOzYiVY4ei2B31GN8cC+Ey5xvMk3m0iNzNLQVOVOCJi+94MxMyskZp1FIeZWZ9Xx5t6140TtJkZELgHbWaWpNYESxxlx5Uoc6Ckk/Ln60vaovjQzMx6T6CKt95SycC/XwJbkd1TEGARcE5hEZmZNUB7FVtvqaTE8cmI2EzSfQAR8aqkZQuOy8ysVzVrDXqxpAHk47glDSfNC55mZjVLMalVkqB/AVwBrC7pdLLV7U4sNCozs17W1ow96IiYLGkasAPZ7a/2ioiZhUdmZtaLErzjVUUL9q8PvAFcXbovIp4tMjAzs97U3ow9aLI7eHfcPHYQMBJ4DNi4wLjMzHpVsy2WBEBEbFr6PF/l7tAuXm5m1pSa9SLhEiJiuqTNiwjGzKxR2tWEJY6lbgLbAmwGzC0sIjOzBmhrdACdqKQHPbjkcStZTfrPxYRjZtYY9R7Fkc8fuRd4PiJ2q6WNbhN0foKVIuK7tTRuZtYsChjFcTQwExhSawNdrsUhaWBEtJGVNMzM+rSoYitH0rrArsC5PYmpux70P8mS8wxJfwEuBV7vOBgRl/fkxGZmKammxCFpAjChZNfEiJhY8vws4FiWLBFXrZIa9DBgHtk9CDvGQwfgBG1mfUY1w+zyZDyxs2OSdgNeiohpksb1JKbuEvTq+QiOh3gvMb8bX09OamaWmrb6laC3AfaQtAvZ5L4hki6IiAOrbai7BD0AWAk6rZw7QZtZn1KviSoRcTxwPEDeg/5OLckZuk/QcyLi1FoaNTNrNs02kzC9aTVmZgUp4paEETEFmFLr+7tL0DvU2qiZWbNpqh50RLzSm4GYmTVSs071NjPr85pywX4zs/6gqUocZmb9iRO0mVmiUpzc4QRtZoZr0GZmyfIoDjOzRLUnWORwgjYzwxcJzcySlV7/2QnazAxwD9rMLFmtSq8P7QRtZoZLHGZmyXKJw8wsUR5mZ2aWqPTSsxO0mRngEoeZWbLaEuxDO0GbmeEetJlZssI9aDOzNKXYg25pdADWvc9/bhwPP3Q7jz5yJ8d+9/BGh2MNcuKPzmS7XfdlrwMPe3ffgoWLOOToE9hl/MEccvQJLFi4qIERNr92ouKttzhBJ6ylpYVfnH06u+1+IJuO3p7x4/fiIx8Z1eiwrAH22uWz/PrMHy6x79xJl7Dl2DFcd/F5bDl2DOddcEmDousbooqttzhBJ2yLzT/Ok0/+m6effpbFixdzySVXscfun290WNYAY8dsytAhg5fYd9sdd7PnzjsCsOfOO3Lr7Xc3IrQ+o5WoeOstTtAJW3udNXlu1ux3n896fg5rr71mAyOylMx7dT7DVxsGwPDVhvHK/AUNjqi5RRX/9ZZCLhJK+kJ3xyPi8i7eNwGYAKABQ2lpWbGA6JqH9P6bpEWkd6XZrC9I8SJhUaM4ds+/rg5sDdyaP98emAJ0mqAjYiIwEWDgsuv0+0z0/Kw5rLfu2u8+X3edtZgz58UGRmQpWXWVlZn78isMX20Yc19+hWErD210SE0txWF2hZQ4IuKgiDiIrJ7+0Yj4YkR8Edi4iPP1VVPvncGHPjSSESPWY5lllmGfffbk6mtuanRYlohxn9qSq66/GYCrrr+Z7bfdqsERNbf2KrbeUvQ46BERMafk+YvAhgWfs89oa2vj6P88keuuvZABLS384fyLeeSRxxsdljXAd0/+CVPve4D58xeyw14H8s2Dv8IhX9mHb//Xj7j8mhtZa43hnPnD7zc6zKbWlmD5UEXWNCX9LzAKuIisN70v8K+IOLLce13isM68OfuORodgCVpmtQ3ef8GmSvt/YO+Kc86Fz1zR4/NVotAedEQcIWlvYLt818SIuKLIc5qZ1SLFGnRvTPWeDiyKiJslrSBpcER4ypOZJSXFURyFjoOW9A3gMuA3+a51gCuLPKeZWS3641Tvw4FtgIUAEfEE2dA7M7Ok1GuiiqT1JN0maaakhyUdXWtMRZc43o6IdzomXEgaSJp3ljGzfq6OozhagW9HxHRJg4Fpkv4aEY9U21DRPei/SToBWF7SZ4FLgasLPqeZWdXqVeKIiDkRMT1/vAiYSVberVrRCfo4YC7wIHAocB1wYsHnNDOrWjUTVSRNkHRvyTahszYljQA+DtxTS0xFD7NrB36bb2ZmyapmmF3pshRdkbQS8GfgPyNiYS0xFbVY0oN0U2uOiI8VcV4zs1rVc3SGpGXIkvPkrhaHq0RRPejd8q8dtwCZlH89AHijoHOamdWsXrOqlY2KOA+YGRFn9qStQhJ0RDwDIGmbiNim5NBxku4CTi3ivGZmtWqrXw96G+ArwIOSZuT7ToiI66ptqOhhditK+lRE3AkgaWugfy/ybGZJqleJI893dVmro+gEfTDwO0kdC9XOB75e8DnNzKqW4s0wih7FMQ0YLWkI2cp5viePmSWpN6dwV6rQBC1pOeCLwAhgYMeMwohwDdrMktIfV7O7ClgATAPeLvhcZmY1S3HB/qIT9LoRsVPB5zAz67EUSxxFT/X+u6RNCz6HmVmPpbjcaNE96E8BX5P0NFmJQ0B4JqGZpabfjeIAdi64fTOzukixxFH0MLuOGYWrA4OKPJeZWU/0u1EckvYAzgDWBl4CPkC2NurGRZ7XzKxabZHeXQmLvkh4GrAl8HhEjAR2AO4q+JxmZlWLiIq33lJ0gl4cEfOAFkktEXEbMKbgc5qZVa0/juKYny9afTswWdJLZPfrMjNLSoo16KJ70HsCbwLHADcATwK7F3xOM7OqtUdUvPWWokdxvF7y9Pwiz2Vm1hMp9qCLuuXVIjq/5VXHRJUhRZzXzKxWKY7iKOqOKoOLaNfMrCi9WbqoVNEXCc3MmkK/KXGYmTUb96DNzBLlHrSZWaLaoq3RIbyPE7SZGf1zuVEzs6bQ75YbNTNrFu5Bm5klyqM4zMwS5VEcZmaJ6jdTvc3Mmo1r0GZmiXIN2swsUe5Bm5klyuOgzcwS5R60mVmiPIrDzCxRvkhoZpaoFEscRd/V28ysKUQV/5UjaSdJj0n6l6Tjao3JPWgzM+rXg5Y0ADgH+CwwC5gq6S8R8Ui1bTlBm5lR1xr0FsC/IuIpAEl/AvYE+k6Cbn3neTU6hlRImhARExsdh6XFvxf1VU3OkTQBmFCya2LJz2Id4LmSY7OAT9YSk2vQzWFC+ZdYP+TfiwaJiIkRMbZkK/2g7CzR19Q9d4I2M6uvWcB6Jc/XBWbX0pATtJlZfU0FRkkaKWlZYF/gL7U0lGwN2pbgOqN1xr8XCYqIVklHADcCA4DfRcTDtbSlFAdnm5mZSxxmZslygjYzS5QTdA9IOkXSd/LHp0rasZPXjJN0TZ3Od0I3x/4tabU6nee1erRjtanX91/SCEkP1aMtawwn6DqJiJMi4uaCT9NlgjazvscJukqSvp8vgnIzsFHJ/j9I+lL+eCdJj0q6E/hCF+18TdLlkm6Q9ISkn5Uc20/Sg5IekvTTfN9PgOUlzZA0uUyMV0qaJunhfMZTx/7XJJ0u6X5J/5C0Rr5/pKS7JU2VdFoPvj1WR5JWknSLpOn578Oe+f4RkmZK+m3+M75J0vL5sU/kP9+7gcMb+g+wHnOCroKkT5CNafw4WeLdvJPXDAJ+C+wObAus2U2TY4DxwKbAeEnrSVob+Cnwmfz45pL2iojjgDcjYkxEHFAm1K9HxCeAscBRklbN968I/CMiRgO3A9/I958N/CoiNgdeKNO29Z63gL0jYjNge+AMSR2z1EYB50TExsB84Iv5/t8DR0XEVr0drNWfE3R1tgWuiIg3ImIhnQ8+/zDwdEQ8EdkYxgu6ae+WiFgQEW+RLaTyAbKkPyUi5kZEKzAZ2K7KOI+SdD/wD7IZTaPy/e8AHfXwacCI/PE2wEX540lVnsuKI+BHkh4AbiZb42GN/NjTETEjfzwNGCFpKLByRPwt3++fZZPzRJXqVTJwvNLB5W+XPG4j+3n0aJEoSeOAHYGtIuINSVOAQfnhxfHewPeO83XwgPj0HAAMBz4REYsl/Zv3fpZL/+4sT/a7459jH+IedHVuB/aWtLykwWRljKU9CoyU9MH8+X5VnuMe4NOSVsvXld0P6OgRLZa0TJn3DwVezZPzh4EtKzjnXWSlG8iSgqVhKPBSnpy3J/sLq0sRMR9YIOlT+S7/LJucE3QVImI6cDEwA/gzcEcnr3mLbJWxa/OLhM9UeY45wPHAbcD9wPSIuCo/PBF4oMxFwhuAgfmfxaeRlTnKORo4XNJUsqRgaZgMjJV0L1myfbSC9xwEnJNfJHyzyOCseJ7qbWaWKPegzcwS5QRtZpYoJ2gzs0Q5QZuZJcoJ2swsUU7Q9j6S2vI1Px6SdKmkFXrQVukaJedK+mg3rx0naesaztHpSn6VrPBX7cpxpSsYmhXNCdo607HmxyZk08MPKz2YT6CpWkQcEhGPdPOScUDVCdqsr3KCtnLuAD6U925vk3Qh8KCkAZL+O18B7wFJhwIo87+SHpF0LbB6R0OSpkgamz/eKV+l7f58xbYRZB8Ex+S9920lDZf05/wcUyVtk7931XwFt/sk/YYKpsd3tcJffuyMPJZbJA3P931Q2UqD0yTdkc/KXLrNo/J/5wOS/lTj99esS16Lw7okaSCwM9nsRIAtgE0i4uk8yS2IiM0lLQfcJekmspX+NiJboW8NskWgfrdUu8PJVvzbLm9rWES8IunXwGsR8fP8dRcC/xMRd0pan+wmnB8BTgbujIhTJe1KNnOznK/n51gemCrpzxExj2yFv+kR8W1JJ+VtH0E2a/OwiHhC0ieBX5KtMFjqOGBkRLwtaeVKvqdm1XCCts4sL2lG/vgO4Dyy0sM/I+LpfP/ngI911JfJpoiPIlt576KIaANmS7q1k/a3BG7vaCsiXukijh2Bj763wiZD8jVQtiNfZzsirpX0agX/pqMk7Z0/7ljhbx7QTjZ9H7KVBy+XtFL+77205NzLddLmA8BkSVcCV1YQg1lVnKCtM29GxJjSHXmier10F3BkRNy41Ot2ofyKapWuutZCtirfEmtK5LFUvEZBmRX+lhb5eecv/T3oxK5kHxZ7AP8laeN8iVizunAN2mp1I/AfHavrSdpQ0opkK/7tm9eo1yJbaH5pd5Ot2Dcyf++wfP8iYHDJ624iKzeQv25M/vB28pXaJO0MrFIm1u5W+GsBOv4K2J+sdLIQeFrSl/NzSNLo0gYltQDrRcRtwLHAysBKZeIwq4p70Farc8kW/J+urEs7F9gLuIKsVvsg8DjvLZX6roiYm9ewL88T3UvAZ4GrgcuU3drpSOAospXZHiD7Xb2d7ELiD4CLJE3P23+2TKw3AIfl7TzGkiv8vQ5sLGkasIDsDjeQfQD8StKJwDLAn8hWF+wwALhA2SL5IquVzy8Th1lVvJqdmVmiXOIwM0uUE7SZWaKcoM3MEuUEbWaWKCdoM7NEOUGbmSXKCdrMLFH/B0UGgIdxnOi0AAAAAElFTkSuQmCC\n",
"text/plain": "<Figure size 432x288 with 2 Axes>"
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": "yhat = tree_cv.predict(X_test)\nplot_confusion_matrix(Y_test,yhat)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 10\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Create a k nearest neighbors object then create a <code>GridSearchCV</code> object <code>knn_cv</code> with cv = 10. Fit the object to find the best parameters from the dictionary <code>parameters</code>.\n"
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": "parameters = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],\n 'p': [1,2]}\n\nKNN = KNeighborsClassifier()"
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "GridSearchCV(cv=10, estimator=KNeighborsClassifier(),\n param_grid={'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],\n 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n 'p': [1, 2]})"
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "knn_cv= GridSearchCV(KNN, parameters, cv=10)\nknn_cv.fit(X_train, Y_train)"
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "tuned hpyerparameters :(best parameters) {'algorithm': 'auto', 'n_neighbors': 4, 'p': 1}\naccuracy : 0.8767857142857143\n"
}
],
"source": "print(\"tuned hpyerparameters :(best parameters) \",knn_cv.best_params_)\nprint(\"accuracy :\",knn_cv.best_score_)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 11\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Calculate the accuracy of tree_cv on the test data using the method <code>score</code>:\n"
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "0.7777777777777778"
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "knn_cv.score(X_test, Y_test)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can plot the confusion matrix\n"
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWgAAAEWCAYAAABLzQ1kAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAfBElEQVR4nO3dd5xdVbnG8d+TAgRIAoHQApiggApKREC6oSi9iVdAsCAYuNJULIAICtjuFS54L+qNoGISooAUKQJSIuWiphBqkCihhEQIgRR6MvPeP/YeOBlm5pQ5+5x1Ms+Xz/7MObustTJzeGfNu9daWxGBmZmlp1+zG2BmZl1zgDYzS5QDtJlZohygzcwS5QBtZpYoB2gzs0Q5QFuvSRok6XpJiyRd2YtyjpR0az3b1gyS/ijpc81uh7U+B+g+RNKnJU2V9LKkeXkg2bkORX8SWBdYKyL+rdZCImJiRHy8Du1ZjqQxkkLS1Z32b5Xvn1xhOd+RNKHceRGxT0RcVmNzzd7iAN1HSPoqcCHwfbJgujHwU+CgOhT/LuDxiFhWh7KKMh/YUdJaJfs+BzxerwqU8f9TVjf+MPUBkoYC5wAnRMTVEfFKRCyNiOsj4uv5OStLulDS3Hy7UNLK+bExkuZIOlXS83nv++j82HeBs4DD8p75MZ17mpJG5j3VAfn7z0t6QtISSbMlHVmy/56S63aUNCVPnUyRtGPJscmSzpV0b17OrZLW7uHb8CZwLXB4fn1/4FPAxE7fq4skPSNpsaRpknbJ9+8NnFHy73ygpB3fk3Qv8CqwSb7v2Pz4zyRdVVL+jyTdLkmV/vys73KA7ht2AFYBrunhnG8B2wOjga2A7YAzS46vBwwFRgDHABdLWjMizibrlf8uIlaPiEt7aoik1YCfAPtExGBgR2BGF+cNA27Mz10LuAC4sVMP+NPA0cA6wErA13qqG/gN8Nn89V7AI8DcTudMIfseDAMuB66UtEpE3Nzp37lVyTWfAcYCg4GnOpV3KvDB/JfPLmTfu8+F11iwCjhA9w1rAS+USUEcCZwTEc9HxHzgu2SBp8PS/PjSiLgJeBnYvMb2tANbShoUEfMi4pEuztkPmBUR4yNiWURMAh4DDig551cR8XhEvAZcQRZYuxUR/wcMk7Q5WaD+TRfnTIiIBXmd5wMrU/7f+euIeCS/Zmmn8l4FjiL7BTMBOCki5pQpzwxwgO4rFgBrd6QYurEBy/f+nsr3vVVGpwD/KrB6tQ2JiFeAw4DjgXmSbpT03gra09GmESXv/1VDe8YDJwK70cVfFHkaZ2aeVllI9ldDT6kTgGd6OhgRfwOeAET2i8SsIg7QfcN9wOvAwT2cM5fsZl+HjXnnn/+VegVYteT9eqUHI+KWiPgYsD5Zr/gXFbSno03P1timDuOBLwE35b3bt+QpiG+S5abXjIg1gEVkgRWgu7REj+kKSSeQ9cTnAt+oueXW5zhA9wERsYjsRt7Fkg6WtKqkgZL2kfQf+WmTgDMlDc9vtp1F9id5LWYAu0raOL9BeXrHAUnrSjowz0W/QZYqaeuijJuAzfKhgQMkHQa8H7ihxjYBEBGzgY+S5dw7GwwsIxvxMUDSWcCQkuPPASOrGakhaTPgPLI0x2eAb0gaXVvrra9xgO4jIuIC4KtkN/7mk/1ZfiLZyAbIgshU4EHgIWB6vq+Wuv4E/C4vaxrLB9V+ZDfO5gIvkgXLL3VRxgJg//zcBWQ9z/0j4oVa2tSp7Hsioqu/Dm4B/kg29O4psr86StMXHZNwFkiaXq6ePKU0AfhRRDwQEbPIRoKM7xghY9YT+WaymVma3IM2M0uUA7SZWZ1J+mU+qevhkn3DJP1J0qz865rlynGANjOrv18De3fadxpwe0RsCtyev++Rc9BmZgWQNBK4ISK2zN//HRgTEfMkrQ9MjogeJ0H1NHGhqX4z4ij/5jCzinz22Qm9Xttk6QtPVBxzVhr+7uPIpvd3GBcR48pctm5EzAPIg/Q65epJNkCbmaUqD8blAnKvOUCbmQG0dzVfqq6ek7R+SYrj+XIX+CahmRlA27LKt9r8gWwNcvKv15W7wD1oMzMgor1uZUmaBIwhW6RsDnA28EPgCknHAE8DZZ8+5ABtZgbQXr8AHRFHdHNoj2rKcYA2MwOoYw+6XhygzcygETcJq+YAbWYG7kGbmaUqah+dURgHaDMzqOtNwnpxgDYzA6c4zMyS5ZuEZmaJcg/azCxRvkloZpYo3yQ0M0tThHPQZmZpcg7azCxRTnGYmSXKPWgzs0S1LW12C97BAdrMDJziMDNLllMcZmaJcg/azCxRDtBmZmkK3yQ0M0uUc9BmZolyisPMLFHuQZuZJco9aDOzRLkHbWaWqGVesN/MLE3uQZuZJco5aDOzRLkHbWaWKPegzcwS5R60mVmiPIrDzCxREc1uwTs4QJuZgXPQZmbJSjBA92t2A8zMkhDtlW9lSPqKpEckPSxpkqRVammSA7SZGUBbW+VbDySNAE4GtomILYH+wOG1NMkpDjMzqHeKYwAwSNJSYFVgbi2FuAdtZgZZgK5wkzRW0tSSbWxHMRHxLPBj4GlgHrAoIm6tpUnuQZuZQVUTVSJiHDCuq2OS1gQOAkYBC4ErJR0VEROqbZJ70GZmQLRHxVsZewKzI2J+RCwFrgZ2rKVN7kGbmUE9c9BPA9tLWhV4DdgDmFpLQQ7QZmZQdnRGpSLir5KuAqYDy4D76SYdUo4DtJkZ1HUUR0ScDZzd23IcoM3MwDMJrXrqJ/a/5Tx2v+zUZjfFEuLPRQEiKt8axAE6ce89dm8WzappjLutwPy5KEAV46Abpe4BWtISSYu72+pd34ps1fWHseEeo5k1aXKzm2IJ8eeiIO1R+dYgdc9BR8RgAEnnAP8CxgMCjgQG17u+Fdm23z2KaedNYuDqg5rdFEuIPxcFqdMojnoqMsWxV0T8NCKWRMTiiPgZcGhPF5ROn7zzlVkFNi19I/YczesvLObFh55sdlMsIf5cFCfa2yveGqXIURxtko4EfgsEcATQ46+o0umTvxlxVHqPN2igdbbZjA0/vjUjdt+K/isPZODgQez8k3/nnpN/1uymWRP5c1GgBqYuKqUo6I6kpJHARcBOZAH6XuDLEfFkJdf39QBdat0d3scWx+/LHZ87v9lNsYT4c/G2zz47Qb0t45XzKo85q53Z+/oqUVgPOg/EBxVVvplZXSXYgy4sQEsaDnwRGFlaT0R8oag6V1TP3TeT5+6b2exmWGL8uaizZendJCwyB30dcDdwG2Vyz2ZmTVfFcqONUmSAXjUivllg+WZm9ZNgiqPIYXY3SNq3wPLNzOqmrw2zOwU4Q9IbwFKyySoREUMKrNPMrDYJ9qCLHMXhWYNm1jr6UoCGt57NtSmwSse+iLiryDrNzGqS4FTvIofZHUuW5tgQmAFsD9wH7F5UnWZmtargWYMNV+RNwlOAbYGnImI34EPA/ALrMzOrXV9Yza7E6xHxuiQkrRwRj0navMD6zMxql+ATVYoM0HMkrQFcC/xJ0kuAVxg3szQlmOIochTHIfnL70i6ExgK3FxUfWZmvdIXArSkYV3sfij/ujrwYr3rNDPrrWjrGymOaWTLi5Yux9fxPoBNCqjTzKx3+kIPOiJG1btMM7OipTjMrtCJKmZmLcMB2swsUemloB2gzcwAYll6EbqwmYSSxleyz8wsCe1VbA1SZA96i9I3kvoDHy6wPjOzmqV4k7DuPWhJp0taAnxQ0mJJS/L3z5M9BsvMLD0J9qDrHqAj4gf5WtD/GRFDImJwvq0VEafXuz4zs3qI9qh4a5Qip3qfLulAYNd81+SIuKGo+szMeiW9e4SFrgf9A2A7YGK+6xRJO7kXbWYpimXNbsE7FXmTcD9gdET2LHNJlwH3Aw7QZpacSLAHXVUOWtKakj5YxSVrlLweWk1dZmYNVcebhJLWkHSVpMckzZS0Qy1NKtuDljQZODA/dwYwX9KfI+KrZS79AXB/vtSoyHLR7j2bWZLq3IO+CLg5Ij4paSVg1VoKqSTFMTQiFufPGPxVRJwt6cFyF0XEpDy4b0sWoL8ZEf+qpZFmZkWrV4CWNISsQ/p5gIh4E3izlrIqSXEMkLQ+8Cmg2lEY/YAXgJeAzSTtWuZ8M7OmiDZVvEkaK2lqyTa2pKhNyJ6/+itJ90u6RNJqtbSpkh70OcAtwD0RMUXSJsCschdJ+hFwGPAIb2dtArirloaamRWpmh50RIwDxnVzeACwNXBSRPxV0kXAacC3q21T2QAdEVcCV5a8fwI4tIKyDwY2j4g3qm2UmVmjRbvKn1SZOcCciPhr/v4qsgBdtW4DtKT/JuvxdikiTi5T9hPAQMAB2sySV68cdET8S9IzkjaPiL8DewCP1lJWTz3oqTW17m2vAjMk3U5JkK4gsJuZNVxE3XrQACcBE/MRHE8AR9dSSLcBOiIuK30vabWIeKWKsv+Qb2ZmyavnMLuImAFs09tyKhkHvQNwKdkTuTeWtBVwXER8qUwDL+vpuJlZStrb6tqDrotKhtldCOwFLACIiAd4ewEkM7MVQrSr4q1RKlqLIyKekZZrVFsxzTEza45GBt5KVRKgn5G0IxB5wvtkYGaxzTIza6xI74EqFQXo48nmlY8AniWbtHJCdydLup6eh+cdWGUbzcwK15I96Ih4ATiyijJ/nH/9BLAeMCF/fwTwZDWNMzNrlDoPs6uLSkZxbELWg96erGd8H/CVfEbhO0TEn/Przo2I0puJ10vyNG8zS1Jbi47iuBy4Algf2IBs2vekCq4bngd3ACSNAobX0kgzs6JFqOKtUSrJQSsixpe8nyDpxAqu+wowWVJHT3skcFyV7TMza4iWykFLGpa/vFPSacBvyVIchwE3lis4Im6WtCnw3nzXY144ycxS1WqjOKaRBeSOXyulvd8Azu3qIkm7R8Qdkj7R6dC7JRERV9fcWjOzgrRUDzoiRtVY5keBO4ADuioWcIA2s+S0tVf1iNaGqGgmoaQtgfcDq3Tsi4jfdHVuRJydf61p9SYzs2ZotRQHAJLOBsaQBeibgH2Ae4AuA7SkHh8mGxEXVN1KM7OCtbfiOGjgk8BWwP0RcbSkdYFLejh/cP51c7IHxnYsOXoAftyVmSWqJSeqAK9FRLukZfnTap8neyhilyLiuwCSbgW2jogl+fvvUPLoLDOzlLRkigOYKmkN4BdkIzteBv5WwXUbs/yjxt8kGwtdkS/Mv7PSU60PeW3u3c1ugq2gWjLFUbIw/88l3QwMiYgHKyh7PPA3SdeQjd44BPAi/maWpJYaxSFp656ORcT0ngqOiO9J+iOwS77r6Ii4v7ZmmpkVK8EMR4896PN7OBbA7uUKz4N4j4HczCwFLZXiiIjdGtkQM7NmatVRHGZmK7w6PtS7bhygzcyAwD1oM7MkLUswxVF2XIkyR0k6K3+/saTtim+amVnjBKp4a5RKBv79FNiB7JmCAEuAiwtrkZlZE7RXsTVKJSmOj0TE1pLuB4iIlyStVHC7zMwaqlVz0Esl9Scfxy1pOGne8DQzq1mKQa2SAP0T4BpgHUnfI1vd7sxCW2Vm1mBtrdiDjoiJkqYBe5A9/urgiJhZeMvMzBoowSdeVbRg/8bAq8D1pfsi4ukiG2Zm1kjtrdiDJnuCd8fDY1cBRgF/B7YosF1mZg3VaoslARARHyh9n69yd1w3p5uZtaRWvUm4nIiYLmnbIhpjZtYs7WrBFEenh8D2A7YG5hfWIjOzJmhrdgO6UEkPenDJ62VkOenfF9McM7PmqPcojnz+yFTg2YjYv5YyegzQeQWrR8TXaynczKxVFDCK4xRgJjCk1gK6XYtD0oCIaCNLaZiZrdCiiq0cSRsC+wGX9KZNPfWg/0YWnGdI+gNwJfBKx8GIuLo3FZuZpaSaFIekscDYkl3jImJcyfsLgW+wfIq4apXkoIcBC8ieQdgxHjoAB2gzW2FUM8wuD8bjujomaX/g+YiYJmlMb9rUU4BeJx/B8TBvB+a32tebSs3MUtNWvxT0TsCBkvYlm9w3RNKEiDiq2oJ6CtD9gdWhy8y5A7SZrVDqNVElIk4HTgfIe9BfqyU4Q88Bel5EnFNLoWZmrabVZhKmN63GzKwgRTySMCImA5Nrvb6nAL1HrYWambWalupBR8SLjWyImVkztepUbzOzFV5LLthvZtYXtFSKw8ysL3GANjNLVIqTOxygzcxwDtrMLFkexWFmlqj2BJMcDtBmZvgmoZlZstLrPztAm5kB7kGbmSVrmdLrQztAm5nhFIeZWbKc4jAzS5SH2ZmZJSq98OwAbWYGOMVhZpastgT70A7QZma4B21mlqxwD9rMLE0p9qD7NbsB1rO9Pj6GRx6+i8cevYdvfP2EZjfHmuTM71/ArvsdzsFHHf/WvkWLl3DsKWew72HHcOwpZ7Bo8ZImtrD1tRMVb43iAJ2wfv368ZOLvsf+BxzFB7bajcMOO5j3vW/TZjfLmuDgfT/Gzy84b7l9l4y/gu23Gc1Nv7uU7bcZzaUTrmhS61YMUcXWKA7QCdtu2w/xz38+yezZT7N06VKuuOI6Djxgr2Y3y5pgm9EfYOiQwcvtu/Pu+zhonz0BOGifPbnjrvua0bQVxjKi4q1RHKATtsGI9Xhmzty33s95dh4bbLBeE1tkKVnw0kKGrz0MgOFrD+PFhYua3KLWFlX81yiF3CSU9ImejkfE1d1cNxYYC6D+Q+nXb7UCWtc6pHc+JC0ivTvNZiuCFG8SFjWK44D86zrAjsAd+fvdgMlAlwE6IsYB4wAGrDSiz0eiZ+fMY6MNN3jr/YYj1mfevOea2CJLyVprrsH8F15k+NrDmP/CiwxbY2izm9TSUhxmV0iKIyKOjoijyfLp74+IQyPiUGCLIupbUU2ZOoP3vGcUI0duxMCBA/nUpw7i+htubXazLBFjdt6e6/54GwDX/fE2dttlhya3qLW1V7E1StHjoEdGxLyS988BmxVc5wqjra2NU758JjfdeDn9+/Xj15f9jkcffbzZzbIm+PrZP2TK/Q+ycOFi9jj4KL50zGc49jOf4tRvf5+rb7iF9dcdzgXnfavZzWxpbQmmD1VkTlPS/wCbApPIetOHA/+IiJPKXesUh3Xltbl3N7sJlqCBa2/yzhs2Vfr0uw6pOOZc/tQ1va6vEoX2oCPiREmHALvmu8ZFxDVF1mlmVosUc9CNmOo9HVgSEbdJWlXS4IjwlCczS0qKozgKHQct6YvAVcD/5rtGANcWWaeZWS364lTvE4CdgMUAETGLbOidmVlS6jVRRdJGku6UNFPSI5JOqbVNRac43oiINzsmXEgaQJpPljGzPq6OoziWAadGxHRJg4Fpkv4UEY9WW1DRPeg/SzoDGCTpY8CVwPUF12lmVrV6pTgiYl5ETM9fLwFmkqV3q1Z0gD4NmA88BBwH3AScWXCdZmZVq2aiiqSxkqaWbGO7KlPSSOBDwF9raVPRw+zagV/km5lZsqoZZle6LEV3JK0O/B74ckQsrqVNRS2W9BA95Joj4oNF1GtmVqt6js6QNJAsOE/sbnG4ShTVg94//9rxCJDx+dcjgVcLqtPMrGb1mlWtbFTEpcDMiLigN2UVEqAj4ikASTtFxE4lh06TdC9wThH1mpnVqq1+PeidgM8AD0make87IyJuqragoofZrSZp54i4B0DSjkDfXuTZzJJUrxRHHu/qslZH0QH6GOCXkjoWql0IfKHgOs3MqpbiwzCKHsUxDdhK0hCylfP8TB4zS1Ijp3BXqtAALWll4FBgJDCgY0ZhRDgHbWZJ6Yur2V0HLAKmAW8UXJeZWc1SXLC/6AC9YUTsXXAdZma9lmKKo+ip3v8n6QMF12Fm1mspLjdadA96Z+DzkmaTpTgEhGcSmllq+twoDmCfgss3M6uLFFMcRQ+z65hRuA6wSpF1mZn1Rp8bxSHpQOB8YAPgeeBdZGujblFkvWZm1WqL9J5KWPRNwnOB7YHHI2IUsAdwb8F1mplVLSIq3hql6AC9NCIWAP0k9YuIO4HRBddpZla1vjiKY2G+aPVdwERJz5M9r8vMLCkp5qCL7kEfBLwGfAW4GfgncEDBdZqZVa09ouKtUYoexfFKydvLiqzLzKw3UuxBF/XIqyV0/cirjokqQ4qo18ysVimO4ijqiSqDiyjXzKwojUxdVKrom4RmZi2hz6Q4zMxajXvQZmaJcg/azCxRbdHW7Ca8gwO0mRl9c7lRM7OW0OeWGzUzaxXuQZuZJcqjOMzMEuVRHGZmieozU73NzFqNc9BmZolyDtrMLFHuQZuZJcrjoM3MEuUetJlZojyKw8wsUb5JaGaWqBRTHEU/1dvMrCVEFf+VI2lvSX+X9A9Jp9XaJvegzcyoXw9aUn/gYuBjwBxgiqQ/RMSj1ZblAG1mRl1z0NsB/4iIJwAk/RY4CFhxAvSyN59Vs9uQCkljI2Jcs9thafHnor6qiTmSxgJjS3aNK/lZjACeKTk2B/hILW1yDro1jC1/ivVB/lw0SUSMi4htSrbSX5RdBfqauucO0GZm9TUH2Kjk/YbA3FoKcoA2M6uvKcCmkkZJWgk4HPhDLQUlm4O25TjPaF3x5yJBEbFM0onALUB/4JcR8UgtZSnFwdlmZuYUh5lZshygzcwS5QDdC5K+I+lr+etzJO3ZxTljJN1Qp/rO6OHYk5LWrlM9L9ejHKtNvb7/kkZKergeZVlzOEDXSUScFRG3FVxNtwHazFY8DtBVkvStfBGU24DNS/b/WtIn89d7S3pM0j3AJ7op5/OSrpZ0s6RZkv6j5NgRkh6S9LCkH+X7fggMkjRD0sQybbxW0jRJj+Qznjr2vyzpe5IekPQXSevm+0dJuk/SFEnn9uLbY3UkaXVJt0uann8eDsr3j5Q0U9Iv8p/xrZIG5cc+nP987wNOaOo/wHrNAboKkj5MNqbxQ2SBd9suzlkF+AVwALALsF4PRY4GDgM+ABwmaSNJGwA/AnbPj28r6eCIOA14LSJGR8SRZZr6hYj4MLANcLKktfL9qwF/iYitgLuAL+b7LwJ+FhHbAv8qU7Y1zuvAIRGxNbAbcL6kjllqmwIXR8QWwELg0Hz/r4CTI2KHRjfW6s8Bujq7ANdExKsRsZiuB5+/F5gdEbMiG8M4oYfybo+IRRHxOtlCKu8iC/qTI2J+RCwDJgK7VtnOkyU9APyFbEbTpvn+N4GOfPg0YGT+eidgUv56fJV1WXEEfF/Sg8BtZGs8rJsfmx0RM/LX04CRkoYCa0TEn/P9/lm2OE9UqV4lA8crHVz+RsnrNrKfR68WiZI0BtgT2CEiXpU0GVglP7w03h743lFfBw+IT8+RwHDgwxGxVNKTvP2z7PzZGUT22fHPcQXiHnR17gIOkTRI0mCyNEZnjwGjJL07f39ElXX8FfiopLXzdWWPADp6REslDSxz/VDgpTw4vxfYvoI67yVL3UAWFCwNQ4Hn8+C8G9lfWN2KiIXAIkk757v8s2xxDtBViIjpwO+AGcDvgbu7OOd1slXGbsxvEj5VZR3zgNOBO4EHgOkRcV1+eBzwYJmbhDcDA/I/i88lS3OUcwpwgqQpZEHB0jAR2EbSVLJg+1gF1xwNXJzfJHytyMZZ8TzV28wsUe5Bm5klygHazCxRDtBmZolygDYzS5QDtJlZohyg7R0kteVrfjws6UpJq/airNI1Si6R9P4ezh0jacca6uhyJb9KVvirduW40hUMzYrmAG1d6VjzY0uy6eHHlx7MJ9BULSKOjYhHezhlDFB1gDZbUTlAWzl3A+/Je7d3SroceEhSf0n/ma+A96Ck4wCU+R9Jj0q6EVinoyBJkyVtk7/eO1+l7YF8xbaRZL8IvpL33neRNFzS7/M6pkjaKb92rXwFt/sl/S8VTI/vboW//Nj5eVtulzQ83/duZSsNTpN0dz4rs3OZJ+f/zgcl/bbG769Zt7wWh3VL0gBgH7LZiQDbAVtGxOw8yC2KiG0lrQzcK+lWspX+NidboW9dskWgftmp3OFkK/7tmpc1LCJelPRz4OWI+HF+3uXAf0XEPZI2JnsI5/uAs4F7IuIcSfuRzdws5wt5HYOAKZJ+HxELyFb4mx4Rp0o6Ky/7RLJZm8dHxCxJHwF+SrbCYKnTgFER8YakNSr5nppVwwHaujJI0oz89d3ApWSph79FxOx8/8eBD3bkl8mmiG9KtvLepIhoA+ZKuqOL8rcH7uooKyJe7KYdewLvf3uFTYbka6DsSr7OdkTcKOmlCv5NJ0s6JH/dscLfAqCdbPo+ZCsPXi1p9fzfe2VJ3St3UeaDwERJ1wLXVtAGs6o4QFtXXouI0aU78kD1Suku4KSIuKXTeftSfkW1Sldd60e2Kt9ya0rkbal4jYIyK/x1Fnm9Czt/D7qwH9kviwOBb0vaIl8i1qwunIO2Wt0C/HvH6nqSNpO0GtmKf4fnOer1yRaa7+w+shX7RuXXDsv3LwEGl5x3K1m6gfy80fnLu8hXapO0D7Bmmbb2tMJfP6Djr4BPk6VOFgOzJf1bXockbVVaoKR+wEYRcSfwDWANYPUy7TCrinvQVqtLyBb8n66sSzsfOBi4hixX+xDwOG8vlfqWiJif57CvzgPd88DHgOuBq5Q92ukk4GSyldkeJPus3kV2I/G7wCRJ0/Pyny7T1puB4/Ny/s7yK/y9AmwhaRqwiOwJN5D9AviZpDOBgcBvyVYX7NAfmKBskXyR5coXlmmHWVW8mp2ZWaKc4jAzS5QDtJlZohygzcwS5QBtZpYoB2gzs0Q5QJuZJcoB2swsUf8PZZWSGMsti3cAAAAASUVORK5CYII=\n",
"text/plain": "<Figure size 432x288 with 2 Axes>"
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": "yhat = knn_cv.predict(X_test)\nplot_confusion_matrix(Y_test,yhat)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## TASK 12\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Find the method performs best:\n"
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "GridSearchCV(cv=10, estimator=DecisionTreeClassifier(),\n param_grid={'criterion': ['gini', 'entropy'],\n 'max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18],\n 'max_features': ['auto', 'sqrt'],\n 'min_samples_leaf': [1, 2, 4],\n 'min_samples_split': [2, 5, 10],\n 'splitter': ['best', 'random']})"
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": "tree_cv"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Authors\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<a href=\"https://www.linkedin.com/in/joseph-s-50398b136/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork26802033-2021-01-01\">Joseph Santarcangelo</a> has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Change Log\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ------------- | ----------------------- |\n| 2021-08-31 | 1.1 | Lakshmi Holla | Modified markdown |\n| 2020-09-20 | 1.0 | Joseph | Modified Multiple Areas |\n"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Copyright \u00a9 2020 IBM Corporation. All rights reserved.\n"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.11"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

Machine Learning Foundations ¶

Notes for the MLF course

Assignments

Each task should have its report and IPython Notebook. Once again, we emphasize the report; it should contain all your questions and your proper statistical answers. Use figures, pictures, and tables. DO NOT PUT ANY CODE IN THE REPORT.

Table of contents

  • Assignment 1
  • Assignment 2
  • Assignment 3
  • Final Project
  • Description
  • Announcements
  • Class Logistics

Live Session Plan

  • Assignments and Final Project Submission Guidelines

DataSci 207: Applied Machine Learning

Lecture: mo, tu, th, office hours: tu, 8-9 am pt.

This course provides a practical introduction to the rapidly growing field of machine learning— training predictive models to generalize to new data. We start with linear and logistic regression and implement gradient descent for these algorithms, the core engine for training. With these key building blocks, we work our way to understanding widely used neural network architectures, focusing on intuition and implementation with TensorFlow/Keras. While the course centers on neural networks, we will make sure to cover key ideas in unsupervised learning and nonparametric modeling.

Along the way, weekly short coding assignments will connect lectures with concrete data and real applications. A more open-ended final project will tie together crucial concepts in experimental design and analysis with models and training.

This class meets for one 90 min class periods each week.

All materials for this course are posted on GitHub in the form of Jupyter notebooks.

  • Please fill out this PRE-COURSE survey so I can get to know a bit more about you and your programming background.
  • Due to a large number of private Slack inquiries, I encourage you to first read this website for commonly asked questions.
  • Any questions regarding course content and organization (including assignments and final project) should be posted on my Slack channel. You are strongly encouraged to answer other students' questions when you know the answer.
  • If there are private matters specific to you (e.g., special accommodations), please contact me directly.
  • If you miss a class, watch the recording and inform me here .
  • If you want to stay up to date with recent work in AI/ML, start by looking at the conferences NeurIPS and ICML .
  • ML study guidelines: Stanford's super cheatsheet .

Core data science courses: research design, storing and retrieving data, exploring and analyzing data.

Undergraduate-level probability and statistics. Linear algebra is recommended.

Python (v3).

Jupiter and JupiterLab notebooks. You can install them in your computer using pip or Anaconda . More information here .

Git(Hub), including clone/commmit/push from the command line. You can sign up for an account here.

If you have a MacOS M1, this .sh script will install everything for you (credit goes to one of my former students, Michael Tay)

Mac/Windows/Linux are all acceptable to use.

  • Raschka & Mirjalili (RM) , Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2.
  • Weekly coding assignments, submitted via GitHub and Digital Campus (see notes below).
  • You will present your final project in class during the final session. You are allowed to work in teams (see notes below).
  • You will submmit your code and presentation slides via GitHub (see notes below).
Week Lecture Lecture Materials Readings Deadlines (Sunday of the week, 11:59 pm PT)
Supervised and Unsupervised Learning
Jan 09-15 Introduction and Framing
Jan 16-22 Linear Regression - part 1 RM (10, 13 - intro to TensorFlow only), ,
Jan 23-29 Linear Regression - part 2 RM (4, 2),
Jan 30 - Feb 05 Logistic Regression - part 1 RM (3, 6 (p.211-219)),
Group, question, and dataset for final project
Feb 06-12 Logistic Regression - part 2 RM (3, 6 (p.211-219)),
Feb 13-19 Feedforward Neural Networks RM (12, 13, 14), ,
Feb 20-26 KNN, Decision Trees, and Ensembles RM (3, 7),
Feb 27 - March 05 K-Means and PCA

RM (11) Assignment 7
Baseline presentation: slides
March 06-12 Sequence modelling and embeddings RM (8, 16)
March 13-19 Convolutional Neural Networks RM (15), ,
March 20-26 Network Architecture and Debugging ML algorithms

March 27 - April 02 Spring Break
April 03-09 Fairness in ML
April 10-16 Advanced Topics: RNN, Transformers, BERT
April 17-23 Week 14 Final presentation: slides and code

Communication channel

Sections Slack channel
1, 2, 99 #datasci-207-2023-spring-ci

For the final project you will form a group (3-4 people are ideal; 2-5 people are allowed; no 1 person group allowed). Grades will be calibrated by group size. Your group can only include members from the section in which you are enrolled.

Do not just re-run an existing code repository; at the minimum, you must demonstrate the ability to perform thoughtful data preprocessing and analysis (e.g., data cleaning, model training, hyperparameter selection, model evaluation).

The topic of your project is totally flexible (see also below some project ideas).

  • week 04: inform me here about your group, question and dataset you plan to use.
  • week 08: prepare the baseline presentation of your project. You will present in class (no more than 10 min).
  • week 14: prepare the final presentation of your project. You will present in class (no more than 10 min).
  • Second Sight through Machine Learning
  • Can we predict solar panel electricity production using equipment and weather data?
  • Predict Stock Portfolio Returns using News Headlines
  • Pneumonia Detection from Chest Xrays
  • Predicting Energy Usage from Publically Available Building Performance Data
  • Can we Predict What Movies will be Well Received?
  • ML for Music Genre Classification
  • Predicting Metagenome Sample Source Environment from Protein Annotations
  • California Wildfire Prediction
  • Title, Authors
  • What is the question you will be working on? Why is it interesting?
  • What is the data you will be using? Include data source, size of dataset, main features to be used. Please also include summary statistics of your data.
  • What prediction algorithms do you plan to use? Please describe them in detail.
  • How will you evaluate your results? Please describe your chosen performance metrices and/or statistical tests in detail.
  • (15%) Motivation: Introduce your question and why the question is interesting. Explain what has been done before in this space. Describe your overall plan to approach your question. Provide a summary of your results.
  • (15%) Data: Describe in detail the data that you are using, including the source(s) of the data and relevant statistics.
  • (15%) Approach: Describe in detail the models (baseline + improvement over baseline) that you use in your approach.
  • (30%) Experiments: Provide insight into the effect of different hyperperameter choices. Please include tables, figures, graphs to illustrate your experiments.
  • (10%) Conclusions: Summarize the key results, what has been learned, and avenues for future work.
  • (15%) Code submission: Provide link to your GitHub repo. The code should be well commented and organized.
  • Contributions: Specify the contributions of each author (e.g., data processing, algorithm implementation, slides etc).
  • Create a GitHub repo for Assignments 1-10. Upload the homework's .ipynb file to Gradescope each week before the deadline.
  • Create a team GitHub repo for Final Project. This repo will contain your code as well as PowerPoint slides. Add me as a contributor if your repo is private (my username is corneliailin), and add the link to your repo here

Final grades will be determined by computing the weighted average of programming projects, final group project, and participation.

Baseline grading range for this course is: A for 93 or above, A- for 90 or above, B+ for 87 or above, B for 83 or above, B- for 80 or above, C+ for 77 or above, C for 73 or above, C- for 70 and above, D+ for 67 and above, D for 63 and above, D- for 60 and above, and F for 59 and below.

Participation5%
Assignments65%
Final project30%

Integrating a diverse set of experiences is important for a more comprehensive understanding of machine learning. I will make an effort to read papers and hear from a diverse group of practitioners, still, limits exist on this diversity in the field of machine learning. I acknowledge that it is possible that there may be both overt and covert biases in the material due to the lens with which it was created. I would like to nurture a learning environment that supports a diversity of thoughts, perspectives and experiences, and honors your identities (including race, gender, class, sexuality, religion, ability, veteran status, etc.) in the spirit of the UC Berkeley Principles of Community.

To help accomplish this, please contact me or submit anonymous feedback through I School channels if you have any suggestions to improve the quality of the course. If you have a name and/or set of pronouns that you prefer I use, please let me know. If something was said in class (by anyone) or you experience anything that makes you feel uncomfortable, please talk to me about it. If you feel like your performance in the class is being impacted by experiences outside of class, please don’t hesitate to come and talk with me. I want to be a resource for you. Also, anonymous feedback is always an option, and may lead to me to make a general announcement to the class, if necessary, to address your concerns.

As a participant in teamwork and course discussions, you should also strive to honor the diversity of your classmates.

If you prefer to speak with someone outside of the course, MICS Academic Director Lisa Ho, I School Assistant Dean of Academic Programs Catherine Cronquist Browning, and the UC Berkeley Office for Graduate Diversity are excellent resources. Also see the following link.

Teach with GitHub Classroom

Learn how to set up your classroom and assignments.

Manage classrooms

You can create and manage a classroom for each course that you teach using GitHub Classroom.

About assignments

You can use individual or group assignments in GitHub Classroom to teach students and grade their work.

Use the Git and GitHub starter assignment

You can use the Git & GitHub starter assignment to give students an overview of Git and GitHub fundamentals.

Create an individual assignment

You can create an assignment for students in your course to complete individually.

Create a group assignment

You can create a collaborative assignment for teams of students who participate in your course.

Editing an assignment

You can edit existing assignments in your course.

Extending an assignment's deadline for an individual or group

You can grant individual students and groups extensions to allow them more time to submit an assignment.

Monitor students' progress with the assignment overview page

You can use the assignment overview page to track the progress of each student or team on an assignment.

Reuse an assignment

You can reuse existing assignments in more than one classroom, including classrooms in a different organization.

Create an assignment from a template repository

You can create an assignment from a template repository to provide starter code, documentation, and other resources to your students.

Leave feedback with pull requests

You can leave feedback for your students in a special pull request within the repository for each assignment.

Use autograding

You can automatically provide feedback on code submissions from your students by configuring tests to run in the assignment repository.

Using GitHub Classroom with GitHub CLI

You can use gh , the GitHub command line interface, to work with GitHub Classroom directly from your command line.

Register a learning management system with GitHub Classroom

You can configure an LTI-compliant learning management system (LMS) with GitHub Classroom.

Connect a learning management system course to a classroom

You can configure an LTI-compliant learning management system (LMS) course to connect to GitHub Classroom so that you can import a roster for your classroom.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 01 July 2024

Predicting glycan structure from tandem mass spectrometry via deep learning

  • James Urban 1 , 2 ,
  • Chunsheng Jin 3 ,
  • Kristina A. Thomsson 3 ,
  • Niclas G. Karlsson   ORCID: orcid.org/0000-0002-3045-2628 4 ,
  • Callum M. Ives   ORCID: orcid.org/0000-0003-0511-1220 5 ,
  • Elisa Fadda 6 &
  • Daniel Bojar   ORCID: orcid.org/0000-0002-3008-7851 1 , 2  

Nature Methods ( 2024 ) Cite this article

Metrics details

  • Data mining
  • Glycobiology
  • Machine learning

Glycans constitute the most complicated post-translational modification, modulating protein activity in health and disease. However, structural annotation from tandem mass spectrometry (MS/MS) data is a bottleneck in glycomics, preventing high-throughput endeavors and relegating glycomics to a few experts. Trained on a newly curated set of 500,000 annotated MS/MS spectra, here we present CandyCrunch, a dilated residual neural network predicting glycan structure from raw liquid chromatography–MS/MS data in seconds (top-1 accuracy: 90.3%). We developed an open-access Python-based workflow of raw data conversion and prediction, followed by automated curation and fragment annotation, with predictions recapitulating and extending expert annotation. We demonstrate that this can be used for de novo annotation, diagnostic fragment identification and high-throughput glycomics. For maximum impact, this entire pipeline is tightly interlaced with our glycowork platform and can be easily tested at https://colab.research.google.com/github/BojarLab/CandyCrunch/blob/main/CandyCrunch.ipynb . We envision CandyCrunch to democratize structural glycomics and the elucidation of biological roles of glycans.

As the most abundant post-translational modification, glycans are frequently dysregulated and mechanistically involved in diseases ranging from cancer 1 to metabolic disorders 2 . The exact structure of complex carbohydrates is often key in mediating their function 3 , such as sialic acid only facilitating influenza infection in a particular linkage orientation 4 . From biomarkers to mechanistic understanding 1 , 2 , structural resolution thus is relevant for integrating and using glycan information for biomedical gains. In the context of systems biology, glycans are routinely measured via mass spectrometry (MS)-based glycomics 5 , providing insights into which structures or substructures are dysregulated, which can be further analyzed with various methods 6 , 7 .

Currently, structural determination of glycans is, at best, semi-manual and proceeds structure by structure 8 . Since different glycan structures can result in the same mass, structural isomers are routinely separated via liquid chromatography (LC) 9 , followed by fragmentation into smaller substructures by MS, conceptually akin to shotgun sequencing. Current in-depth workflows are hard to parallelize, with a general trade-off between resolution and scale 10 . All this has relegated structural glycomics to a few experts, inaccessible to most life science researchers.

Extensive work by Harvey and others 8 , 11 , 12 has demonstrated that, in principle, most substructures 13 , linkages 14 and monosaccharides 15 have diagnostic fragments or intensity ratios. Using this fine structural information that is contained within MS/MS spectra, along with basic biosynthetic assumptions, it is thus frequently possible to achieve high-resolution annotations of native glycans. In practice, however, annotation is often restricted to essentially topological assignments, not least due to time-constraints. Nuances of diagnostic indicators are challenging for humans to decrypt manually or encode programmatically, especially at scale and accommodating diverse experimental setups, as each linkage and monosaccharide can be affected by its sequence context 16 . This combinatorial explosion, combined with rich data, is promising for scalable artificial intelligence (AI) approaches which can learn complex mapping functions, as recently demonstrated by endeavors such as AlphaFold2 (ref. 17 ).

So far, computational attempts to automate MS-based glycomics 18 , 19 , 20 , 21 , 22 , 23 did not engage with deep learning. Rather, they relied on various search methods, to search for either possible topologies given a precursor ion mass or suitable reference spectra, loose constraints that may yield unphysiological predictions. Their primary limitations are scale and annotation resolution, ranging from composition to glycan topology. Neither linkage type nor monosaccharide stereoisomers are commonly resolved during this algorithmic sequencing. Additional hurdles to their wider adoption include poor generalizability, as none of them employ a rigorous train–test mentality, a standard practice in machine learning to evaluate methods on held-out data to prevent overfitting. Many tools were designed for very specific problems and were often tested on few spectra 18 , 19 , 20 , precluding their usage in many experimental setups.

Recent efforts in related fields, particularly in proteomics 24 , 25 , have employed scalable deep learning strategies in MS analysis. Proteomics has partially similar challenges to glycomics, for example, precursor structure elucidation given fragment ions. We thus posit that the translation of analogous methods to structural glycomics, combined with domain-inspired additions such as biosynthetic constraints and building on the accumulated work of many years of glycomics analysts, could be a major leap forward for the field and the usage of glycomics in the broader life sciences.

We present a scalable and accurate workflow for predicting glycan structure from liquid chromatography with tandem mass spectrometry (LC–MS/MS) data, centered on our deep learning model, CandyCrunch. Using a large-scale, curated set of tandem spectra from diverse experimental setups, CandyCrunch predicts glycan structure with high accuracy (~90%), outperforms existing methods on this task and matches/extends expert annotations on unseen data. This is facilitated by various domain-specific advancements, for example, considering glycan structure similarity in the loss function. We embedded this into a downstream workflow converting predictions into interpretable results, further reducing false positive rates, and estimating relative abundances; all in seconds. This workflow includes CandyCrumbs, a comprehensive MS/MS fragment annotation plug-in we developed here. We used this to uncover diagnostic fragments and more complex fragmentation behavior at scale, underpinned by molecular dynamics simulations. Finally, we annotate novel glycomes, analyze biosynthetic constraints at scale and demonstrate that our pipeline can be used in high-throughput glycomics. Our methods are accessible within a Python package ( https://github.com/BojarLab/CandyCrunch ), a free-standing Google Colab notebook at https://colab.research.google.com/github/BojarLab/CandyCrunch/blob/main/CandyCrunch.ipynb and a command line interface available via our Python package (further usage description at https://github.com/BojarLab/CandyCrunch ).

CandyCrunch predicts glycan structure via domain knowledge

Reasoning that the fragmentation patterns and propensities (that is, intensity ratios) in MS/MS are predictive of glycan structure—a relationship that is used by human experts in annotation—we set out to learn this association via machine learning. For this, we collected and curated an unprecedentedly large set of annotated LC–MS/MS spectra that derive from glycans (Fig. 1a,b and Methods ). We envision that, even beyond our efforts here, this dataset will be a valuable resource for data-driven approaches in glycomics. Crucially, this dataset aims to provide a representative view over current glycomics data, with a total of nearly 500,000 labeled MS/MS spectra from >2,000 glycomics experiments, encompassing all major eukaryotic glycan classes ( N -linked, O -linked, glycosphingolipid, milk oligosaccharides) and the most common experimental setups for glycomics. The exact composition of this dataset, broken down by glycan classes and experimental parameters, can be found in Supplementary Table 1 . To avoid overrepresenting some classes (for example, core 1 O -glycan), we then limited each class to a maximum of 1,000 spectra in the independent test set (see Methods for details) and used the remaining ~450,000 spectra to train our model on the most likely glycans in a multiclass classification setup ( Methods ).

figure 1

a , b , Overview of the curated dataset of glycomics LC–MS/MS by glycan class ( a ) and source ( b ). Diagonal bars indicate positive ion mode data. The numbers correspond to spectra with annotations. c , Schematic view of CandyCrunch model architecture. d , Pipeline of curating glycan predictions from raw file to final output table. e , Evaluating top-1 accuracy on the independent test set ( Methods ; see ref. 43 ) across different levels of resolution. f , Learned representations of all spectra in the test set are shown via t -distributed stochastic neighbor embedding ( t -SNE), colored by glycan class. Examples are illustrated with their glycan structures. g , Excerpt from an example prediction output using our Colab notebook on the file JC_171002Y1.mzML (ref. 44 ). h , Proportional Venn diagram of the comparison of CandyCrunch and Glycoforest on the raw file JC_131210PMpx5.mzML (ref. 18 ), not used for training CandyCrunch but used for developing Glycoforest. Shown are topologies (Glycoforest does not output full structures) matching those of the human annotator for each model (see Supplementary Fig. 11 for detailed comparison). All masses shown are from reduced glycans. Glycans here and in the entire paper are drawn using GlycoDraw 45 according to the Symbol Nomenclature for Glycans (SNFG). Conv, convolutional layer; d , dilation; MO/GSL, milk oligosaccharides/glycosphingolipids; PGC, porous graphitized carbon chromatography; RT, retention time.

This resulted in our dilated residual neural network, CandyCrunch, a model architecture suited to MS data 25 . Since experimental parameters such as the ion mode drastically change fragmentation patterns, it uses the MS/MS spectrum, retention time, precursor ion m/z and experimental parameters (for example, LC type, ion mode and so on) as input and predicts glycan rankings as its output (Fig. 1c ), using information from these different sources of input which are only partly redundant (Supplementary Tables 2 and 3 ). We note that we neither claim, nor sought to obtain, the most frugal model for this task, but rather the most performant and flexible, without noticeable hardware limitations (CandyCrunch can be readily used on a typical laptop). Our current binning strategy lowers the effective resolution of the mass spectrometer. Yet we note that, for the moment, analyzing the data at higher resolution, more closely approximating the true instrument resolution, does not give rise to higher accuracy (Supplementary Table 4 ), as most fragments are uniquely specified by our current binning method (Supplementary Fig. 1 and Methods ). Further, capturing minute mass differences such as between CH 4 and O (0.036 Da) would require impractically fine binning and is invalidated by the large proportion of low-resolution data in our training dataset. Available options to run CandyCrunch are shown in Supplementary Fig. 2 . The model is part of a pipeline applied to a raw file (for example, .mzML or .mzXML files), which groups predictions based on mass and retention isomers and further curates predictions with, for example, diagnostic ions (Fig. 1d ). We confirmed that this grouping procedure even succeeded in the case of retention time overlaps between peaks (Supplementary Fig. 3 ), although we caution that biological samples may contain more closely co-eluting structures that are not disambiguated by human annotators in the data used to train CandyCrunch.

If precursor ion intensities are available in the raw file, this pipeline can also estimate relative abundances. These abundances correlate well with those gained by LC peak area integration (Supplementary Fig. 4 ), a state-of-the-art approach for estimating relative abundances. We caution that overlapping isomer peaks may lead to moderate uncertainties in their quantification. Overall, CandyCrunch is highly performant, with an accuracy of ~90% of the top-ranked structure prediction in the independent test set (Fig. 1e ), performing comparably across glycans (Supplementary Fig. 5a,b ) and across different MS setups, glycan classes and derivatized glycans (Supplementary Tables 2 , 5 and 6 ), albeit with lower performance on data-poorer categories such as permethylated glycans. We also note that any evaluation is partly confounded by different annotation qualities, which may be, for instance, substantiated by exoglycosidase treatment in some cases but not in others, resulting in more ambiguous ‘ground truths’. Consequently, higher-quality data further improve performance, reaching up to ~95% accuracy currently (Supplementary Table 6 ). Custom loss functions estimating structural distance to the ground truth, and many more domain knowledge-inspired modifications ( Methods ), ensure that even erroneous predictions are structurally close to the correct solution. We quantified this statement by analyzing that structures with more shared motifs need fewer of their ‘own’ spectra to reach high prediction accuracy, indicating effective cross-training (Supplementary Fig. 6 ). Our approach also includes incompletely resolved structures, so that prediction uncertainty can be meaningfully conveyed via, for instance, missing linkage information (indicated by a higher topology accuracy than structure accuracy; Fig. 1e ). Further, the prediction score is a meaningful indicator of confidence and, when comparing top-1 predictions of the same structures, is higher for correct predictions (Supplementary Table 7 ).

Learned representations of spectra by CandyCrunch cluster by glycan sequence and glycan class (Fig. 1f ), demonstrating that the model has learned to accommodate experimental variability. Further, structurally related glycans, even within the same class, tend to cluster together in the learned representation space. This can be quantified by comparing the cosine distance of learned representations of pairs of glycans with their structural distance, revealing that the co-clustering described by the representations is indeed suggestive of structural relatedness of glycans (two-sided Mantel test of correlating the two resulting cosine distance matrices; P  < 0.001), already alluded to via Supplementary Fig. 6 above.

In framing CandyCrunch as a multiclass classification problem (that is, ranking the likelihood of pre-defined glycans), we minimized the chance for unphysiological glycans in the output, a very real possibility otherwise, given the sparsity of real glycan sequences among possible sequences 26 . However, this made zero-shot predictions—predicting a glycan sequence that was absent from our training set—conceptually infeasible. As repositories such as GlycoPOST do not catalog all physiological glycans, and glycomics studies, such as mucin-type O -glycomics 27 or milk glycomics 28 , routinely discover new structures, we set out to augment our pipeline to allow for, limited, zero-shot prediction outside our 3,391 defined glycans.

Reasoning that glycans in a biological sample tend to be biosynthetically related, that is, contain precursors/intermediates of larger biosynthetic pathways, we turned to our recently developed method of constructing glycan biosynthetic networks 7 . Applying this method to a typical CandyCrunch output (Supplementary Fig. 7 ) revealed the existence of necessary intermediate structures that were absent from our predictions but would explain spectra without a valid prediction. We thus added this routine as an optional step in our inference workflow, to facilitate a certain subset of physiological zero-shot predictions, which we support empirically (Supplementary Fig. 8 ). We caution that this additional workflow step is only expected to add value if mixtures of related glycans, such as in cells, blood or tissue, are analyzed, not purified synthetic structures.

CandyCrunch is fundamentally database-independent but can be further enhanced by methods leveraging databases, such as defined within glycowork 29 , to augment predictions downstream. By carefully selecting a suitable subset of reference structures (for example, by taxonomy, glycan class or tissue), matches for unexplainable spectra could be proposed. These potential matches were then cross-checked for diagnostic ions as well as ranked by biosynthetic compatibility with true predictions. This, again, allowed for a certain subset of zero-shot predictions. It should be noted that this procedure still balanced the theoretical constraint of physiological glycans with the reality of encountering novel structures in biological samples. Our final inference workflow then also contained this latter expansion, resulting in a ranked prediction output that can be further investigated by the researcher (Fig. 1g ). It should be noted, however, that the default in the provided notebook is not to run zero-shot inference, as this requires much more expert review than our regular model-based inference. We also developed a workflow for batches of samples, which accommodates shifts in retention time by grouping peaks across samples, resulting in improved predictions (Supplementary Fig. 9 ).

Next, we compared CandyCrunch with alternative approaches to this problem. As a preface, we should note that no current approach combines CandyCrunch’s advantages of scale, generalizability, performance and its flexibility in usage (Supplementary Fig. 10a ). Further, most methods are maintained for only the briefest of periods and are no longer realistically accessible. Thus, we had to effectively constrain ourselves to compare CandyCrunch on individual raw files that were specifically used to build these alternative approaches, while we excluded them during training. Still, in direct comparison with state-of-the-art methods such as Glycoforest 18 on challenging fish mucin glycans, CandyCrunch demonstrated a greater overlap with manual expert annotations (Fig. 1h ; 62.5% versus Glycoforest’s 40.6%) and a substantially higher structural resolution (Supplementary Fig. 11 ). In addition, by tethering CandyCrunch and the below-mentioned CandyCrumbs to our glycowork ecosystem 29 and by providing everything open-source, we substantially increase the chances for the long-term viability of our presented methods.

Applied to fully unseen datasets, CandyCrunch routinely achieved high performance (Supplementary Table 8 ; topology: 75.3% top-1 accuracy, structure: 72.4% top-1 accuracy) and potentially can extend expert annotations by correctly capturing additional structures and isomers (Supplementary Fig. 12 ). The additional predictions in this sample partly even stemmed from remnant glycans from the previous sample, showcasing the exceptional sensitivity of our model. We would like to highlight here that the cross-training of CandyCrunch on all glycan classes yielded performance synergy, as a model only trained on O -glycans performed worse for predicting O -glycans (Supplementary Table 8 ; topology: 72.3% accuracy, structure: 66.9% accuracy) than the model trained on all classes. We posit that this was due to the structure-based loss function we used for training, as well as shared information between spectra of different classes, stemming from shared glycan motifs across classes (for example, Neu5Ac-Hex).

The speed and relatively low resource requirements of CandyCrunch (Supplementary Fig. 13a ) mean that samples can be exhaustively analyzed, without practical constraints to the most abundant structures, which is a routine necessity in human analysis. In its typical application, CandyCrunch also makes fewer assumptions about what is or should be present in a sample, enhancing the chances for novel discoveries. This means, for example, that co-released N -glycans can be detected in O -glycan preparations (Supplementary Fig. 14 ).

CandyCrumbs facilitates automated diagnostic ion discovery

When analyzed by humans, fragment ions are usually annotated via the Domon–Costello nomenclature 30 and used for elucidating the structure of a glycan. While there are programs that automate this assignment 21 , 22 , they either are only accessible via graphical user interfaces or only provide annotations for simple fragment ions. We thus decided to implement an exhaustive Python-based solution to this problem, CandyCrumbs, which is also freely available via the CandyCrunch Python package. Given a candidate glycan sequence and fragment peaks, CandyCrumbs can automatically and rapidly (Supplementary Fig. 13b ) annotate fragment ions in Domon–Costello and International Union of Pure and Applied Chemistry (IUPAC)-condensed nomenclature (Fig. 2a and Methods ). Compared with alternative approaches, this presents the most feature-complete and rapid implementation of this task (Supplementary Fig. 10b ).

figure 2

a , Schematic view of the CandyCrumbs workflow for automatic fragment ion annotation. b – e , Negative ion mode spectra of reduced glycans with prediction confidence between 0.9 and 1.0 for Fucα1-2Galβ1-3GalNAc/Galβ1-4GlcNAcβ1-3Fuc ( b ), Neu5Acα2-3Galβ1-3GalNAc/Galβ1-3(Neu5Acα2-6)GalNAc ( c ), GlcNAcβ1-3(Neu5Acα2-6)GalNAc/GalNAcα1-3(Neu5Acα2-6)GalNAc ( d ) and GlcNAcβ1-3(Neu5Gcα2-6)GalNAc/GalNAcα1-3(Neu5Gcα2-6)GalNAc ( e ) were averaged and juxtaposed. Fragments exhibiting differential abundance were labeled by CandyCrumbs in the Domon–Costello nomenclature 30 . f , g , Negative ion mode spectra of reduced glycans with prediction confidence between 0.6 and 1.0 for Neu5Acα2-3Galβ1-4GlcNAcβ1-2Manα1-3(Neu5Acα2-6Galβ1-4GlcNAcβ1-2Manα1-6)Manβ1-4GlcNAcβ1-4GlcNAc/Neu5Acα2-3Galβ1-4GlcNAcβ1-2Manα1-3(Neu5Acα2-6Galβ1-4GlcNAcβ1-2Manα1-6)Manβ1-4GlcNAcβ1-4GlcNAc ( f ) and Neu5Acα2-6Galβ1-4GlcNAcβ1-2Manα1-3(Galβ1-4GlcNAcβ1-2Manα1-6)Manβ1-4GlcNAcβ1-4GlcNAc/Neu5Acα2-3Galβ1-4GlcNAcβ1-2Manα1-3(Galβ1-4GlcNAcβ1-2Manα1-6)Manβ1-4GlcNAcβ1-4GlcNAc ( g ) were averaged, juxtaposed and labeled similar to b – e . Doubly charged fragment ions are colored gray.

Further, we used several domain knowledge-inspired heuristics and probability rules to highlight the most probable fragments (Supplementary Fig. 15 and Methods ), if multiple fragmentation options could result in an m/z value that was acceptable at a given threshold. We then also integrated CandyCrumbs within the aforementioned open-access Colab notebook (at https://colab.research.google.com/github/BojarLab/CandyCrunch/blob/main/CandyCrunch.ipynb ) for full flexibility. Our implementation of CandyCrumbs then allowed us to use it in a high-throughput setting and integrate it into CandyCrunch workflows, such as for identifying diagnostic ions at scale as discussed below, to aid expert annotation of challenging cases.

Reference spectra are routinely used as high-quality examples in semi-manual annotation 31 . As ‘spectrum quality’ is an ill-defined and subjective characteristic, we aimed to quantify this aspect by using calibrated 32 prediction confidence of CandyCrunch as a proxy, with the reasoning that a more confidently assessed spectrum is a higher-quality spectrum with more information for effective prediction. Rather than one reference spectrum, that is, the usual approach 31 , we then extracted hundreds to thousands of high-quality spectra for a given structure from our dataset and engaged in highly powered statistical comparisons between isomers. This identified numerous diagnostic ions and/or ratios for topologically distinct (Fig. 2b,c ) and identical (Fig. 2d,e ) isomers, with large effect sizes. This also extended to other glycan classes and, for example, facilitated detecting conserved fragmentation differences of linkages (for example, stronger B 3 ion in α2-6 versus α2-3) across glycan backbones (Fig. 2f,g ) and recapitulated known effects from the literature 33 , such as a higher stability of α2-6 versus α2-3 in negative mode (see B 1 ion in Fig. 2g ). Importantly, these differences diminished, and eventually vanished, with lower-quality spectra (Supplementary Fig. 16 ). We then analyzed the predictiveness of these diagnostic features when reducing spectrum quality. Intriguingly, some diagnostic features, even if they were not the strongest initial signal, remained predictive even for medium- to low-quality spectra (Supplementary Fig. 17 ), making them promising candidates for aiding annotation.

Similarities between Neu5Ac and Neu5Gc versions of the same isomers (Fig. 2d,e ) suggested molecular determinants of fragmentation propensities. We thus first analyzed all high-quality O -glycan spectra juxtaposing composition-matched glycans containing GalNAcα1-3 or GlcNAcβ1-3, confirming systematic fragmentation propensities on a global scale (Supplementary Fig. 18 ).

Molecular dynamics supports diagnostic fragmentation

In the abovementioned scenario (Fig. 2d,e ), our conclusion was that GlcNAcβ1-3(Siaα2-6)GalNAc fragmented along the HexNAc-HexNAc axis, while GalNAcα1-3(Siaα2-6)GalNAc fragmented along the Sia-HexNAc linkage. To elucidate how structural properties of these molecules could give rise to these differences in fragmentation behavior, we engaged in molecular dynamics simulations of both isomers.

The fragmentation pattern of the GlcNAcβ1-3(Siaα2-6)GalNAc glycan displayed evidence of a charge-induced fragmentation mechanism (Fig. 2d,e ). In agreement with this, we saw evidence of the carboxylic acid moiety of the terminal sialic acid interacting with the hydrogen of the C6 hydroxyl group of the terminal HexNAc sugar (Fig. 3 ). The interaction sampled 11.9% of our cumulative 2-μs simulations of GlcNAcβ1-3(Neu5Acα2-6)GalNAc. As these simulations were conducted in aqueous solution, rather than a vacuum as would be the environment for fractionation, the frequency of this interaction will be far greater during the in vacuo fragmentation due to absence of water molecules competing for hydrogen bonding. Therefore, this suggests that the charge-induced fragmentation mechanism of GlcNAcβ1-3(Neu5Acα2-6)GalNAc is due to removal of a proton from the terminal HexNAc sugar, therefore resulting in fragmentation along the HexNAc-HexNAc axis.

figure 3

a , b , Kernel density estimate distribution of the distance between the center of geometry of the carboxyl group of the sialic acid and the hydrogen of the hydroxyl group of C6 of the terminal HexNAc residues for the closed ( a ) and open ( b ) reducing GalNAc residue for both GlcNAcβ1-3(Neu5Acα2-6)GalNAc (blue) and GalNAcα1-3(Neu5Acα2-6)GalNAc (yellow green). The plots show how in GlcNAcβ1-3(Neu5Acα2-6)GalNAc, the carboxyl group is able to interact with the hydroxyl of the C6 of the HexNAc. However, this interaction is not observed in GalNAcα1-3(Neu5Acα2-6)GalNAc. c , A representative snapshot of the structure of GlcNAcβ1-3(Neu5Acα2-6)GalNAc is shown, with the interaction between the two moieties displayed by a dashed line (orange). KDE, kernel density estimation.

Conversely, simulations of GalNAcα1-3(Neu5Acα2-6)GalNAc were not able to sample this interaction (occurrence < 0.1%). As a result, fragmentation of this glycan occurs along the Neu5Ac-HexNAc linkage instead.

Furthermore, during the ionization of both of the glycans, reductive β-elimination would result in the reducing end GalNAc being reduced to an alditol. As this linearized structure may result in increased flexibility, we also conducted molecular dynamics simulations of both glycans with a linearized reducing GalNAc. These simulations yielded a similar insight to those described previously. In the reduced GlcNAcβ1-3(Neu5Acα2-6)GalNAc glycan, the carboxyl group of the terminal sialic acid interacted with the hydrogen of the C6 hydroxyl group of the terminal HexNAc sugar during 6.8% of the simulated time. Again, the reduced GalNAcα1-3(Neu5Acα2-6)GalNAc was not able to sample this interaction (occurrence < 0.1%).

We therefore concluded that the identified fragmentation behavior can be used to distinguish between these two isomers, an endeavor that is otherwise challenging without specific enzymatic digestion. This implied that we could use our CandyCrunch and CandyCrumbs-powered approach to distinguish very close structural isomers based on diagnostic fragmentation behaviors, beyond single diagnostic ions or ratios and more akin to how human experts would distinguish them.

New biological insights via CandyCrunch and CandyCrumbs

Striving towards AI-assisted glycomics, we propose our platform as a means to enhance human analysts by (1) saving time, (2) making annotations more robust and (3) analyzing samples more comprehensively. We illustrate the latter point with de novo predictions of murine intestinal glycans that were too low in abundance to be included in the original annotation but revealed, for example, the presence of Neu5Gc-containing glycans and low levels of sialyl-Tn antigen in these samples (Supplementary Fig. 19 ). Importantly, we do not claim that human analysts could not have annotated these structures in principle, but rather that very real time and resource constraints make this frequently infeasible in practice. This limitation is lifted by CandyCrunch.

To demonstrate that we could apply our developed methods to truly novel samples, we analyzed the serum N -glycome of southern bluefin tuna ( Thunnus maccoyii ), which was measured within GPST000182 (ref. 34 ) but never reported in an annotated manner. This resulted in over 50 glycans, including high-mannose, hybrid and complex structures, with features such as bisecting GlcNAc, core and antenna fucosylation, Neu5Gc and multi-antennae N -glycans (Fig. 4a and Supplementary Fig. 20 ). In our comprehensive database within glycowork, not a single glycan from T. maccoyii has been reported so far, demonstrating that these pipelines can facilitate new discoveries.

figure 4

a , Serum N -glycome of the southern bluefin tuna ( T. maccoyii ). Shown are the precursor ion intensities, arrayed by LC retention time. Representative structures that are meant to illustrate the identified sequence diversity are shown via the SNFG. Next to each structure, we show the cosine similarity of the shown spectrum and the averaged spectrum of all negative ion mode spectra of reduced glycans of the predicted structure with a confidence above 0.5 (see Fig. 2 for background). b , O -glycan reactions are path-dependent. For every situation in which two glycosyltransferases competed for the same substrate ( n  = 1,003 biosynthetic networks), we analyzed which order of reactions was experimentally observed across our networks. Box plots used the median as the center line and the 25th (Q1) and 75th (Q3) percentiles as the lower and upper edges of the box. The whiskers extend to the first data point within Q1 − 1.5 × IQR (interquartile range) and to the last data point within Q3 + 1.5 × IQR and outlier values outside this range are depicted as diamonds. c , O -glycan networks decomposed into biosynthetic communities relating to core structures. We detected communities via the Louvain method and calculated their pairwise Jaccard distances, shown here as a hierarchically clustered heatmap. d , Community corresponding to core 5 O -glycans. Clustering of the distance matrix from c using OPTICS (Ordering Points To Identify the Clustering Structure) 46 resulted in conserved communities broadly corresponding to O -glycan cores, with the one from core 5 being shown here as a network, nodes scaled by degree. e , Clustering cancer cell line O -glycomes. Predicted O -glycomes of AML cell lines (GPST000214) and differentiated colorectal cancer cell lines (CaCo-2, GPST000256), via a CandyCrunch model not trained on these datasets, are shown via t -SNE ( n  = 103), using glycan abundance as features. f , Differential glycan expression between AML and colorectal cancer cell lines. Given the predicted glycomes of e , we used the get_volcano function from glycowork to test differential expression at the motif level (two-tailed Welch’s t -test), shown as a volcano plot. Differentially expressed glycans are drawn inversely scaled by corrected P value (Holm–Šídák correction for multiple testing). FC, fold change.

We also wanted to highlight how predictions could be used downstream to derive new insights from aggregating glycomics studies. This can even be done in the context of already performed glycomics experiments, distilling results from the accumulated data of many years of study. For this, we re-used the total 250,000 O -glycan spectra mentioned in the context of Fig. 2 to construct biosynthetic networks 7 . As described above, this process filled in the gaps of unobserved intermediates in the biosynthesis of observed structures. A key benefit here is that all datasets have been analyzed by the same annotator (CandyCrunch), eliminating an important source of heterogeneity 35 . Applied to our dataset, this resulted in 1,003 biosynthetic networks (corresponding to 1,003 glycomics experiments measuring O -glycans) that we used to analyze systematic effects in that glycan class. This revealed that some intermediates were never measured (Supplementary Fig. 21 and Supplementary Table 9 ), such as the reducing end GalNAc (likely due to the mass range of the mass spectrometer used), while others, such as Gal3Sβ1-3GalNAc, were nearly always reliably measured whenever larger structures that included this building block as a substructure were present in a sample. We believe that this approach might shed light on subsets of the O -glycome that are currently hard to measure, as we here, again 7 , noted the peculiar absence of GlcNAc-terminated structures from measured glycans as a trend.

Further analyses across our networks then allowed us to compare the reaction order of glycosyltransferases, reinforcing the highly dominant nature of galactosyltransferases 7 (Fig. 4b ). Decomposing the biosynthetic networks into communities unveiled several conserved clusters that were modular and occurred in many of our datasets (Fig. 4c ). Further investigation resulted in the observation that these clusters corresponded to the O -glycan core structures and their respective biosynthetic extensions (Fig. 4d ). In general, these proved to be relatively modular, except for cases such as cores 1 and 2, which showed some biosynthetic overlap. We envision that this rapid decomposition of many networks into biosynthetic subcategories will prove useful for comparing and understanding the eventual terminal motifs that will be exposed in these different O -glycan cores, as well as their biosynthesis.

As a proof of concept, to demonstrate the capabilities of CandyCrunch for high-throughput analysis, we next predicted the O -glycomes of acute myeloid leukemia (AML) cell lines (GPST000214 (ref. 36 )) and differentiated colorectal cancer cell lines (CaCo-2, GPST000256 (ref. 37 )). With a total of 103 glycomics raw files for this analysis, we could show that the predicted glycomes of AML and colorectal cancer cell lines formed distinct clusters (Fig. 4e ), which both were separate from the blanks used in GPST000256. We then engaged in a differential glycan expression analysis to investigate what distinguished these clusters. While there was considerable intra-cluster heterogeneity, this analysis revealed that the colorectal cell lines on average were more enriched in structures containing fucosylated galactose and remnant N -glycans, while the AML cell lines exhibited higher levels of sialylated glycans and Lewis structures (Fig. 4f ). This set of analyses shows that CandyCrunch can be applied to large sets of glycomics measurements and eventually be used in conjunction with other glycowork functionality to reveal dysregulated glycans and glycan motifs, directly from LC–MS/MS raw files.

We present here generalizable methods to (1) predict glycan structures from LC–MS/MS data using deep learning (CandyCrunch) and (2) automatically annotate fragment ions in higher-order tandem mass spectrometry spectra (CandyCrumbs). Proven performance on blinded data with ground truth labels (Fig. 1h , Supplementary Figs. 11 , 12 , 14 and 19 and Supplementary Table 8 ) cements the usefulness of CandyCrunch. Both CandyCrunch and CandyCrumbs are suited for high-throughput usage and can scale to large datasets as well as extremely diverse glycans and experimental setups. With the high performance that we demonstrate here, we are confident that these pipelines will be useful both for experts, accelerating and augmenting their workflows, as well as for less experienced users, similar to how automated workflows in other systems biology disciplines have democratized access to state-of-the-art methods 38 , 39 . We have demonstrable experience with maintaining software over longer periods (via our glycowork platform) and, since we ourselves are active users of CandyCrunch for our core research, have a natural incentive to further develop this technology.

Our approach is ultimately limited by the representativeness of available data. While CandyCrunch is applicable to all major glycan classes and most experimental setups (for now limited to electrospray ionization-type setups), we do note that the very best results can be expected for reduced glycans in negative mode, particularly O -glycans or free oligosaccharides. This is both a result of high-quality data in those cases and particular efforts in fine-tuning our pipeline for optimal results, as they intersected most with our own research interests and capabilities. In general, compelling results can be expected for samples similar to our training data, strongly enriched in mammalian and fish samples (Supplementary Fig. 22 ), and we expect to perform worse, on average, on remote samples such as from invertebrates. We envision that, with increasing data, this will improve. We thus urge the community to make their glycomics data (as well as high-quality annotations) available through platforms such as GlycoPOST 40 , as this will improve approaches such as CandyCrunch, and ultimately advance glycobiology and its applications.

We recognize that, as with any model, CandyCrunch predictions are imperfect, exhibiting false negative and false positive predictions, which occasionally might not resemble errors made by humans. Particularly, non-CandyCrunch glycan additions within our pipeline, via biosynthetic networks and database queries, exhibit a more tentative character and should be further evaluated by experts. For ideal results, we always recommend predictions to be further refined by experts. We are, however, convinced that CandyCrunch predictions can raise result quality and comprehensiveness for both experts and novices, in addition to the considerable increase in throughput. Lastly, during data curation, we assumed expert annotations within our training data to be correct, which may retain analyst bias, such as preferential annotation of type II versus type I LacNAc structures in N -glycans without conclusive evidence. We do note, however, that the annotations that we trained on were, in part, informed by other sources of information, such as third-generation product ion spectra or exoglycosidase digestions. Once sufficient data become available, future work may also extend this approach to higher-order tandem mass spectrometry spectra and/or exoglycosidase treatments, with more detailed structural information.

Beyond the fact that the zero-shot capabilities of CandyCrunch are limited, we would also like to note that, while we support common derivatizations such as permethylation, we do not currently support every type of glycan modification within CandyCrunch and CandyCrumbs. Specialized methods, such as azidosugars 41 , are at the moment beyond our scope. Once sufficient raw data of new modifications become available, CandyCrunch can be easily retrained (the CandyCrunch package includes a training script, and models can be retrained in less than 12 h on a free Google Colab instance).

We are also enthusiastic about the potential of upcoming methods to simulate high-resolution fragmentation spectra via deep learning 42 , which could be adapted for AI-glycomics in future work and aid either training or the evaluation of prediction results. Further, once sufficient data from either high-resolution mass spectrometers or absolute normalizations of retention time (for example, via glucose units) become available, we expect CandyCrunch to reach even higher performance. While we focus on glycomics here, we envision that analogous efforts in glycoproteomics could also advance and accelerate the field. Overall, we conclude that our presented methods not only pave the way for AI-enhanced structural glycomics but also enable many other avenues ranging from systematic comparisons over data science to glycoinformatics. This is facilitated by our large, curated dataset and the ability to quantify spectrum quality, engaging in analyses at scale for many different aspects of glycomics data.

Tandem mass spectra from electrospray ionization experiments stemmed from repositories such as GlycoPOST 40 , MassIVE, UniCarb-DB 31 , UniCarb-DR and NIST, as well as from individual publications with associated public raw data. A full list of the 196 data sources can be found in Supplementary Table 10 . All raw files were converted into the open-access format .mzML using the msconvert software 47 . A custom script using the pymzML package 48 (v.2.5.2) or pyteomics 49 (v.4.6) was used to extract all spectra at the MS/MS level, together with their stored precursor ion m/z and retention time, if available. This extraction functionality is now available as the process_mzML_stack function within our CandyCrunch package (v.0.3.0), next to an analogous process_mzXML_stack function. We extracted up to 1,000 fragment peaks of the highest intensity per spectrum, if available. Then, spectra were retained that fell within ±0.5-Da m/z and ±2-min retention time of reported glycan peaks in the associated publications. All retained spectra were kept for self-supervised training, paired with the information of the respective glycan class, while only spectra that could be unambiguously linked to structures described in the respective publications were kept for supervised training. This resulted in a total number of 625,547 glycan spectra, of which 489,103 spectra were labeled with a defined structure and could be used for training, the latter stemming from 3,391 unique glycan structures (Supplementary Table 11 ). The full dataset can be found at Zenodo under https://doi.org/10.5281/zenodo.10029271 (ref. 43 ).

Data processing

We first removed all spectra with a retention time below 2 min as noise. Retention times then were normalized for each individual sample, by dividing absolute retention times by the respective maximal retention time (or a minimum of 30, if the maximum extracted retention time was below 30). Missing retention times were assigned a value of zero. Fragment intensities were normalized for each spectrum, by dividing the intensity of each peak by the total intensity of the spectrum. Then, intensities were binned in 2,048 equal-sized m/z windows from the observed minimum (39.714) up to a maximum of 3,000. Additionally, the m/z remainder (that is, the difference of the m/z of the highest intensity peak of a bin to the left bin window) was calculated for each bin, as suggested in Altenburg et al. 25 , allowing the model to learn exact peak location despite binning. We explicitly emphasize here that this procedure, combined with the fact that most bins contain only one peak (Supplementary Fig. 1 ), allows us to override the nominal mass resolution of 1.45 Da that our binning creates. Glycan class, MS ion mode, ion trap type, LC type and glycan modification type were coded as integers to allow for learned embeddings.

During training, we capped all glycan structures to at most 1,000 randomly sampled spectra per structure in the independent test set, to avoid imbalance in assessment by frequently observed but simple glycans. We used an 85/15 split into train/test set for the 489,103 spectra, which were split on the level of samples, to ensure that spectra of one sample were not found in both train and test sets and thus make the generalizability estimation more robust. For training, classes in the test set that would constitute zero-shot prediction were afterwards moved into the train set.

Model architecture

CandyCrunch is a dilated residual neural network, with additional embedded inputs, to predict glycan structure from tandem mass spectrum in a multiclass classification setup.

For the processing of binned intensities and m/z remainders, a one-dimensional convolution layer was followed by a leaky rectified linear unit (ReLU) and six residual dilated convolutions, with dilations of 1, 2, 4, 8, 16 and 32. Then, we used max-pooling with a kernel size of 20 and a fully connected layer to bring this output to a dimensionality of 1,024. Glycan class, MS ion mode, ion trap type, LC type and glycan modification type were embedded into dimensionalities of 24, respectively. Precursor m/z and normalized retention time were also brought to dimensionalities of 24 via a fully connected layer, a layer normalization and a leaky ReLU. Then, all inputs were concatenated and passed through two sets of fully connected layers, layer normalization, leaky ReLUs and dropout (at a rate of 0.2). Finally, a last fully connected layer yielded the class probabilities. In total, CandyCrunch exhibited 12,375,084 trainable parameters.

Model training

All models were trained in PyTorch 50 (v.2.1.0) using two Zotac GeForce RTX 4090 Trinity GPUs. CandyCrunch was initialized via He initialization. All models were trained for 200 epochs, with an early stopping regularization of stopping training after 12 epochs without improvement in the test loss and a batch size of 256.

We set the learning rate at 0.0001, with a schedule to reduce the learning rate to a fifth after four epochs with no improvement in test loss. As a base optimizer we used AdamW with a weight decay of 2 × 10 −5 , which was further modified via adaptive sharpness-aware minimization (ASAM) 51 to ensure a generalizable final model.

Data augmentation during training was used only on the training set and included random (1) low-intensity peak removal, (2) peak intensity jitter and (3) new peak addition for the binned spectrum, as proposed previously for MS 52 , as well as adduct formation of the precursor ion (acetate/sodium adducts) and random noise of the precursor m/z (±0.5 Da) and retention time (±10%).

As our base loss, we used PolyLoss 53 , with an additional label-smoothing of 0.1 and epsilon = 1. We note that the label-smoothing employed here, as well as the fact that the annotators for many of our datasets have used additional information to refine their annotations (for example, third-generation product ion spectra, exoglycosidases), at least in part counteracts potential concerns about label uncertainty. We also used two additional loss terms, informed by domain knowledge, that were added to the PolyLoss term. These constituted a structure distance loss and a composition distance loss. Both involved the calculation of a distance matrix, based on pairwise cosine distances of fingerprint vectors of either the number of mono- and disaccharide motifs or the base composition of two glycans. All operations on glycans were performed using glycowork 29 (v.1.1.0). Then, the class probabilities for each input sample, transformed via a softmax activation, were multiplied by the structure distance vector and the composition distance vector (that is, the distance to the target glycan), followed by mean averaging to obtain loss terms. This unsupervised procedure preferentially penalizes confidently predicted but structurally dissimilar glycans and improves performance as well as the meaningfulness of errors.

We first engaged in supervised training on annotated MS/MS spectra. Then, using the trained model we predicted glycan structure for our unannotated spectra for self-supervised training. Spectra with a prediction score of over 0.7 were then merged with the original training dataset, followed by a deduplication step. Specifically, as described above, we retained the same test set and again formed a training dataset with at most 1,000 examples per glycan in the independent test set, followed by re-training.

Model inference

To predict glycan structures from unannotated raw files, all tandem spectra were extracted via pymzML as described above and processed as described for the general data processing. Then, we grouped m/z precursor ions by scanning for discontinuities larger than 0.5 Da in the extracted spectra. Within these m/z groups, we searched for structural isomers by analyzing their retention time in chunks of 0.5 min. While this may lead to overlaps between isomer peaks, this is not an inherent problem, as long as co-elution is not perfect, as different chunks will still retain the respective isomers as the dominant species, which will be reflected in the final output table. For each retention time group, we averaged all spectra for input of a robust averaged spectrum to CandyCrunch and extracted the median spectrum, to have a representative spectrum for each glycan entity in the sample. We first retrieved the top 25 predictions for each averaged spectrum, using the trained CandyCrunch model. We then employed a single-parameter variant of Platt Scaling 32 to calibrate the prediction confidence before the softmax layer, using a scaling factor of 1.15 that was estimated via the limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm. Using test-time augmentation, we averaged the predictions of five independent inferences that were modified with the same data augmentation strategy as employed during training.

Next, we used domain knowledge to automatically filter out predictions, such as of (1) a prediction probability below a threshold of 0.01, (2) the wrong glycan class, (3) the wrong mass, even when considering multiply charged ion forms, and (4) predictions that lacked corroborating diagnostic ions in their fragment lists. Domain-specific exceptions were made, such as allowing cross-class predictions if the prediction confidence was extraordinarily high (above 0.2; justified by the fact that O -linked glycan samples often contain remnant N -linked glycans and so on) Finally, predictions were deduplicated by merging any mass/retention windows that resulted in identical predictions.

Lastly, we used biosynthetic knowledge to refine our predictions, conceptualized in the canonicalize_biosynthesis function within CandyCrunch. Using the subgraph_isomorphism function from glycowork and starting from the largest glycan prediction, we searched for top-1 predictions of biosynthetic precursors in the whole prediction dataframe. For each prediction at mass M , we added 0.1 to its prediction confidence for each unique biosynthetic precursor in top-1 predictions at mass M -1, M -2, …, M - n . If this changed the order of predictions, we re-ordered predictions according to their scores. Thereafter, scores were re-normalized to 1 and the, up to, top-5 predictions were retained. This procedure not only improved the accuracy of our results but also increased the meaningfulness and consistency of both correct and wrong predictions (that is, wrong predictions were structurally closer to the ground truth after this procedure).

Spectra without valid predictions but with valid compositions, cross-referenced by relevant databases within glycowork, were also retained and subjected to as many of the abovementioned domain filters as possible. Whenever available, top-1 predictions were paired with their GlyTouCan ID 54 . The whole inference workflow, including elements described below, is available via the wrap_inference function in the CandyCrunch package. Available options for running the function are shown in Supplementary Table 12 and mentioned in the documentation of the CandyCrunch package ( https://github.com/BojarLab/CandyCrunch ).

For the case of multiple samples from the same experiment, we also added the wrap_inference_batch function to the CandyCrunch package. This expanded workflow aligns retention times across samples, if possible and suitable, to build a prediction library and ensure that shifts in retention time between samples are accommodated.

Zero-shot prediction

For a given sample, all retained top-1 predictions were used to construct a biosynthetic network as described previously 7 , using the implementation within glycowork. For milk oligosaccharides, this also included evolutionary pruning, as pre-calculated species networks were available. Then, we calculated whether any of the inferred biosynthetic precursors would explain the mass and composition of glycan spectra without a valid prediction. Matches within a mass difference of 0.5 Da, including multiply charged ions, were retained as additional predictions beyond our model-defined library of predictable glycans. While direct model predictions were awarded the evidence category ‘strong’, the biosynthetic network intermediaries merited the category ‘medium’.

Next, we checked for missed Neu5Gc-substituted Neu5Ac-glycans and vice versa (that is, a mass difference of 16 Da per substitution, with the corresponding diagnostic ions). Similarly, in the case of an O -glycan sample, we checked for missed GlcNAc6S-substituted GlcNAc-glycans and vice versa (connected to the reducing end GalNAc). Additionally, we used a suitable subset of the glycowork-stored database, of the right taxonomic section and glycan class, to search for possible matches to compositions without predictions. Both of these endeavors were annotated with the evidence label ‘weak’.

After these additional routines to enable predictions outside of our defined list of glycans, we again employed the domain knowledge-informed filters mentioned above. This ensured that glycans introduced via these methods still had empirical support in the underlying data. Predictions from these routines were also subjected to the canonicalize_biosynthesis workflow from above (although ‘bonus’ points were awarded only for biosynthetic precursors from the ‘strong’ category), to allow for prioritization of the most probable structures.

Fragment annotation via CandyCrumbs

The final prediction of the CandyCrunch model was used as a starting point for fragment annotation and converted into a directed graph using NetworkX (v.3.0), each monosaccharide making up a node and each linkage labeling an edge. The randomized enumeration method was implemented to find all induced connected subgraphs 55 . After filtering which modifications were physically possible based on linkage numbers, each terminal monosaccharide on the subgraphs was permuted to create these cross-ring or bond fragmentations. Each possible global modification was also added to each fragment. The mass of each theoretical fragment was calculated to then be matched with observed masses in MS/MS spectra. Finally, the fragments were converted into Domon–Costello 30 and IUPAC-condensed nomenclature. If multiple fragment possibilities could explain a given m/z value, a prioritization scheme was developed (Supplementary Fig. 15 ), which emphasized prior likelihood of each fragment option and the evidence of the remaining fragments in a given tandem spectrum. We note that fragment prioritization is an optional step in this workflow and can be disabled, if all possible fragments are desired. CandyCrumbs is available via CandyCrunch.analysis.CandyCrumbs in our developed Python package.

Molecular dynamics simulation

Initial conformations for the GlcNAcβ1-3(Neu5Acα2-6)GalNAc and GalNAcα1-3(Neu5Acα2-6)GalNAc glycans were obtained using the Carbohydrate Builder tool of the GLYCAM-Web server 56 . Four structures were produced for each glycan with different combinations of the α2-6 torsion angles. This approach provided different initial starting points for the simulations, and thus maximized the sampling of the conformational space. Each glycan was parameterized with the GLYCAM06-j1 forcefield 57 , and a cuboid solvent box of TIP3P water molecules created to produce a minimum solute distance of 15 Å. In the case of the reduced glycan structures, the structures of the open GalNAc were parameterized using the GAFF2 forcefield 58 . A single Na + ion was included in each system to neutralize the net charge of the system. These systems were then converted into GROMACS topology files using Acpype 59 . For each initial starting conformation of each system, a 500-ns simulation was performed using GROMACS2022.4 (ref. 60 ), resulting in 2 μs of simulations for each respective system.

Biosynthetic network analysis

For all networks constructed and analyzed in this work, we used the code functionality within the glycowork.network.biosynthesis module (v.1.1.0). Our analyses were oriented very closely by the ones described by Thomès et al. 7 Briefly, the analysis of glycosyltransferase competition was performed by analyzing diamond-like network motifs via the trace_diamonds and find_diamonds functions within glycowork. Thereby, we analyzed the proportion of networks that presented a certain case of glycosyltransferase competition and counted how often each alternative order of reactions was experimentally observed among these. This allowed us to analyze which reaction order dominated across (1) glycan contexts and (2) networks. The differences shown in Fig. 4 were further filtered to contain at least (1) two glycan sequence contexts, (2) a mean difference of 30 and (3) a corrected P value below 0.01.

Biosynthetic communities were extracted using the get_communities function, from glycowork, on reaction path preference-pruned biosynthetic networks 7 . Conserved communities were detected by first calculating a distance matrix based on pairwise Jaccard distances, followed by clustering these distances using the OPTICS algorithm as implemented in scikit-learn (v.1.2.2), with a minimum number of 50 samples per cluster.

Statistical analyses

Comparing two groups was done via one-tailed or two-tailed Welch’s t -tests. In all cases, significance was defined as P  < 0.05. All multiple testing was corrected with a Holm–Šídák correction. All statistical testing has been done in Python 3.9 using the statsmodels package (v.0.13.5) and the scipy package (v.1.10.1). Effect sizes were calculated as Cohen’s d using glycowork (v.1.1.0). The correlation of distance matrices was performed via two-sided Mantel tests as implemented within scikit-bio (v.0.5.8).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All relevant data, including their data provenance with accession IDs from GlycoPOST 40 , MassIVE, UniCarb-DB 31 , UniCarb-DR or NIST, can be found at Zenodo via https://doi.org/10.5281/zenodo.10029271 (ref. 43 ) or are contained within Supplementary Tables 10 and 11 . The 196 data sources are listed in Supplementary Table 10 .

Code availability

All relevant code is integrated into glycowork (v.1.1.0) and/or can be found at https://github.com/BojarLab/CandyCrunch . CandyCrunch and CandyCrumbs can also be readily accessed at https://colab.research.google.com/github/BojarLab/CandyCrunch/blob/main/CandyCrunch.ipynb .

Guo, Y., Jia, W., Yang, J. & Zhan, X. Cancer glycomics offers potential biomarkers and therapeutic targets in the framework of 3P medicine. Front. Endocrinol. 13 , 970489 (2022).

Article   Google Scholar  

Cvetko, A. et al. Plasma N-glycome shows continuous deterioration as the diagnosis of insulin resistance approaches. BMJ Open Diabetes Res. Care 9 , e002263 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Varki, A. Biological roles of glycans. Glycobiology 27 , 3–49 (2017).

Article   CAS   PubMed   Google Scholar  

Zhao, C. & Pu, J. Influence of host sialic acid receptors structure on the host specificity of influenza viruses. Viruses 14 , 2141 (2022).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Rudd, P. M. et al. in Essentials of Glycobiology (eds Varki, A. et al.) Ch. 51 (Cold Spring Harbor Laboratory Press, 2022).

Bao, B. et al. Correcting for sparsity and interdependence in glycomics by accounting for glycan biosynthesis. Nat. Commun. 12 , 4988 (2021).

Thomès, L., Karlsson, V., Lundstrøm, J. & Bojar, D. Mammalian milk glycomes: connecting the dots between evolutionary conservation and biosynthetic pathways. Cell Rep. 42 , 112710 (2023).

Article   PubMed   Google Scholar  

Everest-Dass, A. V., Abrahams, J. L., Kolarich, D., Packer, N. H. & Campbell, M. P. Structural feature ions for distinguishing N- and O- linked glycan isomers by LC-ESI-IT MS/MS. J. Am. Soc. Mass Spectrom. 24 , 895–906 (2013).

Veillon, L. et al. Characterization of isomeric glycan structures by LC-MS/MS: liquid phase separations. Electrophoresis 38 , 2100–2114 (2017).

Trbojević-Akmačić, I. et al. High-throughput glycomic methods. Chem. Rev. 122 , 15865–15913 (2022).

Harvey, D. J. Analysis of carbohydrates and glycoconjugates by matrix‐assisted laser desorption/ionization mass spectrometry: an update for 2019–2020. Mass Spectrom. Rev. https://doi.org/10.1002/mas.21806 (2023).

Chai, W., Piskarev, V. & Lawson, A. M. Negative-ion electrospray mass spectrometry of neutral underivatized oligosaccharides. Anal. Chem. 73 , 651–657 (2001).

Yu, J. et al. Distinctive MS/MS fragmentation pathways of glycopeptide-generated oxonium ions provide evidence of the glycan structure. Chemistry 22 , 1114–1124 (2016).

De Leoz, M. L. A., Simón-Manso, Y., Woods, R. J. & Stein, S. E. Cross-ring fragmentation patterns in the tandem mass spectra of underivatized sialylated oligosaccharides and their special suitability for spectrum library searching. J. Am. Soc. Mass Spectrom. 30 , 426–438 (2019).

Li, W., Hou, C., Li, Y., Wu, C. & Ma, J. HexNAcQuest: a tool to distinguish O-GlcNAc and O-GalNAc. J. Am. Soc. Mass Spectrom. 33 , 2008–2012 (2022).

Toghi Eshghi, S. et al. Classification of tandem mass spectra for identification of N- and O-linked glycopeptides. Sci. Rep. 6 , 37189 (2016).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature https://doi.org/10.1038/s41586-021-03819-2 (2021).

Horlacher, O. et al. Glycoforest 1.0. Anal. Chem. 89 , 10932–10940 (2017).

Chen, Z. et al. GlycoDeNovo2: an improved MS/MS-based de novo glycan topology reconstruction algorithm. J. Am. Soc. Mass Spectrom. 33 , 436–445 (2022).

Kumozaki, S., Sato, K. & Sakakibara, Y. A machine learning based approach to de novo sequencing of glycans from tandem mass spectrometry spectrum. IEEE/ACM Trans. Comput. Biol. Bioinform. 12 , 1267–1274 (2015).

Ceroni, A. et al. GlycoWorkbench: a tool for the computer-assisted annotation of mass spectra of glycans. J. Proteome Res. 7 , 1650–1659 (2008).

Klein, J. & Zaia, J. glypy: an open source glycoinformatics library. J. Proteome Res. 18 , 3532–3537 (2019).

Vakhrushev, S. Y., Dadimov, D. & Peter-Katalinić, J. Software platform for high-throughput glycomics. Anal. Chem. 81 , 3252–3260 (2009).

Yilmaz, M. et al. Sequence-to-sequence translation from mass spectra to peptides with a transformer model. Preprint at bioRxiv https://doi.org/10.1101/2023.01.03.522621 (2023).

Altenburg, T., Giese, S. H., Wang, S., Muth, T. & Renard, B. Y. Ad hoc learning of peptide fragmentation from mass spectra enables an interpretable detection of phosphorylated and cross-linked peptides. Nat. Mach. Intell. 4 , 378–388 (2022).

Bojar, D., Powers, R. K., Camacho, D. M. & Collins, J. J. Deep-learning resources for studying glycan-mediated host-microbe interactions. Cell Host Microbe 29 , 132–144.e3 (2021).

Jin, C. et al. Structural diversity of human gastric mucin glycans. Mol. Cell. Proteom. 16 , 743–758 (2017).

Article   CAS   Google Scholar  

Jin, C., Lundstrom, J., Korhonen, E., Luis, A. S. & Bojar, D. Breast milk oligosaccharides contain immunomodulatory glucuronic acid and LacdiNAc. Mol. Cell. Proteomics. 22 , 100635 (2023).

Thomès, L., Burkholz, R. & Bojar, D. Glycowork: a Python package for glycan data science and machine learning. Glycobiology https://doi.org/10.1093/glycob/cwab067 (2021).

Domon, B. & Costello, C. E. A systematic nomenclature for carbohydrate fragmentations in FAB-MS/MS spectra of glycoconjugates. Glycoconj. J. 5 , 397–409 (1988).

Hayes, C. A. et al. UniCarb-DB: a database resource for glycomic discovery. Bioinformatics 27 , 1343–1344 (2011).

Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. Preprint at http://arxiv.org/abs/1706.04599 (2017).

Seymour, J. L., Costello, C. E. & Zaia, J. The influence of sialylation on glycan negative ion dissociation and energetics. J. Am. Soc. Mass Spectrom. 17 , 844–854 (2006).

Coff, L. et al. Profiling the glycome of Cardicola forsteri , a blood fluke parasitic to bluefin tuna. Int. J. Parasitol. 52 , 1–12 (2022).

Kawahara, R. et al. Community evaluation of glycoproteomics informatics solutions reveals high-performance search strategies for serum glycopeptide analysis. Nat. Methods 18 , 1304–1316 (2021).

Blöchl, C. et al. Integrated N- and O-glycomics of acute myeloid leukemia (AML) cell lines. Cells 10 , 3058 (2021).

Madunić, K. et al. O-glycomic and proteomic signatures of spontaneous and butyrate-stimulated colorectal cancer cell line differentiation. Mol. Cell. Proteom. 22 , 100501 (2023).

Russo, F. & Angelini, C. RNASeqGUI: a GUI for analysing RNA-seq data. Bioinformatics 30 , 2514–2516 (2014).

Malm, E. K., Srivastava, V., Sundqvist, G. & Bulone, V. APP: an Automated Proteomics Pipeline for the analysis of mass spectrometry data based on multiple open access tools. BMC Bioinformatics 15 , 441 (2014).

Watanabe, Y., Aoki-Kinoshita, K. F., Ishihama, Y. & Okuda, S. GlycoPOST realizes FAIR principles for glycomics mass spectrometry data. Nucleic Acids Res. 49 , D1523–D1528 (2021).

Laughlin, S. T. & Bertozzi, C. R. Metabolic labeling of glycans with azido sugars and subsequent glycan-profiling and visualization via Staudinger ligation. Nat. Protoc. 2 , 2930–2944 (2007).

Murphy, M. et al. Efficiently predicting high resolution mass spectra with graph neural networks. Preprint at https://arxiv.org/abs/2301.11419 (2023).

Urban, J. et al. Predicting glycan structure from tandem mass spectrometry via deep learning. Zenodo https://doi.org/10.5281/zenodo.10029271 (2024).

Kouka, T. et al. Computational modeling of O-linked glycan biosynthesis in CHO cells. Molecules 27 , 1766 (2022).

Lundstrøm, J., Urban, J., Thomès, L. & Bojar, D. GlycoDraw: a python implementation for generating high-quality glycan figures. Glycobiology 33 , 927–934 (2023).

Ankerst, M., Breunig, M. M., Kriegel, H.-P. & Sander, J. OPTICS: ordering points to identify the clustering structure. SIGMOD Rec. 28 , 49–60 (1999).

Adusumilli, R. & Mallick, P. in Proteomics Vol. 1550 (eds Comai, L. et al) 339–368 (Springer, 2017).

Kösters, M. et al. pymzML v2.0: introducing a highly compressed and seekable gzip format. Bioinformatics 34 , 2513–2514 (2018).

Levitsky, L. I., Klein, J. A., Ivanov, M. V. & Gorshkov, M. V. Pyteomics 4.0: five years of development of a python proteomics framework. J. Proteome Res. 18 , 709–714 (2019).

Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Preprint at https://arxiv.org/abs/1912.01703 (2019).

Kwon, J., Kim, J., Park, H. & Choi, I. K. ASAM: adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. Preprint at https://arxiv.org/abs/2102.11600 (2021).

Huber, F., van der Burg, S., van der Hooft, J. J. J. & Ridder, L. MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra. J. Cheminform. 13 , 84 (2021).

Leng, Z. et al. PolyLoss: a polynomial expansion perspective of classification loss functions. Preprint at https://arxiv.org/abs/2204.12511 (2022).

Tiemeyer, M. et al. GlyTouCan: an accessible glycan structure repository. Glycobiology 27 , 915–919 (2017).

Wernicke, S. in Algorithms in Bioinformatics Vol. 3692 (eds Casadio, R. & Myers, G.) 165–177 (Springer, 2005).

GLYCAM-Web (Complex Carbohydrate Research Center, University of Georgia, 2005).

Kirschner, K. N. et al. GLYCAM06: a generalizable biomolecular force field. Carbohydrates. J. Comput. Chem. 29 , 622–655 (2008).

Wang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A. & Case, D. A. Development and testing of a general amber force field. J. Comput. Chem. 25 , 1157–1174 (2004).

Sousa Da Silva, A. W. & Vranken, W. F. ACPYPE—AnteChamber PYthon Parser interfacE. BMC Res. Notes 5 , 367 (2012).

Abraham, M. J. et al. GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2 , 19–25 (2015).

Download references

Acknowledgements

This work was funded by a Branco Weiss Fellowship–Society in Science awarded to D.B., by the Knut and Alice Wallenberg Foundation and by the University of Gothenburg, Sweden. The Science Foundation of Ireland (SFI) Frontiers for the Future Programme is gratefully acknowledged for financial support of C.M.I. (grant no. 20/FFP-P/8809). We also thank C. Fogarty for his assistance in parameterizing the open GalNAc residue for molecular dynamics simulations. We thank SciLifeLab and BioMS (Swedish research council) for providing financial support to the Proteomics Core Facility, Sahlgrenska Academy. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Open access funding provided by University of Gothenburg.

Author information

Authors and affiliations.

Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg, Sweden

James Urban & Daniel Bojar

Wallenberg Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg, Sweden

Proteomics Core Facility at Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden

Chunsheng Jin & Kristina A. Thomsson

Section of Pharmacy, Department of Life Sciences and Health, Faculty of Health Sciences, Oslo Metropolitan University, Oslo, Norway

Niclas G. Karlsson

Department of Chemistry and Hamilton Institute, Maynooth University, Maynooth, Ireland

Callum M. Ives

School of Biological Sciences, University of Southampton, Southampton, UK

Elisa Fadda

You can also search for this author in PubMed   Google Scholar

Contributions

D.B. and J.U. conceived the method. D.B. curated the dataset. D.B., E.F., C.M.I. and J.U. performed computational analyses. D.B., C.M.I. and J.U. prepared the figures. C.J., N.G.K. and K.A.T. confirmed spectra annotations and provided domain expertise for method development and application. D.B. and E.F. supervised. All authors wrote and edited the paper.

Corresponding author

Correspondence to Daniel Bojar .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Methods thanks Kiyoko Aoki-Kinoshita and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary Figs. 1–22.

Reporting Summary

Peer review file, supplementary tables.

Supplementary Tables 1–12.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Urban, J., Jin, C., Thomsson, K.A. et al. Predicting glycan structure from tandem mass spectrometry via deep learning. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02314-6

Download citation

Received : 13 June 2023

Accepted : 17 May 2024

Published : 01 July 2024

DOI : https://doi.org/10.1038/s41592-024-02314-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

machine learning assignment github

IMAGES

  1. GitHub

    machine learning assignment github

  2. GitHub

    machine learning assignment github

  3. 18-19-machine-learning-final-assignment/knn.py at master · Dekai21/18

    machine learning assignment github

  4. GitHub

    machine learning assignment github

  5. Machine-Learning-Lab/19AIE205_Python_for_Machine_Learning_Assignment_3

    machine learning assignment github

  6. GitHub

    machine learning assignment github

VIDEO

  1. [COSC2753] [SG_T1_G1] Machine Learning Assignment 2 Presenation

  2. Machine learning assignment 3

  3. HIT

  4. INTRODUCTION TO MACHINE LEARNING ASSIGNMENT 2#machine learning#

  5. Group 13 Machine Learning Assignment Presentation

  6. NPTEL INTRODUCTION TO MACHINE LEARNING ASSIGNMENT-8 ANSWERS IITKGP#machine learning#

COMMENTS

  1. atinesh-s/Coursera-Machine-Learning-Stanford

    Machine Learning (Coursera) This is my solution to all the programming assignments and quizzes of Machine-Learning (Coursera) taught by Andrew Ng. After completing this course you will get a broad idea of Machine learning algorithms. Try to solve all the assignments by yourself first, but if you get stuck somewhere then feel free to browse the ...

  2. machine-learning-projects · GitHub Topics · GitHub

    Add this topic to your repo. To associate your repository with the machine-learning-projects topic, visit your repo's landing page and select "manage topics." GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

  3. GitHub

    This repository contains the exercises, lab works and home works assignment for the Introduction to Machine Learning online class taught by Professor Leslie Pack Kaelbling, Professor Tomás Lozano-Pérez, Professor Isaac L. Chuang and Professor Duane S. Boning from Massachusett Institute of Technology - denikn/Machine-Learning-MIT-Assignment

  4. Coursera Machine Learning MOOC by Andrew Ng

    An unfortunate aspect of this class is that the programming assignments are in MATLAB or OCTAVE, probably because this class was made before python became the go-to language in machine learning. The Python machine learning ecosystem has grown exponentially in the past few years, and is still gaining momentum. I suspect that many students who ...

  5. applied-machine-learning-in-python/Assignment+1.ipynb at ...

    Solutions to the 'Applied Machine Learning In Python' Coursera course exercises - amirkeren/applied-machine-learning-in-python

  6. agarwal-ayushi/Machine-Learning-Assignments

    This repository will contains all the assignments to be done for the completion of the course COL 774 at IIT, Delhi taken by Professor Parag Singla Resources Readme

  7. GitHub

    Coursera-Machine-Learning-Assignments-Personal-Solutions Exercises are done on Matlab R2017a This repository consists my personal solutions to the programming assignments of Andrew Ng's Machine Learning course on Coursera.

  8. Lab 1: Machine Learning with Python

    scikit-learn #. One of the most prominent Python libraries for machine learning: Contains many state-of-the-art machine learning algorithms. Builds on numpy (fast), implements advanced techniques. Wide range of evaluation measures and techniques. Offers comprehensive documentation about each algorithm.

  9. 10 GitHub Repositories to Master Machine Learning

    10 GitHub Repositories to Master Machine Learning. The blog covers machine learning courses, bootcamps, books, tools, interview questions, cheat sheets, MLOps platforms, and more to master ML and secure your dream job. By Abid Ali Awan, KDnuggets Assistant Editor on December 1, 2023 in Machine Learning. Image generated with DALLE-3.

  10. DataSci W207: Applied Machine Learning (Fall 2022)

    Step 1: Create GitHub repos for Assignments 1-10 and Final Project. Step 2: If weekly assignments, upload .ipynb file in Gradescope. If final project, upload an .ipynb file that contains the link to your group GitHub repo (add your presentation slides to the repo; each team member submits in Gradescope) Grading.

  11. Assignment 1

    Assignment 1 - Introduction to Machine Learning. For this assignment, you will be using the Breast Cancer Wisconsin (Diagnostic) Database to create a classifier that can help diagnose patients. First, read through the description of the dataset (below). :Number of Instances: 569.

  12. Foundations of Machine Learning

    Bloomberg presents "Foundations of Machine Learning," a training course that was initially delivered internally to the company's software engineers as part of its "Machine Learning EDU" initiative. This course covers a wide variety of topics in machine learning and statistical modeling. The primary goal of the class is to help participants gain ...

  13. Prediction Assignment Writeup

    Prediction Assignment Writeup Alaettin Serhan Mete Apr. 2017 Overview . This document summarizes the work done for the Prediction Assignment Writeup project for the Coursera Practical Machine Learning course. It's created using the functionalities of the knitr package in RStudio using the actual analysis code.

  14. CS5785 Applied Machine Learning

    H. Daumé III, A Course in Machine Learning, v0.8. Course Requirements and Grading Grade Breakdown: Your grade will be determined by the assignments (30%), one prelim (30%), a final exam (30%), and in-class quizzes (10%). Homework: There will be four assignments and an "assignment 0" for environment setup. Each assignment will have a due ...

  15. Machine Learning Specialization (Andrew Ng)

    Hi everyone, I recently completed Andrew Ng's three courses in machine learning through Coursera. In light of what was once a free offering that is now paid, I have open sourced my notes and submissions for the lab assignments, in hopes people can follow along with the material.

  16. Assignment 3

    Assignment 3. Save your working file in Google drive so that all changes will be saved as you work. MAKE SURE that your final version is saved to GitHub. Before you turn this in, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All).

  17. Machine Learning Assignments · GitHub

    Machine Learning Assignments. GitHub Gist: instantly share code, notes, and snippets.

  18. Assignment: Machine Learning Prediction · GitHub

    gatoytoro. /. Assignment: Machine Learning Prediction. "source": "Space X advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is because Space X can reuse the first stage. Therefore if we can determine if the first stage will land, we ...

  19. Machine Learning Foundations

    Machine Learning Foundations Home Week-1 Week-1 Outline Linear Algebra for ML ML problem Datasets Week-2 Week-2 None Week-3 Week-3 Outline Linear Regression Xθ = 0 Xθ = y Xθ ≈ y Best Fit Hyperplane Projections Week-4 Week-4 Outline ...

  20. Assignments

    Machine Learning Course of Computer Science Faculty of Shahid Beheshti University. Winter 2023 Machine Learning ... Repo on GitHub Assignments Each task should have its report and IPython Notebook. Once again, we emphasize the report; it should contain all your questions and your proper statistical answers.

  21. DataSci 207: Applied Machine Learning (Spring 2023)

    Raschka & Mirjalili (RM), Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2. Assignments. Weekly coding assignments, submitted via GitHub and Digital Campus (see notes below). Final Project. You will present your final project in class during the final session.

  22. GitHub

    Host and manage packages Security. Find and fix vulnerabilities

  23. Teach with GitHub Classroom

    Connect a learning management system course to a classroom. You can configure an LTI-compliant learning management system (LMS) course to connect to GitHub Classroom so that you can import a roster for your classroom. Learn how to set up your classroom and assignments.

  24. Developing with GitHub Copilot and VS Code

    Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

  25. Predicting glycan structure from tandem mass spectrometry via deep learning

    CandyCrunch is a deep learning-based tool for predicting glycan structures from tandem mass spectrometry data. The paper also introduces CandyCrumbs that automatically annotates fragment ions in ...