Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

Contains Optional Labs and Solutions of Programming Assignment for the Machine Learning Specialization By Stanford University and - Coursera (2023) by Prof. Andrew NG


  • Jupyter Notebook 94.9%
  • Python 5.1%

Bloomberg ML EDU presents:

Foundations of Machine Learning

Understand the Concepts, Techniques and Mathematical Frameworks Used by Experts in Machine Learning

About This Course

Bloomberg presents "Foundations of Machine Learning," a training course that was initially delivered internally to the company's software engineers as part of its "Machine Learning EDU" initiative. This course covers a wide variety of topics in machine learning and statistical modeling. The primary goal of the class is to help participants gain a deep understanding of the concepts, techniques and mathematical frameworks used by experts in machine learning. It is designed to make valuable machine learning skills more accessible to individuals with a strong math background, including software developers, experimental scientists, engineers and financial professionals.

The 30 lectures in the course are embedded below, but may also be viewed in this YouTube playlist . The course includes a complete set of homework assignments, each containing a theoretical element and implementation challenge with support code in Python, which is rapidly becoming the prevailing programming language for data science and machine learning in both academia and industry. This course also serves as a foundation on which more specialized courses and further independent study can build.

Please fill out this short online form to register for access to our course's Piazza discussion board. Applications are processed manually, so please be patient. You should receive an email directly from Piazza when you are registered. Common questions from this and previous editions of the course are posted in our FAQ .

The first lecture, Black Box Machine Learning , gives a quick start introduction to practical machine learning and only requires familiarity with basic programming concepts.

Highlights and Distinctive Features of the Course Lectures, Notes, and Assignments

  • Geometric explanation for what happens with ridge, lasso, and elastic net regression in the case of correlated random variables.
  • Investigation of when the penalty (Tikhonov) and constraint (Ivanov) forms of regularization are equivalent.
  • Concise summary of what we really learn about SVMs from Lagrangian duality.
  • Proof of representer theorem with simple linear algebra, emphasizing it as a way to reparametrize certain objective functions.
  • Guided derivation of the math behind the classic diamond/circle/ellipsoids picture that "explains" why L1 regularization gives sparsity (Homework 2, Problem 5)
  • From scrach (in numpy) implementation of almost all major ML algorithms we discuss: ridge regression with SGD and GD (Homework 1, Problems 2.5, 2.6 page 4), lasso regression with the shooting algorithm (Homework 2, Problem 3, page 4), kernel ridge regression (Homework 4, Problem 3, page 2), kernelized SVM with Kernelized Pegasos (Homework 4, 6.4, page 9), L2-regularized logistic regression (Homework 5, Problem 3.3, page 4),Bayesian Linear Regession (Homework 5, problem 5, page 6), multiclass SVM (Homework 6, Problem 4.2, p. 3), classification and regression trees (without pruning) (Homework 6, Problem 6), gradient boosting with trees for classification and regression (Homework 6, Problem 8), multilayer perceptron for regression (Homework 7, Problem 4, page 3)
  • Repeated use of a simple 1-dimensional regression dataset, so it's easy to visualize the effect of various hypothesis spaces and regularizations that we investigate throughout the course.
  • Investigation of how to derive a conditional probability estimate from a predicted score for various loss functions, and why it's not so straightforward for the hinge loss (i.e. the SVM) (Homework 5, Problem 2, page 1)
  • Discussion of numerical overflow issues and the log-sum-exp trick (Homework 5, Problem 3.2)
  • Self-contained introduction to the expectation maximization (EM) algorithm for latent variable models.
  • Develop a general computation graph framework from scratch, using numpy, and implement your neural networks in it.


The quickest way to see if the mathematics level of the course is for you is to take a look at this mathematics assessment , which is a preview of some of the math concepts that show up in the first part of the course.

  • Solid mathematical background , equivalent to a 1-semester undergraduate course in each of the following: linear algebra, multivariate differential calculus, probability theory, and statistics. The content of NYU's DS-GA-1002: Statistical and Mathematical Methods would be more than sufficient, for example.
  • Python programming required for most homework assignments.
  • Recommended: At least one advanced, proof-based mathematics course
  • Recommended: Computer science background up to a "data structures and algorithms" course
  • (HTF) refers to Hastie, Tibshirani, and Friedman's book The Elements of Statistical Learning
  • (SSBD) refers to Shalev-Shwartz and Ben-David's book Understanding Machine Learning: From Theory to Algorithms
  • (JWHT) refers to James, Witten, Hastie, and Tibshirani's book An Introduction to Statistical Learning


GD, SGD, and Ridge Regression

Lasso Regression

SVM and Sentiment Analysis

Kernel Methods

Probabilistic Modeling

Multiclass, Trees, and Gradient Boosting

Computation Graphs, Backpropagation, and Neural Networks

The cover of Hands-On Machine Learning with Scikit-Learn and TensorFlow

Other tutorials and references

  • Carlos Fernandez-Granda's lecture notes provide a comprehensive review of the prerequisite material in linear algebra, probability, statistics, and optimization.
  • Brian Dalessandro's iPython notebooks from DS-GA-1001: Intro to Data Science
  • The Matrix Cookbook has lots of facts and identities about matrices and certain probability distributions.
  • Stanford CS229: "Review of Probability Theory"
  • Stanford CS229: "Linear Algebra Review and Reference"
  • Math for Machine Learning by Hal Daumé III

A photo of David Rosenberg

David S. Rosenberg

Teaching Assistants

  • Description
  • Announcements
  • Class Logistics

Live Session Plan

  • Assignments and Final Project Submission Guidelines

DataSci 207: Applied Machine Learning

Lecture: mo, tu, th, office hours: tu, 8-9 am pt.

This course provides a practical introduction to the rapidly growing field of machine learning— training predictive models to generalize to new data. We start with linear and logistic regression and implement gradient descent for these algorithms, the core engine for training. With these key building blocks, we work our way to understanding widely used neural network architectures, focusing on intuition and implementation with TensorFlow/Keras. While the course centers on neural networks, we will make sure to cover key ideas in unsupervised learning and nonparametric modeling.

Along the way, weekly short coding assignments will connect lectures with concrete data and real applications. A more open-ended final project will tie together crucial concepts in experimental design and analysis with models and training.

This class meets for one 90 min class periods each week.

All materials for this course are posted on GitHub in the form of Jupyter notebooks.

  • Please fill out this PRE-COURSE survey so I can get to know a bit more about you and your programming background.
  • Due to a large number of private Slack inquiries, I encourage you to first read this website for commonly asked questions.
  • Any questions regarding course content and organization (including assignments and final project) should be posted on my Slack channel. You are strongly encouraged to answer other students' questions when you know the answer.
  • If there are private matters specific to you (e.g., special accommodations), please contact me directly.
  • If you miss a class, watch the recording and inform me here .
  • If you want to stay up to date with recent work in AI/ML, start by looking at the conferences NeurIPS and ICML .
  • ML study guidelines: Stanford's super cheatsheet .

Core data science courses: research design, storing and retrieving data, exploring and analyzing data.

Undergraduate-level probability and statistics. Linear algebra is recommended.

Python (v3).

Jupiter and JupiterLab notebooks. You can install them in your computer using pip or Anaconda . More information here .

Git(Hub), including clone/commmit/push from the command line. You can sign up for an account here.

If you have a MacOS M1, this .sh script will install everything for you (credit goes to one of my former students, Michael Tay)

Mac/Windows/Linux are all acceptable to use.

  • Raschka & Mirjalili (RM) , Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2.
  • Weekly coding assignments, submitted via GitHub and Digital Campus (see notes below).
  • You will present your final project in class during the final session. You are allowed to work in teams (see notes below).
  • You will submmit your code and presentation slides via GitHub (see notes below).

Communication channel

For the final project you will form a group (3-4 people are ideal; 2-5 people are allowed; no 1 person group allowed). Grades will be calibrated by group size. Your group can only include members from the section in which you are enrolled.

Do not just re-run an existing code repository; at the minimum, you must demonstrate the ability to perform thoughtful data preprocessing and analysis (e.g., data cleaning, model training, hyperparameter selection, model evaluation).

The topic of your project is totally flexible (see also below some project ideas).

  • week 04: inform me here about your group, question and dataset you plan to use.
  • week 08: prepare the baseline presentation of your project. You will present in class (no more than 10 min).
  • week 14: prepare the final presentation of your project. You will present in class (no more than 10 min).
  • Second Sight through Machine Learning
  • Can we predict solar panel electricity production using equipment and weather data?
  • Predict Stock Portfolio Returns using News Headlines
  • Pneumonia Detection from Chest Xrays
  • Predicting Energy Usage from Publically Available Building Performance Data
  • Can we Predict What Movies will be Well Received?
  • ML for Music Genre Classification
  • Predicting Metagenome Sample Source Environment from Protein Annotations
  • California Wildfire Prediction
  • Title, Authors
  • What is the question you will be working on? Why is it interesting?
  • What is the data you will be using? Include data source, size of dataset, main features to be used. Please also include summary statistics of your data.
  • What prediction algorithms do you plan to use? Please describe them in detail.
  • How will you evaluate your results? Please describe your chosen performance metrices and/or statistical tests in detail.
  • (15%) Motivation: Introduce your question and why the question is interesting. Explain what has been done before in this space. Describe your overall plan to approach your question. Provide a summary of your results.
  • (15%) Data: Describe in detail the data that you are using, including the source(s) of the data and relevant statistics.
  • (15%) Approach: Describe in detail the models (baseline + improvement over baseline) that you use in your approach.
  • (30%) Experiments: Provide insight into the effect of different hyperperameter choices. Please include tables, figures, graphs to illustrate your experiments.
  • (10%) Conclusions: Summarize the key results, what has been learned, and avenues for future work.
  • (15%) Code submission: Provide link to your GitHub repo. The code should be well commented and organized.
  • Contributions: Specify the contributions of each author (e.g., data processing, algorithm implementation, slides etc).
  • Create a GitHub repo for Assignments 1-10. Upload the homework's .ipynb file to Gradescope each week before the deadline.
  • Create a team GitHub repo for Final Project. This repo will contain your code as well as PowerPoint slides. Add me as a contributor if your repo is private (my username is corneliailin), and add the link to your repo here

Integrating a diverse set of experiences is important for a more comprehensive understanding of machine learning. I will make an effort to read papers and hear from a diverse group of practitioners, still, limits exist on this diversity in the field of machine learning. I acknowledge that it is possible that there may be both overt and covert biases in the material due to the lens with which it was created. I would like to nurture a learning environment that supports a diversity of thoughts, perspectives and experiences, and honors your identities (including race, gender, class, sexuality, religion, ability, veteran status, etc.) in the spirit of the UC Berkeley Principles of Community.

To help accomplish this, please contact me or submit anonymous feedback through I School channels if you have any suggestions to improve the quality of the course. If you have a name and/or set of pronouns that you prefer I use, please let me know. If something was said in class (by anyone) or you experience anything that makes you feel uncomfortable, please talk to me about it. If you feel like your performance in the class is being impacted by experiences outside of class, please don’t hesitate to come and talk with me. I want to be a resource for you. Also, anonymous feedback is always an option, and may lead to me to make a general announcement to the class, if necessary, to address your concerns.

As a participant in teamwork and course discussions, you should also strive to honor the diversity of your classmates.

If you prefer to speak with someone outside of the course, MICS Academic Director Lisa Ho, I School Assistant Dean of Academic Programs Catherine Cronquist Browning, and the UC Berkeley Office for Graduate Diversity are excellent resources. Also see the following link.

Coursera Machine Learning

Coursera machine learning by prof. andrew ng, machine learning by prof. andrew ng.


Table of Contents

Breif intro, video lectures index, programming exercise tutorials, programming exercise test cases, useful resources, extra information.

  • Online E-Books

Aditional Information

The most of the course talking about hypothesis function and minimising cost funtions

A hypothesis is a certain function that we believe (or hope) is similar to the true function, the target function that we want to model. In context of email spam classification, it would be the rule we came up with that allows us to separate spam from non-spam emails.

Cost Function

The cost function or Sum of Squeared Errors(SSE) is a measure of how far away our hypothesis is from the optimal hypothesis. The closer our hypothesis matches the training examples, the smaller the value of the cost function. Theoretically, we would like J(θ)=0

Gradient Descent

Gradient descent is an iterative minimization method. The gradient of the error function always shows in the direction of the steepest ascent of the error function. Thus, we can start with a random weight vector and subsequently follow the negative gradient (using a learning rate alpha)

Differnce between cost function and gradient descent functions

Bias and variance.

When we discuss prediction models, prediction errors can be decomposed into two main subcomponents we care about: error due to “bias” and error due to “variance”. There is a tradeoff between a model’s ability to minimize bias and variance. Understanding these two types of error can help us diagnose model results and avoid the mistake of over- or under-fitting.


Hypotheis and Cost Function Table

Regression with pictures.

  • Linear Regression
  • Logistic Regression

Week 1 - Due 07/16/17:

  • Welcome - pdf - ppt
  • Linear regression with one variable - pdf - ppt
  • Linear Algebra review (Optional) - pdf - ppt
  • Lecture Notes

Week 2 - Due 07/23/17:

  • Linear regression with multiple variables - pdf - ppt
  • Octave tutorial pdf
  • Programming Exercise 1: Linear Regression - pdf - Problem - Solution
  • Program Exercise Notes

Week 3 - Due 07/30/17:

  • Logistic regression - pdf - ppt
  • Regularization - pdf - ppt
  • Programming Exercise 2: Logistic Regression - pdf - Problem - Solution

Week 4 - Due 08/06/17:

  • Neural Networks: Representation - pdf - ppt
  • Programming Exercise 3: Multi-class Classification and Neural Networks - pdf - Problem - Solution

Week 5 - Due 08/13/17:

  • Neural Networks: Learning - pdf - ppt
  • Programming Exercise 4: Neural Networks Learning - pdf - Problem - Solution

Week 6 - Due 08/20/17:

  • Advice for applying machine learning - pdf - ppt
  • Machine learning system design - pdf - ppt
  • Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance - pdf - Problem - Solution

Week 7 - Due 08/27/17:

  • Support vector machines - pdf - ppt
  • Programming Exercise 6: Support Vector Machines - pdf - Problem - Solution

Week 8 - Due 09/03/17:

  • Clustering - pdf - ppt
  • Dimensionality reduction - pdf - ppt
  • Programming Exercise 7: K-means Clustering and Principal Component Analysis - pdf - Problems - Solution

Week 9 - Due 09/10/17:

  • Anomaly Detection - pdf - ppt
  • Recommender Systems - pdf - ppt
  • Programming Exercise 8: Anomaly Detection and Recommender Systems - pdf - Problems - Solution

Week 10 - Due 09/17/17:

  • Large scale machine learning - pdf - ppt

Week 11 - Due 09/24/17:

  • Application example: Photo OCR - pdf - ppt
  • Linear Algebra Review and Reference Zico Kolter
  • CS229 Lecture notes
  • CS229 Problems
  • Financial time series forecasting with machine learning techniques
  • Octave Examples

Online E Books

  • Introduction to Machine Learning by Nils J. Nilsson
  • Introduction to Machine Learning by Alex Smola and S.V.N. Vishwanathan
  • Introduction to Data Science by Jeffrey Stanton
  • Bayesian Reasoning and Machine Learning by David Barber
  • Understanding Machine Learning, © 2014 by Shai Shalev-Shwartz and Shai Ben-David
  • Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman
  • Pattern Recognition and Machine Learning, by Christopher M. Bishop

Course Status


  • What are the top 10 problems in deep learning for 2017?
  • When will the deep learning bubble burst?

Statistics Models

  • HMM - Hidden Markov Model
  • CRFs - Conditional Random Fields
  • LSI - Latent Semantic Indexing
  • MRF - Markov Random Fields
  • SIGIR - Special Interest Group on Information Retrieval
  • ACL - Association for Computational Linguistics
  • NAACL - The North American Chapter of the Association for Computational Linguistics
  • EMNLP - Empirical Methods in Natural Language Processing
  • NIPS - Neural Information Processing Systems


Applied Machine Learning in Python

Introduction ¶.

This book aims to provide an accessible introduction into applying machine learning with Python, in particular using the scikit-learn library. I assume that you’re already somewhat familiar with Python and the libaries of the scientific Python ecosystem. If you find that you have a hard time following along some of the details of numpy, matplotlib and pandas, I highly recommend you look at Jake VanderPlas’ Python Data Science handbook .

Scope and Goals ¶

After reading this book, you will be able to do exploratory data analysis on a dataset, evaluate potential machine learning solutions, implement, and evaluate them. The focus of the book is on tried-and-true methodology for solving real-world machine learning problems. However, we will not go into the details of productionizing and deloying the solutions. We will mostly focus on what’s know as tabular data, i.e. data that would usually be represented as a pandas DataFrame, Excel Spreadsheet, or CSV file. While we will discuss working with text-data in Chapter, there are many more advanced techniques, for which I’ll point you towards Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola. We will not look at image recognition, video or speech data, or time series forecasting, though many of the core concepts described in this book will also apply there.

What is machine learning? ¶

Machine learning, also known as predictive modeling in statistics, is a research field and a collection of techniques to extract knowledge from data, often used to automate decision-making processes. Applications of machine learning are pervasive in technology, in particular in complex websites such as facebook, Amazon, youtube or Google. These sites use machine learning to personalize the experience, show relevant content, decide on advertisements, and much more. Without machine learning, none of these services would look anything like they do today. Outside the web, machine learning has also become integral to commercial applications in manifacturing, logistics, material design, financial markets and many more. Finally, over the last years, machine learning has also become essential to research in practically all data-driven sciences, including physics, astronomy, biology, medicine, earth sciences and social sciences.

There are three main sub-areas of machine learning, supervised learning, unsupervised learning, and reinforcement learning, each of which applies to a somewhat different setting. We’ll discuss each in turn, and give some examples of how they can be used.

Supervised Learning ¶

Supervised learning is by far the most commonly used in practice. In supervised learning, a model is built from a dataset of input-output pairs, where the input is known as features or independent variables, which we’ll denote by \(x\) , and the output is known as target or label, which we’ll denote by \(y\) . The input here is a representation of an entity of interest, say a customer to your online shop, represented by their age, location and shopping history. The output is a quantity of interest that we want our model to predict, say whether they would buy a particular product if we recommend it to them. To build a model, we need to collect many such pair, i.e. we need to build records of many customers and their decisions about whether or not they bought the product after an recommendation was shown to them. Such a set of input-output pairs for the purpose of building a supervised machine learning model is called a training set .

(customers of bank easier? because discrete products?)

Once we collected this dataset, we can (attempt to) build a supervised machine learning model that will make a prediction for a new user that wasn’t included in the training dataset. That might enable us to make better recommendations, i.e. only show recommendations to a user that’s likely to buy.

(given an example?)

The name supervised learning comes from the fact that during learning, the dataset contains the correct targets, which acts as a supervisor for the model training.

For both regression and classification, it’s important to keep in mind the concept of generalization. Let’s say we have a regression task. We have features, that is data vectors x_i and targets y_i drawn from a joint distribution. We now want to learn a function f, such that f(x) is approximately y, not on this training data, but on new data drawn from this distribution. This is what’s called generalization, and this is a core distinction to function approximation. In principle we don’t care about how well we do on x_i, we only care how well we do on new samples from the distribution. We’ll go into much more detail about generalization in about a week, when we dive into supervised learning.

Classification and Regression ¶

\(^1\) There are many other kinds of supervised learning tasks such as ranking or probability estimation, however, we will focus on classification and regression, the most commonly used tasks, in this book.

There are two main kinds of supervised learning tasks, called classification and regression \(^1\) . If the target of interest \(y\) that we want to predict is a quantity, the task is a regression problem. If it is discrete, i.e. one of several distinct choices, then it is a classification problem. For example, predicting the time it will take a patient to recover from an illness is a regression task, say measured in days. We might want our model to predict whether a patient will be ready to leave a hospital 3.5 days after admission or 5 or 10. This is regression becaues the time is clearly a continuous quantity, and there is a clear sense of ordering and distance between the different possible predictions. If the correct prediction is that the patient can leave after 4.5 days, but instead we predict 5, that might not be exactly correct, but it might still be a useful prediction. Even 6 might be somewhat useful, while 20 would be totally wrong.

\(^2\) This might be more naturally formulated as a multi-label task, which is basically a series of binary classification tasks. There could be more than one medication that leads to success, so this could be phrased as a yes/no question for each candidate.

An example for a classification task would be which of a set of medications the patient would respond best to \(^2\) . Here, we have a fixed set of disjoint candidates that are known a-priori, and there is usually no order or sense of distance between the classes. If medication A is the best, then predicting any other medication is a mistake, so we need to predict the exact right outcome for the prediction to be accurate. A very common instance of classification is the special case of binary classification, where there are exactly two choices. Often this can be formulated as a “yes/no” question to which you want to predict an answer. Examples of this are “is this email spam?”, “is there a pedestrian on the street”, “will this customer buy this product” or “should we run an X-ray on this patient”.

The distinction into classification is important, as it will change the algorithms we will use, and the way we measure success. For classification, a common metric is accuracy , the fraction of correctly classified examples, i.e. the fraction of times the model predictied the right class. For regression on the other hand, a common metric is mean squared error, which is the squared average distance from the prediction to the correct answer. In other words, in regression, you want the prediction to be close to the truth, while in classification you want to predict exactly the correct class. In practice, the difference is a bit more subtle, and we will discuss model evaluation in depth in chapter TODO.

Usually it’s quite clear whether a task is classification or regression, but there are some cases that could be solved using either approach. A somewhat common example is ratings in the 5-star rating system that that’s popular on many online platforms. Here, the possible ratings are one start, two starts, three stars, four stars and five stars. So these are discrete choices, and you could apply a classification algorithm. On the other hand, there is a clear ordering, and if the real answer is one star, predicting two stars is probably better than predicting 5 stars, which means it might be more appropriate to use regression. Here, which one is more appropriate depends on the particular algorithm you’re using and how to integrate into your larger workflow.

Generalization ¶

When building a model for classification or regression, keep in mind that what we’re interested in is applying the model to new data for which we do not know the outcome. If we build a model for detecting spam emails, but it only works on emails in the training set, i.e. emails the model has seen during building of the model, it will be quite useless. What we want from a spam detection algorithm is to predict reasonably well whether an email is spam or not for a new email that was not included in the training set. The ability for a supervised model to make accurate predictions on new data is called generalization and is the core goal of supervised learning. Whithout asking for generalization, an algorithm could solve the spam detection task on the training data by just storing all the data, and when presented with one of these emails, look up what the correct answer was. This approach is known as memorization, but it’s impossible to apply to new data.

Conditions for success ¶

For a supervised learning model to generalize well, i.e. for it to be able to learn to make accurate prediction on new data, some key assumptions must be met:

First, the necessary information for making the correct prediction actually needs to be encoded in the training data . For example, if I try to learn to predict a fair coin flip before the coin is tossed, iI won’t be able to build a working machine learning model, no matter what I choose as the input features. The process is very (or entirely?) random, and the information to make a prediction is just not available. More technically, one might say the process has high intrinsic randomness that we can not overcome by building better models. While you’re unlikely to encounter a case as extreme (and obvious) as a coin toss, many processes in the real world are quite random (such as the behavior of people) and it’s impossible to make entirely accurate predictions for them.

In other cases, a prediction might be possible in principle, but we might not have provided the right information to the model. For example, it might be possible for a machine learning model to learn to diagnose pneumonia in a patient, but not if the only information about the patient that we represent to them is their shopping habbits and wardrobe. If we use a chest x-ray as a representation of the patient, together with a collection of symptoms, we will likely have better success. Even if the information is represented in the input, learning might still fail if the model is unable to extract the information. For example, visual stimuli are very easy to interpret for humans, but in general much harder to understand for machine learning algorithms. Consequently, it would be much harder for a machine to determine if a graffiti is offensive by presenting it with a photograph, than if the same information was represented as a text file.

Secondly, the training dataset needs to be large and varied enough to capture the variability of the process . In other words, the training data needs to be representative of the whole process, not only representing a small portion of it. Humans are very good at abstracting properties, and a child will be able to understand what a car is after seeing only a handfull. Machine learning algorithms on the other hand require a lot of variability to be present. For example, to learn the concept of what a car looks like, an algorithm likely needs to see pictures of vans, of trucks, of sedans, pictures from the front, the side and above, pictures parking and in traffic, pictures in rain and in sunshine, in a garage and outdoors, maybe even pictures taken by a phone camera and pictures taken by a news camera. As we said before, the whole point of supervised learning is to generalize, so we want our model to apply to new settings. However, how new a setting can be depends on the representation of the data and the algorithm in question. If the algorithm has only ever seen trucks, it might not recognize a sedan. If the algorithm has never seen a snow-covered car, it’s unlikely it will recognize it. Photos (also known as natural images in machine learning) are a very extreme example as they have a lot of variability, and so often require a lot of training data. If your data has a simple structure, or the relationship between your features and your target are simple, then only a handful of training examples might be enough.

example of simple training task

Third and finally, the data that the model is applied to needs to be generated from the same process as the data the model was trained on . A model can only generalize to data that in essence adheres to the same rules and has the same structure. If I collect data about public transit ridership in Berlin, and use it to make predictions in New York, my model is likely to perform poorly. While I might be able to measure the same things, say number of people at stations, population density, holidays etc, there are so many differences between data collected in Berlin and data collected in New York that it’s unlikely a model trained on one could predict well on the other. As another example, let’s say you train an image recognition model for recognizing hot dogs on a dataset of stock photos, and you want to deploy it to an app using a phone camera. This is also likely to fail, as stock photography doesn’t resemble photos taken by users pointing their phone. Stock photography is professionally produced and well-lit, the angles are carefully chosen, and often the food is altered to show it in it’s best light (have you noticed how food in a restaurant never looks like in a commercial?). However, machine learning requires you to use a training dataset that was generated by the same process as the data the model will be applied to.

Mathematical Background

From a mathematical standpoint, supervised learning assumes that there is a joint distribution \(p(x, y)\) and that the training dataset consists for independent, identically distributed (i.i.d.) samples from this joint distribution. The model is then applied to new data sampled from the same distribution, but \(y\) is unknown. The model is then used to estimate \(p(y | x)\) , or more commonly the mode of this distribution, i.e. the most likely value for \(y\) to take given the \(x\) we observed. In the case of learning to predict a coin flip, you could actually learn a very accurate model of \(p(y | x)\) , that predicts heads and tails with equal probability. There is no way to predict the particular outcome itself, though.

The third requirement for success is easily expressed as saying that the test data is sampled i.i.d. from the same distribution \(p(x, y)\) that the training data was generated from.

Unsupervised Learning ¶

In unsupervised machine learning, we are usually just given data points \(x\) , and the goal is to learn something about the structure of the data. This is usually a more open-ended task than what we saw in supervised learning. This kind of task is called unsupervised, because even during training, there is no “supervision” providing a correct answer. There are several sub-categories of unsupervised learning that we’ll discuss in Chapter 3, in particular clustering, dimensionality reduction, and signal decomposition. Clustering is the task of finding coherent groups within a dataset, say subgroups of customers that behave in a similar way, say “students”, “new parents” and “retirees”, that each have a distinct shopping pattern. However, here, in contrast to classification, the groups are not pre-defined. We might not know what the groups are, how many groups there are, or even if there is a coherent way to define any groups. There might also be several different ways the data could be grouped: say you’re looking at portraits. One way to group them could be by whether the subject has classes or not. Another way to group them could be by the direction they are facing. Yet another might be hair color or skin color. If you tell an algorithm to cluster the data, you don’t know which aspect it will pick up on, and usually manually inspecting the groups or clusters is the only way to interpret the results.

Two other, related, unsupervised learning tasks are dimensionality reduction and signal decomposition. In these, we are not looking at groups in the data, but underlying factors of variance, that are potentially more semantic than the original representation. Going back to the example of portraits, an algorithm might find that head orientation, lighting and hair color are important aspects of the image that vary independently. In dimensionality reduction, we are usually looking for a representation that is lower-dimensional, i.e. that has less variables than the original feature space. This can be particularly useful for visualizing dataset with many features, by projecting them into a two-dimensional space that’s easily plotted. Another common application of signal decomposition is topic modeling of text data. Here, we are trying to find topics among a set of documents, say news articles, or court documents, or social media posts. This is related to clustering, though with the difference that each document can be assigned multiple topics, i.e. topics in the news could be politics, religion, sports and economics, and an article could be about both, politics and economics.

Both, clustering and signal decomposition, are most commonly used in exploratory analysis, where we are trying to understand the data. They are less commonly used in production systems, as they lend themselves less easily to automating a decision process. Sometimes signal decomposition is used as a mechanism to extract more semantic features from a dataset, on top of which a supervised model is learned. This can be particularly useful if there is a large amount of data, but only a small amount of annotated data, i.e. data for which the outcome \(y\) is known.

Reinforcement Learning ¶

The third main family of machine learning tasks is reinforcement learning, which is quite different from the other two. Both supervised and unsupervised learning basically work on a dataset that was collected and stored, from which we then build a model. Potentially, this model is then applied to new data in the future. In reinforcement learning, on the other hand, there is no real notion of a dataset. Instead, reinforcement learning is about a program (usually known as an agent) interacting with a particular environment. Through this interaction, the agent learns to achieve a particular goal. A good example of this is a program learning to play a video game. Here, the agent would be an AI playing the game, while the environment would be the game itself, i.e. the world in which it plays out. The agent is presented with the environment, and has choices of actions (say moving forward and backward and jumping) and each of these actions will result in the environment being in a new state (i.e. with the agent placed a bit forward, or backward, or falling in a hole). Given the new environment, the agent again can choose an action and the environment will be in a new state as a consequence.


Fig. 1 The reinforcement learning cycle. ¶

The learning in reinforcement learning happens with so-called rewards , which need to be specified by the data scientist building the system. The agent is trained to seek rewards (hence the name reinforcement learning), and will find series of actions that maximize its reward. In a game, a reward could be given to the environment every time they score points, or just once when the agent wins the game. In the second case, there might be a long delay between the agent taking an action, and the agent winning the game, and one of the main challenges in reinforcement learning is dealing with such settings (this is known as credit attribution problem: which of my actions should I give credit for me winning the game).

Compared with supervised learning, reinforcement learning is a much more indirect way to specify the learning problem: we don’t provide the algorithm with the correct answer (i.e. the correct sequence of actions to win the game), instead we only reward the agent once they achieve a goal. Suprisingly, this can work surprisingly well in practice. This is like learning a game without someone ever telling you the rules, or what the goal of the game is, but only telling you whether you lost or won at the end. As you might expect, it might take you many many tries to figure out the game.

However, algorithms are notoriously patient, and researchers have been able to use reinforcement learning to create programs that can play a wide variety of complex games. Potentially one of the most suprising and impressive feats was learning to play the ancient chinese boardgame of Go at a superhuman level.

citations etc, numbers of games, years…

When this was publicided in TODO, many researchers in the area were shocked, as the game was known to be notoriously hard, and many believed it could not be learned by any known algorithms. While the initial work used some human knowledge, later publications learned to play the game from scratch, i.e. without any rewards other than for winning the game, by the agent repeatedly playing the game against itself. The resulting programs are now playing at superhuman level, meaning that they are basically unbeatable, even by the best human players in the world. Similar efforts are now underway for other games, in particular computer games like StarCraft II and DOTA.

Algorithms achieved super-human performance in the game of chess long before this, in the year TODO with the famous play of Kasparov against Deep Blue. Chess has much fewer possible moves and games are much shorter sequences of actions than in Go or StarCraft, which makes it much easier to devise algorithms to play chess.

Reinforcement learning also has a long history in other areas, in particular robotics, where it is used for learning and tuning behaviors, such as walking or grasping. While many impressive achievements have been made with reinforcement learning, there are several aspects that limit it’s broad usefulness. A potential application of reinforcement learning could be self-driving cars. However, as mentioned above, reinforcement learning usually requires many attempts or iterations before it learns a specific task. If I wanted to train a car to learn to park, it might fail thousands or hundreds of thousands of times first. Unfortunately, in the real world this is impractical: self-driving cars are very expensive, and we don’t want to crash them over and over again. It might also be risky for the person conducting the experiment. With thousands of attempts, even if the car doesn’t crash, the gas will run out, and the person having to reset the experiment every time will probably get very tired very quickly. Therefore, reinforcement learning is most successful when there is a good way to simulate the environment, as is the case with games, and with some aspects of robotics. For learning how to park, a simulation might actually work well, as the sensing of other cars and the steering of the car can be simulated well. However, for really learning how to drive, a car would need to be able to deal with a variety of situations, such as different weather conditions, crowded streets, people running on the strees, kids chasing balls, navigating detours and many other scenarios. Simulating these in a realistic way is very hard, and so reinforcement learning is much harder to apply in the physical world.

A setting that has attracted some attention, and might become more relevant soon, is online platforms that are not games. You could think of a social media timeline as an agent that gets rewarded for you looking at it. Right now, this is often formulated as a supervised learning task (TODO or more acurately active learning). However, your interactions with social media are not usually indepentent events, but your behavior online is shaped by what is being presented to you, and what was shown to you in the past might influence what is shown to you in the future. A maybe somewhat cynical analogy would be to think of this as a timeline being an agent, playing you, winning whenever you stay glued to the screen (or click an ad or buy a product). I’m not aware that this has been implemented anywhere, but as computational capacity increase and algorthms become more sophisticated, it is a natural direction to explore.

Reinforcement learning is a fascinating topic, but much beyond the scope of this book. For an introduction, see TODO Sutten Barto. For an overview of modern approaches, see TODO.

As you might have notices in the table of contents, this book mostly concerns itself with supervised and unsupervised learning, and we will not discuss reinforcement learning any further. As a matter of fact, the book heavily emphasizes supervised learning, which has found the larges success among the three in practical applications so far. While all three of these areas are interesting in their own right, when you see an application of machine learning, or when someone says they are using machine learning for something, chances are they mean supervised learning, which is arguably the most well-understood, and the most easy to productionize and analyze.

Isn’t this just statistics? ¶

A very common question I get is ‘is machine learning not just statistics?’ and I want to quickly address how the approach in this book differs from the approach taken in a statistics class or textbook. The machine learning community and the statistics community have some historical differences (ML being born much later and from within computer science), but study many of the same subjects. So I don’t think it makes sense to say that one thing is statistics and the other thing is machine learning. However, there is usually a somewhat different emphasis in the kinds of problems and questions that are addressed in either, and I think it’s important to distinguish these tasks. Much of statistics often deals with inference , which means that given a dataset, we want to make statements that hold for the dataset (often called population in statistics) as a whole. Machine learning on the other hand often emphasizes prediction , which means we are looking to make statements about each sample, in other words indivdual level statements. Asking “do people that take zinc get sick less often”” is an inference question, as it asks about whether something happens on average over the whole population. A related prediction question would be “will this particular person get sick if they take zinc?”. The answer for the inference question would be either “yes” or “no”, and using hypothesis testing methodology this statement could have an effect size and a significance level attached to it. The answer for the prediction question would be a prediction for each sample of interest, or maybe even a program that can make prediction given information about a new patient.

As you can see, these are two fundamentally different kinds of questions, and require fundamentally different kinds of tools to answer them. This book solely looks at the prediction task, and we consider a model a good model if it can make good predictions. We do not claim that the model allows us to make any statistical or even causal statements that hold for the whole population, or the process that generated the dataset.

There are some other interesting differences between the kind of prediction questions studied in supervised machine learning compared to inference questions that are traditionally studied in statistics; in particular, machine learning usually assumes that we have access to data that was generated from the process we want to model, and that all samples are created equal (i.i.d.). Statistical inference usually makes no such assumptions, and instead assumes that we have some knowledge about the structure of the process that generated the data. As an example, consider predicting a presidential election outcome. As of this writing, there’s 58 past elections to learn from. For a machine learning task, this is by no means enough observations to learn from. But even worse, these samples are not created equally. The circumstances of the first election are clearly different than the will be for the next election. The economic and societal situation will be different, as will be the candidates. So really, we have no examples whatsoever from the process that we’re interested in. However, understanding all the differences to previous elections, we might still be able to make accurate forecasts using statistical modeling.

is this math? should this been in a math section?

If you’re interested in a discussion of prediction vs inference and how they relate, I highly recommend the seminal paper Statistical Modeling: The Two Cultures by Leo Breiman.

Some of my favorite machine learning textbooks are written by statisticians (the subfield is called predictive modeling), and there are certainly machine learning researchers that work on inference questions, so I think making a distinction between statistics and machine learning is not that useful. However, if you look at how a statistics textbook teaches, say, logistic regression, the intention is likely to be inference, and so the methods will be different from this book, where the emphasis is on prediction, and you should keep this in mind.

This is not to say that one is better than the other in any sense, but that it’s important to pick the right tool for the job. If you want to answer an inference question, the tools in this book are unlikely to help you, but if you want to make accurate predictions, they likely will.

The bigger picture ¶

This book is mostly technical in nature, with an emphasis on practial programming techniques. However, there are some important guiding principles for developing machine learning solutions that are often forgotten by practitioners who find themselves deep in the technical aspects. In this section, I want to draw your attention to what I think are crucial aspects of using machine learning in applications. It might seem a bit dry for now, but I encourage you to keep these ideas in mind while working through the rest of the book, and maybe come back here in the end, once you’ve got your toes a bit wet.

The machine learning process ¶

Outside of the depicted process is the formulation of the problem, and the definition of measures, both of which are critical, but usually not part of the loop. The actual machine learning process itself starts with data collection, which might mean mining historical data, labeling data by hand, or running simulations or even performing actual physical experiments. Once the data is collected, it needs to be processed into a format suitable for machine learning, which we’ll discuss in more detail in Chapter TODO. Before building the model expoloratory data analysis and visualization are essential to form or confirm intuition on the structure of the data, to spot potential data quality issues, to select suitable candidate models, and potentially generate new features. The next step, model building, usually involves building several candidate models, tweaking them, and comparing them. Once a model is selected, it is usually evaluated first in an off-line manner, that is using already collected data.

Unclear, should we talk about shadow-running models?

Then, potentially it is further validated in a live setting with current data. Finally, the model is deployed into the production environment. For a web app, deployment might be deployment in the software sense: deploying a service that takes user data, runs the model, and renders some outcome on your website. For industrial applications, deployment could mean integrating your defect detection into an assembly line and discarding defect; if your model is for evaluating real-estate, deployment might mean buying highly valued properties.

This process is depicted as a circle, as deployment usually generates new data, or informs future data collection, and restarts the process. While I drew a circle, actually this is more than one loop, in fact this is a fully connected graph, where after each step, you might decide to go back to previous steps and start over, improving your model or your process. At any point, you might find data quality issue, figure out new informative ways to represent the data, or find out that your model doesn’t perform as well as you thought. Each time, you might decide to improve any of the previously taken steps. Usually there are many iterations before reaching integration and deployment for the first time, as using an unsuitable model might represent substantial risk to your project.

The rest of the book will focus on model building and evaluation, which are at the core of machine learning. However, for a successful project, all of the steps in the process are important. Formulating the problem, collecting data, and establishing success metrics are often at least as crucial as selecting the right model and tweaking it. Given the technical nature of the material presented in this book, it’s easy to lose sight of how critical all the steps of the process are. We will discuss some of these in a bit more detail now.

The role of data ¶

Clearly the data used for building and evaluating a machine learning model are crucial ingredients. Data collection is often overlooked in machine learning education, where students usually look at fixed datasets, and the same is true for online competitions and platforms such as kaggle . However, in practice, data collection is usually part of building any machine learning application, and there is usually a choice to collect additional data, or to change the data collection. Having more data can be the difference between a model that’s not working, and a model that outperforms human judgement, in particular if you can collect data that covers the variablity that you will encounter in prediction. Sometimes it might be possible to collect additional features that make the task much easier, and selecting what data to collect is often as critical as selecting the right model. Usually it’s easier to throw away data later than to add new fields to the data collection. It’s common for data scientist to start working on a model only to discover that a critical aspect of the process was not logged, and a task that could have been easy becomes extremely hard.

data quality?

Potentially one of the most ingenious ways to capture labeled training data is ReCAPTCHA . It provides a service to verify that web user is not a bot by solving visual tasks. These are then used as ground truth annotation for training machine learning models.

Depending on the problem you’re tackling, the effort and cost of data collection can vary widely. In some settings, the data is basically free and endless . Say you want to predict how much attention a post will receive on social media. As long as your post is similar to other posts on the platform, you can obtain arbitrary amounts of training data by looking at existing posts, and collect the number of likes and comments and other engagement. This data rich situation often appears when you are tying to predict the future, and you can observe the labels of past data simply by waiting, i.e. seeing how many people like a photo. In some cases the same might be true for showing ads or recommendations, where you are able to observe past behavior of users, or in content moderation, where users might flag offending content for you. This assumes that the feedback loop is relatively small and the events repeat often, though. If you work in retail, the two data points that are most crucial (at least in the US) are Black Friday and Christmas. And while you might be able to observe them, you can only observe them once a year, and if you make a bad decision, you might go out of business before observing them again.

Another common situation is automating a business process that before has been done manually. Usually collecting the answers is not free in this setting, but it’s often possible to collect additional data by manually annotation . The price of collecting more data then depends on the level of qualification required to create accurate labels, and the time involved. If you want to detect personal attacks in your online community, you can likely use a crowd-sourcing platform or a contractor to get reasonable labels. If your decision requires expert knowledge, say, which period a painting was created in, hiring an expert might be much more expensive or even impossible. In this situation, it’s often intersting to ask yourself what is more cost-effective: spending time building and tuning a complex machine learning model, or collecting more data and potentially getting results with less effort. We will discuss how to make this decision in TODO.

Finally, there are situations where getting additional data is infeasible or impossible; in this situations, people speak of precious data . Examples of this could be the outcome of a drug-trial, which is lengthy and expensive and where collecting additional data might not be feasible. Or the simulation of a complex physical system, or observations on a scientific measurement. Maybe each sample corresponds to a new microchip architecture for which you want to model energy efficiency. These settings are those where tweaking your model and diving deep into the problem might pay off, but these situations are overall rather rare in practice.

say why machines are good at some things and bad at others? like medical imaging?

Feedback loops in data collection ¶

One particular aspect that is often neglected in data collection is that the effect of deploying a machine learning model might change the process generating the data. A simple example for this would be a spammer, who, once a model is able to flag their content, would change their strategy or content so as to be no longer detected. Clearly, the data here changed as a consequence of deploying the model, and the model that might have been able to accurately identify spam in an offline setting might not work in practice. In this example, there is an adverserial intent and the spammers intentionally try to defeat the model. However, similar changes might happen indicentally, but still invalidate a previous model. For example, when building systems for product recommentation, the model often relies on data that was collected using some other recommendation scheme, and the choice of this scheme clearly influences what data will be collected. If a streaming platform never suggests a particular movie, it’s unlikely to be seen by many users, and so it will not show up in user data that’s collected, and so a machine learning algorithm will not recommend it, creating a feedback loop that will lead to the movie being ignored. There is a whole subset of machine learning devoted to this kind of interactive data collection, called active learning , where the data that is collected is closely related to the model that’s being build. This area also has a close relation to reinforcement learning.

Given the existence of these feedback loops, it’s important to ensure that your model performs well, not only in an offline test, but also in a production environment. Often this is hard to simulate, as you might not be able to anticipate the reaction of your users to deploying an algorithm. In this case, using A/B testing might be a way to evaluate your system more rigourously.

A/B testing

A particular nefarious example of this feedback loop has been observed (TODO citation) in what is known as predictive policing . The premise of predictive policing is to send police patrols to neighborhoods where they expect to observe crime, at times that they expect to observe crime. However, if police is sent to a neighborhood, they are likely to find criminal activity there (even if it might be minor); and clearly they will not find criminal activity in neighborhoods they did not patrol. Historically, police patrols in certain US cities have focused on non-white neighborhoods, and given this historical data, predictive policing methods steered patrols to these same neighborhoods. Which then lead them to observe more crime there, leading to more data showing crime in these neighborhoods, leading to more patrols being send there, and so on.

Metrics and evaluation ¶

One of the most important parts of machine learning is defining the goal, and defining a way to measure that goal. The first part of this is having the right data for evaluating your model, data that will reflect the way the model will be used in production. Equally important is establishing a measure of impact for your task. Usually your application is driven by some ultimate goal, such as user engagement, revenue, keeping patients healthy or any number of possible motivations. The question is now how your machine learning solution will impact this goal. It’s important to note that the goal is rarely if ever to make accurate predictions. It’s not some evaluation metrics that counts, but it’s the real world impact of the decision that are made by your model.

\(^3\) Here and in the following I will talk of your goals in terms of a business, however, if your goal is scientific discovery, health or justice, the same principles apply.

There are some common hurdles in measuring the impact of your model. Often, the effect on the bottom line might only be very indirect. If you’re removing fake news from your social media platform, this will not directly increase your ad revenue, and removing a particular fake news article will probably have no measurable impact. However, curating your platform will help maintain a brand image and might drive users to your platform, which in turn will create more revenue. But this effect is likely to be mixed in with many other effects, and much delayed in time. So often data scientists rely on surrogate metrics, measures that relate to intermediate business goals \(^3\) that can be measured more directly, such as user engagement or click-through-rate.

The problem with such surrogate metrics is that they might not capture what you assume they capture. I heard an (if not true, then at least illustrative) anectote about optimizing the placement of an ad on a shopping website. An opptimization algorithm placed it right next to the search button, with the same color as the search button, which resulted in the the most clicks. However, when analyzing the results more closely, the team found that the clicks were caused by users missing the search button and accidentally clicking the ad, not resulting in any sales, but resulting in irritated users that had to go back and search again.

There is usually a hierachy of measurements, from accuracy of a model on an offline holdout dataset, which is easy to calculate but can be misleading in several ways, to more business specific metrics that can be evaluated on an online system, to the actual business goal. Moving from evaluating just the model to the whole process and then to how the process integrates into your business makes evaluation more complex and more risky. Usually, evaluation on all levels is required if possible: if a model does well in offline tests, it can be tried in a user-study. If the user study is promising, it can be deployed more widely, and potentially an outcome on the actual objective can be observed. However, often we have to be satisfied with surrogate metrics on some level, as it’s unlikely that each model will have a measurable impact on the bottom line of a complex product.

One aspect that I find is often overlooked by junior data scientists is to establish a baseline. If you are employing machine learning in any part of your process, you should have a baseline of not employing machine learning. What are your gains if you do? What if your replace your deep neural network with the simplest heuristic that you can come up with? How will it affect your users? There are cases in which the difference between 62% accuracy and 63% accuracy can have a big impact on the bottom line, but more often than not, small improvements in the model will not drastically alter the overall process and result.

When developing any machine learning solution, always keep in mind how your model will fit in the overal process, what consequences your predictions have, and how to measure the overall impact of the decisions made by the model.

When to use and not to use machine learning ¶

As you might be able to tell by the existance of this book, I’m excited about machine learning and an avid advocate. However, I think it is cruicial to not fall victim to hype, and carefully consider whether a particular situation calls for a machine learning solution. Many machine learning practitioners get caught up in the (fascinating) details of algorithms and datasets, but lose perspective of the bigger picture. To the data scientist with a machine learning hammer, too often everything looks like a classification nail. In general, I would recommend restricting yourself to supervised learning in most practical settings; in other words, if you do not have a training dataset for which you know the outcome, it will be very hard to create an effective solution. As mentioned before, machine learning will be most useful for making individual-level predictions, not for inference. I also already laid out some prerequisits for using supervised learning in the respective section above. Let’s assume all of these criteria are met and you carefully chose your business-relevant metrics. This still doesn’t mean machine learning is the right solution for your problem. There are several aspects that need to be balanced; on the one hand there is the positive effect a successful model can have. On the other hand there is the cost of developing the initial solution. Is it worth your time as a data scientist to attack this problem, or are there problems where you can have a bigger impact? There is also the even greater cost of maintaining a machine learning solution in a production environment [SHG+14] . Machine learning models are often intransparent and hard to maintain. The exact behavior depends on the training data, and if the data changes (maybe the trends on social media change, or the political climate changes or a new competitor appeared), the model needs to be adjusted. A model might also make unexpected predictions, potentially leading to costly errors, or annoyed customers. All of these issues need to be weight against the potential benefits of using a model.

Your default should be not to use machine learning, unless you can demonstrate that your solution improves the overall process and impacts relevant business goals, while being robust to possible changes in the data and potentially even to adverserial behavior. Try hard to come up with heuristics to outperform any model you develop, and always compare your model to the simplest approach and the simplest model that you can think of. And keep in mind, don’t evaluate these via model accuracy, evaluate them on something relevant to your process.

Ethical aspects of machine learning ¶

One aspect of machine learning that only recently is getting quite a bit of attention is ethics. The field of ethics in technology is quite broad, and machine learning and data science have many of the same questions that are associated with any use of technology. However, there are many situations where machine learning quite directly impacts individuals, for example when hiring decision, credit approvals or even risk assessments in the criminal justice [BHJ+18] system are powered by machine learning. Given the complexity of machine learning algorithms, and the intricate dependencies on the training data, together with the potentials for feedback loops, it is often hard to assess the potential impact that the deployment of an algorithm or model can have on individuals. However, that by no means relieves data scientists of the responsibility to investigate potential issues of bias and discrimination in machine learning. There is a growing community that investigates fairness, accountability and transparency in machine learning and data science, providing tools to detect and address issues in algorithms and datasets. On the other hand, there are some that question algorithmic solutions to ethical issues, and ask for a broader perspective on the impact of data science and machine learning on society [KHD19] [FL20] . This is a complex topic, and so far, there is little consensus on best practices and concrete steps. However, most researchers and practitioners agree that fairness, accountability and transparancy are essential principles for the future of machine learning in society. While approaches to fair machine learning are beyond the scope of this book, I want to encourage you to keep the issues of bias and discrimination in mind. Real-world examples, such as the use of predictive policing, racial discrimination in criminal risk assessment or gender discrimination in ML-driven hiring unfortunately abound. If your applications involves humans in any capacity (and most do), make sure to pay special attention to these topics, and research best practices for evaluating your process, data, and modeling.

Scaling up ¶

This book focuses on using Python and scikit-learn for machine learning. One of the main limitations of scikit-learn that I’m often asked about is that it is usually restricted to using a single machine, not a cluster (though there are some ways around this in some cases). My standpoint on this is that for most applications, using a single machine is often enough, easier, and potentially even faster [RND+12] . Once your data is processed to a point where you can start your analysis, few applications require more than at most several gigabites of data, and many applications only require megabytes. These workloads are easily done on modern machines: even if your machine does not have enough memory, it’s quick and cheap to rent machines with hundreds of GB of RAM from a cloud provider, and do all your machine learning in memory on a single machine. If this is possible, I would encourage you to go with this solution. The interactivity and simplicity that comes from working on a single machine is hard to beat, and the amount of libraries available is far greater for local computations. If your raw data, say user logs, is many terrabytes, it might still be possible that after extracting the data needed for machine learning is only in the hundreds of megabytes, and so after preparing the data in a distributed environment such as Spark, you can then transition to a single machine for your machine learning workflow. Clearly there are situations when a single machine is not enough; large tech companies often use their own bespoke in-house systems to learn models on immense datastreams. However, most projects don’t operate on the scale of the facebook timeline or google searches, and even if your production environment requires truely large amounts of data, prototyping on a subset on a single machine can be helpful for a quick exploratory analysis or a prototype. Avoid premature optimization and start small–where small these days might mean hundreds of gigabytes.

Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in criminal justice risk assessments: the state of the art. Sociological Methods & Research , pages 0049124118782533, 2018.

Sina Fazelpour and Zachary C Lipton. Algorithmic fairness from a non-ideal perspective. In AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES) 2020 . 2020.

Os Keyes, Jevan Hutson, and Meredith Durbin. A mulching proposal: analysing and improving an algorithmic system for turning the elderly into high-nutrient slurry. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing systems , 1–11. 2019.

Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O’Shea, and Andrew Douglas. Nobody ever got fired for using hadoop on a cluster. In Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing , 1–5. 2012.

David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. Machine learning: the high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop) . 2014.

Prediction Assignment Writeup

Alaettin serhan mete.

This document summarizes the work done for the Prediction Assignment Writeup project for the Coursera Practical Machine Learning course. It's created using the functionalities of the knitr package in RStudio using the actual analysis code. The repository for this work can be found at .

Using devices such as Jawbone Up , Nike FuelBand , and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it . In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: (see the section on the Weight Lifting Exercise Dataset).

The training data for this project are available here:

The test data are available here:

First, we begin by exporting the data. One can simply download the training and testing datasets using:

Then, we load some useful packages using:

Now let's load the data into memory:

Now that we have the data in the memory, let's get to the fun part. First thing we better do is to split the training data into two parts. We'll use 70% of this data to actually train our model and the remaining 30% to validate it:

Next, we need to prepare the data for modeling. If you look at the training data you'll see that there are a number of variables that have either no variance or a large fraction of missing values. These will not really help us in any meaningful way. Therefore, let's clean them up for a healthy modeling:

At this point, we have healthy clean data that we can use for building models. We'll build two models: a random forest and a generalized boosted model . We'll train these in the training portion of the original training dataset and then test them in the validation portion of the original training dataset:

Then let's see how well these two models perform predicting the values in the validation dataset. This can be easily accomplished by predicting the values in the validation set, and then comparing the predictions with the actual values.

We can investigate our generalized boosted model a bit further to see which variables have the highest relative influence:

plot of chunk unnamed-chunk-10

The above list shows the ranking of variables in our GBM. We see that num_window , roll_belt , and pitch_forearm are the most performant ones. We can checkout a few plots demonstrating their power:

plot of chunk unnamed-chunk-11

At this point we see the random forest has marginally better performance (Accuracy : 0.998) than the generalized boosted model (Accuracy : 0.9876). Actually we can go w/ either or ensemble them but that might be an overkill at this point. In any case they yield the same result. Let's test our model in the actual testing dataset:

On the expected out of sample error

Please note that since the method random forest is chosen, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. This is explained as: "In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree. Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests." The reader can find more information at:

Machine Learning DS-GA 1003 · Spring 2019 · NYU Center for Data Science

About this course.

This course covers a wide variety of topics in machine learning and statistical modeling. While mathematical methods and theoretical aspects will be covered, the primary goal is to provide students with the tools and principles needed to solve the data science problems found in practice. This course also serves as a foundation on which more specialized courses and further independent study can build. A tentative syllabus can be found here .

This course was designed as part of the core curriculum for the Center for Data Science's Masters degree in Data Science . Other interested students who satisfy the prerequisites are welcome to take the class as well. This class is intended as a continuation of DS-GA-1001 Intro to Data Science , which covers some important, fundamental data science topics that may not be explicitly covered in this DS-GA class (e.g. data cleaning, cross-validation, and sampling bias).

We will use Piazza for class discussion. Rather than emailing questions to the teaching staff, please post your questions on Piazza , where they will be answered by the instructor, TAs, graders, and other students. For questions that are not specific to the class, you are also encouraged to post to Stack Overflow for programming questions and Cross Validated for statistics and machine learning questions. Please also post a link to these postings in Piazza, so others in the class can answer the questions and benefit from the answers. anonymized version of our Piazza board .

Other information:

  • Course details can be found in the syllabus .
  • The Course Calendar contains all class meeting dates.
  • All course materials are stored in a GitHub repository . Check the repository to see when something was last updated.
  • For registration information, please contact Kathryn Angeles .
  • The course conforms to NYU’s policy on academic integrity for students .


  • DS-GA-1001: Intro to Data Science or its equivalent
  • DS-GA-1002: Statistical and Mathematical Methods or its equivalent
  • Solid mathematical background , equivalent to a 1-semester undergraduate course in each of the following: linear algebra, multivariate calculus (primarily differential calculus), probability theory, and statistics. (The coverage in the 2015 version of DS-GA 1002, linked above, is sufficient.)
  • Python programming required for most homework assignments.
  • Recommended: Computer science background up to a "data structures and algorithms" course
  • Recommended: At least one advanced, proof-based mathematics course
  • Some prerequisites may be waived with permission of the instructor
  • You can also self-assess your preparation by filling out the Prerequisite Questionnaire

Homework (40%) + Midterm Exam (30%) + Final Exam (30%)

Many homework assignments will have problems designated as “optional”. At the end of the semester, strong performance on these problems may lift the final course grade by up to half a letter grade (e.g. B+ to A- or A- to A), especially for borderline grades. You should view the optional problems primarily as a way to engage with more material, if you have the time. Along with the performance on optional problems, we will also consider significant contributions to Piazza and in-class discussions for boosting a borderline grade.

Important Dates

  • Midterm Exam (100 min) Tuesday, March 12th, 5:20–7pm.
  • Final Exam (100 min) Thursday, May 16th, 6-7:50pm (confirmed).
  • See Assignments section for homework-related deadlines.

The cover of Elements of Statistical Learning

Other tutorials and references


Late Policy: Homeworks are due at 11:59 PM on the date specified. Homeworks will still be accepted for 48 hours after this time but will have a 20% penalty.

Collaboration Policy: You may discuss problems with your classmates. However, you must write up the homework solutions and the code from scratch, without referring to notes from your joint session. In your solution to each problem, you must write down the names of any person with whom you discussed the problem—this will not affect your grade.

Homework Submission: Homework should be submitted through Gradescope . If you have not used Gradescope before, please watch this short video: "For students: submitting homework." At the beginning of the semester, you will be added to the Gradescope class roster. This will give you access to the course page, and the assignment submission form. To submit assignments, you will need to:

  • Upload a single PDF document containing all the math, code, plots, and exposition required for each problem.
  • Where homework assignments are divided into sections, please begin each section on a new page .
  • You will then select the appropriate page ranges for each homework problem, as described in the "submitting homework" video.

Homework Feedback: Check Gradescope to get your scores on each individual problem, as well as comments on your answers. Since Gradescope cannot distinguish between required and optional problems, final homework scores, separated into required and optional parts, will be posted on NYUClasses .

Typesetting your homework

Due: January 1st, 11:59 PM

Due: February 9th, 11:59 PM

Due: February 18th, 11:59 PM

Due: February 25th, 11:59 PM

Due: March 8th, 11:59 PM

Due: April 5th, 11:59 PM

Due: April 29th, 11:59 PM

Due: May 10th, 11:59 PM


A photo of Julia Kempe

Julia Kempe

[email protected]

Julia is the Director of the NYU Center for Data Science (CDS). She is a professor of Computer Science and Mathematics at CDS and the NYU Courant Institute.

A photo of David Rosenberg

David Rosenberg

[email protected]

David is a data scientist in the office of the CTO at Bloomberg L.P. Formerly he was Chief Scientist of YP Mobile Labs at YP .

Section Leaders

A photo of Sreyas Mohan

Sreyas Mohan (Head TA)

[email protected]

Sreyas is a second year PhD student in the Data Science Program at CDS working with Prof. Carlos Fernandez-Granda and Prof. Eero Simoncelli.

A photo of Xintian Han

Xintian Han

[email protected]

Xintian is a second year PhD student in the Data Science Program at CDS working with Prof. Rajesh Ranganath.

A photo of Sanyam Kapur

Sanyam Kapur (Head Grader)

[email protected]

Sanyam is a Masters Student in Computer Science at NYU Courant. He is currently working towards improving Markov Chain Monte Carlo methods.

A photo of Aakash Kaku

Aakash Kaku

Aakash is a second-year Masters student in the Data Science program at NYU. He is interested in solving problems in the healthcare domain using machine learning.

A photo of Mingsi Long

Mingsi Long

Mingsi is a second year student in the Data Science Program at NYU CDS.

A photo of Mihir Rana

Mihir is a Master's student in Data Science at the NYU Center for Data Science, interested in computer vision, reinforcement learning, and natural language understanding.

A photo of Tingyan Xiang

Tingyan Xiang

Tingyan is a second-year Masters student in the Data Science program at NYU.

A photo of Yi Zhou

Yi is a second year student at the CS department at NYU Tandon.


MGMT 4190/6560 Introduction to Machine Learning Applications @Rensselaer

Interact on Colab

Assignment 1 ¶

Before you start working on this assignment please click file -> save a copy in drive. ¶.

Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All). You can speak with others regarding the assignment but all work must be your own.

This is a 30 point assignment. ¶

Before you begin ¶.

Please work through each of these notebooks, which will give you some understanding of the Google Colab environment.

Working with Notebooks in Colaboratory ¶

Overview of Colaboratory

Guide to Markdown

Importing libraries and installing dependencies

Saving and loading notebooks in GitHub

Working with Data ¶

Some of this is a bit more advanced, but at this point just make sure you know where the code is for how to upload and download a file.

Loading data: Drive, Sheets, and Google Cloud Storage

Run these Cells ¶

This will setup the automated testing environment on Colab

Question 1.

In the next cell:

a. Assign the value for x to 150

b. Set the value for y to 13 times x

c. Set the value for z to y divided by x squared.

Question 2.

Packages are really important compontent of most programming languages.

In the overview, you learnned about tab completion as a way to explore python objects. This can be really useful. Let’s use it to find the formula for the the factorial of 15. Assign the results to the variable m .

Question 3.

Markdown is a useful aspect of Jupyter Notebooks.

machine learning assignment github

Double click on cell below to open it for markdown editing. There is no test for this question.

Header For the above header, make it an h1 tag using markdown.

Sub-Header For the above sub-header, make it an h5 tag using markdown. (Embed this image)

Question 4.

Installing Packages ¶

Python packages are an important part of data science and critical to leveraging the broader Python ecosystem.

You typically have two options when installing a package. You can install it with Conda or pip .

The ! in a jupyter notebook means that the line is being processed on the commmand line and not by the Python interpreter.

If you try to import something and get an error, it is usally a tell that you need to install a package.

Install the fastparquet Package to be able to work with Parquet Files ¶

CSV (comma delimited files are great for humans to read and understand.

For “big data” though, it isn’t a great long term storage option (inefficient/slow).

Parquet is a type columnar storage format. It makes dealing with lots of columns fast.

fastparquet is a Python package for dealing with Parquet files.

Apache Spark also natively reads Parquet Files.

Look here for instructions on installing the fastparquet package.

Show All Columns in a Pandas Dataframe ¶

Notice there is a ... which indicates you are only seeing some of the columns, and the output has been truncated.

Read this article and find how to show all the columns of a pandas dataframe.

Question 5.

Importing CSV into a Pandas Dataframe ¶

Comma delimited files are a common way of transmitting data.

Data for different columns is separated by a comma.

It is possible to open a csv in different ways, but Pandas is the easiest.

Data structured like CSV’s are extremely common and known as tabular data.

Pandas will give access to many useful methods for working with data.

pandas is often imported as the abbreviated pd .

You can also get help by using a ? after the method call. For example, to find the doc string for the read csv function you could execute:

pd.read_csv? or


Get CSVs from the Web/Github. ¶

You can also get a CSV directly from a web url.

View this file in your web browser. You won’t be able to load this into pandas.

To get the link you can load, you need to click on the raw button. That should lead to this url:


It should run without errors.

Click File -> Download .ipynb to download the assignment. Then Upload it to Assignment 1 in the LMS. ¶

This work is licensed under the Creative Commons Attribution 4.0 International license agreement.


Jump to: [Homeworks] [Projects] [Quizzes] [Exams]

There will be one homework (HW) for each topical unit of the course. Due about a week after we finish that unit.

These are intended to build your conceptual analysis skills plus your implementation skills in Python.

  • HW0 : Numerical Programming Fundamentals
  • HW1 : Regression, Cross-Validation, and Regularization
  • HW2 : Evaluating Binary Classifiers and Implementing Logistic Regression
  • HW3 : Neural Networks and Stochastic Gradient Descent
  • HW4 : Trees
  • HW5 : Kernel Methods and PCA

After completing each unit, there will be a 20 minute quiz (taken online via gradescope).

Each quiz will be designed to assess your conceptual understanding about each unit.

Probably 10 questions. Most questions will be true/false or multiple choice, with perhaps 1-3 short answer questions.

You can view the conceptual questions in each unit's in-class demos/labs and homework as good practice for the corresponding quiz.

There will be three larger "projects" throughout the semester:

  • Project A: Classifying Images with Feature Transformations
  • Project B: Classifying Sentiment from Text Reviews
  • Project C: Recommendation Systems for Movies

Projects are meant to be open-ended and encourage creativity. They are meant to be case studies of applications of the ML concepts from class to three "real world" use cases: image classification, text classification, and recommendations of movies to users.

Each project will due approximately 4 weeks after being handed out. Start early! Do not wait until the last few days.

Projects will generally be centered around a particular methodology for solving a specific task and involve significant programming (with some combination of developing core methods from scratch or using existing libraries). You will need to consider some conceptual issues, write a program to solve the task, and evaluate your program through experiments to compare the performance of different algorithms and methods.

Your main deliverable will be a short report (2-4 pages), describing your approach and providing several figures/tables to explain your results to the reader.

You’ll be assessed on effort, the sophistication of your technical approach, the clarity of your explanations, the evidence that you present to support your evaluative claims, and the performance of your implementation. A high-performing approach with little explanation will receive little credit, while a careful set of experiments that illuminate why a particular direction turned out to be a dead end may receive close to full credit.

UvA - Machine Learning 1

Lectures and slides for the uva master ai course machine learning 1.

Welcome to the public page for the course Machine Learning 1. The course is part of the Artificial Intelligence master program at the University of Amsterdam . The course is developed by the Amsterdam Machine Learning Lab and currently taught by Erik Bekkers .

This page presents an overview of the course including links to the lectures ( this Youtube channel ) and the corresponding annotated slides in pdf. Students enrolled for the course are referred to Canvas for extra materials such as practice exercises, homework assignments, lab assignments (jupyter notebooks) and additional resources.

In this lecture series we follow closely the Pattern Recognition and Machine Learning book by Bishop. Relevant chapters are indicated at the start of each video.

The contents of this page and the video lectures are licensed under a Creative Commons Attribution 4.0 International License .

machine learning assignment github

Weekly overview

  • Lecture 1.1 ( video , pdf ): Introduction to the course, administrative announcements
  • Lecture 1.2 ( video , pdf ): What is Machine Learning?
  • Lecture 1.3 ( video , pdf ): Types of Machine Learning
  • Lecture 1.4 ( video , pdf ): Probability Theory, Bayes Theorem

Lecture 1.5 ( video , pdf ): Probability Theory: An Example

  • Lecture 2.1 ( video , pdf ): Expectation, Variance, Covariance
  • Lecture 2.2 ( video , pdf ): Gaussian Distribution
  • Lecture 2.3 ( video , pdf ): Maximum Likelihood Estimation
  • Lecture 2.4 ( video , pdf ): Maximum Likelihood Estimation: An Example
  • Lecture 2.5 ( video , pdf ): Maximum A Posteriori
  • Lecture 2.6 ( video , pdf ): Bayesian Prediction
  • Lecture 3.1 ( video , pdf ): Linear Regression With Basis Functions
  • Lecture 3.2 ( video , pdf ): Linear Regression via Maximum Likelihood
  • Lecture 3.3 ( video , pdf ): Stochastic Gradient Descent
  • Lecture 3.4 ( video , pdf ): Underfitting and Overfitting

Lecture 3.5 ( video , pdf ): Regularized Least Squares

  • Lecture 4.1 ( video , pdf ): Model Selection
  • Lecture 4.2 ( video , pdf ): Bias Variance Decomposition
  • Lecture 4.3 ( video , pdf ): Gaussian Posteriors
  • Lecture 4.4 ( video , pdf ): Sequential Bayesian Learning
  • Lecture 4.5 ( video , pdf ): Bayesian Predictive Distributions
  • Lecture 5.1 ( video , pdf ): Equivalent Kernel
  • Lecture 5.2 ( video , pdf ): Bayesian Model Comparison
  • Lecture 5.3 ( video , pdf ): Model Evidence Approximation/Empirical Bayes
  • Lecture 5.4 ( video , pdf ): Classification With Decision Regions
  • Lecture 5.5 ( video , pdf ): Decision Theory

Lecture 5.6 ( video , pdf ): Probabilistic Generative Models

  • Lecture 6.1 ( video , pdf ): Probabilistic Generative Models: Maximum Likelihood
  • Lecture 6.2 ( video , pdf ): Probabilistic Generative Models: Discrete Data (Naive Bayes)
  • Lecture 6.3 ( video , pdf ): Discriminant Functions
  • Lecture 6.4 ( video , pdf ): Discriminant Functions: Least Squares Regression
  • Lecture 6.5 ( video , pdf ): Discriminant Functions: Perceptron
  • Lecture 7.1 ( video , pdf ): Classification With Basis Functions
  • Lecture 7.2 ( video , pdf ): Probabilistic Discriminative Models: Logistic Regression
  • Lecture 7.3 ( video , pdf ): Logistic Regression: Stochastic Gradient Descent

Lecture 7.4 ( video , pdf ): Logistic Regression: Newton Raphson

  • Lecture 8.1 ( video , pdf ): Neural Networks
  • Lecture 8.2 ( video , pdf ): Neural Networks: Universal Approximation Theorem
  • Lecture 8.3 ( video , pdf ): Neural Networks: Losses
  • Lecture 8.4 ( video , pdf ): Neural Networks: Stochastic Gradient Descent
  • Lecture 8.5 ( video , pdf ): Neural Networks: Backpropagation
  • Lecture 9.1 ( video , pdf ): Unsupervised Learning - Latent Variable Models
  • Lecture 9.2 ( video , pdf ): K-Means Clustering
  • Lecture 9.3 ( video , pdf ): Lagrange Multipliers

Lecture 9.4 ( video , pdf ): Gaussian Mixture Models and The Expectation Maximization Algorithm

  • Lecture 10.1 ( video , pdf ): Principal Component Analysis: Maximum Variance
  • Lecture 10.2 ( video , pdf ): Principal Component Analysis: Minimum Reconstruction Error
  • Lecture 10.3 ( video , pdf ): Probabilistic Principal Component Analysis
  • Lecture 10.4 ( video , pdf ): Non-Linear Principal Component Analysis (Kernel-PCA and Auto-Encoders)
  • Lecture 11.1 ( video , pdf ): Kernelizing Linear Models
  • Lecture 11.2 ( video , pdf ): The Kernel Trick
  • Lecture 11.3 ( video , pdf ): Support Vector Machines: Maximum Margin Classifiers
  • Lecture 11.4 ( video , pdf ): Inequality Constraint Optimization (The Dual Lagrangian)
  • Lecture 11.5 ( video , pdf ): Support Vector Machines: Kernel SVM

Lecture 11.6 ( video , pdf ): Support Vector Machines: Soft Margin Classifiers

  • Lecture 12.1 ( video , pdf ): Some Useful Properties of Gaussians
  • Lecture 12.2 ( video , pdf ): Kernelizing Bayesian Regression
  • Lecture 12.3 ( video , pdf ): Gaussian Processes
  • Lecture 12.4 ( video , pdf ): Gaussian Processes: With An Exponential Kernel
  • Lecture 12.5 ( video , pdf ): Gaussian Processes: Regression
  • Lecture 13.1 ( video , pdf ): Model Combination Methods (vs Bayesian Model Averaging)
  • Lecture 13.2 ( video , pdf ): Bootstrapping and Feature Bagging
  • Lecture 13.3 ( video , pdf ): Adaboost
  • Lecture 13.4 ( video , pdf ): Decision Trees and Random Forests


Each task should have its report and IPython Notebook. Once again, we emphasize the report; it should contain all your questions and your proper statistical answers. Use figures, pictures, and tables. DO NOT PUT ANY CODE IN THE REPORT.

Table of contents

  • Assignment 1
  • Assignment 2
  • Assignment 3
  • Final Project

machine learning assignment github

A free one-day virtual conference for the whole C++ community.

What’s New in vcpkg (March 2024)

' data-src=

Augustin Popa

March 29th, 2024 0 0

This blog post summarizes changes to the vcpkg package manager as part of the 2024.03.19 and 2024.03.25 releases as well as changes to vcpkg documentation throughout March. This month’s vcpkg release includes an arm64ec platform expression, more flexibility when mixing static and dynamic libraries, diagnostics improvements, a change in the binary caching ABI calculation, and bug fixes.

Some stats for this period:

  • There are now 2,415 total libraries available in the vcpkg public registry.
  • 21 new ports were added to the open-source registry. A port is a versioned recipe for building a package from source, such as a C or C++ library.
  • 405 updates were made to existing ports. As always, we validate each change to a port by building all other ports that depend on or are depended by the library that is being updated for our 13 main triplets .
  • 32 contributors submitted PRs, issues, or participated in discussions in the main repo.
  • The main vcpkg repo has over 5,900 forks and 21,200 stars on GitHub.

vcpkg changelog (2024.03.19 + 2024.03.25 releases)

Several improvements to diagnostics output. For example:

  • Added an arm64ec platform expression to assist with creating triplets targeting arm64ec (PR: Microsoft/vcpkg-tool#1352 , thanks @autoantwort!).
  • Removed the warning about the default triplet change that was implemented in a previous release (PR: Microsoft/vcpkg-tool#1351 ).
  • DLLs can now be used in static triplets if BuildPolicy::DLLS_IN_STATIC_LIBRARY is turned on. Those DLLs should still have CRT and architecture checked. Static libraries can also now be in dynamic triplets if their CRT linkage matches (PR: Microsoft/vcpkg-tool#1358 ).
  • New behavior: vcpkg will output a message with the path to the compiler it is using in certain cases where there is ambiguity about the compiler version (PR: Microsoft/vcpkg-tool#1328 ).
  • Installing ports with –only-downloads will no longer affect the ABI calculation used by the binary caching feature (PR: Microsoft/vcpkg-tool#1363 ).
  • Fixed a bug that could cause a vcpkg command to exit prematurely if a ‘)’ character was present in certain path assignments (PR: Microsoft/vcpkg-tool#1354 , thanks @jcbrill!).
  • Fixed a bug preventing certain versioning schemes from working with the versioning overrides feature (PR: Microsoft/vcpkg-tool#1329 ).
  • Fixed a regression that broke binary caching for the Meson tool port (PR: Microsoft/vcpkg#37585 ).
  • Fixed a crash that could occur when running vcpkg install against a manifest with a non-existent port and version override. (PR: Microsoft/vcpkg#36994 , thanks @dforsten!).
  • Various other minor bugfixes.

Documentation changes

  • Understanding default features in vcpkg
  • Added Default Features section to versioning troubleshooting guide.
  • Unified some manifest mode articles and fixed links.
  • Changed CMakePresets.json sample files to use schema version 1.
  • Various minor improvements throughout the docs.

If you have any suggestions for our documentation, please submit an issue in our GitHub repo or use the button “This page” at the bottom of a particular article.

A screenshot of a website Description automatically generated

Total ports available for tested triplets

While vcpkg supports a much larger variety of target platforms and architectures (as community triplets ), the list above is validated exhaustively to ensure updated ports don’t break other ports in the catalog.

Thank you to our contributors

vcpkg couldn’t be where it is today without contributions from our open-source community. Thank you for your continued support! The following people contributed to the vcpkg , vcpkg-tool , or vcpkg-docs repos in this release:

  • jiayuehua (84 commits)
  • dg0yt (32 commits)
  • Neumann-A (13 commits)
  • waywardmonkeys (8 commits)
  • Thomas1664 (8 commits)
  • miyanyan (7 commits)
  • SchaichAlonso (4 commits)
  • Be-ing (3 commits)
  • Osyotr (3 commits)
  • AenBleidd (3 commits)
  • RT2Code (3 commits)
  • autoantwort (3 commits)
  • moritz-h (2 commits)
  • omarhogni (2 commits)
  • xiaozhuai (2 commits)
  • coryan (2 commits)
  • Honeybunch (1 commit)
  • DragonJoker (1 commit)
  • horenmar (1 commit)
  • Tradias (1 commit)
  • fran6co (1 commit)
  • yurybura (1 commit)
  • eao197 (1 commit)
  • chausner (1 commit)
  • VelocityRa (1 commit)
  • dforsten (1 commit)
  • jcbrill (1 commit)

Is your company looking for a better C/C++ dependency management experience?

We are partnering with companies to help them get started with vcpkg and overcome any initial hurdles. We have also been making product and documentation changes based on feedback we receive from these partnerships. If you are interested in trying out vcpkg or just have some thoughts to share with us, feel free to reach out at [email protected] .

You can find the full 2024.03.19 release notes on GitHub for the main repo. Recent updates to the vcpkg tool can be viewed on the vcpkg-tool Releases page . To contribute to documentation, visit the vcpkg-docs repo . If you’re new to vcpkg or curious about how a package manager can make your life easier as a C/C++ developer, check out the vcpkg website – .

If you would like to contribute to vcpkg and its library catalog, or want to give us feedback on anything, check out our GitHub repo . Please report bugs or request updates to ports in our issue tracker or join more general discussion in our discussion forum .

' data-src=

Augustin Popa Senior Product Manager, Microsoft C++ Team

' data-src=

Leave a comment Cancel reply

Log in to start the discussion.


Insert/edit link

Enter the destination URL

Or link to existing content


  1. A-sad-ali/Machine-Learning-Specialization

    Contains Optional Labs and Solutions of Programming Assignment for the Machine Learning Specialization By Stanford University and - Coursera (2023) by Prof. Andrew NG - A-sad-ali/Machine-Learning-Specialization

  2. Lab 1: Machine Learning with Python

    scikit-learn #. One of the most prominent Python libraries for machine learning: Contains many state-of-the-art machine learning algorithms. Builds on numpy (fast), implements advanced techniques. Wide range of evaluation measures and techniques. Offers comprehensive documentation about each algorithm.

  3. Foundations of Machine Learning

    Bloomberg presents "Foundations of Machine Learning," a training course that was initially delivered internally to the company's software engineers as part of its "Machine Learning EDU" initiative. This course covers a wide variety of topics in machine learning and statistical modeling. The primary goal of the class is to help participants gain ...

  4. DataSci 207: Applied Machine Learning (Spring 2023)

    Raschka & Mirjalili (RM), Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow 2. Assignments. Weekly coding assignments, submitted via GitHub and Digital Campus (see notes below). Final Project. You will present your final project in class during the final session.

  5. Machine Learning By Prof. Andrew Ng :star2::star2::star2::star2::star:

    Cost Function. The cost function or Sum of Squeared Errors (SSE) is a measure of how far away our hypothesis is from the optimal hypothesis. The closer our hypothesis matches the training examples, the smaller the value of the cost function. Theoretically, we would like J (θ)=0.

  6. 10 GitHub Repositories to Master Machine Learning

    10 GitHub Repositories to Master Machine Learning. The blog covers machine learning courses, bootcamps, books, tools, interview questions, cheat sheets, MLOps platforms, and more to master ML and secure your dream job. By Abid Ali Awan, KDnuggets Assistant Editor on December 1, 2023 in Machine Learning. Image generated with DALLE-3.

  7. Introduction

    This book aims to provide an accessible introduction into applying machine learning with Python, in particular using the scikit-learn library. I assume that you're already somewhat familiar with Python and the libaries of the scientific Python ecosystem. If you find that you have a hard time following along some of the details of numpy ...

  8. Prediction Assignment Writeup

    Prediction Assignment Writeup Alaettin Serhan Mete Apr. 2017 Overview . This document summarizes the work done for the Prediction Assignment Writeup project for the Coursera Practical Machine Learning course. It's created using the functionalities of the knitr package in RStudio using the actual analysis code.

  9. DS-GA 1003 / CSCI-GA 2567: Machine Learning, Spring 2019

    Math for Machine Learning by Hal Daumé III Software. NumPy is "the fundamental package for scientific computing with Python." Our homework assignments will use NumPy arrays extensively. scikit-learn is a comprehensive machine learning toolkit for Python. We won't use this for most of the homework assignments, since we'll be coding things from ...

  10. Assignment 3

    Assignment 3. Save your working file in Google drive so that all changes will be saved as you work. MAKE SURE that your final version is saved to GitHub. Before you turn this in, make sure everything runs as expected. First, restart the kernel (in the menu, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All).

  11. Assignment 1

    Assignment 1 — MGMT 4190/6560 Introduction to Machine Learning Applications @Rensselaer. This is a 30 point assignment. Show All Columns in a Pandas Dataframe. MAKE SURE THAT THIS ENTIRE NOTEBOOK RUNS WITHOUT ERRORS. TO TEST THIS DO RUNTIME -> RESTART AND RUN ALL.

  12. Assignments

    Each quiz will be designed to assess your conceptual understanding about each unit. Probably 10 questions. Most questions will be true/false or multiple choice, with perhaps 1-3 short answer questions. You can view the conceptual questions in each unit's in-class demos/labs and homework as good practice for the corresponding quiz.

  13. UvA

    Students enrolled for the course are referred to Canvas for extra materials such as practice exercises, homework assignments, lab assignments (jupyter notebooks) and additional resources. In this lecture series we follow closely the Pattern Recognition and Machine Learning book by Bishop. Relevant chapters are indicated at the start of each video.

  14. Assignments

    Machine Learning Course of Computer Science Faculty of Shahid Beheshti University. Winter 2023 Machine Learning ... Repo on GitHub Assignments Each task should have its report and IPython Notebook. Once again, we emphasize the report; it should contain all your questions and your proper statistical answers.

  15. A Review of Neuroscience-Inspired Machine Learning

    GitHub, GitLab or BitBucket URL: * ... A Review of Neuroscience-Inspired Machine Learning 16 Feb 2024 ... One major criticism of deep learning centers around the biological implausibility of the credit assignment schema used for learning -- backpropagation of errors. This implausibility translates into practical limitations, spanning scientific ...

  16. What's New in vcpkg (March 2024)

    This blog post summarizes changes to the vcpkg package manager as part of the 2024.03.19 and 2024.03.25 releases as well as changes to vcpkg documentation throughout March. This month's vcpkg release includes an arm64ec platform expression, more flexibility when mixing static and dynamic libraries, diagnostics improvements, a change in the binary caching ABI calculation, and bug fixes.