CS50 Video Player
    • 🧁

    • 🍮

    • 🥝

    • 🍿
    • 0:00:02KEVIN XU: Hey everyone!
    • 0:00:03Welcome to our introduction to ML seminar.
    • 0:00:06I'm Kevin Xu, a sophomore at the college,
    • 0:00:09studying computer science and physics.
    • 0:00:11ZAD CHIN: I'm Zad, I'm a sophomore at Harvard College,
    • 0:00:13studying computer science and maths.
    • 0:00:15KEVIN XU: And we're so excited that you could join us here today.
    • 0:00:19We know that we're in the middle of final test week,
    • 0:00:21and so everybody's just a little stressed.
    • 0:00:23So, hopefully, we can give a very interesting, and hopefully
    • 0:00:27fun presentation about what to look forward
    • 0:00:30to as you start implementing your final projects.
    • 0:00:32So just some logistics first.
    • 0:00:34We will have some set points where we'll take a few questions from the audience.
    • 0:00:39So feel free to type them into the chat as you think of them,
    • 0:00:43and somebody will probably read them to us when we pause for questions.
    • 0:00:49Yeah, other than that, let's just get straight into it.
    • 0:00:55Right so, of course, our seminar is about machine learning, or ML.
    • 0:00:59And, so, the first question, which is not so obvious,
    • 0:01:02is what is machine learning?
    • 0:01:03And, so, I'm sure a lot of you have heard about the developments that
    • 0:01:08have been in this field, a lot about neural networking, and perhaps
    • 0:01:12reinforcement learning.
    • 0:01:13And a popular topic, of course, is the game development theory,
    • 0:01:17where some computers have solved games through these complicated machine
    • 0:01:22learning algorithms and neural networks.
    • 0:01:24And, so, we just want to give a quick overview of what
    • 0:01:29exactly falls under machine learning.
    • 0:01:31And this is actually a very broad category
    • 0:01:33that includes a lot of different--
    • 0:01:36some ideas, and some fields.
    • 0:01:38And, so, there are three main ones.
    • 0:01:40There's unsupervised learning, supervised learning,
    • 0:01:42and reinforcement learning.
    • 0:01:43And a lot of the neural networks that you think about,
    • 0:01:47when you think about machine learning, and reinforcement learning,
    • 0:01:50fall under one of these categories.
    • 0:01:53But, at the end of the day, ML is about taking a lot of data
    • 0:01:57and having the computer, or an algorithm, or a program, process
    • 0:02:01the important parts of that data.
    • 0:02:04Recognize the patterns in the data, and attempt
    • 0:02:07to use that data to generalize to a bigger data
    • 0:02:09set that you might be given.
    • 0:02:11So Zad now is going to present a little bit more about what exactly
    • 0:02:15these subfields in ML are.
    • 0:02:17ZAD CHIN: So the first thing we want to know is about supervised learning,
    • 0:02:20so what is supervised learning, right?
    • 0:02:22So, given some labeled data, say for example,
    • 0:02:24we have a data set of cat and dogs pictures.
    • 0:02:27So, how can a machine learn to predict the labels that
    • 0:02:30generalize to unseen data?
    • 0:02:32So what we do is we put a data set of cats and dogs
    • 0:02:35pictures, and then put it into a machine learning model.
    • 0:02:37And let the machine learning model kind of learn itself on how
    • 0:02:40to differentiate between cats and dogs.
    • 0:02:42And then we have some testing data set, we maybe have cats and dogs, or maybe
    • 0:02:45other images, and the machine would kind of like label, oh, this is a cat,
    • 0:02:49or this is a dog, or this is not a cat or a dog.
    • 0:02:52So this is kind of like a supervised machine
    • 0:02:54learning, whereby we actually label the data so the machine can learn from it.
    • 0:02:59So, the next big idea in machine learning
    • 0:03:01is also unsupervised learning, where the machine actually recognizes
    • 0:03:04the pattern itself from the data.
    • 0:03:06So like, we don't label anything, we don't label it's a dog, or a cat,
    • 0:03:10we just show all the images to the machine learning,
    • 0:03:14and the machine will be like, oh this feature kind of like resemble a cat,
    • 0:03:17so this is a cat.
    • 0:03:18And this feature kind of resembles a dog, so it's a dog.
    • 0:03:21So that's kind of like a difference between unsupervised learning
    • 0:03:24and supervised learning.
    • 0:03:25And most of the time, unsupervised learning
    • 0:03:26includes like, clustering for example, in a very high dimensional ICU data.
    • 0:03:31Like who is most likely to get readmitted,
    • 0:03:33and what are the features that those people are going to be readmitted
    • 0:03:36or not?
    • 0:03:36So that's kind of like a difference between supervised and unsupervised
    • 0:03:39learning.
    • 0:03:40And next, Kevin is going to talk more about what is neural network,
    • 0:03:43and why it important to unsupervised learning.
    • 0:03:46KEVIN XU: So, as Zad mentioned, a lot of the time you're dealing with data
    • 0:03:50sets that are hugely multidimensional.
    • 0:03:54And all that means is, there's a lot of different categories that you could
    • 0:03:57consider, such as an election, like the population, the demographics, how
    • 0:04:02likely the other candidates are, et cetera, et cetera.
    • 0:04:04And you just have all of these categories of data
    • 0:04:06that you need to somehow compile together and get some reasonable kind
    • 0:04:10of prediction as a result.
    • 0:04:12And this is where neural networking shines.
    • 0:04:15And if you continue researching this field,
    • 0:04:19and perhaps continue with this kind of topic,
    • 0:04:22you're probably going to see this kind of picture a lot.
    • 0:04:25And at first this looks very incomprehensible.
    • 0:04:28It's just a giant graph with a bunch of nodes and lines.
    • 0:04:31But this is actually just a very brief picture of what neural networking is.
    • 0:04:37On the left, you can consider these as input nodes, and on the right
    • 0:04:40you have output nodes, and in the middle you have hidden nodes.
    • 0:04:42But this doesn't really tell you anything about what this does.
    • 0:04:46And, so, you can think about this giant network of graphs,
    • 0:04:49or however you might imagine it, as just a black box function.
    • 0:04:53And this black box function takes in your input data,
    • 0:04:56so the different categories, your multidimensional input data,
    • 0:04:58and then it spits out the output data that
    • 0:05:01should be good enough for a prediction.
    • 0:05:03So, in the most simplified case, let's just
    • 0:05:06talk about a game, such as tic-tac-toe.
    • 0:05:09And, so, your input data just might be the state that you're currently in.
    • 0:05:13So, what squares are filled, whose turn is it, et cetera, et cetera.
    • 0:05:16And your output data might just be one single number,
    • 0:05:19the heuristic between zero and one.
    • 0:05:21And all this number tells you is how good your current board position is.
    • 0:05:26So obviously, if you can make three in a row in your next turn,
    • 0:05:29your heuristic should be very good, it should be one.
    • 0:05:32And if you're going to lose in the next turn it should be zero.
    • 0:05:34So, obviously, not every state falls under this case.
    • 0:05:39So, there's a continuous range of numbers between zero and one
    • 0:05:42that your black box function should be able to retrieve, or accurately
    • 0:05:48predict from this input data.
    • 0:05:50And this is a job of everything in the middle, right?
    • 0:05:53How do these variables relate, what function
    • 0:05:55or what functions you should apply to these variables in the middle
    • 0:05:59of processing, et cetera, et cetera.
    • 0:06:03And, so, the overall goal of neural networking
    • 0:06:06is to design this middle function, right?
    • 0:06:09How do you get the computer to find these patterns, and to make--
    • 0:06:15create this black box function in a reasonable manner.
    • 0:06:18And this is something that's really hard for humans to do,
    • 0:06:20because there's so much data.
    • 0:06:22And, so, we give it to the machine instead.
    • 0:06:25ZAD CHIN: So, a general outline in a machine learning
    • 0:06:27project, or machine learning research, for example,
    • 0:06:30we start with a data collection.
    • 0:06:31So there are a lot of different ways where we can collect data.
    • 0:06:34Some significant ways like finding sources of data
    • 0:06:37is like, scrapping it from websites, API call, or readily available data
    • 0:06:41sets such as Kaggle, or like in UCI ML repository.
    • 0:06:47But do take a note that scrapping raw data, for example, from API call,
    • 0:06:50or even websites, may take a long time to clean up
    • 0:06:53the data, especially when there is a lot of empty rows and stuff.
    • 0:06:55So the next step we do is normally data exploration.
    • 0:06:58And initial data exploration is normally very powerful,
    • 0:07:01and it gives you great insight on what features to take in,
    • 0:07:04what features to drop, whether your data is biased.
    • 0:07:07And understanding the dimension of the data is actually very important.
    • 0:07:11So the third one is, of course, choosing an ML model.
    • 0:07:14Choose a simple model to start with, say for example,
    • 0:07:17what is the type of problem you want to solve, right?
    • 0:07:19We'll go too deep in that-- in an example
    • 0:07:22later, but like, some problems I consider,
    • 0:07:25is it a regression or classification problem?
    • 0:07:27Or whether to actually use the supervised or unsupervised machine
    • 0:07:30learning model.
    • 0:07:31There are some models that you can choose from.
    • 0:07:33And the fourth one, after you choose a model,
    • 0:07:35and code it out with all the libraries, you get the result.
    • 0:07:38You want to test it out.
    • 0:07:39What is the accuracy score of your model,
    • 0:07:42what is the precision, what is the record score of your model.
    • 0:07:45And how can we improve the model by fine tuning the parameters
    • 0:07:49or even using the more sophisticated models?
    • 0:07:51Say, for example, you started with logistic regression,
    • 0:07:53maybe you can move on to neural networks.
    • 0:07:55So, I think we can start with a bit of questions if you guys have any.
    • 0:08:01Or not, we can move on to examples that we have.
    • 0:08:05SPEAKER: OK, there's one question, from Prateek.
    • 0:08:09"Is there any difference between data science and machine learning?"
    • 0:08:17ZAD CHIN: So--
    • 0:08:18KEVIN XU: Yeah, do you want to take this, Zad?
    • 0:08:19ZAD CHIN: Yeah, sure.
    • 0:08:20I think data science is more--
    • 0:08:22I think machine learning is under data science.
    • 0:08:24You can correct me when I'm wrong.
    • 0:08:25I think, like, data scientists somehow also use machine learning model
    • 0:08:30to help them to analyze the data.
    • 0:08:32So, basically, what it means for data science is that like,
    • 0:08:35we try to understand the pattern, or what useful information
    • 0:08:41can a huge amount of data tell us, right?
    • 0:08:43So, I feel like, machine learning model can be a very good stepping stone
    • 0:08:47into data science.
    • 0:08:48And also, there are statistics as well, like pure statistics,
    • 0:08:51such as chi-square test, which is not really machine learning but also
    • 0:08:56a part of data science.
    • 0:08:58KEVIN XU: Right, just echoing what Zad said, ML is a technique.
    • 0:09:03Or a-- you can always think of it as a methodology
    • 0:09:08to approach the study of how you can extrapolate data.
    • 0:09:12And, so, there's a lot that falls under this, that falls on machine learning.
    • 0:09:17But there are also a lot of things that you can do
    • 0:09:19that are not machine learning, right?
    • 0:09:22And, so, I hope that answers your question.
    • 0:09:25SPEAKER: OK, no other questions, you can continue.
    • 0:09:30KEVIN XU: Zad will have to [? click. ?]
    • 0:09:34ZAD CHIN: So you guys can go to this link.
    • 0:09:36If you want to make a copy of the notebook yourself and follow up when we
    • 0:09:40actually present it, you can go to this tinyurl.com/ML-notebook-1.
    • 0:09:46We will jump straight to the notebook that we can show you.
    • 0:09:49So there are a few contents, if you want this table of contents.
    • 0:09:53And I'll close it, so we can have a better view of [? it. ?]
    • 0:09:57So let's get started.
    • 0:09:58We talked a lot about how we can start with machine learning research.
    • 0:10:02So, in this particular notebook, we will try
    • 0:10:05to use the Iris data set, one of the most famous data sets
    • 0:10:08about machine learning.
    • 0:10:10So, in a supervised learning, we will try
    • 0:10:12to use two different models, which is k-nearest neighbor
    • 0:10:16and logistic regression.
    • 0:10:18And, the second part, we will jump into reinforcement
    • 0:10:20learning with a different example.
    • 0:10:22So, as we talked about last time, the ML development process
    • 0:10:26includes data collection, train-test split, identify correct ML problems
    • 0:10:31and ML model to use, evaluate the performance of the algorithm,
    • 0:10:34and try different models and retrain to retest.
    • 0:10:38So let's get started.
    • 0:10:39The first thing you want to do is, of course, import the necessary libraries,
    • 0:10:42like Pandas, NumPy, plotting library, train-test library.
    • 0:10:46So you get straight and run this.
    • 0:10:49So to run a cell in Jupyter Notebook or like, Google Colab,
    • 0:10:53you just press Shift-Enter.
    • 0:10:57And, so, the next thing I'll do is to import
    • 0:10:59the data set that we talked about, which is the Iris data set.
    • 0:11:01So this DataFrame.head is to print out like-- or how the--
    • 0:11:05it's to give you an idea on how the DataFrame actually looks like.
    • 0:11:08So we can see that if we run this, there is sepal length, and then petal length
    • 0:11:14and petal width, and there's target.
    • 0:11:15Target is the three different things that we have, which is, I think,
    • 0:11:19different kinds of iris.
    • 0:11:22And the next thing we want to do is try to explore
    • 0:11:24what is the shape of the DataFrame, for example, right?
    • 0:11:27Now, we know that there are 150 rows and five different columns,
    • 0:11:31here we can see there's five different columns, and there's 150 rows.
    • 0:11:35So, first of all, the first thing we want to do
    • 0:11:37is to actually explore and analyze the data.
    • 0:11:40That's because it's important to do initial data exploration because it
    • 0:11:44might be biased, it might be noisy, and it
    • 0:11:46shows the relationship between different features and target
    • 0:11:49and better train our ML.
    • 0:11:50So I put some links here that you can use next time.
    • 0:11:54But we will start by using-- putting a pie
    • 0:11:56chart between the different targets.
    • 0:11:58Like, there's setosa, virginica, and versicolor.
    • 0:12:03That's a different type of iris.
    • 0:12:05I can see that each of them, kind of like,
    • 0:12:0733%, represented very equally and balanced in the labeled data set.
    • 0:12:12And this is to plot the number of different labels versus--
    • 0:12:18this is the number of like-- count, and then this is the width.
    • 0:12:22So you can see that most of the--
    • 0:12:24say for example, sepal, we can see fall in the 3.0 to 3.5,
    • 0:12:29and it's kind of evenly, or normal distributed.
    • 0:12:33And this is more like a graph to see how different kinds of length and width
    • 0:12:38vary according to species.
    • 0:12:41And I take some time, and I think this is very useful.
    • 0:12:44And this is basically a pairwise plotting
    • 0:12:46between how the features and the target are influenced.
    • 0:12:50It takes some time to plot, but it will be cool.
    • 0:12:54So say, for example, look at this.
    • 0:12:57We can see that like, petal length has a very huge influence on the target.
    • 0:13:01You can see that for target zero, they have really,
    • 0:13:04really, significant smaller petal length, as opposed
    • 0:13:07to target two which actually has a very clear distinct feature.
    • 0:13:10And moving forward, to actually-- you know like sometimes,
    • 0:13:13when you look at graph, it's not intuitive?
    • 0:13:15So a very, very good way to plot the intuition--
    • 0:13:20to calculate, to see the relationship between different features,
    • 0:13:23is to plot a heat map.
    • 0:13:26You can see that when we plot a heat map,
    • 0:13:28you can see the higher correlation between petal length and target.
    • 0:13:31This is where we want the label, right, petal length and target,
    • 0:13:34and petal width and target.
    • 0:13:36So we pass it into the machine learning model where you want to-- we can also
    • 0:13:41drop-- we can actually drop, which means delete,
    • 0:13:43the column of sepal width and length, because it doesn't really affect--
    • 0:13:47it doesn't really correlate to the target that we actually care about.
    • 0:13:51So the next thing we want to do is to train-test split.
    • 0:13:53The importance of train-test split can be learned more about here,
    • 0:13:55but we will just move on.
    • 0:13:57So, basically, when we have a data set, we
    • 0:13:59can't just put everything in your training sample.
    • 0:14:02We need to split it into maybe 70/30 split, which is 70% training, then
    • 0:14:0730% testing, or 80/20, 80% testing and--
    • 0:14:12I'm sorry, 80% training and 20% testing.
    • 0:14:15Depending on the size of your data set, some people
    • 0:14:17go with 70/30, if you have a huge data set you can actually do 80/20 split.
    • 0:14:22So, in here, because we don't have a huge data set,
    • 0:14:25you only have like 1,500 rows, so I decided
    • 0:14:28to go with 70/30 split, which we can see that the test size is 0.3.
    • 0:14:35And we have, after splitting, we get 105 rows in the training data set,
    • 0:14:40and 45 rows in the testing data set.
    • 0:14:44And we can also check whether we actually trained it correctly,
    • 0:14:48and this is a very important step whereby we actually split.
    • 0:14:52So, now, we have five different columns, right?
    • 0:14:54We have sepal length, sepal width, petal length, and petal width.
    • 0:14:57And we have target, and this is what we want to test it out.
    • 0:15:01So we need to split the training data set into both x and y, where we have--
    • 0:15:05these are call features, and then this is the target that we want.
    • 0:15:09And these are all features that are in the testing data set
    • 0:15:11and this is the target.
    • 0:15:13So we split that, all of this can be fine--
    • 0:15:19this is actually under Pandas, if this notation seems really, really weird
    • 0:15:22to you.
    • 0:15:24It is very well documented in the Panda DataFrame.
    • 0:15:27And this is still head, and we see how it looks like in this
    • 0:15:31or-- after I split it, you can see that train x now has four things.
    • 0:15:36Four different features that we want to train it on the head, which is this
    • 0:15:40is the attribute of it, which correspond to the ID features.
    • 0:15:45So, next, what you want to do is identify correct problem and ML model.
    • 0:15:49So in this problem it's more like a classification problem.
    • 0:15:52So, for example, if you are provided with several inputs and variables,
    • 0:15:55a classification model will actually try to predict
    • 0:15:58the class of the kind of data.
    • 0:16:02Say, for example, here is a classification problem,
    • 0:16:04because we are given the features sepal length, sepal width, petal length
    • 0:16:08and petal width, and we want to find what target, which
    • 0:16:11is what iris species it is.
    • 0:16:13So that's a classification problem.
    • 0:16:16As opposed to a regression problem, where we are given certain features.
    • 0:16:19You want to find a petal length or petal width, that's a regression model.
    • 0:16:22You can read more about it here, I'm not going through it for time's sake.
    • 0:16:27And before we move on to test out that model,
    • 0:16:30I want to talk about evaluation metrics so we can keep running and see it.
    • 0:16:34So there are a few eval-- before we dive into evaluation metrics,
    • 0:16:39we need to understand what is a true false--
    • 0:16:41true positive, true negative, and false positive, and false negative.
    • 0:16:45So a true positive is when outcome correctly predicts the positive class.
    • 0:16:50Say, for example, you have a one, and then the model actually correctly
    • 0:16:54predicts the one class.
    • 0:16:56And, similarly, a true negative is an outcome where the model actually
    • 0:17:00correctly predicts the negative class.
    • 0:17:02Let me give you an example.
    • 0:17:03So, basically, say we have a cat data set and a not cat data set.
    • 0:17:08So a true positive is whereby a model correctly predicts that it is a cat.
    • 0:17:14A true negative is when the model actually correctly predicts
    • 0:17:17that it is not a cat.
    • 0:17:19So that's the difference between true positive and true negative.
    • 0:17:22A false positive is an outcome where the model incorrectly
    • 0:17:25predicts the positive class.
    • 0:17:26Say, for example, I have picture of a cat,
    • 0:17:29but then my model predicts it as not a cat [? view. ?]
    • 0:17:32That's a false positive.
    • 0:17:34And a false negative is an outcome where the model incorrectly
    • 0:17:36predicts a negative class.
    • 0:17:38That's when I give--
    • 0:17:39I feed in a non-cat picture, but the model says, this is a cat.
    • 0:17:44So that's a false negative.
    • 0:17:47So in terms of a different kind of metrics
    • 0:17:49that we use in judging the performance of an ML metrics,
    • 0:17:53we have accuracy, recall, precision.
    • 0:17:55And we also have other called the ROC which is receiver
    • 0:17:58operating characteristic curve.
    • 0:18:00And AUC, which most often used as opposed
    • 0:18:04to ROC, which is area under the curve.
    • 0:18:07So accuracy bias' meaning is the fraction of predictions that got right.
    • 0:18:11Say, let's take the example of the cat and non-cat data set, right?
    • 0:18:17So the fraction of--
    • 0:18:19out of the 150 prediction, for example, if my model
    • 0:18:24predicts 96% of the cat pictures correctly, that's an accuracy.
    • 0:18:28And, also, you have recall, which is what proportion of actual positive
    • 0:18:32are defined correctly.
    • 0:18:33So mathematically, it's defined as true positive or true positive plus
    • 0:18:37false negative.
    • 0:18:38And precision is a true positive or true positive plus false positive, sorry.
    • 0:18:43So, it's like, what proportion of positive identification
    • 0:18:46were actually correct.
    • 0:18:47And I thought it might be a bit overwhelming
    • 0:18:50to hear a lot like, true positive, true negative, it's like,
    • 0:18:53oh my god, this is so much.
    • 0:18:56I included a lot of different resources that you can actually go and read it.
    • 0:19:00I think they are pretty good resources whereby you can learn more, and get
    • 0:19:04accustomed to all of these terms.
    • 0:19:06And, so, let's first dive in to the first machine learning--
    • 0:19:10the supervised model that we're going to talk about,
    • 0:19:12which is k-nearest neighbor.
    • 0:19:14So the k, in k-nearest neighbor, is the nearest neighbor
    • 0:19:18we wish to take the vote from.
    • 0:19:19Let's do an example, right?
    • 0:19:21So this is a dot I want to label, whether it's a blue or a red dot.
    • 0:19:25In the radius of three, which is this-- what's it called--
    • 0:19:30solid line circle, you can see that we have
    • 0:19:33two red triangles but one blue square.
    • 0:19:38So, in this case, this dot will be classified by the KNN label
    • 0:19:43as the red triangle.
    • 0:19:45And what about if it's larger, right?
    • 0:19:47So what about k value of five?
    • 0:19:50So it would be the dotted line circle.
    • 0:19:52So you can see that in this dotted line circle,
    • 0:19:54we can see that if we classify this green dot, we have three blue squares,
    • 0:20:02but we have two red triangles.
    • 0:20:04So this blue dot, based on this dotted line circle,
    • 0:20:07will be classified as a blue square.
    • 0:20:12So we can see that actually choosing a k value in the KNN algorithm
    • 0:20:18is very important.
    • 0:20:18And that's where all the time consuming parts come in,
    • 0:20:21because we want to choose the correct model that doesn't overfit or underfit
    • 0:20:24the algorithm.
    • 0:20:26So some of the ways I can do it is plot a graph of accuracy versus k value,
    • 0:20:30or graph of error rate versus k value.
    • 0:20:33But, most of the time, we just test with a random one and we move on from there.
    • 0:20:37And there's also a different metric system that we can use.
    • 0:20:40If we learn maths we also learn about the Euclidean distance,
    • 0:20:43whereby it's like, draw a triangle, and we have the Euclidean distance.
    • 0:20:46Or we have the Manhattan distance, whereby
    • 0:20:48it's like the longest distance from a point to a point.
    • 0:20:52So the KNN resources that I found really, really helpful in explaining
    • 0:20:57will be here as well.
    • 0:20:58So, without further ado, let's get started with model testing.
    • 0:21:01So I use a SKlearn model to--
    • 0:21:05I paste in my k-neighbor, which is three, and I paste a KNN classifier,
    • 0:21:09and I train it.
    • 0:21:11And after that I used-- this is the prediction-- after I train it,
    • 0:21:14I just used the model to predict my testing data set,
    • 0:21:17to see how accurate it is on the data set itself.
    • 0:21:21So, after I run it, you can see that the accuracy score was around 93%.
    • 0:21:27That means that of all the positive--
    • 0:21:30of all of the predictions that all the ML model made, 93% of them are correct.
    • 0:21:36And what about the precision score?
    • 0:21:39We got 0.9476, around 95%.
    • 0:21:43That means that out of the total positive observation,
    • 0:21:4695% of the prediction is actually correct.
    • 0:21:48That's pretty high.
    • 0:21:49And the recall of the KNN is around 93%.
    • 0:21:52I'm so sorry, I will change this later.
    • 0:21:54But, maybe, let's test around with different kind of neighbor scores.
    • 0:21:58[? Let's do ?] about 10, which is basically overfeeding.
    • 0:22:03We can see that it actually affects the--
    • 0:22:06I think, just now, we have 93% accuracy.
    • 0:22:09But now, we can see that if we feed it with a amount of 10 neighbors, which
    • 0:22:13is increasing the radius of the circle, you can see that the--
    • 0:22:18both the precision, and accuracy, and recall score
    • 0:22:20actually increased from around 93% to 95%.
    • 0:22:24So that's why constantly testing and evaluating the model
    • 0:22:28is actually very important.
    • 0:22:29So, let's move on to another one, which is logistic regression.
    • 0:22:34Logistic regression is mainly based on this graph,
    • 0:22:38called the MLE, maximum likelihood estimation.
    • 0:22:44Giving an example would be, say, for example, I was given a bunch of data
    • 0:22:48and I was trying to predict whether someone is COVID positive
    • 0:22:52or COVID negative.
    • 0:22:53So, what we will do is, we will plot a graph like this, a sigmoid curve.
    • 0:22:57And then, say the data was here, right, and I'll be like,
    • 0:23:00oh, because this data was around more than 90, more than half of this curve,
    • 0:23:04I would classify it as one.
    • 0:23:06But if it's here, we'll classify it as COVID negative.
    • 0:23:10So that's how this model generally works,
    • 0:23:13is that depending on where the data points are,
    • 0:23:15we classify whether it's positive or negative based on where it is.
    • 0:23:20So, in this case, you also try to run a logistic regression
    • 0:23:24model on the Iris data set itself.
    • 0:23:26And I also include a few resources where you can
    • 0:23:30understand logistic regression more.
    • 0:23:32So if we run the logistic regressions I think it's actually--
    • 0:23:41[INAUDIBLE]
    • 0:23:43So, if you run the logistic regression, you
    • 0:23:45can see that the accuracy score is actually less than the one
    • 0:23:50that we had before.
    • 0:23:51We have 95% of accuracy with KNN model, but we only have around 93%
    • 0:24:00with logistic regression.
    • 0:24:02And that's how we actually, so--
    • 0:24:05alternatively, you can use other approach or ML
    • 0:24:08model that we suggested here--
    • 0:24:10sorry to keep jumping--
    • 0:24:11which is decision tree, support vector machine, or neural network.
    • 0:24:15I remember when you run it across support vector machine,
    • 0:24:18it has really good accuracy and prediction,
    • 0:24:20while neural network is also a pretty good way to try to classify it.
    • 0:24:24So that's all for me you can run on to--
    • 0:24:27KEVIN XU: Yeah, so, great, Zad, a wonderful job
    • 0:24:33of highlighting how you can use a regression,
    • 0:24:38or you can use ML to take a data set and attempt to predict future elements
    • 0:24:46that may be part of this data set, right?
    • 0:24:47And, so, this is one of the core pieces of ML.
    • 0:24:50And, don't be worried if you couldn't follow all of that.
    • 0:24:54Or, you don't really understand how the syntax for the code works.
    • 0:24:58It's learning new libraries, and machine learning
    • 0:25:01is very heavily dependent on previously written libraries.
    • 0:25:05It's a lot of work to develop your own algorithm to-- for the machine learning
    • 0:25:11take place.
    • 0:25:12And, so, a lot of the case it's the--
    • 0:25:14a lot of the time it's the case of seeing how your data looks,
    • 0:25:18and then trying to choose the best model.
    • 0:25:21Which is a sequence of algorithms that make this black box function,
    • 0:25:24as we talked about earlier, to actually get accurate predictions.
    • 0:25:29So, yeah, feel free to ask questions-- or pose questions about this.
    • 0:25:34But I will be going on to kind of the flip side of machine learning.
    • 0:25:38Instead of using previous data to predict elements of the same data set,
    • 0:25:44we're going to start talking about how can we get the machine to calculate
    • 0:25:50or like-- we're going to start talking about games, and reinforcement
    • 0:25:53learning.
    • 0:25:54And, so, one of the most popular uses for reinforcement learning
    • 0:25:57is, of course, trying to solve games, right?
    • 0:26:00So you have a game, and you want to build an artificial intelligence,
    • 0:26:06or an AI, that best wins your game, or that
    • 0:26:10always plays the best move possible.
    • 0:26:14And this is actually loaded with theory about how data sets work, how the game
    • 0:26:21itself works, and there's a lot of math and logic behind it.
    • 0:26:25But, at the end of the day, the idea is, given what you know about the game,
    • 0:26:29can you get your computer to train on the game such
    • 0:26:33that it always has a pretty good idea of what the next best move is?
    • 0:26:39And, so, I've built just a really silly game here.
    • 0:26:43We don't have to worry too much about the structure of the game itself,
    • 0:26:48but in general, essentially, when you start the game
    • 0:26:52you're presented with two doors.
    • 0:26:53You choose one of the doors, and then you're
    • 0:26:55presented with another two doors.
    • 0:26:57And you choose one of the doors again, and behind that door,
    • 0:27:01there's a randomly generated value between zero
    • 0:27:05and a certain number that corresponds with the door.
    • 0:27:08So if you think about this kind of tree-like structure,
    • 0:27:11you have two, and then two more, so there's four total doors at the end.
    • 0:27:14And each one of them has a number assigned to them.
    • 0:27:16Perhaps like nine-- or in this case, I think, three, nine, one, and 20.
    • 0:27:21All right, so obviously, the door that is associated with 20
    • 0:27:26is going to, on average, give you better points, or reward,
    • 0:27:29or whatever this point system works out as, right?
    • 0:27:33Than the door that has a one associated with it.
    • 0:27:36And, so, the question is, can you get the computer
    • 0:27:39to realize which door is the best, simply
    • 0:27:43by playing the game a bunch of times?
    • 0:27:45And this is the concept of reinforcement learning.
    • 0:27:48You just play it a bunch of times.
    • 0:27:50And for things that turn out well, as in you've got a lot of points,
    • 0:27:54and well, yeah, you've got a lot of points,
    • 0:27:56then the computer should be more likely to choose that option in the future.
    • 0:28:00And for doors that you didn't get a lot of points,
    • 0:28:03you could be less likely to choose that door in the future.
    • 0:28:06So we don't have to worry too much about the overall structure of this code,
    • 0:28:11but right now, I have it set up such that the computer just simply
    • 0:28:16plays random moves every time.
    • 0:28:17So it goes through one of the two initial doors, with equal probability,
    • 0:28:22and then it chooses one of the next two doors with equal probability.
    • 0:28:25So you expect that the expected value is going
    • 0:28:29to be pretty low, because these aren't in general pretty--
    • 0:28:32or, yeah.
    • 0:28:33So it won't be the highest that it could possibly be, right?
    • 0:28:37So if we run this code, we see we got an expected value about 4.125.
    • 0:28:45So, on average, the computer is scoring four points, right?
    • 0:28:50And, so, this is definitely not the best we can do, right?
    • 0:28:53This is a totally random--
    • 0:28:54the computer is playing totally randomly,
    • 0:28:56like this is completely stupid for the computer to do.
    • 0:28:59So we want to attempt to get it to figure out which doors are better.
    • 0:29:04And this actually doesn't have to--
    • 0:29:06this-- to implement reinforcement learning, a lot of the time neural
    • 0:29:10networking and other aspects of machine learning
    • 0:29:12are incorporated with reinforcement learning.
    • 0:29:15So you take in the data, and you pass it through another regression,
    • 0:29:21or prediction, and that also helps you find out which move is best.
    • 0:29:25But, in this case, because the game is so simple,
    • 0:29:28you can simply just hard code the training in.
    • 0:29:31And, so, if we take a look here, and this will be a little brief,
    • 0:29:36so if you want to ask about the logic, go ahead in the chat,
    • 0:29:39but I'll try not to waste too much time trying to explain
    • 0:29:41what every line of code does.
    • 0:29:43So for every move you take, you consider the state that the move resides in.
    • 0:29:50So, the current state of the board, so, which doors you have in front of you,
    • 0:29:53and which possible doors you can go through.
    • 0:29:55So, if you start at the beginning, you have the first two doors, right?
    • 0:29:58So, your state is just at the beginning, and you have two possible choices.
    • 0:30:02And, so, if you label every state in the system with a heuristic value that is--
    • 0:30:08that basically tells you the goodness of that state,
    • 0:30:11like how desirable do you are-- how desirable is it to be in that state
    • 0:30:15if the goal is to accumulate points.
    • 0:30:18Then, what you can do is, have the computer simply
    • 0:30:21go through this, the states, that have the highest goodness value, right?
    • 0:30:25So how do we actually calculate this goodness value?
    • 0:30:28Well, we just play the game a bunch of times.
    • 0:30:30So for every time you play the game, you're keeping an internal track,
    • 0:30:36or the computer is keeping an internal track, of what the current state
    • 0:30:40heuristic value is.
    • 0:30:42And then, it makes a move based on what it thinks
    • 0:30:46is the best move in this case.
    • 0:30:48So, here, once we make the move, we then find out what the next state is,
    • 0:30:57and how good the next state is as a result of the move.
    • 0:31:00So using this accumulation, we can kind of backtrack and see
    • 0:31:05how good was the current state that we were in before we made the move,
    • 0:31:08and how good of a move did we make?
    • 0:31:10And, so, of course, by incrementing these, right, you
    • 0:31:13increase the probability to make good moves and you increase-- you decrease,
    • 0:31:17relatively, the probability to make bad moves.
    • 0:31:23And, so, if we throw this training in that calculates the heuristic model,
    • 0:31:33and we recompile, we see we got an expected value of about 10,
    • 0:31:37which is about twice as big as we had previously.
    • 0:31:40So the computer has gotten way better at this game.
    • 0:31:42And if you actually take a look into the data structure, which
    • 0:31:45I won't at the moment, because it looks really complicated,
    • 0:31:47and well, it's just a big dictionary, and I don't think
    • 0:31:50it will help you understand what is going on here.
    • 0:31:53If you actually take a look, you'll see that the end result
    • 0:31:56of the training, so the training, we played to the game 10,000 times.
    • 0:32:00And we evaluated the goodness of every move for those 10,000 games.
    • 0:32:07And you'll find that after this training,
    • 0:32:10the probability of choosing the door that leads to this 20 door,
    • 0:32:16sorry I went a little far, so that leads to this 20 door at the end
    • 0:32:20is almost one.
    • 0:32:21It's like, 0.9997 or something like that.
    • 0:32:24And, so, the computer has basically figured out,
    • 0:32:27without us telling the computer anything except for the game, right, the game
    • 0:32:33state, and what the results are, how to beat this game.
    • 0:32:40And, so, this is the goal of reinforcement learning.
    • 0:32:43And I think this is a very interesting thing, that has--
    • 0:32:49there's a lot of application in chess, and go,
    • 0:32:52and this is the basic core idea of how people have solved these games.
    • 0:32:57You want to make good moves, is what it boils down to.
    • 0:33:02Which sounds very simple, but in execution can
    • 0:33:05be more complicated than it seems.
    • 0:33:09Let's go--
    • 0:33:09SPEAKER: Kevin, and Zad, can we take a couple of questions now?
    • 0:33:12Oh perfect timing, yay.
    • 0:33:13KEVIN XU: --that's what we're planning on doing right now.
    • 0:33:16SPEAKER: There we go, all right.
    • 0:33:17There have been several since the last little break there.
    • 0:33:22OK, from Angela, "Was Person of Interest a realistic example
    • 0:33:26of machine learning?"
    • 0:33:29KEVIN XU: Person of Interest?
    • 0:33:32ZAD CHIN: I'm sorry, can you repeat the question?
    • 0:33:34SPEAKER: That's the question, it was probably about the earlier--
    • 0:33:37maybe Angela, you can write back in the chat?
    • 0:33:40I'm going to just keep going on and continue answering,
    • 0:33:43but if you want to clarify that in the chat.
    • 0:33:46From Maria, "Do you ever feel like machine learning sometimes
    • 0:33:50have very serious sequences, like in elections for example?"
    • 0:33:56ZAD CHIN: Yes, we do feel like, in terms of machine learning,
    • 0:34:00the impact of machine learning model, especially,
    • 0:34:02I mean it can be both bad and good.
    • 0:34:04That's why, at the last slide of our slides,
    • 0:34:06we also talk about how machine learning actually
    • 0:34:09generates deepfake, or like, privacy intrusion, the alignment problem.
    • 0:34:15So we actually include some machine learning ethics
    • 0:34:17that we hope to actually share with you as well.
    • 0:34:20But generally, in terms of machine learning in an election,
    • 0:34:25I do think that that's true.
    • 0:34:26Because a lot of the time, advertisement model in like Facebook, or Google,
    • 0:34:30they use a lot of machine learning model to predict the kind of person
    • 0:34:34that you are, and recommend--
    • 0:34:36recommender system is actually a part of machine learning.
    • 0:34:38It's a very huge research topic in machine learning.
    • 0:34:41And, so, if you want to know more it will be recommender system at Google
    • 0:34:46or Facebook, I would--
    • 0:34:47SPEAKER: OK, and then we have another question about cybersecurity.
    • 0:34:52So is that--
    • 0:34:53I think we can leave that for later, I think you have some slides on that.
    • 0:34:56Is that correct, Zad?
    • 0:34:58Yeah, OK.
    • 0:34:59ZAD CHIN: Yeah, we do.
    • 0:35:00SPEAKER: All right, so, let me just--
    • 0:35:02this is from James, "How does a computer or machine recognize objects
    • 0:35:06by itself in unsupervised learning?
    • 0:35:11ZAD CHIN: It depends.
    • 0:35:12So for example, if you say like, a data point, for example, currently
    • 0:35:17I'm doing research on the ICU data set.
    • 0:35:19We have a lot of features, and then, so what we do is we paste in the features,
    • 0:35:24and then it will be represented in like, let me give you a simple example.
    • 0:35:29Say, for example, we recommend-- we have x and y feature.
    • 0:35:32And we recommend it on--
    • 0:35:34we just plot, like we just put the points in.
    • 0:35:37And then how the machine learning knows is that they try to cluster the points.
    • 0:35:40One of the very good ways to actually know, like in unsupervised learning,
    • 0:35:45is clustering.
    • 0:35:46So basically, say, for example, I took a point, right?
    • 0:35:48Like what is the neighbor of the point, and how it should relate,
    • 0:35:51how strong it actually relates.
    • 0:35:53So, maybe like, for example, the points are very separated,
    • 0:35:57or the points are like a block, or the points are not related at all.
    • 0:36:01So, basically, one of the very good ways of unsupervised learning
    • 0:36:06is also clustering, that's like for discrete data set.
    • 0:36:10If you say for image data set, for example, most
    • 0:36:13of the time we use something called a convolutional neural network, which
    • 0:36:16is CNN, which is basically passed through a lot of [? features. ?]
    • 0:36:19And I think in CS50, we also have this pset
    • 0:36:23where we talk about edges detection.
    • 0:36:25I think it's under the--
    • 0:36:27it's one of the psets.
    • 0:36:28And that's an edge detection, whereby we draw out the edge itself.
    • 0:36:33And that's also a part of the CNN, whereby
    • 0:36:35we try to pass through different filters and networks.
    • 0:36:38And then, we draw the edge to recognize what is the image itself.
    • 0:36:42So if you are interested in knowing about how machine learning actually
    • 0:36:46understands images, I would recommend CNN.
    • 0:36:49If you about-- if you like normal, unsupervised learning,
    • 0:36:52there are a few where we can say here's like, clustering autoencoder,
    • 0:36:55which is a part of neural network.
    • 0:36:58SPEAKER: OK, why don't we take two more, and then we'll continue again.
    • 0:37:01So, Amir asks, "Which step is preferred first?
    • 0:37:04Data analysis or machine learning?"
    • 0:37:09ZAD CHIN: I would-- yeah, go ahead.
    • 0:37:10KEVIN XU: So, this is very context-dependent at times.
    • 0:37:14But in general, you don't want to take some data set that you just gathered
    • 0:37:20and immediately throw it into machine learning.
    • 0:37:23Because, when you take data, in the real world, as opposed to generate data,
    • 0:37:29there's a lot of noise.
    • 0:37:31There's a lot of inconsistencies, there is
    • 0:37:33a lot of things that can go wrong, just when you take normal data.
    • 0:37:37And, so, if you just throw something random into an ML program,
    • 0:37:42it will not necessarily get you the results you want.
    • 0:37:45And most of the time it won't, because there's so much noise in the data
    • 0:37:48that it's really difficult to identify the patterns.
    • 0:37:52And, so, generally, when you're designing this ML--
    • 0:37:57type of sequence of steps, you want to at least screen your data before you
    • 0:38:04throw it into any type of program.
    • 0:38:06That way, you can catch early on if things are going to go wrong, right?
    • 0:38:10And like, in the worst case, you just might
    • 0:38:12have to retake the entire data set because it's not valuable, right?
    • 0:38:16And, so, it's kind of like a screening process, at least the initial data
    • 0:38:20analysis, before you can actually try and be productive with that data set.
    • 0:38:24ZAD CHIN: Adding it on, I feel like it's very important
    • 0:38:26to do data analysis first.
    • 0:38:27This is because say, for example, you get a data of like COVID positive
    • 0:38:30and COVID negative patients.
    • 0:38:32And actually it is very dangerous.
    • 0:38:34So, for example, if your data set is very
    • 0:38:36biased toward COVID negative person, and you just
    • 0:38:38pass it to a logistic regression model, and the model would
    • 0:38:41be like, oh, since 98% of the people are actually COVID negative,
    • 0:38:45right, then I can just predict, oh nine--
    • 0:38:48out of the test example I gave, I will just predict everybody as negative.
    • 0:38:53So I still get a 98% accuracy, which is actually very, very dangerous.
    • 0:38:56And it's all the way--
    • 0:38:57why you want to know how many labels that we have,
    • 0:39:00before we actually pass it to the machine learning model.
    • 0:39:03Because these kind of biases can happen.
    • 0:39:05Machines can be just like, oh, because 98% of people are actually negative,
    • 0:39:10so I can just predict everyone is negative.
    • 0:39:12And I get a 98% accuracy, right?
    • 0:39:14So it's very, very important to do the data analysis
    • 0:39:18before you actually train the data, especially
    • 0:39:20on this kind of highly biased data itself.
    • 0:39:24SPEAKER: OK, and the last question, from Victor, "Do all machine
    • 0:39:26learning models use neutral--
    • 0:39:28neural networks?"
    • 0:39:32ZAD CHIN: No.
    • 0:39:32KEVIN XU: Yeah, no.
    • 0:39:33So neural networking are--
    • 0:39:35convolutional neural networking, these two things,
    • 0:39:38or same, thing but specialized, are a specific subset
    • 0:39:43of the machine learning that you can do.
    • 0:39:45And, so, I think you probably asked this before I showed
    • 0:39:48the reinforcement learning example.
    • 0:39:50But you don't have to use neural networks at all when you're
    • 0:39:53trying to get the machine to learn.
    • 0:39:57It is very-- neural networks are very powerful, because you're
    • 0:40:02able to take a lot of data and have the computer generate
    • 0:40:07the relationships between the data.
    • 0:40:09But it's not always necessary to use a neural network when
    • 0:40:14you're trying to learn from a data set, or have the computer learn from a data
    • 0:40:18set, as you saw in reinforcement learning, right?
    • 0:40:20You can simply have the computer attempt to make its own conclusions based
    • 0:40:26on what state, or what type of data you give it, and what it wants to achieve.
    • 0:40:36ZAD CHIN: Right.
    • 0:40:37KEVIN XU: So, yeah, we'll take some more questions at the end.
    • 0:40:41So yeah, feel free to continue posting them in the chat.
    • 0:40:44But, for now, since we've probably hit you with a lot, and things we can do,
    • 0:40:50and things we can't do, and possible things,
    • 0:40:52we want to give some tips on what is reasonable to consider
    • 0:40:56if you're actually attempting this-- to implement this in a CS50 final project.
    • 0:41:00And, so, Zad has made this great graph here
    • 0:41:03that goes kind of in difficulty level from the left to right.
    • 0:41:06And, so, at the very left you have supervised learning
    • 0:41:09and unsupervised learning, which is--
    • 0:41:12requires some effort, but not huge amounts of dedication.
    • 0:41:16Although, this is always context-dependent as well.
    • 0:41:19And then, reinforcement learning will probably take more time, simply
    • 0:41:23because you not only have to provide the data,
    • 0:41:26but you often have to build an infrastructure that
    • 0:41:29can interpret the data.
    • 0:41:30So, in the case of the game, right, you have to build something that actually--
    • 0:41:35you have to actually build the game into Python.
    • 0:41:38Where the game has to take in an input state somehow,
    • 0:41:42and it has to return to you the new state and the results.
    • 0:41:45And, so, there's all this extra infrastructure
    • 0:41:49that you need before you can actually run any ML,
    • 0:41:52and sometimes this takes longer than running the actual ML.
    • 0:41:55I actually spent longer trying to get this infrastructure
    • 0:41:58to work out than actually implementing the reinforcement learning.
    • 0:42:01So this is like very--
    • 0:42:04something to be cautious of, if you're interested in doing something
    • 0:42:07like solving a game.
    • 0:42:09But, of course, it is certainly doable if you
    • 0:42:13are willing to put in the extra time to do it.
    • 0:42:16And then, with convolutional neural networks, and deep learning,
    • 0:42:20and some of the higher stuff, we caution against it
    • 0:42:24unless you are very familiar with this kind of construct.
    • 0:42:27And it requires some-- quite a bit of in depth knowledge.
    • 0:42:32So you don't really have to worry about that.
    • 0:42:35And, so, quickly, I just want to mention how you can actually
    • 0:42:39implement these things.
    • 0:42:40So we used Google Colab which runs Jupyter Notebook which
    • 0:42:45is a Python interpreter that goes line by line.
    • 0:42:48So the nice thing about Google Colab is that, well, there's two nice things.
    • 0:42:52One, is that it interprets the line by line.
    • 0:42:54So you can change a line without having to rerun
    • 0:42:57the script from the top down, which is great
    • 0:42:59if your things take forever to run, as is often the case in machine learning.
    • 0:43:04And the other nice thing about Google Colab is that it's cloud computed.
    • 0:43:08So there's a GPU on the server end that does all this,
    • 0:43:11and then returns it to you over the web.
    • 0:43:14And, so you won't try and break your machine
    • 0:43:17trying to process like 10,000 images.
    • 0:43:21But, on the other hand, if you feel comfortable running it on your device,
    • 0:43:25it's definitely certainly doable.
    • 0:43:26I ran the reinforcement learning code just fine
    • 0:43:29on just a terminal on my device.
    • 0:43:33And that-- those are the kind of things that don't really
    • 0:43:35take too much processing.
    • 0:43:37So, yeah, just keep that in mind when you are thinking
    • 0:43:42about how to actually implement this.
    • 0:43:46ZAD CHIN: So, next, I will talk about the useful Python library
    • 0:43:49that we almost--
    • 0:43:50not-- normally use for machine learning.
    • 0:43:53The first one we will go into is data.
    • 0:43:55How do we get data, right?
    • 0:43:56So, in terms of mining data from online, we
    • 0:43:59can go for BeautifulSoup, which is kind of like a scrap library
    • 0:44:03that someone wrote about.
    • 0:44:04It's all linked, so you guys can press that,
    • 0:44:06and the slides are available on the CS50 website.
    • 0:44:09And, also, Scrapy, which is a very good scraping library.
    • 0:44:13And in some of like, built in data sources that are very well documented,
    • 0:44:18I would recommend Kaggle, it's a Google platform
    • 0:44:21with a lot of machine learning data.
    • 0:44:23And UCI, University of California Irvine machine learning repo--
    • 0:44:27also a lot of data sets available.
    • 0:44:29And you can also scrap from websites or API calls,
    • 0:44:32with BeautifulSoup or Scrapy, or even your own API call.
    • 0:44:35And, for data pre-processing, I'm a very huge fan of Pandas.
    • 0:44:39So I highly, highly recommend using Pandas,
    • 0:44:42like changing your CSV to Pandas, and just work from there.
    • 0:44:45It's actually much more efficient and useful.
    • 0:44:47And there's also NumPy and SciPy, which is also like things that you normally
    • 0:44:51use in terms of calculating mean, median,
    • 0:44:54and a lot of very useful functions to analyze
    • 0:44:56the data or pre-process the data.
    • 0:44:59So the next thing is Python library, which is like visualization.
    • 0:45:03The most simple basic visualization library that is available online is
    • 0:45:08Matplotlib, very useful, super well documented,
    • 0:45:11a lot of examples online that you can use.
    • 0:45:13Seaborn is a better kind of visualization,
    • 0:45:16where you can choose your own color map, or better kind of visualization.
    • 0:45:19The best visualization that I can actually think of, maybe not the best,
    • 0:45:23but like--
    • 0:45:23Plotly is a very highly visual--
    • 0:45:26very engaging, highly visual, very nice visualization libraries
    • 0:45:31that Python have.
    • 0:45:33So, in terms of ML models, there are a lot of built in libraries.
    • 0:45:37You can also do your own ML model from scratch,
    • 0:45:39from Python, which actually increases your understanding on the model itself.
    • 0:45:44But if you want to save time, if you just want to use the model itself,
    • 0:45:47there are a lot of built in libraries.
    • 0:45:49Say, for example, SKlearn is a very good library for beginners,
    • 0:45:52they are various classification, regression, and clustering algorithm
    • 0:45:56that we can just use.
    • 0:45:57And there is also TensorFlow, I think a Google kind of like--
    • 0:46:02a Google deep neural network, Python library.
    • 0:46:05It goes for various tasks, it's on training and inference,
    • 0:46:08mostly on deep neural network.
    • 0:46:09And they have pretty good documentation online and on YouTube and really
    • 0:46:13good tutorials.
    • 0:46:15We also have PyTorch, where it also runs on neural network computer
    • 0:46:18vision and NLP.
    • 0:46:20And we have Keras, I think Keras is by Facebook.
    • 0:46:23And it's primarily for developing and evaluating deep learning
    • 0:46:26models as well.
    • 0:46:26Those are really, really useful libraries that you--
    • 0:46:28and are really well documented, there are a lot of tutorials
    • 0:46:31online that we actually recommend, and it also
    • 0:46:33linked so you guys can actually have a look too.
    • 0:46:36In terms of ML resources and support from CS50 and beyond,
    • 0:46:40we actually recommend you to go on Ed if you have any bugs
    • 0:46:43that you don't know how to fix.
    • 0:46:44The team-- the CS50 team is really, really happy to help you on Ed.
    • 0:46:48And you also have CS50 Intro to AI classes
    • 0:46:51online too, so you guys can have a look at it.
    • 0:46:53They also a lot of Python libraries documentation online, where I highly,
    • 0:46:58highly recommend you to look over the examples
    • 0:47:00before you actually start coding.
    • 0:47:02Kaggle is a place where all the data scientists go.
    • 0:47:05They are data sets, they are also example codes that you can try out,
    • 0:47:10and also competitions that you can join.
    • 0:47:12Very good place to learn about data science, and more
    • 0:47:15about machine learning.
    • 0:47:16And there are a few blogs that we found very useful as tools.
    • 0:47:19Data science blog is by Medium, and Machine Learning Mastery blog
    • 0:47:22is a free blog that's really, really helpful.
    • 0:47:24Of course, we have our favorite, Stack Overflow and GitHub, really,
    • 0:47:27really good ML resources and support there as well.
    • 0:47:31So let's talk about the ML ethics that we actually
    • 0:47:34see, like a lot of questions about what is the best.
    • 0:47:37Machine learning can do so much good stuff, right?
    • 0:47:39Like it is used in health care, education, anywhere
    • 0:47:42that you can think of.
    • 0:47:43But, at the same time, it's had its own like, danger itself.
    • 0:47:47So, for example, one of the things that we see the most is deepfake.
    • 0:47:50And I think Malan actually did a deepfake example video in last year's
    • 0:47:54CS50, which is actually very exciting.
    • 0:47:56I highly, highly recommend you guys to watch,
    • 0:47:58I actually linked it here so you guys can watch it later.
    • 0:48:01And another thing about machine learning is actually it's a black box.
    • 0:48:05So it's like, interpretability in machine learning
    • 0:48:08is a huge topic that a lot of machine learning practitioners
    • 0:48:11actually talking about.
    • 0:48:12So one of them is Professor Finale, also a professor at Harvard.
    • 0:48:16He-- she actually does a lot of stuff about interpretability in AI
    • 0:48:20and in healthcare as well.
    • 0:48:22So I highly recommend you to watch the TED Talk with her, if you want to.
    • 0:48:25And the other thing about machine learning
    • 0:48:27is AI biases, fairness, we heard a lot about it.
    • 0:48:29So there is this specific course by MIT that talks
    • 0:48:32about AI biases and fairness, really well documented video,
    • 0:48:35highly recommend.
    • 0:48:37And like, how you guys say we have a lot of fairness, transparency,
    • 0:48:41privacy issues, that are related to machine learning.
    • 0:48:44And there's this book that is really good that I wrote here as well.
    • 0:48:46It's called The Alignment Problem, it's by a professor in UCB.
    • 0:48:49It's about how we can align machine learning with our human values.
    • 0:48:52All of this stuff is linked, so you have more--
    • 0:48:55if you like to know more about machine learning ethics
    • 0:48:59and how it actually can be dangerous.
    • 0:49:01But I don't say I don't support it, I was like,
    • 0:49:03machine learning is very helpful, but we need to be mindful
    • 0:49:06that it can be perilous as well.
    • 0:49:09So, that's all from us, and we will take some questions.
    • 0:49:13And, yeah.
    • 0:49:15SPEAKER: OK, wonderful.
    • 0:49:16So, let's go to Doris, "Is there a need for bigger capacity of laptop
    • 0:49:20for machine learning?
    • 0:49:21I'm using an old Mac."
    • 0:49:24KEVIN XU: This is actually a great question.
    • 0:49:27This is truly dependent on the data sets that you're working with.
    • 0:49:32A lot of the times, at least in the universities,
    • 0:49:34right, when they're dealing with huge amounts of data,
    • 0:49:38they have to process this on the cluster.
    • 0:49:40Which is a cluster of server computers that will--
    • 0:49:43that are like way higher tech than anything
    • 0:49:46you could purchase individually.
    • 0:49:48But, for the sake of implementing a small ML
    • 0:49:51project, for the sake of learning about ML,
    • 0:49:53or for the sake of doing a fun small project, such as like attempting
    • 0:49:58to write an image pro-- like, recognitions
    • 0:50:01set, with just a bunch of images, this is something
    • 0:50:04that is very doable on most machines.
    • 0:50:08Of course, it really depends on like--
    • 0:50:12yeah, it will be software-- or hardware dependent here,
    • 0:50:15but I would say that short of being very, very old hardware,
    • 0:50:23then you should be able to at least get the script working.
    • 0:50:26But, of course, if it doesn't, cloud computing is always an option.
    • 0:50:30Google Colab is free, which is great, so it's--
    • 0:50:34and it's honestly not any different from just accessing the [? IDE. ?]
    • 0:50:38And, so, we really recommend that if you are
    • 0:50:42interested in things that will run really slowly, that you
    • 0:50:45look into cloud on Google Colab.
    • 0:50:48SPEAKER: OK, [? Aviral, ?] "Is it possible
    • 0:50:52to use two types of different algorithms at the same time
    • 0:50:56to increase the accuracy of the model?"
    • 0:51:01ZAD CHIN: It's not two possible models at the same time.
    • 0:51:04I think it is--
    • 0:51:05normally like we actually use different model,
    • 0:51:08and different model has its own strengths and weaknesses, I would say.
    • 0:51:11So, for example, like KNN, it might overfit very well.
    • 0:51:14But, then, I think, for example, like mo-- the example that you
    • 0:51:18have, KNN versus logistic regression, it's
    • 0:51:21not like we can integrate both two together.
    • 0:51:23I think both KNN and logistic regression has its own advantages
    • 0:51:27and disadvantages.
    • 0:51:28The idea is to test it against different kinds of ML models
    • 0:51:33that's available out there.
    • 0:51:35And to increase-- to see which in ML performs better on the current data
    • 0:51:40set.
    • 0:51:41So for example you can see that KNN performed really
    • 0:51:43well on the data set that you have just now,
    • 0:51:45but then it doesn't mean that KNN will always perform better on other data
    • 0:51:48sets that you are testing.
    • 0:51:50So it's actually very recommended to use different data sets.
    • 0:51:52But on the idea of integrating two machine learning models together
    • 0:51:56to get higher accuracy, it might be possible,
    • 0:51:59but I'm not really sure about it.
    • 0:52:01KEVIN XU: I would say the close analogy is that if you can recall-- well
    • 0:52:05actually, let's just go back to it.
    • 0:52:07If you can recall the nodes that we have, right?
    • 0:52:11You can think of each set of nodes as like, a different transformation
    • 0:52:14that you do to the data.
    • 0:52:15So what is often the case is that your data will go through many, many layers,
    • 0:52:20especially for things that are very complicated.
    • 0:52:23And, so, this is not necessarily applying two different machine learning
    • 0:52:26models, but you are--
    • 0:52:28this kind of transformation is very iterative.
    • 0:52:31And, so, this is like--
    • 0:52:34we call them layers, so like you pass the data through one layer,
    • 0:52:37and then you pass it through another layer, iteratively, until you get
    • 0:52:40to this end output layer.
    • 0:52:42So this actually goes into the specific of how the function works,
    • 0:52:46but you will often have to make multiple layers
    • 0:52:50to get good results for your data set, because the data is always complicated.
    • 0:52:56SPEAKER: This is, an I--
    • 0:52:57please excuse my pronunciation here, [? Effie ?]
    • 0:53:00writes, "Does ML help in cybersecurity?"
    • 0:53:07ZAD CHIN: Wait, the question is, "Does ML affect cyber security?
    • 0:53:09KEVIN XU: Or does it--
    • 0:53:10SPEAKER: "Does ML help in cybersecurity?"
    • 0:53:12ZAD CHIN: Oh, I think it does.
    • 0:53:14So, for example, I think I heard a lot of information
    • 0:53:17about how banks actually use ML to detect fraud transactions.
    • 0:53:22So in terms of that kind of way, there's a lot
    • 0:53:24of ways whereby ML actually helps in trying to prevent cybersecurity attacks
    • 0:53:29or like detect cybersecurity attacks, especially in a large company.
    • 0:53:33Because it's actually really well in trying
    • 0:53:35to find patterns, or unusual patterns in a huge amount of data set.
    • 0:53:39So I definitely think that it has a very huge application in cybersecurity
    • 0:53:43itself.
    • 0:53:44KEVIN XU: Yeah, and if there's one takeaway
    • 0:53:46that we want you to have from this, it's that the goal of-- what
    • 0:53:51ML is really good at doing is taking a lot of data
    • 0:53:53and finding connections between those data points.
    • 0:53:56So anything you can think of that needs to process a huge amount of data,
    • 0:54:00you can almost always apply some form of ML.
    • 0:54:04If this data is meaningful, of course.
    • 0:54:08So in terms of cybersecurity, like actually
    • 0:54:11building firewalls and those kinds of things,
    • 0:54:14it doesn't have to necessarily do as much
    • 0:54:16with processing huge amounts of data.
    • 0:54:19But, what, as Zad said, right, with like, fraudulent account
    • 0:54:23accesses, and things like that, that is something
    • 0:54:25that is totally very applicable to ML.
    • 0:54:28SPEAKER: OK.
    • 0:54:29[? Madhi ?] asks, "What is the AI application in video games nowadays?
    • 0:54:34And as a machine learning developer, can I work in game field?
    • 0:54:40KEVIN XU: Oh yeah, so, I can take this question.
    • 0:54:42Yeah, so, I don't know if you've seen, but the I think the team, OpenAI,
    • 0:54:47I think is run by Tesla, or is like a subsidiary or maybe--
    • 0:54:52ZAD CHIN: I think it's Google.
    • 0:54:53KEVIN XU: It might be Google, yeah.
    • 0:54:55So OpenAI is a team that has been developing machine learning
    • 0:54:59game software for quite a while.
    • 0:55:01And a couple of years ago, they released, like, big news, Dota 2, which
    • 0:55:06is one of the popular MOBA games, they managed
    • 0:55:12to write an AI that beat a team of professionals.
    • 0:55:14So this is totally something that is doable,
    • 0:55:18and currently, and I recommend you check out the OpenAI--
    • 0:55:25the stuff that they've been doing.
    • 0:55:27It's like, really cool and it's definitely very marketable
    • 0:55:32if you're interested in that kind of stuff.
    • 0:55:34ZAD CHIN: I think in terms of games, like DeepMind also created AlphaGo,
    • 0:55:37which is like one of--
    • 0:55:38I don't know whether that's considered a video game, but it's a game.
    • 0:55:41It's just super huge at that time, because it's the world's best Go
    • 0:55:45player on planet Earth.
    • 0:55:46It was like AlphaGo versus human.
    • 0:55:48There's also a movie about it, called Go, or AlphaGo, something like that.
    • 0:55:51Yeah, highly recommended.
    • 0:55:53Super--
    • 0:55:53KEVIN XU: Yes, so, furthermore, like in terms of not just video games,
    • 0:55:59people are still developing things to--
    • 0:56:03just test what we can use this for, in terms of not just these--
    • 0:56:09so, games have different classifications,
    • 0:56:11and Carnegie Mellon actually solved poker a few years ago.
    • 0:56:15Which is crazy, because poker, you don't always have all the information.
    • 0:56:20So there's so much more extrapolation that you
    • 0:56:22need to do from a given subset of information.
    • 0:56:25And, so, it's actually very interesting how far that machine learning
    • 0:56:30has come, from being able to take such a small, comparatively small set of data,
    • 0:56:34and be able to generalize it to big things.
    • 0:56:40SPEAKER: OK, and [? Yashvi ?] asks, "Just as you mentioned,
    • 0:56:43we trained the computer for game 1,000 times.
    • 0:56:46Was that the training part or the testing part from train-test?"
    • 0:56:50KEVIN XU: Ah, yeah, great.
    • 0:56:51So, when we did reinforcement learning, the training over 1,000 times
    • 0:56:56was simply to train the computer.
    • 0:57:00When you're reinforcement learning, your testing part,
    • 0:57:04you don't really have a test set, as opposed
    • 0:57:07to when you have a big data set of images where you can divide 80% of them
    • 0:57:11to train it, and then verify that your algorithm, or your regression
    • 0:57:15is correct with the remainder 20%.
    • 0:57:16In the case of the game, you just have-- like, once you've trained your AI,
    • 0:57:21you just have your AI play the game.
    • 0:57:22In which case, this is your testing phase.
    • 0:57:25It's like, did the AI win?
    • 0:57:27Like in our case, it was like, how many points did the AI get on average?
    • 0:57:31And, so, that is the testing phase, which doesn't actually
    • 0:57:35use a data set but rather just uses the game itself.
    • 0:57:38SPEAKER: OK.
    • 0:57:39"What is the--" This is from [? Madhi ?] again, "What is the application--"
    • 0:57:43Oh, no I already asked that one, sorry about that.
    • 0:57:47OK from HW, "Besides Google TensorFlow, what other ML platforms
    • 0:57:53should we look into?
    • 0:57:54Anything from IBM or Microsoft?"
    • 0:57:57I think you answered that one, right?
    • 0:58:00KEVIN XU: It's--
    • 0:58:00OK, so with the--
    • 0:58:02the thing about libraries, like, choosing the best library
    • 0:58:07can be very hard.
    • 0:58:08And this is like something that requires knowledge, and like,
    • 0:58:11you to just jump into ML and see what works and see what doesn't.
    • 0:58:14So, what we recommend, is that you don't concentrate so much
    • 0:58:20on optimizing, right now, everything.
    • 0:58:23You want to get something that works with the knowledge and the skills
    • 0:58:27that you have, first, before you can improve and try and optimize
    • 0:58:31to get algorithms that are very good for your data set and things like that.
    • 0:58:34So--
    • 0:58:36ZAD CHIN: I think the other thing about trying
    • 0:58:38to find an ML library is to find whether it is well documented
    • 0:58:42or whether actually a lot of people have tried it as well.
    • 0:58:45I personally don't know any library.
    • 0:58:47Maybe I'm just shallow minded, but I don't really know any library
    • 0:58:51from like IBM and Microsoft.
    • 0:58:52But I think they have a really strong IBM and Microsoft Research team.
    • 0:58:56Which I really admire as well.
    • 0:58:58But in terms of TensorFlow, they are really, really well documented.
    • 0:59:01There's a lot of examples out there, a lot of Stack Overflow posts about that
    • 0:59:05which is actually very important.
    • 0:59:06But you actually face a bottleneck, right?
    • 0:59:08Because you need someone to help you, someone to understand your data set.
    • 0:59:11And everybody is using it, it was really well documented.
    • 0:59:14So I would recommend TensorFlow if you are beginning,
    • 0:59:17but like, if you want to be more sophisticated,
    • 0:59:19you can start to build your own neural network.
    • 0:59:22And I think that will be the most sophisticated thing that you can do.
    • 0:59:25KEVIN XU: And of course, we don't expect you guys, in CS50,
    • 0:59:28to build your own right now.
    • 0:59:29But, in the future, just as consideration,
    • 0:59:33there's a lot of statistical theory that goes into what machine learning is.
    • 0:59:37And, so, there are Harvard classes, or MIT classes,
    • 0:59:40that you can take that are just totally on ML and applying it.
    • 0:59:45And if you are interested in talking about that,
    • 0:59:48feel free to reach out to one of us, and we'd be happy to chat with you.
    • 0:59:52ZAD CHIN: Yeah.
    • 0:59:52And, so, before we actually end, please take one minute to actually fill out
    • 0:59:56this feedback form, which is tinyurl.com/ML50-feedback.
    • 1:00:00And we actually really thank you so much for actually coming,
    • 1:00:03it's really late, or really early, I don't know what's the time zone.
    • 1:00:06But thank you so much for actually coming,
    • 1:00:08we actually enjoyed this great seminar, and we are very excited to share this.
    • 1:00:12And we hope that you learned something too.
    • 1:00:14So--
    • 1:00:17KEVIN XU: So, yeah, I think we'll be ending the recording now,
    • 1:00:20but might stay a little bit after to answer any further questions.
  • CS50.ai
Shortcuts
Before using a shortcut, click at least once on the video itself (to give it "focus") after closing this window.
Play/Pause spacebar or k
Rewind 10 seconds left arrow or j
Fast forward 10 seconds right arrow or l
Previous frame (while paused) ,
Next frame (while paused) .
Decrease playback rate <
Increase playback rate >
Toggle captions on/off c
Toggle mute m
Toggle full screen f or double-click video