CS50 Video Player
    • 🧁

    • 🍦

    • 🍊

    • 🍿
    • 0:00:00Introduction
    • 0:00:19Outliers
    • 0:02:09temps.R
    • 0:07:21Transforming Vectors
    • 0:10:53Logical Expressions
    • 0:18:56Logical Operators
    • 0:32:16chicks.R
    • 0:51:18Logical Functions
    • 1:04:12Menus
    • 1:19:40Conditionals
    • 1:24:52sales.R
    • 0:00:00[MUSIC PLAYING]
    • 0:00:19CARTER ZENKE: Well, hello, one and all, and welcome back
    • 0:00:22to CS50's Introduction to Programming with R. My name is Carter Zenke.
    • 0:00:26And in this lecture, we'll learn all about transforming data.
    • 0:00:29We'll see how to remove unwanted pieces of data, how to subset our data
    • 0:00:33and find certain pieces that we want to take a look at, and ultimately, how
    • 0:00:36to take different data from different sources
    • 0:00:38and combine it into one single data set.
    • 0:00:40So let's go ahead and jump right on in.
    • 0:00:43Now, whether or not you're familiar with statistics or data science,
    • 0:00:46you might have heard of this idea of an outlier, where
    • 0:00:49an outlier is some piece of data that falls outside some standard range.
    • 0:00:52Now, here, for instance, is a graph of average temperatures in January
    • 0:00:56up here in the Northeast United States.
    • 0:00:58Notice first on the y-axis, I have the temperature in degrees Fahrenheit.
    • 0:01:02That's what we use up here in the US.
    • 0:01:03And then down below, I have the day of the month, 1 through 31.
    • 0:01:07And it seems to me like these bars represent individual days of the month.
    • 0:01:11And how high or low they go represents the average temperature on that day.
    • 0:01:17Now, in the Northeast US, it can get pretty cold
    • 0:01:19by default, kind of all the way down towards 0 degrees.
    • 0:01:22But it could also get as warm as, let's say, 50 degrees
    • 0:01:25or so, as kind of shown by most of these bars.
    • 0:01:27But in this data, it seems like there are a few days that
    • 0:01:30fell outside of that range.
    • 0:01:32Like, if I look down here on day 2, that seemed
    • 0:01:35like a really cold day, somewhere like negative 10, negative 15 degrees.
    • 0:01:38Day 4 seemed even colder, like negative 20 or so.
    • 0:01:42And then day 7, that was really warm for January up here.
    • 0:01:46It was, like, 60 degrees or higher.
    • 0:01:47So it seems like these would be the outliers in this data
    • 0:01:51set of temperatures.
    • 0:01:53And for one reason or another, you might hope, as a scientist, a data scientist,
    • 0:01:57or a statistician, to remove these outliers altogether and conduct
    • 0:02:01some analysis without them involved.
    • 0:02:04So let's see if we can solve this problem of outliers now using R.
    • 0:02:08We'll come back over here to RStudio, our old friend, our IDE,
    • 0:02:12or our Integrated Development Environment,
    • 0:02:14that allowed us to write R code and to write R programs.
    • 0:02:18So we saw this function last time called file.create
    • 0:02:22that allowed me to create a new file, which I could write some R code.
    • 0:02:26So I'll go ahead and type that same thing here, file.create.
    • 0:02:29And in this case, I'll call this one temps.R for temperatures here.
    • 0:02:35And I'll hit Enter.
    • 0:02:36And now I see TRUE, again which means this file was, in fact, created.
    • 0:02:40And as we saw last time, I can go to my File Explorer
    • 0:02:44over here, which shows my working directory, the place I'm
    • 0:02:47going to store these R files by default. And I can click on temps.R.
    • 0:02:52And I'll open it in what's called my file editor,
    • 0:02:55where I can write more than one line of R code.
    • 0:02:59Now, as we saw last time, one thing you often want to do in R
    • 0:03:03is read some data from some file.
    • 0:03:05And we saw these CSV files, comma separated value files
    • 0:03:09that could store tables of data.
    • 0:03:11Well, it turns out that R can also work with all kinds of other file
    • 0:03:15formats, one of which is particular to R. This is called a R data file.
    • 0:03:21And it turns out that using an R data file,
    • 0:03:23you can store R's data structures, like vectors, data frames
    • 0:03:27like we saw last time, in a file itself such that when I load them,
    • 0:03:32I just see exactly what was in the environment in terms
    • 0:03:35of that same vector or that same data frame.
    • 0:03:37So let me try doing that.
    • 0:03:39And to load an R data file, I can use this function conveniently called load.
    • 0:03:45So I'll type load here followed by some parentheses.
    • 0:03:48And now, I could type the name of the R data file I want to open.
    • 0:03:53Now, my colleague, let's say, has given me a file called temps.RData.
    • 0:03:57So I could open it using load temps.RData, just like this.
    • 0:04:02And now, let me run this line of R code.
    • 0:04:05I can do so if I type Command Enter on a Mac or Control Enter on Windows.
    • 0:04:10I could also click this run button here.
    • 0:04:12Let me hit Command Enter.
    • 0:04:14And I'll see, well, nothing, really.
    • 0:04:17But if I look in my environment now, if I open this other pane over here
    • 0:04:21called Environment, I should actually see
    • 0:04:23that I now have a vector called temps that seems
    • 0:04:27to have 31 numbers as part of it here.
    • 0:04:31So why don't I try to find, first off, the average temperature in all
    • 0:04:36of January?
    • 0:04:37And if I want to find an average, I could
    • 0:04:39use this other function called mean, where we often call an average a mean.
    • 0:04:44Well, I could type mean here and then give it
    • 0:04:46this same vector of temperatures.
    • 0:04:48And if I run this line of R code, I'll hit Enter and see the mean,
    • 0:04:52the average of these temperatures was 22.74 roughly degrees Fahrenheit.
    • 0:04:57Now, if you're not familiar with averages or means, all I've done here
    • 0:05:01is I've summed up all the values in this vector.
    • 0:05:04And I have divided by the number of values
    • 0:05:06that I have, producing some kind of typical value of the data set,
    • 0:05:10also called the average.
    • 0:05:12So this then tells us that in January, it
    • 0:05:15seems like our average temperature is somewhere around 22 degrees Fahrenheit.
    • 0:05:19But that's not why we're here.
    • 0:05:21We're here because some of these data points seem to be a little anomalous.
    • 0:05:24We had some really cold days and some really hot days.
    • 0:05:27And maybe you want to remove those days altogether
    • 0:05:30before we run this temperature analysis.
    • 0:05:33So let me actually take a peek at this entire vector.
    • 0:05:36I can do so by simply typing the name of the vector
    • 0:05:39and hitting Command Enter to see it down in my console.
    • 0:05:42And here are each of those 31 values.
    • 0:05:46So one thing you might notice is that I can see these outliers now in the data
    • 0:05:51below.
    • 0:05:51It seems like that second day, it seemed really cold.
    • 0:05:54Well, that day actually had an average temperature of negative 15 degrees
    • 0:05:58Fahrenheit.
    • 0:05:59And that fourth day, that was about negative 20 degrees.
    • 0:06:01And same thing here.
    • 0:06:03Looks like the seventh day was all the way up
    • 0:06:05at 65, which is pretty warm over here.
    • 0:06:08So one thing you might want to do is actually pull out these outliers
    • 0:06:12to use them in my code.
    • 0:06:13And we saw last time, I could use this method of indexing
    • 0:06:17into this particular vector that is trying to find particular values
    • 0:06:21and pull them out to use in my code using their positions in this vector.
    • 0:06:26Now, it seemed like that second day was particularly cold.
    • 0:06:30So I could find that temperature by using temps
    • 0:06:32bracket 2, where 2 represents that second element in our vector.
    • 0:06:36If I want to find it, I could use bracket 2.
    • 0:06:39And I'll see, in fact, I get back negative 15.
    • 0:06:42Same thing for the other one.
    • 0:06:44I could use temps bracket 4.
    • 0:06:45And that shows me negative 20, that other outlier in our data set.
    • 0:06:49I could also use temps bracket 7, and that
    • 0:06:52would show me this really warm temperature
    • 0:06:54overall in this same vector.
    • 0:06:56But this is where we left off last time.
    • 0:06:59And what I want to do now ideally is not have these outliers represented
    • 0:07:04individually, but really have a vector or a list of those outliers
    • 0:07:09to work with.
    • 0:07:10And I'd argue that I don't quite know how to do that just yet.
    • 0:07:14But I can show you one trick we can use in R to get back
    • 0:07:18a vector from a current vector.
    • 0:07:21So let's think through what we've already done.
    • 0:07:23We saw last time, if we wanted to get some element from a vector,
    • 0:07:27we could use the same bracket notation that we even just now used.
    • 0:07:32I could use bracket notation and say, give me the second element
    • 0:07:35inside of this temps vector.
    • 0:07:37And this is known as indexing into this vector.
    • 0:07:40I take the position of the element I want to find, put it in brackets,
    • 0:07:43and I get back that very same element.
    • 0:07:46So again, temp bracket for negative 20, temps bracket 7 is now 65.
    • 0:07:51But it turns out that cleverly in R, we don't always
    • 0:07:54have to provide a single index.
    • 0:07:57If we want instead a vector from this current vector, maybe a vector that
    • 0:08:02includes only some values, well, I could actually
    • 0:08:05give, as the index, not a single index, but a vector of indexes.
    • 0:08:11And I could actually index into this vector using a vector of indexes.
    • 0:08:15So let's take a look at that.
    • 0:08:17I could instead type something like this.
    • 0:08:18Give me 2, 4, and 7, those elements at these positions, 2, 4, and 7.
    • 0:08:25And notice here, I'm using this c function
    • 0:08:27we saw earlier, which stands for combine.
    • 0:08:29This makes for me a vector that includes 2, 4, and 7.
    • 0:08:34And now I'm indexing into temps using not a single value,
    • 0:08:37but a vector of indexes.
    • 0:08:39And what I'll get back is as follows.
    • 0:08:41I'll kind of mark these as the ones I want to grab.
    • 0:08:43And I will grab them out and turn them into their own vector
    • 0:08:47for me to work with in R.
    • 0:08:49So let's go ahead and try this transformation of this vector in R
    • 0:08:53and see what we get back.
    • 0:08:54Go back to my computer.
    • 0:08:56And I'll go back to RStudio, where we have our same temps vector.
    • 0:09:00But now I don't want these individual values.
    • 0:09:03I want a vector of the outliers.
    • 0:09:06So I could modify how I'm indexing into this temps vector.
    • 0:09:10And I could use instead a vector to index into it.
    • 0:09:14I want to get back those values at locations 2, 4, and 7.
    • 0:09:18And if I hit Command Enter here, I'll see
    • 0:09:21I now have a vector of those outliers.
    • 0:09:25And that's pretty cool.
    • 0:09:26I think we do a lot with this.
    • 0:09:28But one thing I haven't done yet is removed them.
    • 0:09:31Like, if I still look at temps now, I'll see
    • 0:09:34that those vectors-- or those elements are still part of my vector.
    • 0:09:37I haven't taken them out to remove them altogether.
    • 0:09:40If I wanted to do that, well, I'll need to take a different approach.
    • 0:09:44And one thing I can do in R is use a simple minus sign or a dash
    • 0:09:50and prefix my c function here, my vector of indexes.
    • 0:09:54And what this will tell R is I don't want you to grab these.
    • 0:09:58I actually want you to remove them.
    • 0:10:01This minus sign says take the elements at these indexes and drop them.
    • 0:10:05Remove them from this vector.
    • 0:10:07So now, if I run this line of code on line three, what do I see?
    • 0:10:12Well, all of my temperatures.
    • 0:10:14But you'll notice that I'm now missing some.
    • 0:10:16I'm missing those elements that were previously at positions 2, 4, and 7,
    • 0:10:20or those outliers.
    • 0:10:22So let's visualize this too.
    • 0:10:24One thing that I've done over here is I've said,
    • 0:10:26I actually want you to remove these values.
    • 0:10:29And I've done so by putting this dash in front of this particular index,
    • 0:10:33this vector of indexes here.
    • 0:10:35And what R will now do is highlight these essentially
    • 0:10:38and say, OK, I know you want to remove these particular elements.
    • 0:10:41And it will then return to me, give me back,
    • 0:10:43a vector that includes not those elements anymore.
    • 0:10:46It becomes shorter, so to speak, just like this.
    • 0:10:48So now, back in R, I'm able to remove those elements from my vector.
    • 0:10:54Now, let's come back over here.
    • 0:10:55And let's see what more we could do with this.
    • 0:10:58Well, one thing I wouldn't want to be in this scenario
    • 0:11:01is the person who has to go through and find all of these particular outliers
    • 0:11:06and tell me what their indexes are.
    • 0:11:08Like, if I had to go through thousands of pieces of data
    • 0:11:11and figure out which ones were the outliers
    • 0:11:13and which ones weren't, well, I'd kind of be wasting my time.
    • 0:11:16What I'd love to do instead is really ask a question.
    • 0:11:21Is this piece of data an outlier, or is it not?
    • 0:11:24Ask this yes or no question.
    • 0:11:26And it turns out that in R, we can actually
    • 0:11:28express those kinds of questions using a tool called a logical expression.
    • 0:11:34A logical expression.
    • 0:11:35Now, a logical expression allows us, as programmers,
    • 0:11:38to express these yes or no questions and get back a yes or no answer.
    • 0:11:42In particular, logical expressions often use what we're
    • 0:11:44going to call comparison operators.
    • 0:11:47And here are a few of them here.
    • 0:11:49Notice this one, this double equal sign, stands for equality.
    • 0:11:53Allows me to compare two values, a left one and a right one, and ask,
    • 0:11:56are they equal, or are they not?
    • 0:11:59Now, this next operator, this exclamation point equals,
    • 0:12:02that stands for not equals.
    • 0:12:04It will take a value on the left and a value on the right and say,
    • 0:12:07are these two values not equal?
    • 0:12:10And similarly for the other one down here,
    • 0:12:12you might have seen this greater than sign in grade school.
    • 0:12:14This one stands for greater than.
    • 0:12:15This one stands for greater than or equal to, this one less than,
    • 0:12:18this one less than or equal to.
    • 0:12:20But these comparison operators allow us to compare different values
    • 0:12:24and get back a yes or no response.
    • 0:12:27And actually, true to their name, these logical expressions
    • 0:12:30return to us what's called in R a logical, where a logical is simply
    • 0:12:34this value that is either true or false, yes or no.
    • 0:12:38And so you'll see these values occur throughout your time in using R,
    • 0:12:41capital T-R-U-E and capital F-A-L-S-E. These represent yes or no.
    • 0:12:48TRUE or FALSE.
    • 0:12:49Is this comparison true or not?
    • 0:12:52Now, you might also see them in terms of just T and F.
    • 0:12:55This is shorthand for these same logicals.
    • 0:12:58But in general, you might often see TRUE or FALSE here.
    • 0:13:02So let's see if I could use these logical expressions to make
    • 0:13:05my job a whole lot easier now as a programmer.
    • 0:13:08I don't have to find these actual indexes going through data one
    • 0:13:11by one by one.
    • 0:13:12Come back to my code over here.
    • 0:13:15And why don't I go back to RStudio.
    • 0:13:17So here, I have these indexes that I found
    • 0:13:20by kind of combing through my data.
    • 0:13:22But it would be nice if I could have R tell me whether some piece of data
    • 0:13:26is an outlier or not.
    • 0:13:27Well, one thing I can do is maybe try to find
    • 0:13:30those temperatures that are lower than we usually see,
    • 0:13:32like less than 0 degrees.
    • 0:13:34Below 0 degrees is kind of this common benchmark for it was really cold.
    • 0:13:37So let's look maybe first at the first element in this temps vector
    • 0:13:42and ask the question, was that temperature lower than or less
    • 0:13:47than 0 degrees?
    • 0:13:49And this is my first logical expression.
    • 0:13:52Now, if I were to run this line of code, hit Command Enter here,
    • 0:13:56what do I get back?
    • 0:13:57Well, FALSE.
    • 0:13:58So it seems like temps bracket 1, if I were to run this and show you
    • 0:14:02what that actually is equal to, 15.
    • 0:14:0415, of course, is not less than 0.
    • 0:14:08Now, what if I did it for the second one?
    • 0:14:10I could ask that same question, temps bracket 2.
    • 0:14:12And then I could say 1 over here.
    • 0:14:15And now I have TRUE.
    • 0:14:16So it seems like temps bracket 2 is negative 15.
    • 0:14:21So in that case-- actually, let me change this this.
    • 0:14:23This is not 1.
    • 0:14:24It should be less than 0.
    • 0:14:25So temps bracket 2 less than 0.
    • 0:14:27Negative 15 is certainly less than 0.
    • 0:14:30I could keep going and ask the same question for temps bracket 3.
    • 0:14:32Is temps bracket 3 less than 0?
    • 0:14:35Well, it turns out it's not.
    • 0:14:36If I see temps bracket 3 down here, looks like that value is 20.
    • 0:14:41So I've gotten some of the way there.
    • 0:14:44I'm able to ask these questions of individual pieces of data.
    • 0:14:47But I'd argue my job, my life isn't that much easier right now.
    • 0:14:52I still have to go through all of these indices, temps bracket 4, temps
    • 0:14:56bracket 5, and so on.
    • 0:14:57And my job is still to write lots and lots of R code to ask these questions.
    • 0:15:03Now, thankfully, these comparison-- or these operators
    • 0:15:08here, they allow me to actually give an entire vector as input.
    • 0:15:13They're what we would call vectorized.
    • 0:15:15So I could, on line three, instead of giving a single value from this vector,
    • 0:15:19I could give it the entire vector and get back a vector in response.
    • 0:15:23I could run line three, Command Enter here.
    • 0:15:26And now, I have a whole vector of TRUE or FALSE values, these logical values.
    • 0:15:32This is what's called a logical vector.
    • 0:15:34And notice here that for every element inside temps,
    • 0:15:38I actually asked this same question.
    • 0:15:40Is this element less than 0?
    • 0:15:42Is this element less than 0?
    • 0:15:43And I see it seems like the second and the fourth are less than 0,
    • 0:15:48just like we saw in our data.
    • 0:15:51So let me pause here and ask, what questions do we
    • 0:15:55have on these logical expressions and these logical comparison operators?
    • 0:16:00AUDIENCE: Can I access the inner tuple in the list?
    • 0:16:03CARTER ZENKE: So a question about tuples and lists,
    • 0:16:05which are other structures we have in R. Tuples are similar to vectors,
    • 0:16:09but they actually store more than one storage mode,
    • 0:16:12for instance, both numeric and character types.
    • 0:16:15We'll focus more on tuples and lists a little
    • 0:16:17later on, but not particularly right now, though.
    • 0:16:20Any other questions?
    • 0:16:21AUDIENCE: When you used the deletion operator with the minus sign,
    • 0:16:25is that modifying our source data?
    • 0:16:27CARTER ZENKE: Good question.
    • 0:16:28So when I use that negative and I got back
    • 0:16:30a vector that excluded some values, the question is,
    • 0:16:33did that kind of save as a new vector?
    • 0:16:35Did it change our environment at all?
    • 0:16:37And the answer is I get to decide that myself.
    • 0:16:40I go back to my code over here.
    • 0:16:42Let me go back to what we did before, where I had temps here as a vector.
    • 0:16:47And I decided to, in this case, access individual elements of it,
    • 0:16:51like 2, 4, and 7.
    • 0:16:53I instead wanted to remove those.
    • 0:16:55If I wanted to actually update temps to remove those in future lines of code
    • 0:17:00as well, I would need to reassign this vector.
    • 0:17:03I would say temps is reassigned, in this case,
    • 0:17:06the exclusion of these particular indexes here.
    • 0:17:09So I'm first going to remove these elements, 2, 4, and 7,
    • 0:17:12and reassign it back to temps.
    • 0:17:14And now, below this line of code, temps will always
    • 0:17:17exclude those values for me.
    • 0:17:19A good question.
    • 0:17:22OK.
    • 0:17:22So we've seen how we can ask these questions in R code
    • 0:17:26to determine which of these values are outliers.
    • 0:17:30And in fact, we can use these logical vectors, these logical expressions,
    • 0:17:34to actually figure out automatically at which indexes
    • 0:17:38we had these particular values being true or false.
    • 0:17:42We can use a function called which, where
    • 0:17:45which takes, as input, this vector of logical values
    • 0:17:48and tells me which ones are true.
    • 0:17:51Or more particularly, it tells me the indices of which ones are true.
    • 0:17:55Here, I'll run line three, and I get back both 2 and 4.
    • 0:17:59So it seems like if I look at the logical vector
    • 0:18:01itself, which was temps less than 0, notice
    • 0:18:06how the second element of this vector is TRUE, and so is the fourth.
    • 0:18:10So if I were to use which, which would tell me
    • 0:18:13at which indices is this logical vector true.
    • 0:18:17So pretty helpful now.
    • 0:18:19But I'd argue that I'm not really asking the question I wanted to ask.
    • 0:18:23Like, I wanted to ask, is this piece of data an outlier?
    • 0:18:27And an outlier can be both low or high.
    • 0:18:30So here, I've been focusing on outliers that are low.
    • 0:18:33But I also want to find outliers that are high,
    • 0:18:36let's say greater than 60 degrees.
    • 0:18:38So for that, I could use another logical expression,
    • 0:18:41like temps greater than, let's say, 60.
    • 0:18:44And if I run or evaluate this logical expression, what will I see?
    • 0:18:49Well, I'll see FALSE, FALSE, FALSE, FALSE.
    • 0:18:51But I will see TRUE for that seventh day because that
    • 0:18:54was a pretty high temperature there.
    • 0:18:56So there has to be a way for me to combine,
    • 0:18:59let's say, these logical expressions and ask the question I want to ask.
    • 0:19:03And it turns out we can do so in R using what we'll call logical operators.
    • 0:19:08Logical operators let us combine two or more logical expressions
    • 0:19:13to ask a more complex question in code.
    • 0:19:16Now, you might notice that I asked the question, is this value less than 0,
    • 0:19:22or is it greater than 60?
    • 0:19:25You often want to combine logical expressions
    • 0:19:27with this idea of and or or.
    • 0:19:30And in fact, R gives you a way to do just that.
    • 0:19:33Here, I have two symbols.
    • 0:19:34One is the ampersand, and one is this vertical pipe.
    • 0:19:37The ampersand represents and.
    • 0:19:40I can combine two logical expressions and use an and between them
    • 0:19:45with this ampersand.
    • 0:19:46I want to-- if I want to use a or, for instance, I could use this bar here.
    • 0:19:49This represents or for me.
    • 0:19:51So for instance, let's say I wanted to ask a question,
    • 0:19:54is this temperature below 0 or greater than 60?
    • 0:19:58I would put those two logical expressions
    • 0:20:00on either side of this vertical pipe.
    • 0:20:02And the pipe would symbolize that if either of those expressions is true,
    • 0:20:06then the entire thing is true.
    • 0:20:08For and, by contrast, both expressions on either side
    • 0:20:12have to be true for the entire expression now to be true.
    • 0:20:16And you can think of this a bit like English.
    • 0:20:18Something is only true if this and that are true as well.
    • 0:20:22Now, unlike our comparison operators that we saw earlier,
    • 0:20:26these logical operators actually work differently
    • 0:20:30for vectors of logicals and single logical values.
    • 0:20:34So these single symbols, ampersand and the vertical bar,
    • 0:20:38those work for vectors of logicals.
    • 0:20:41If you have a single logical value that you want to combine between,
    • 0:20:45you need to use this double character set here, ampersand ampersand
    • 0:20:49or vertical bar vertical bar.
    • 0:20:51These work for the single value TRUE or FALSE, whereas these work for vectors
    • 0:20:56of TRUE or FALSE.
    • 0:20:58So let's try actually inventing now this in code
    • 0:21:01to see if I can get at my question now.
    • 0:21:04How can I find the outliers in this data set?
    • 0:21:07Well, here, I have my two logical expressions.
    • 0:21:10And I want to combine them to represent one larger logical expression.
    • 0:21:14Well, as I said before, I'm interested in whether a temperature is below 0
    • 0:21:19or if it's above 60, just like this.
    • 0:21:23So this now is my full logical expression.
    • 0:21:26And I can evaluate it or run it if I do Command Enter on line three.
    • 0:21:31And now I'll see I've kind of combined my different expressions.
    • 0:21:35I still see that these second and fourth values,
    • 0:21:39this expression is true for those.
    • 0:21:41They are less than 0.
    • 0:21:42But I also see that on the element 7 here, that value is greater than 60.
    • 0:21:47And so now that is true as well.
    • 0:21:49If either of these expressions is true, less than 0 or greater than 60,
    • 0:21:53I'll then see a TRUE in this logical vector.
    • 0:21:57And now I can go back to using which.
    • 0:21:59I could use which to figure out at which indexes, which indices,
    • 0:22:04these particular values are stored.
    • 0:22:07So it seems like 2, 4, and 7.
    • 0:22:12OK, so I think we're making some pretty good progress here.
    • 0:22:15We've gone from using individual indices to now using entire logical vectors
    • 0:22:20to automatically find for us at which places
    • 0:22:23we have this condition being true.
    • 0:22:26Some other functions to be aware of are these.
    • 0:22:29One you might be curious about is this one called any.
    • 0:22:32Any.
    • 0:22:32Any takes as input a logical vector and returns TRUE
    • 0:22:37if any of these values in that logical vector are true.
    • 0:22:41So here, I'm effectively asking not which values are outliers, but are
    • 0:22:46any of them outliers?
    • 0:22:47A yes or no question.
    • 0:22:48And I'll get back, in this case, yes, that some of these values are outliers.
    • 0:22:53There are, in other words, some values TRUE inside of this logical vector.
    • 0:22:58I could also ask this question.
    • 0:23:01Are all of these values outliers?
    • 0:23:03Kind of a nonsensical question at this point,
    • 0:23:05but you might use it in other cases.
    • 0:23:07Are all of these values outliers?
    • 0:23:11I can give this function, that same logical vector as input, run this,
    • 0:23:15and I'll see FALSE.
    • 0:23:16No.
    • 0:23:16Not all of them are outliers.
    • 0:23:19If any of them are false, I'll get back FALSE.
    • 0:23:23I need instead for all of the values in this logical vector to be true for all
    • 0:23:28to return TRUE as well.
    • 0:23:30All right.
    • 0:23:31So one thing we might be wanting to do now is kind of tidy this up a bit.
    • 0:23:36And so I could try to find those values in my temps vector
    • 0:23:42by now using these logical expressions.
    • 0:23:44And I could write that as follows.
    • 0:23:46Temps bracket.
    • 0:23:47And then in this case, let me go ahead and say which.
    • 0:23:50And then let me type in logical expression we decided on earlier.
    • 0:23:53I'll say temps less than 0 or temps greater than 60.
    • 0:23:58And now, what will happen is first, I'll evaluate this logical expression,
    • 0:24:02finding all the values for which this expression is true.
    • 0:24:05Which will convert that into some set of indices at which point
    • 0:24:10I'll pass those into temps.
    • 0:24:12And now, if I run line three, I see my outliers
    • 0:24:15without me going through the data myself.
    • 0:24:18I could also decide to remove these values
    • 0:24:21if I tried to use a minus sign here.
    • 0:24:23Let's try this out.
    • 0:24:24And I should see that same result, but now just dropping
    • 0:24:28or removing those outliers altogether.
    • 0:24:31But it turns out that which here is actually kind of redundant,
    • 0:24:35that R allows me to do the following.
    • 0:24:39I could actually index into my temps vector using nothing other
    • 0:24:44than a logical vector.
    • 0:24:45And what R will do is give me back all of the elements
    • 0:24:49for which this logical expression evaluates to TRUE.
    • 0:24:53I think it's worth visualizing this.
    • 0:24:54And we'll call this taking a subset with a logical vector.
    • 0:24:58So let's imagine, for instance, we have our vector called temps
    • 0:25:01and our logical vector now called filter, for instance.
    • 0:25:04And notice how the values, both FALSE and TRUE and filter, align with those
    • 0:25:09values I either want to keep or remove in temps.
    • 0:25:12The values I want to remove?
    • 0:25:13Well, those align with FALSE.
    • 0:25:15The values I want to keep, those align with TRUE.
    • 0:25:18So now, instead of finding to temps some numbers,
    • 0:25:20some indices to subset this vector, I could provide this logical vector
    • 0:25:24instead, filter, just like this.
    • 0:25:26And I'll mark those values to either kept or removed,
    • 0:25:29aligning now with that TRUE or FALSE value we saw in filter.
    • 0:25:33And once I complete this subset, I'll be left only with those values
    • 0:25:37that aligned with TRUE or those values I wanted to keep,
    • 0:25:40negative 15, negative 20, and 65 now.
    • 0:25:44I'm going to come back to RStudio.
    • 0:25:45I will go over to my console.
    • 0:25:47And why don't I try just running this line of code as it is?
    • 0:25:51I know that this logical expression evaluates to a logical vector.
    • 0:25:56If I wanted to, I can make this more explicit.
    • 0:25:59Like, we do on the slides, I could say my filter, my filter here,
    • 0:26:02as if I'm trying to remove some values but keep others,
    • 0:26:05is this evaluation here.
    • 0:26:07And now, inside of temps, I can put filter just like this.
    • 0:26:11And now, if I run line three, inside of filter is this logical vector.
    • 0:26:16I can then use this logical vector to subset,
    • 0:26:19to access some elements of temp, but not others.
    • 0:26:22Run line four.
    • 0:26:22And now I get back those particular outliers.
    • 0:26:27OK.
    • 0:26:28Now, what questions do we have on these logical vectors
    • 0:26:32and using them, in this case, as a way to index into
    • 0:26:35or take a subset of our vector here?
    • 0:26:39All right.
    • 0:26:39So seeing none, let's go ahead and keep going.
    • 0:26:41And let's introduce one more thing here.
    • 0:26:44So I promised that we would try to actually remove
    • 0:26:46these outliers altogether.
    • 0:26:48And one thing I've done so far is I've found the outliers
    • 0:26:52and put them in their own separate vector.
    • 0:26:54I haven't actually removed them.
    • 0:26:55Now, one thing that's helpful when you work with these logical expressions
    • 0:26:58is the idea of kind of inverting the result you've gotten.
    • 0:27:02If I get a TRUE value, maybe I actually want
    • 0:27:04to get the opposite, like a FALSE value.
    • 0:27:07Here, I could do the following.
    • 0:27:08Let's say I want to filter to only those temperatures that are actually
    • 0:27:12not outliers.
    • 0:27:14This logical expression here represents a element being an outlier.
    • 0:27:17I could, though, negate this and say, I want
    • 0:27:20to find a value that actually is not an outlier by putting in front of this
    • 0:27:25this exclamation point here.
    • 0:27:27This exclamation point means not.
    • 0:27:29It takes a TRUE value and converts it to FALSE or a FALSE value
    • 0:27:33and converts it to TRUE.
    • 0:27:35So let's try this.
    • 0:27:36I'll run line three just like this.
    • 0:27:39And I'll update my logical vector.
    • 0:27:41Now I'll run line four.
    • 0:27:43And I'll see that now I'm actually getting access
    • 0:27:46to only those elements that are, in this case, not outliers.
    • 0:27:50So again, this value, this exclamation point, this symbol,
    • 0:27:54allows us to take a logical expression that
    • 0:27:57evaluates to either TRUE or FALSE and negate it, get the opposite of that,
    • 0:28:01in this case, TRUE, or in this other case, FALSE.
    • 0:28:05All right.
    • 0:28:05Let's see what else we can do.
    • 0:28:07I'll come back to my RStudio over here.
    • 0:28:09And one thing we also did is we wrapped this logical expression, in this case,
    • 0:28:14in parentheses.
    • 0:28:15This allows me to treat the entire thing as one.
    • 0:28:18Notice how I had two here, one temps less than 0 and one
    • 0:28:22temps greater than 60.
    • 0:28:24In this case, though, I wanted to negate the entire thing.
    • 0:28:28So I wrapped that, in this case, in parentheses.
    • 0:28:31And now I think we've kind of solved our problem.
    • 0:28:34We've gone from, in this case, using these individual indexes to creating,
    • 0:28:39in this case, a vector that excludes those outliers altogether.
    • 0:28:45Now let's complete our analysis.
    • 0:28:46I'll go ahead and try to save, at this point, a vector that
    • 0:28:50doesn't include outliers.
    • 0:28:52And I'll call it no outliers.
    • 0:28:54So I'll go ahead and take my vector temps, just like this.
    • 0:28:59And I'll try to find, again, those values that were not outliers.
    • 0:29:03I'll index into it using my logical vector, temps less than 0
    • 0:29:08or temps, in this case, greater than 60.
    • 0:29:11And negating that, that means that this logical vector
    • 0:29:14is taking the opposite now.
    • 0:29:16And I could, if I wanted to, then find a vector of outliers,
    • 0:29:20just like this, temps and then bracket and then saying temps less than 0
    • 0:29:24or temps greater than 60 now not negated.
    • 0:29:27And now I have two vectors, one that excludes the outliers and one
    • 0:29:32that includes the outliers.
    • 0:29:34And now, finally, if I wanted to save these vectors here,
    • 0:29:37I could use this function called save, that similar to load,
    • 0:29:41allows me to create an R data file instead of loading it
    • 0:29:45into my environment here.
    • 0:29:48If I type save, I can also then give save the actual vector
    • 0:29:53I want to save to this R data file.
    • 0:29:55I'll save, let's say, no outliers.
    • 0:29:58And then the next argument is one called file.
    • 0:30:01I could say file equals and then say no_outliers.RData.
    • 0:30:07And if I run this line of code, line six, I'll now have,
    • 0:30:11in my File Explorer, this R data file that says no outliers.
    • 0:30:15And we can now save exactly this vector to my computer.
    • 0:30:19And same thing now for outliers.
    • 0:30:21I could save that one to a file called outliers.RData as well.
    • 0:30:27And I would argue this is our entire program,
    • 0:30:29to open and load some vector, to find those outliers and to remove them,
    • 0:30:34and now finally, to save them to their own separate files.
    • 0:30:38I could run this entire file with source up here
    • 0:30:40and get all these results saved to my computer.
    • 0:30:45Now, before we move on, what questions do we have on these logical vectors
    • 0:30:49or on this saving and loading of our data files?
    • 0:30:54AUDIENCE: Do we have if statements in the R?
    • 0:30:56CARTER ZENKE: Yeah, a good question.
    • 0:30:57So we have heard, in other languages, of these things called if statements
    • 0:31:00to let you ask questions in other ways.
    • 0:31:02We'll actually see those in a little bit as well.
    • 0:31:07Let's take one more question here.
    • 0:31:09AUDIENCE: What kind of data file is the type R data?
    • 0:31:12Is it like a CSV file or--
    • 0:31:14CARTER ZENKE: Yeah, a great question.
    • 0:31:15So a difference between a CSV file and an R data file
    • 0:31:19is that a CSV file, at the end of the day, is just plain text.
    • 0:31:22You can open it and see the text you have in your data file
    • 0:31:25separated by commas.
    • 0:31:26An R data file, though, lets us save an actual R data
    • 0:31:31structure, like a vector or a data frame, to a file
    • 0:31:34and load it and put it back into our environment.
    • 0:31:37So an R data file is not plain text.
    • 0:31:40But it does allow us to save an actual vector of data, a data frame,
    • 0:31:43and make it easy to load that data later on.
    • 0:31:46So R data files are particular to R and its own data structures,
    • 0:31:50a way of organizing data, like these vectors and data frames,
    • 0:31:52unlike a CSV, which can be used across many different languages altogether.
    • 0:31:56A good question.
    • 0:31:59OK, so we've seen here how to remove unwanted pieces of data
    • 0:32:03and how to do so using these things called logical expressions.
    • 0:32:07Up next, we'll see how to take subsets of data
    • 0:32:09and find those pieces of data we're actually interested in
    • 0:32:11and ask questions of that piece of data instead.
    • 0:32:14See you all in five.
    • 0:32:16Well, we're back.
    • 0:32:17And so we previously saw how to remove unwanted pieces of data,
    • 0:32:21like these outliers, using these things called logical expressions.
    • 0:32:25Up next, we'll see how to apply those very same tools
    • 0:32:28to now entire tables of data to find some subset of that data we're actually
    • 0:32:33interested in.
    • 0:32:34Now, to do that, we need to use this next data
    • 0:32:36set, which is a data set involving these very cute baby chickens.
    • 0:32:40And in particular, we have a table of data
    • 0:32:42here, where each row represents an individual baby chick
    • 0:32:46and how they grew up over two weeks of the very beginning of their lives.
    • 0:32:50Here, notice how in every row, represents a single chick.
    • 0:32:53And every column has some piece of data about that chick.
    • 0:32:57So here, on column one, this chick column
    • 0:33:00represents a number for each chick, identifying each chick uniquely.
    • 0:33:05Now, this feed column tells us what kind of food
    • 0:33:08that baby chick ate over the course of two weeks.
    • 0:33:11And then this weight column tells us how much
    • 0:33:13they weighed in grams at the end of the first two weeks of their life.
    • 0:33:17Notice here how the feed column has food like casein,
    • 0:33:20which is kind of like a protein, fava, which is like a fava bean,
    • 0:33:24if you're familiar.
    • 0:33:25And then the weight column has their weight, in this case, in grams.
    • 0:33:28So in this case, chick one seemed to have eaten casein
    • 0:33:32and weighed 368 grams at the end of the first two weeks of their life.
    • 0:33:37Now, one thing we'd be interested in is figuring out, well,
    • 0:33:40what is the average weight of any given chick in this data set?
    • 0:33:44We could certainly do that.
    • 0:33:45We could look at all of the values in the weight column and average those
    • 0:33:49and come to the conclusion that the average chick weighed some amount.
    • 0:33:53But I'd argue it's more interesting to find how much each chick weighed
    • 0:33:58depending on what they ate, like how much, for instance,
    • 0:34:01did the chicks who ate casein weigh, and how much did
    • 0:34:04the chicks who ate fava weight?
    • 0:34:06And what does that tell us about which food is
    • 0:34:08more nutritious for these baby chicks?
    • 0:34:11So let's see how we can use these same tools of logical expressions
    • 0:34:15now subset a data table like this and ultimately figure out
    • 0:34:19these different averages across these individual different food groups.
    • 0:34:23Let's come back to RStudio here.
    • 0:34:25And I'll aim to create now a program that can subset this data
    • 0:34:28and find for me the average weight of these chicks based on the kinds of food
    • 0:34:32they ate over time.
    • 0:34:34So why don't I create a new file here.
    • 0:34:36I'll do so using file.create.
    • 0:34:38And I'll call this file chicks.R for it's
    • 0:34:41going to be chicks that we're going to grow up and see how they do.
    • 0:34:45So now I'll open my File Explorer.
    • 0:34:47And I'll see I have this chicks.R file along
    • 0:34:50with a new file called chicks.csv.
    • 0:34:53So my data in this table is stored inside of this file called chicks.csv.
    • 0:34:59Why don't I go ahead and open this.
    • 0:35:01And I can do so in the same way we saw last time,
    • 0:35:04using this function called read.csv.
    • 0:35:07So I'll type read.csv and the name of the file I want to open, in this case,
    • 0:35:12chicks.csv.
    • 0:35:14And of course, read.csv will return to me
    • 0:35:17a data frame that is a table of data that
    • 0:35:20is now represented in R's own format.
    • 0:35:23I'll say that this data frame is called chicks.
    • 0:35:26And if I run line one, I'll now have that data frame
    • 0:35:30stored in my environment pane.
    • 0:35:32If I want to view this, I could use that same function we saw earlier, view,
    • 0:35:36and I could then give chicks as input.
    • 0:35:38And now I see I have my table of chicks and the various foods they ate.
    • 0:35:43So true to the slides here, we have individual chicks
    • 0:35:47numbered to represent that individual particular chick.
    • 0:35:50We have different kinds of feed or food the chicks were given.
    • 0:35:53I see casein, fava, linseed, which is like flaxseed, if you're familiar,
    • 0:35:58meatmeal, which involves various kinds of meat, soybean,
    • 0:36:01the actual plant bean, and sunflower seeds .
    • 0:36:05And here, we have our weight column.
    • 0:36:07Now, I'll notice that unlike on the slides, like below fava here,
    • 0:36:11I do seem to have some NA values.
    • 0:36:13Like, the linseed value seems to be NA.
    • 0:36:16Same with this one here for chick 9.
    • 0:36:19Same for 11 and 12.
    • 0:36:20Now, these NAs could mean a variety of things.
    • 0:36:23They might mean we didn't measure this chick.
    • 0:36:26They might mean we measured it incorrectly.
    • 0:36:28It didn't want to include that data.
    • 0:36:29But regardless, NA, as we learned last time, stands for Not Available.
    • 0:36:34There could be some data point here, but there isn't.
    • 0:36:37So probably we need to handle that as we go through and do this analysis here.
    • 0:36:42Now, I'll go back to my chicks.R file.
    • 0:36:45And one thing I could do just off the bat
    • 0:36:47is figure out, how much do the chicks weigh
    • 0:36:50on average, across all different kinds of feed?
    • 0:36:53If I wanted to find that out, I could use the mean function,
    • 0:36:57as we saw just a little bit ago, and then give it as input
    • 0:37:00the vector representing the weight column in chicks.
    • 0:37:04And so here, all I'm doing again is accessing
    • 0:37:07the weight column of chicks, which, as we learned last time, is a vector mean.
    • 0:37:13We'll take that vector and hopefully produce for me
    • 0:37:15the average weight of these chicks.
    • 0:37:18I'll run line two, and I'll see, hm.
    • 0:37:21I'll see NA.
    • 0:37:24Well, let me go back to my data table again.
    • 0:37:28I mean, I see NA values.
    • 0:37:31But why do you think I would get an NA now
    • 0:37:35if I try to find the average of the values in the weight column?
    • 0:37:39Let me turn it over to our audience here.
    • 0:37:41Why do you think I would get NA if I have NAs in the vector of weights
    • 0:37:47I'm trying to find the average of?
    • 0:37:49AUDIENCE: I think because it's interrupting the other values.
    • 0:37:53CARTER ZENKE: Yeah.
    • 0:37:54So it's kind of you might say corrupting other values in some way.
    • 0:37:58Or it's trying to maybe modify them in some way.
    • 0:38:01Now, one thing particularly about these NA values
    • 0:38:04is that they mean something special.
    • 0:38:05There should be data here, but there isn't.
    • 0:38:08And if you're doing statistics or data science,
    • 0:38:10that's actually a really good indicator that you
    • 0:38:12should make a deliberate choice about what you want to do about those values.
    • 0:38:16You could remove them.
    • 0:38:18You could substitute some new value for it.
    • 0:38:20But what you shouldn't do is just ignore them and treat them
    • 0:38:23like they don't even exist.
    • 0:38:24And so R has a way of telling me now, look, you have NA values here.
    • 0:38:29You need to make a decision of what you want to do in order to actually compute
    • 0:38:33what you're trying to compute.
    • 0:38:34So one thing I could do, which goes most natural I think for this case,
    • 0:38:39is simply remove those NA values.
    • 0:38:42And if I wanted to do that, I could actually
    • 0:38:44use one of mean's other parameters, which
    • 0:38:46I learned documentation called na.rm.
    • 0:38:50So recall from last time, if I want this function
    • 0:38:52to have more than one argument, I separate each with a comma.
    • 0:38:56I'll say comma here and then na.rm equals.
    • 0:39:01It turns out from the documentation, na.rm is either
    • 0:39:05going to be equal to TRUE or FALSE.
    • 0:39:08Na.rm stands for whether I should remove,
    • 0:39:12rm, these NA values before I compute the average.
    • 0:39:17By default, na.rm is false.
    • 0:39:20I won't remove them.
    • 0:39:21But if I don't remove them, mean won't know how to handle them
    • 0:39:25and so can't compute the mean.
    • 0:39:26But if I were to remove them instead, that is,
    • 0:39:29to make this parameter, this argument, true,
    • 0:39:32well, then I would be able to compute the average because I
    • 0:39:34will have dropped or removed those NA values
    • 0:39:37and then computed the average from the rest of those values that
    • 0:39:41are in my weight column.
    • 0:39:42So let me run line two here now that the na.rm parameter is set to TRUE.
    • 0:39:47And I'll see that the average weight across all the chicks
    • 0:39:50seems to be 280.77 grams or so.
    • 0:39:54So a healthy weight for these chicks.
    • 0:39:57Now, what I argued was more interesting was
    • 0:40:00the idea of trying to find how much the chicks weighed
    • 0:40:03depending on what they ate.
    • 0:40:05And we could use that to figure out, what
    • 0:40:06is the healthiest kind of meal for these chicks?
    • 0:40:10Well, one thing I might be interested in first is how much on average
    • 0:40:14do the chicks who ate casein weigh?
    • 0:40:16But for that, I'm going to need to only deal with the chicks who ate casein.
    • 0:40:21So one way to do that would be to subset my data frame.
    • 0:40:26Only find the rows for which the feed column is equal to casein.
    • 0:40:31As we saw last time, there is a way to do this
    • 0:40:33based on the indices of this particular data of the rows here.
    • 0:40:38Notice how on the left-hand side, I have individual numbers
    • 0:40:41for each of these rows.
    • 0:40:42These are the indices of these rows.
    • 0:40:45If I wanted row one, well, I could use bracket notation and ask for row one.
    • 0:40:50If I wanted row two, I could do the same thing.
    • 0:40:53So I'll go back to my chicks.R code, and I'll
    • 0:40:56try that as a first step towards this.
    • 0:40:58I'll say chicks as my data frame.
    • 0:41:01And we saw last time that we can use a bracket
    • 0:41:03notation to access individual values or elements of this data frame.
    • 0:41:08Now, because a data frame is 2D, it took two values, one for the row
    • 0:41:13and one for the column, two indices to represent
    • 0:41:16the position of the row we want and the position of the column we want.
    • 0:41:20Turns out that by convention, the row number
    • 0:41:23comes first followed by the column number, separated, of course,
    • 0:41:27by this comma.
    • 0:41:28So if I wanted the first row, I could do this one here, that first row.
    • 0:41:34And I want all the columns.
    • 0:41:35So I'll leave this part blank.
    • 0:41:37If I run line three now, what will I see?
    • 0:41:40We'll, I'll see, just in this case, row one.
    • 0:41:44Now, like our vectors that we saw earlier,
    • 0:41:47these data frames can take more than just individual indices as input.
    • 0:41:51They can also take a vector of indices.
    • 0:41:54So let's try that.
    • 0:41:55I'll give, in this case, chicks a vector of indices
    • 0:41:59that will then return to me all the rows for which the feed column equals
    • 0:42:03casein.
    • 0:42:04That seems to me, just based on eyeballing here,
    • 0:42:06that it's these rows, one, two, and three.
    • 0:42:09So I could use the 1, 2, and 3 here, create a vector of those values,
    • 0:42:15and then get back, in this case, all three of those rows.
    • 0:42:20So now I have indexed into my data frame's rows now using a vector.
    • 0:42:26And I've gotten back all the rows that I care about.
    • 0:42:29So why don't we call this one, at least for now, casein chicks.
    • 0:42:33Why don't I actually try to save this particular smaller
    • 0:42:36subset of my data frame in this object called casein chicks.
    • 0:42:39And now, if I wanted to find the mean or the average weight for those chicks,
    • 0:42:44I could use mean.
    • 0:42:46But then I could ask for the weight column from the casein
    • 0:42:50chick data frame, this subset of our previous data frame.
    • 0:42:53So now I'll run line four.
    • 0:42:55And I'll see that the casein chicks seem to weigh
    • 0:42:58significantly more than other chicks, 379 grams on average.
    • 0:43:04Now, what might we want to use now that we've
    • 0:43:08seen how inefficient this might be?
    • 0:43:10Well, as we saw before, I often don't want to use individual indices.
    • 0:43:14You could imagine me, the programmer, going through and trying to find,
    • 0:43:17OK, well, 1 through 3 is casein, 4 through 6 is fava, 7 through 9
    • 0:43:21is linseed.
    • 0:43:21That's not how I want to spend my time.
    • 0:43:24There is a very minor improvement I could
    • 0:43:26make to this, which is as follows.
    • 0:43:28I could actually represent this same vector with the following syntax.
    • 0:43:34I could use 1 colon 3.
    • 0:43:37I've saved myself a few keystrokes, and I've
    • 0:43:40gotten in return the very same vector.
    • 0:43:43This colon here, when it's between two individual numbers,
    • 0:43:47gives us a sequential vector, all numbers between 1 through 3 inclusive.
    • 0:43:52And I can prove it to you in the console if I ran this line of code down below.
    • 0:43:551 colon 3.
    • 0:43:57Hit Enter.
    • 0:43:58I'll see I get a vector 1 through 3 inclusive.
    • 0:44:02Maybe I could do the same for, let's say, the chicks that are eating fava.
    • 0:44:06Well, I could go 4 through 6 and get back those particular row indices.
    • 0:44:10But at the end of the day, I'm still actually defining
    • 0:44:15the indices at which this particular condition is true.
    • 0:44:17I could rely on something better.
    • 0:44:20I could probably rely on these logical expressions and use those instead.
    • 0:44:25So what kind of logical expression could help us out here?
    • 0:44:29Well, we might notice that we really care
    • 0:44:31about those chicks for which the feed column is equal to casein.
    • 0:44:36So I could try to make a logical expression that
    • 0:44:39involves this feed column of chicks.
    • 0:44:42Why not try that.
    • 0:44:43I'll go back to chicks.R. And now I'll try this logical expression here.
    • 0:44:48Chicks and the feed column therein, when is that equal to casein?
    • 0:44:55So recall that this is my logical expression.
    • 0:44:59And because one part of it includes a vector,
    • 0:45:02I'll get back a vector of logicals of TRUE or FALSE values.
    • 0:45:06Let me evaluate this expression by hitting Command Enter.
    • 0:45:10And now I'll see I get back this vector of TRUE or FALSE.
    • 0:45:14And it seems to me, if I look at this vector over here,
    • 0:45:16that these first three values in the feed column are equal to TRUE.
    • 0:45:21TRUE, TRUE.
    • 0:45:22TRUE.
    • 0:45:23Are equal to casein, in fact.
    • 0:45:24So TRUE, TRUE, and TRUE.
    • 0:45:26These are equal to casein.
    • 0:45:27The rest, though, are not.
    • 0:45:29They're FALSE.
    • 0:45:31Now, one thing to notice when you're working with data frames
    • 0:45:34is that really, these elements of this particular column
    • 0:45:38called feed, these kind of correspond to the rows of the data frame.
    • 0:45:43If I go back to my visualization of my data frame,
    • 0:45:48I might notice that the first three values in the feed column, well, those
    • 0:45:53correspond to the first three rows in my data frame.
    • 0:45:57And similar to vectors, data frames can actually
    • 0:46:01be subset with logical vectors.
    • 0:46:04So let's see how that could work here.
    • 0:46:07I have to keep in mind this relationship between the first elements of my column
    • 0:46:12and the actual rows of my data frame.
    • 0:46:15But I think we'll see how we could use these expressions to help
    • 0:46:17us subset this data frame.
    • 0:46:19Why don't we visualize it a bit like this, where before, we had seen
    • 0:46:24that we had a data frame called chicks.
    • 0:46:27And we could access it using bracket notation,
    • 0:46:29entering in the indices for the rows or for the columns.
    • 0:46:33But if I had some separate logical vector,
    • 0:46:36like the one I just created, and I called it, let's say, filter, just
    • 0:46:39for simplicity, I might notice that all of those same TRUEs and FALSEs, they
    • 0:46:46align now with the rows of my data frame.
    • 0:46:49So here, for instance, this logical vector
    • 0:46:52was created by comparing the values of feed with casein.
    • 0:46:56Those first three values were, in fact, equal to casein.
    • 0:46:59But the kind of revelation here is that these same elements now
    • 0:47:03correspond to rows of my data frame.
    • 0:47:07I could take this very same logical vector and put it into the place
    • 0:47:11where I would actually ask for the different rows of my data frame.
    • 0:47:15And I would get back the following, something like this.
    • 0:47:19I would mark, so to speak, certain rows to be kept at the end of this execution
    • 0:47:24here and certain rows to be removed.
    • 0:47:26And I would ultimately end up with only those rows for which
    • 0:47:30the logical vector evaluated to TRUE.
    • 0:47:32I would have, in fact, a subset of my data
    • 0:47:35without touching any of the actual individual indices.
    • 0:47:38So let's try it in R. I'll come back to RStudio here.
    • 0:47:42And I will do as follows.
    • 0:47:45I will try to kind of prevent myself from using individual indices.
    • 0:47:50And I will instead use this logical expression.
    • 0:47:53Similar to the slides, why don't I just call this logical vector filter, just
    • 0:47:57like this.
    • 0:47:59And why don't I run line three.
    • 0:48:01Now I have, in the case of filter, what do I have?
    • 0:48:05I have a logical vector.
    • 0:48:08Now, I could use this logical vector to index into, to find a subset of,
    • 0:48:14my my actual data frame here if I use it instead of some individual indices
    • 0:48:19to index into this data frame.
    • 0:48:21Now, if I run line five, I'll have subset my data frame.
    • 0:48:26And if I run line six now, I'll see exactly the same result.
    • 0:48:30And I can even show you what casein chicks looks like.
    • 0:48:33Let me show you in the console here.
    • 0:48:35I'll see I, in fact, have the chicks that ate, in this case, casein.
    • 0:48:41I could change this filter, though.
    • 0:48:43Let's say I want the chicks to ate something like linseed.
    • 0:48:46I could use linseed here.
    • 0:48:48And now, let me rename casein chicks to linseed chicks
    • 0:48:52and find out how much they weighed, those chicks who ate linseed.
    • 0:48:56I'll rerun my code top to bottom.
    • 0:48:58On line three, I'll change my filter.
    • 0:49:01I'll get back a logical expression representing those elements of feed
    • 0:49:04that were equal to linseed.
    • 0:49:06And then on line five, I'll go ahead and subset my data frame again.
    • 0:49:10And now I'll have only those chicks--
    • 0:49:12only those chicks who ate linseed.
    • 0:49:14And now, could I find the mean if I run line six?
    • 0:49:17And so it seems like the NAs are still involved here.
    • 0:49:21I need to now do the na.rm here equal to TRUE.
    • 0:49:25I want to remove the NA values.
    • 0:49:27And I could find, on average, how much those chicks who ate linseed weighed.
    • 0:49:31Seems like it was 229.
    • 0:49:34Grams, that is.
    • 0:49:35So let's go ahead and think through other improvements
    • 0:49:37we could make to this program.
    • 0:49:39Now, as I just saw, I don't want to have to write na.rm equals TRUE every time
    • 0:49:45I encounter these NA values.
    • 0:49:47What I would love to do instead is actually just filter out these NA
    • 0:49:50values to begin with, maybe load my data set, but then as soon as I do,
    • 0:49:55remove all the rows that have an NA value for the weight column.
    • 0:49:59So for that, I could probably still use a logical expression.
    • 0:50:03And one that comes to mind might be something like as follows.
    • 0:50:07Let's say I want to figure out first which elements of the weight column
    • 0:50:12or really which rows in my data frame are equal to NA.
    • 0:50:17Or let's say maybe not equal to.
    • 0:50:19So I'll do chicks here.
    • 0:50:21And I'll find the weight column of chicks.
    • 0:50:24And I'll ask the question, which ones, in this case, are equal to NA?
    • 0:50:29So I can maybe remove them later on.
    • 0:50:31And you might notice that I get this little yellow squiggly sign in R
    • 0:50:36and this little warning that says, "use is.na to check
    • 0:50:39whether expression evaluates to NA."
    • 0:50:41I'm going to ignore that for now.
    • 0:50:42I'm just going to run line three here and see what we get.
    • 0:50:46We'll see I get a vector of NA values.
    • 0:50:49And this has to do with the fact that R really
    • 0:50:52wants you to know that NA values exist.
    • 0:50:54If you have an NA value in your logical expression,
    • 0:50:57it's going to make everything else NA because R wants you to decide, what
    • 0:51:01are you going to do with this NA value?
    • 0:51:05So it seems like this approach won't work.
    • 0:51:07But thankfully, R does have other functions
    • 0:51:10that we can use to be more deliberate about checking
    • 0:51:13for any values in some given vector or in some given data frame.
    • 0:51:18Now, in R, these are known as logical functions, functions
    • 0:51:21that can return to us a logical value.
    • 0:51:23And there are a lot of logical functions that
    • 0:51:25are based on these special values we saw in R last time.
    • 0:51:29You could imagine the is.infinite function.
    • 0:51:33We saw last time it was a special value called infinite or inf that allowed us
    • 0:51:36to represent a very, very large number.
    • 0:51:38You could use is.infinite to test if some value is infinite.
    • 0:51:43You could also use, as we just saw, is.na.
    • 0:51:47Is.na looks at some given value and returns TRUE
    • 0:51:51if that value literally is NA.
    • 0:51:54If it's not, it returns FALSE.
    • 0:51:56Same for is.nan, or is dot not a number, a special value called nan.
    • 0:52:01Well, this tests for that value.
    • 0:52:03And same for null, that special value called null we saw last time.
    • 0:52:06That will return TRUE if we have the null value or FALSE if we don't.
    • 0:52:11But I think the one we're going to care about here is is.na.
    • 0:52:14So let's try that one out.
    • 0:52:16I'll come back to my code over here.
    • 0:52:19And why don't I try to use is.na on this weight column in chicks.
    • 0:52:25I can pass, as input to is.na, this particular vector,
    • 0:52:29this column called weight.
    • 0:52:31And now, if I run line three, well, I'll get back
    • 0:52:35a vector of logicals, a logical vector.
    • 0:52:38And I should actually see which, in this case, elements of the weight column
    • 0:52:43are equal to NA.
    • 0:52:44So it seems like-- and I might want to use which here.
    • 0:52:47But it seems like one, two, three, four, five, six, seven, the seventh value
    • 0:52:51seems to be NA.
    • 0:52:53Maybe the later one too.
    • 0:52:54Let's actually use which for this.
    • 0:52:55I'll come back to RStudio.
    • 0:52:57And why don't I use which.
    • 0:52:59Let's say which values, which indi--
    • 0:53:03which elements of the weight column are equal to NA.
    • 0:53:07And I'll see that it in fact seems to be the 7th, 9th, 11th and 18th--
    • 0:53:1312th and 18th rows in chicks.
    • 0:53:17Now, that seems helpful.
    • 0:53:19But I would ideally like to find those values that aren't
    • 0:53:22equal to NA and keep those instead.
    • 0:53:26So if I wanted to negate this expression here,
    • 0:53:29as we saw before, I could use the exclamation point,
    • 0:53:32this not operator, that says if you gave me a FALSE, give me instead a TRUE.
    • 0:53:37If you gave me a TRUE, give me instead a FALSE.
    • 0:53:40So this will test which values are now not NA in that weight column.
    • 0:53:45I'll run line three.
    • 0:53:47And now we'll see we have more TRUEs than FALSEs, representing
    • 0:53:51all those values in our weight column that are not, in this case, NA.
    • 0:53:56So if I wanted to subset this data frame,
    • 0:53:59I could use the same kind of trick we saw
    • 0:54:01earlier of realizing that these individual elements of this vector
    • 0:54:06correspond to the rows of my data frame.
    • 0:54:09And I could subset, in this case, chicks as follows.
    • 0:54:13We could say chicks and give it this logical expression, which
    • 0:54:16in fact returns to me a logical vector, and then use that logical vector
    • 0:54:20to subset the chicks data frame to now only include
    • 0:54:24those rows that, in this case, have a weight that is not equal to NA.
    • 0:54:30Now, it would be good for me to maybe save this
    • 0:54:34as the most recent version of chicks.
    • 0:54:36Now, on lines one and two, I'm loading the chicks data frame.
    • 0:54:40And I'm now saying immediately I'm going to remove any NA values in the weight
    • 0:54:44column, just like this.
    • 0:54:46So now, when I use mean later on, I won't
    • 0:54:49need to use na.rm because I'll know that all those NA values in the weight
    • 0:54:53column are gone for good.
    • 0:54:57Now, there is one more way to subset these data frames as
    • 0:55:01opposed to using this logical expression that is kind of serving as an index
    • 0:55:06into this data frame.
    • 0:55:07There is actually a function called subset that works on data frames
    • 0:55:12and takes both a data frame and a logical vector as input,
    • 0:55:16returning for us all the rows for which that logical expression is true.
    • 0:55:20That logical vector evaluates to TRUE.
    • 0:55:23So let's try this.
    • 0:55:25Why don't I instead use subset here.
    • 0:55:27I want to subset my data frame to only find those rows where weight is not
    • 0:55:32equal to NA.
    • 0:55:34Well, I could still use subset.
    • 0:55:35I could use subset here, which means the subset function,
    • 0:55:38and I could pass, as the first input to subset, the chicks data frame.
    • 0:55:43And now, as the second input, the second argument,
    • 0:55:46I now need to give it a logical expression to evaluate, to see,
    • 0:55:50which rows to keep and which rows to exclude.
    • 0:55:53Now, one thing is I could say is not not is.na.
    • 0:55:58So this means any row that is not equal to NA.
    • 0:56:01And I could then give the weight column of chicks as input.
    • 0:56:06Notice here the syntax is a little bit different.
    • 0:56:08I no longer need to use the dollar sign notation to actually access
    • 0:56:13the row or the column of chicks.
    • 0:56:16I instead just type in the column itself.
    • 0:56:18And this works because subset takes as input the data frame.
    • 0:56:22It will assume if I say weight, I'm talking about, in this case,
    • 0:56:26the column in chicks.
    • 0:56:28So this should have the same result. If I run line one and then line two,
    • 0:56:33if I view now chicks, I should see that all of those
    • 0:56:37waits that were previously NA are gone from my data set.
    • 0:56:42I could even use this, let's say, later on to figure out how much on average
    • 0:56:46the chicks who ate, let's say, soybean weigh.
    • 0:56:50Why don't I use subset again.
    • 0:56:52I'll make an object called soybean chicks, just like this.
    • 0:56:56And I will then subset the chicks data frame, the latest version of it.
    • 0:57:01And I'll try to make sure that, in this case, the feed column equals,
    • 0:57:05what did we say?
    • 0:57:06Soybean.
    • 0:57:07Equals soybean.
    • 0:57:09Again, because I'm now using the subset function,
    • 0:57:12I don't need to tell R that the feed column belongs to chicks.
    • 0:57:17Subset will do that work for me.
    • 0:57:19I can just give the column name and ask, where is it equal to soybean?
    • 0:57:23And now subset will return to me all the rows in chicks
    • 0:57:27where this expression is true.
    • 0:57:30Let me run line four then.
    • 0:57:31And let's see what's inside of soybean chicks.
    • 0:57:35We'll see that now I have that subset of my data frame.
    • 0:57:40And I could now run analyses like mean to determine, how much on average
    • 0:57:46did those particular chicks weigh?
    • 0:57:50All right.
    • 0:57:51Now, one more thing to keep in mind is that if I were to view this chicks data
    • 0:57:56frame, just like this, if I'm being very astute,
    • 0:58:00I might notice something a little bit off about it.
    • 0:58:03So I have the individual numbers representing each chick here.
    • 0:58:08But data frames in R also have what's called row names,
    • 0:58:12individual indices for our rows.
    • 0:58:15And if I wanted to find those row names, I
    • 0:58:18could use this rownames as a function.
    • 0:58:21And I could run rownames on line four.
    • 0:58:24And these are the row names of this data frame.
    • 0:58:28Now, if you're being a little observant, what do you notice?
    • 0:58:33Now that we've run line two, what might be missing
    • 0:58:37from these indices of our data frame?
    • 0:58:431, 2, 3, 4, 5.
    • 0:58:46What are we missing in the end?
    • 0:58:48AUDIENCE: I think it's the NA or not available variables.
    • 0:58:52CARTER ZENKE: Yeah, so we're missing, in this case, all of those row names
    • 0:58:56that previously corresponded to those rows that
    • 0:58:59had an NA value in the weight column.
    • 0:59:01So we have 1, 2, 3, 4, 5, 6, and where's 7?
    • 0:59:05Well, 7 we saw earlier actually had an NA value in the weight column.
    • 0:59:09So we removed it.
    • 0:59:10But it's really not good practice for me to actually have these row names not
    • 0:59:15now ascend one after the other in sequential order,
    • 0:59:18to have these missing values here.
    • 0:59:20So I need to reset them.
    • 0:59:22And I can do that using a special value that we saw earlier called null.
    • 0:59:26I'll come back to RStudio here.
    • 0:59:29And if I want to reset the row names for this chicks data set,
    • 0:59:35I could do as follows.
    • 0:59:36I could not just print row names or see what they are.
    • 0:59:40I could assign them some value.
    • 0:59:42And R has a handy trick, where if I assign the row names of some data frame
    • 0:59:47to be NULL, capital N-U-L-L, that will reset them to count sequentially 1 up
    • 0:59:54through the number of rows we have.
    • 0:59:56Now, null, remember, meant literally nothing.
    • 1:00:00There's intentionally no value at all here.
    • 1:00:02It means nothing at all.
    • 1:00:03But when I assign this value to be the data frames row names,
    • 1:00:07it kind of gets rid of them.
    • 1:00:08And R decides to build them back in.
    • 1:00:11So let's try this.
    • 1:00:12I'll run line four.
    • 1:00:13And now, I'll check on the row names again.
    • 1:00:16And I'll see that we're back to now being in sequential order.
    • 1:00:20So whenever you take a subset of your data,
    • 1:00:23consider updating the row names to make sure
    • 1:00:25that things are staying just as they should and you have the actual row
    • 1:00:28names in ascending order to index your data, in this case, properly.
    • 1:00:34Now, what final questions do we have on subsetting these data frames?
    • 1:00:42What questions do we have?
    • 1:00:44AUDIENCE: So when you introduce the is.na function in conjunction
    • 1:00:54with the which function, we had the indices that had NA on them
    • 1:00:59on the weights vector.
    • 1:01:02Would we have an easy way to count how many NAs we had in the vector?
    • 1:01:10Because maybe if we had a bigger data frame,
    • 1:01:14we would have a hard time counting the number of indices that it returned.
    • 1:01:19CARTER ZENKE: No, a really good question, Bruno.
    • 1:01:21And so one thing we'd be asking yourself is, how do I figure out exactly how
    • 1:01:25many NAs I had in the first place?
    • 1:01:28Well, we can use a little handy trick of these logical values, the TRUE or FALSE
    • 1:01:32values, which is that at the end of the day, a TRUE corresponds to a 1,
    • 1:01:37and a FALSE corresponds to a 0.
    • 1:01:40So let's actually see this in action and see
    • 1:01:41how we can actually count up our number of these TRUE or FALSE values.
    • 1:01:46I'll come back to RStudio here.
    • 1:01:48And our question was, how many NA values did
    • 1:01:51we have in the weight column of chicks?
    • 1:01:55Well, we used, remember, is.na to test and see
    • 1:02:00which elements of the weight column were equal to NA.
    • 1:02:04If I use is.na here, I get back this logical vector.
    • 1:02:08And actually, right now, all of them are FALSE because I actually
    • 1:02:11am still working with the updated version of chicks
    • 1:02:13that removed those NA values.
    • 1:02:14Let me run line one, which will reload the CSV.
    • 1:02:18And now let me run line three, which now has those NA values added back in.
    • 1:02:23Now I'll see that some of these values are TRUE,
    • 1:02:26that there are some places in the weight column of chicks that are equal to NA.
    • 1:02:32Now, a useful trick when you're trying to count up these kinds of values
    • 1:02:37is to keep in mind that TRUE underneath the hood corresponds to the number 1,
    • 1:02:42and FALSE underneath the hood corresponds to the number 0.
    • 1:02:46And I think if I were to do this, if I were to do, in the R console,
    • 1:02:49as.integer, this value TRUE, this would take the value TRUE
    • 1:02:55and show me its true integer representation.
    • 1:02:58Let me run Enter here.
    • 1:02:59I see 1.
    • 1:03:00Let me do as.integer for FALSE to see what it really is underneath the hood.
    • 1:03:05That seems like it's a 0.
    • 1:03:08So I could take this vector of TRUEs and FALSEs, and I could sum it,
    • 1:03:14just like this, where sum will allow me to count up
    • 1:03:17all the possible values in here.
    • 1:03:19And because TRUE is always equal to 1 and FALSE is always
    • 1:03:23equal to 0, what I'll really get back is the number of TRUEs
    • 1:03:26that are inside this vector or the number of values in the weight
    • 1:03:31column of chicks that were equal to NA.
    • 1:03:34So I'll run line three, and I'll see that there were five values, five
    • 1:03:38values in chicks that were equal to NA.
    • 1:03:40If I view chicks now, I think we should see,
    • 1:03:44if we count for ourselves, one, two, three, four,
    • 1:03:48and then down below, five, exactly five values of NA.
    • 1:03:52So you can keep in mind this when you're trying
    • 1:03:54to count up your number of NA values that you might have.
    • 1:03:59OK.
    • 1:03:59We'll take a quick break here and come back
    • 1:04:01to talk more about how we can not just choose the subset of data ourselves,
    • 1:04:05as programmers, but give the user more control over choosing
    • 1:04:08which subset of data they want to see.
    • 1:04:10We'll be back in five.
    • 1:04:12Well, we're back.
    • 1:04:14And so we've seen so far how to take subsets of our data.
    • 1:04:17But what we'll do now is turn more control over to the user
    • 1:04:20and let them choose a subset of data they want to see.
    • 1:04:23Now, R in general has this idea of a menu,
    • 1:04:25where you could present the user with some options they could choose from.
    • 1:04:28First is we show them our feed data.
    • 1:04:30We could ask them which subset of data they want to see.
    • 1:04:33Is it the casein subset, the fava subset, the linseed subset, and so on?
    • 1:04:37And the user could type in down below which number subset they want to see,
    • 1:04:41whether it's 1 for casein, 2 for fava, or 3 for linseed.
    • 1:04:45So let's go and implement something like this in R now and show the user
    • 1:04:49the subset of data that they want to see.
    • 1:04:51I'll come back over to RStudio here.
    • 1:04:53And I actually already have a program typed up here,
    • 1:04:55one that will implement a bit of this idea already.
    • 1:04:58So notice here how I am still reading in my chicks.csv file.
    • 1:05:02And now we're moving any weights that are NA, just like we saw before.
    • 1:05:06I'm now going to determine which options I should show to the user.
    • 1:05:10And I could do that using this function called unique,
    • 1:05:13where I'll pass in the feed column of chicks
    • 1:05:15and get back all the possible options that are inside of that feed column.
    • 1:05:19And then down below, what will I do?
    • 1:05:22Well, I'll prompt the user with options using this new function
    • 1:05:25we haven't seen yet called cat.
    • 1:05:27Cat actually concatenates character strings
    • 1:05:30and prints them out all at the same time.
    • 1:05:32So here, I'll cat or print the 1 dot followed by the first feed
    • 1:05:38option, probably casein, in this case.
    • 1:05:40Then on the line, I will cat 2 followed by the second feed option, which will
    • 1:05:45be something like linseed, let's say.
    • 1:05:47And I'll go through all of my possible feed options.
    • 1:05:50And at the very end, I will ask the user to enter some feed type, some number
    • 1:05:54of the subset that they want to see.
    • 1:05:57So let's see this in action here.
    • 1:05:59I'll go ahead and go to the top and click Source now.
    • 1:06:02And hm.
    • 1:06:04So some things seem to be working here.
    • 1:06:07I have actually the feed options being shown as I want them to be shown.
    • 1:06:11But what I don't see are these options on new lines.
    • 1:06:15Like, I would rather have 1.
    • 1:06:17space casein followed by 2.
    • 1:06:19space fava, not all of these on the same line.
    • 1:06:22So I think we'll need some new character here to solve this problem.
    • 1:06:26And in fact, R does have a special character that can we
    • 1:06:28actually use to solve this problem.
    • 1:06:31In general, these kinds of characters are called escape characters.
    • 1:06:35And one escape character is this one here,
    • 1:06:37backslash n, which if I were to use it, it won't print out a backslash n
    • 1:06:42to my console.
    • 1:06:43It will instead print out a new line.
    • 1:06:46And this backslash t?
    • 1:06:47Well, this is actually a special one too.
    • 1:06:49If I type backslash t, I won't see backslash t.
    • 1:06:53I'll instead see a tab.
    • 1:06:55So these are helpful for us.
    • 1:06:56And in general, these escape characters don't actually
    • 1:06:59print out the way you type them.
    • 1:07:00They print out something special, like a new line or a tab or something
    • 1:07:03else entirely for other escape characters too.
    • 1:07:06So let's use now backslash n and see if that can help solve our problem.
    • 1:07:10I'll come back over to RStudio.
    • 1:07:12And let me now add in this backslash n to each of my cat functions here.
    • 1:07:17I will also concatenate, on each line, this backslash n, just like this.
    • 1:07:23And hopefully, when I finish typing all this in,
    • 1:07:25I'll be able to see each of these feed options on some new line of my console
    • 1:07:31here.
    • 1:07:31Backslash n and backslash n.
    • 1:07:34And all I'm doing here is actually adding in some new lines
    • 1:07:38to concatenate to each of my options.
    • 1:07:40So let me clear my terminal down below.
    • 1:07:43And I'll click Source now.
    • 1:07:45And now I'll see that all of these options are on their own new line
    • 1:07:49because what I'm doing is first printing out 1.
    • 1:07:53Then I'm going to print out the first feed option.
    • 1:07:56Then I'm going to cat or print out this backslash n to move to that next line
    • 1:08:00here, ultimately allowing me to see all of these options top to bottom.
    • 1:08:05Now, let's pause here and ask, what questions
    • 1:08:07do we have on these escape characters or this program so far?
    • 1:08:11AUDIENCE: As we concluded from the first two lectures,
    • 1:08:13I think the programming with R is not safe enough because it
    • 1:08:19saves arguments or variables.
    • 1:08:21Then after it, you can't change it, or you can't access the first element.
    • 1:08:27So how we can--
    • 1:08:28how we can program defensively with these available features?
    • 1:08:34CARTER ZENKE: Yeah, a good question.
    • 1:08:36And I like the way you're thinking.
    • 1:08:37We need to think of how we can program defensively.
    • 1:08:40And so one way to think defensively here is
    • 1:08:42to think through what possible input the user could give us.
    • 1:08:45If I look at this particular prompt, I offer the user
    • 1:08:49that they could type in 1 through 5 here.
    • 1:08:51But what if they typed in a 0 or a 7?
    • 1:08:55They could very well do that.
    • 1:08:56And so we'll see how we can actually handle
    • 1:08:58those kinds of cases in a little bit.
    • 1:09:01But first, I would argue that this, although it works,
    • 1:09:05isn't exactly the best designed program we could write.
    • 1:09:08I do have the right kind of menu for the user to see,
    • 1:09:11but I could probably improve the design of my code too.
    • 1:09:14So let's come back to RStudio and think through how
    • 1:09:16we could improve the design of this code using R's vectorized features.
    • 1:09:22So here, if you notice, on line 9 through 14,
    • 1:09:27there's no reason for me to type all these lines of code.
    • 1:09:30And if you find yourself ever accessing one element of a vector after another
    • 1:09:35just to print something out to the screen,
    • 1:09:36you could probably think to yourself, there
    • 1:09:38has to be a better way to do this.
    • 1:09:41And in fact, there is.
    • 1:09:42One thing that you might often think about
    • 1:09:44is transforming your output to the user and turning it into a vector itself.
    • 1:09:50So here, I have all of my formatted options
    • 1:09:53in terms of individual lines of code.
    • 1:09:56But it would be really, really nice if I had
    • 1:09:58a vector of these formatted options.
    • 1:10:00And I could then pass that vector to cat, for instance.
    • 1:10:04Now, cat can take a full vector as input and separate
    • 1:10:09those character-- separate those elements
    • 1:10:11with some character I tell it to.
    • 1:10:13Now, for instance, I could, if I had this vector called, let's say--
    • 1:10:18why don't we call it formatted options.
    • 1:10:21And that is a vector itself.
    • 1:10:23I could pass that vector to cat and tell it, in this case,
    • 1:10:26to separate every element with a backslash n.
    • 1:10:29And so long as this vector of formatted options
    • 1:10:32included 1 for casein, 2 for linseed, and so on,
    • 1:10:36it would then be able to print all of them
    • 1:10:38out at once separated by a new line, exactly what we just did,
    • 1:10:42but now using only one line of code.
    • 1:10:46Now the challenge is, though, how do I get these formatted options
    • 1:10:50in terms of their own vector?
    • 1:10:51And how can I pass them, in this case, to cat?
    • 1:10:54Well, I think we need another part of our program now.
    • 1:10:56I'll say let's make a section to format, to format our options
    • 1:11:01and to do so a little better than we did before.
    • 1:11:05So I claim that ideally, we want to create
    • 1:11:08an object called formatted options that looks a bit like this.
    • 1:11:12This object is a vector.
    • 1:11:14And it includes, for the user, all of their menu options.
    • 1:11:18So this is six total options, each one here, 1 for casein, 2 for fava,
    • 1:11:233 for linseed.
    • 1:11:24And notice how I've kind of appended these numbers, in each case, 1.
    • 1:11:28space the food option, 2.
    • 1:11:30space the food option, 3.
    • 1:11:32space and the food option.
    • 1:11:34Now, I'm kind of noticing a pattern in this vector here,
    • 1:11:38which is that for the most part, every option
    • 1:11:41I have begins with a number 1 to 6 down here.
    • 1:11:46Then we have a period followed by a space in every element of this vector.
    • 1:11:51And then the next thing I see is we have whatever food option
    • 1:11:55corresponds to this particular option, like casein, fava, linseed,
    • 1:11:58or meatmeal.
    • 1:11:59Now, when you're using R and you're using vectors,
    • 1:12:02it really pays to think in a vectorized way.
    • 1:12:06So I could actually think about this single vector
    • 1:12:08as the combination of three different ones, these right here.
    • 1:12:13Maybe I have one vector of numbers 1 through 6,
    • 1:12:17one vector of just that dot space, which I've quoted here to show the space,
    • 1:12:22in fact, one vector of just those dot spaces,
    • 1:12:24and one vector which we already have of those feed options to show to the user.
    • 1:12:29And it would be really nice if I had a function
    • 1:12:32to basically combine these various vectors into a single one.
    • 1:12:36Take these three and concatenate them into one single list
    • 1:12:40of formatted options.
    • 1:12:42Now, you actually already know what that vector is.
    • 1:12:46In fact, that vector-- or not that vector.
    • 1:12:48That function, you know what that function is.
    • 1:12:50That function is paste and its sibling, paste 0.
    • 1:12:53Paste can still work with these vectors but concatenate them now element-wise.
    • 1:12:59So let's try using paste to vectorize our formatting here and improve
    • 1:13:03the design of this code in R. Come back to RStudio here.
    • 1:13:08And again, our goal is to create this vector called formatted options that
    • 1:13:13has the number prefix to each of our options to show to the user.
    • 1:13:18Now, if I wanted to do that, I claimed we could use paste 0.
    • 1:13:22But instead of giving paste 0 several individual options,
    • 1:13:26I could give it a few different vectors.
    • 1:13:28So maybe the first vector to give to it is the number vector.
    • 1:13:32I want to first begin my input with those numbers.
    • 1:13:35And so I could do as follows.
    • 1:13:37I could say 1 colon 6.
    • 1:13:39That represents the number of the--
    • 1:13:43the number vector that I have.
    • 1:13:45If I go down to the console here, I can prove to you
    • 1:13:47that 1 colon 6, that is, in fact, a vector of 1 through 6.
    • 1:13:52OK.
    • 1:13:52Now, the next part was to incorporate that dot space in the middle.
    • 1:13:57And I claim, before I show you this, that I can actually
    • 1:14:01get away with not putting this in its own vector,
    • 1:14:04but instead putting it as a single value.
    • 1:14:06And R will repeat that value for me or recycle it for me, as we'll see.
    • 1:14:10Then the third input, in this case, is the actual option
    • 1:14:13that the user should see in terms of the feed options.
    • 1:14:16So I'll type feed options here, which as we saw, looking at our console here,
    • 1:14:20is just a vector of the options we want to show the user.
    • 1:14:25So visually, what I've done here looks a bit as follows.
    • 1:14:28I've given as input to paste 0 these three
    • 1:14:31vectors here, one of numbers 1 through 6, one of this single element,
    • 1:14:36dot space, and one of our feed options, casein, fava, linseed, and so on.
    • 1:14:41And when I concatenate all of these together,
    • 1:14:42I'll get back a vector of six elements element-wise, concatenating these here.
    • 1:14:47So the first one seems pretty straightforward.
    • 1:14:49I'll take 1 concatenate it with dot space, concatenate that with casein,
    • 1:14:53and I'll get back 1.
    • 1:14:54space casein.
    • 1:14:56But the problem becomes, what do I do on this next element?
    • 1:14:59Well, 2 concatenates with what?
    • 1:15:02Turns out that R actually recycles this single value to the next element too,
    • 1:15:06a bit like this.
    • 1:15:07So I'll now concatenate 2.
    • 1:15:09space fava, and I'll get 2.
    • 1:15:11space fava.
    • 1:15:12I'll recycle this value again for linseed, getting 3.
    • 1:15:16space linseed and recycle it again and again and again
    • 1:15:19until I reach the end of the full length of these vectors
    • 1:15:21here, getting, in the end, my full list of formatted options.
    • 1:15:25So let me come back now to RStudio.
    • 1:15:27And let me try to see what's inside of formatted options.
    • 1:15:31Let me go over here.
    • 1:15:33And let me first run, let's say, line 9.
    • 1:15:38Let me now see what's inside of formatted options.
    • 1:15:40And here, we actually see our formatted vector of options to print to the user.
    • 1:15:47Now, what questions do we have, if any, on how paste
    • 1:15:51has now handled these vectors as input?
    • 1:15:54AUDIENCE: Could we make our concatenation
    • 1:16:00a little bit more flexible, maybe using the length of our feed options vector?
    • 1:16:06Because maybe if we added another chicks that ate additional foods,
    • 1:16:15maybe we could make it a little bit more adaptable.
    • 1:16:19So that is my question.
    • 1:16:20CARTER ZENKE: Yeah, a good question on making our program more
    • 1:16:22adaptable and flexible here.
    • 1:16:24Let's go ahead and try to implement that and see what it could do for us.
    • 1:16:27I'll come back to RStudio here.
    • 1:16:29And let's go back to our program.
    • 1:16:31And I think you've rightly noticed that if we ever had more than, for instance,
    • 1:16:35six feed options, this would no longer work.
    • 1:16:38What's more flexible would be to actually
    • 1:16:40dynamically find the length of the feed options we have
    • 1:16:43or how many we have in total.
    • 1:16:44And I could do that using this function called length, just like this.
    • 1:16:48And as input to length, I'll give this feed options vector.
    • 1:16:52And length will return to me now how many elements are inside
    • 1:16:55of that vector.
    • 1:16:57For instance, if I go down to the console
    • 1:16:59and show you what this evaluates to, I can clear my console here and type this
    • 1:17:04in, 1 colon length of feed options.
    • 1:17:07And I'll see 1 through 6.
    • 1:17:09But if the length was ever 7 or 8 or 9 or 10,
    • 1:17:11I would get back 1 through 7, 8, 9, or 10, making this more dynamic overall.
    • 1:17:17So a great improvement to make here.
    • 1:17:19I think there's still other improvements we can make, though.
    • 1:17:22So if I were to run this program as a user,
    • 1:17:25and I were to enter the feed type I wanted to view, like casein, well,
    • 1:17:29I don't actually see anything.
    • 1:17:30So I'll need to now figure out how to find the subset of data
    • 1:17:33the user has asked for.
    • 1:17:35Well, if I go down to the bottom of my program now,
    • 1:17:37I could write that piece of code.
    • 1:17:41Let me make a port here that says Print selected option.
    • 1:17:44And I'll go ahead and try to find the subset of data the user asked for.
    • 1:17:48Now, they've given me a number, like 1, 2, 3, 4, 5, or 6.
    • 1:17:53I'll probably need to convert that to the feed option they hope to see.
    • 1:17:57So why don't I make a new object, one called selected feed,
    • 1:18:01like this, that will really take the user's number
    • 1:18:04and convert it to the actual character representation,
    • 1:18:07whether it's casein or linseed or so on?
    • 1:18:09To do that, I could still use the feed options
    • 1:18:11vector, which has, of course, our feed options as characters inside of them.
    • 1:18:15And maybe I could use as the index the user's number
    • 1:18:18they selected because if they asked for number 1,
    • 1:18:20they want the first feed option, or number 2, the second feed option,
    • 1:18:23and so on.
    • 1:18:24So here, I'll index in using the user's feed choice
    • 1:18:28and get back now their selected feed as a character.
    • 1:18:31And finally, I could print out the subset of data they had asked for.
    • 1:18:35So I'll print the subsetted version of chicks,
    • 1:18:39where the feed column is equal to the user's selected feed, just like this.
    • 1:18:44So now my program should hopefully work a little bit better.
    • 1:18:46If I were to save it and click Source, I'll now be able to type in, let's say,
    • 1:18:511.
    • 1:18:52And I'll see that subset that corresponds to the casein chicks.
    • 1:18:55Let me go ahead and clear my terminal again and click Source.
    • 1:18:58And what if I did 2?
    • 1:18:59Well, I'll see the fava chick chicks.
    • 1:19:01That seems to be going pretty well for me.
    • 1:19:03But as we've talked about, I think it's worth thinking defensively here still.
    • 1:19:08So if I click on Source, what if I were being malicious as a user,
    • 1:19:12and I typed in something like this?
    • 1:19:130.
    • 1:19:14What will we get?
    • 1:19:15I'll hit Enter.
    • 1:19:16Hm.
    • 1:19:17So I won't see really a friendly output at all.
    • 1:19:20I'll see this empty data frame.
    • 1:19:22And I'll also see zero rows or zero length row names.
    • 1:19:26Ideally, I would show the user something different, something
    • 1:19:28like invalid choice, for instance.
    • 1:19:30But to do this, I think we'll need more tools in our toolkit.
    • 1:19:34I'll need to be able to respond to what the user has entered
    • 1:19:38and take some other path in my program.
    • 1:19:40Now, thankfully, in R, we have access to what
    • 1:19:44are called conditionals, where conditionals
    • 1:19:46let us run some piece of code conditionally,
    • 1:19:48depending on whether some logical expression is true or false.
    • 1:19:51We have, in particular, a keyword called if that will run some block of code
    • 1:19:57if some condition or logical expression is true.
    • 1:20:00So let's try out this if keyword here and see
    • 1:20:03if it can help us out in our program.
    • 1:20:05I'll come back to RStudio.
    • 1:20:07And maybe before we decide to show the user their selected subset,
    • 1:20:12what if I were to handle this invalid case?
    • 1:20:15I might do something like this.
    • 1:20:16I could say Handle maybe invalid input.
    • 1:20:19And why don't I use this if keyword.
    • 1:20:22I'll say if.
    • 1:20:24And then in parentheses, I'll supply some logical expression,
    • 1:20:27some condition that if it is true, I'll do
    • 1:20:30some code that will indent and put inside these curly
    • 1:20:33braces here this body of our if statement.
    • 1:20:36Hm.
    • 1:20:36So what should my condition be?
    • 1:20:39Maybe if the feed choice is less than 1, so it's 0, negative 1,
    • 1:20:45negative 2, or so on, or let's say, or the feed choice is greater than 6,
    • 1:20:51just like this, I think that should handle things for us.
    • 1:20:54And notice here, we're actually seeing now this double bar for the
    • 1:20:58or because we're comparing now to single true or false values, not
    • 1:21:02a vector of values here.
    • 1:21:04So what do I want to do if this condition is true?
    • 1:21:07I want to tell the user that they entered an invalid choice, just
    • 1:21:11like this.
    • 1:21:12Let's try it.
    • 1:21:13I'll go ahead and click Source now.
    • 1:21:14And notice how if I do enter a valid choice, like 1,
    • 1:21:19I don't see that line of code that says cat invalid choice
    • 1:21:22because this condition was not true.
    • 1:21:25If it's not true, I won't do the code that is inside of these braces here.
    • 1:21:29But what if this condition is true?
    • 1:21:31I enter some number like 0.
    • 1:21:33Let me try this.
    • 1:21:34I'll click Source.
    • 1:21:35And now I'll type 0.
    • 1:21:36And I'll see-- well, I'll see invalid choice.
    • 1:21:39But I still see that output I didn't want to see.
    • 1:21:43Now, why is that?
    • 1:21:44Well, if I go back to my program here and I read it top to bottom,
    • 1:21:48well, it seems like if I enter 0, I will print out invalid choice.
    • 1:21:53But then I'll still go on and show the subset
    • 1:21:55that I didn't want to show in the first place.
    • 1:21:58So thankfully, we do have other keywords that
    • 1:22:00can make these conditions kind of mutually exclusive.
    • 1:22:03Either do this, or do that.
    • 1:22:05And these keywords look a bit like this.
    • 1:22:07We have one called else if and one called else.
    • 1:22:11So let's use these here as well.
    • 1:22:13I'll come back to my program.
    • 1:22:15And what if I wanted to consider what I should
    • 1:22:17do when the user enters a valid choice?
    • 1:22:20Well, I don't want to print out invalid choice.
    • 1:22:23And I do want to print out the right subset.
    • 1:22:25So let's say, in the case, that the user has entered an invalid choice.
    • 1:22:28I only want to print out invalid choice and not the subset
    • 1:22:31that they want to see.
    • 1:22:32I'll type else here.
    • 1:22:33And now I'll make this kind of mutually exclusive.
    • 1:22:36I'll take this code and put it here.
    • 1:22:38And now, what will happen is if the user enters an invalid choice, like 0,
    • 1:22:44I will print out Invalid choice.
    • 1:22:46But I will not do the code that is now inside of this else block.
    • 1:22:50Let me try it.
    • 1:22:51I'll click Source.
    • 1:22:52And I will then type 0.
    • 1:22:54And now I'll only see Invalid choice.
    • 1:22:57What if I did something else?
    • 1:22:58What if I did source and I did, let's say, 1?
    • 1:23:01Well, now I see exactly the right input.
    • 1:23:04So these conditions here are kind of mutually exclusive.
    • 1:23:07Now, we could use the else if keyword, which lets us say else and then
    • 1:23:12ask if some condition is true again.
    • 1:23:15Else if, let's say, maybe the feed choice is valid.
    • 1:23:18I'll say feed choice is maybe greater than our feed choices between, let's
    • 1:23:24say, 1, so greater than or equal to 1.
    • 1:23:26And let's say the feed choice is less than or equal to 6,
    • 1:23:31so between 1 and 6 inclusive.
    • 1:23:33This, I would argue, would still work.
    • 1:23:35We're going to first check if the input is invalid.
    • 1:23:39And if it's not, we're going to check if it is valid.
    • 1:23:41So I'll click Source here, and now I'll run top to bottom.
    • 1:23:44I'll type maybe 0, and I'll see Invalid choice.
    • 1:23:48If I do here maybe a 1, I'll see the casein checks as well.
    • 1:23:52But I think this is a little less efficient
    • 1:23:55than simply having just an else here.
    • 1:23:57Well, why?
    • 1:23:58What kind of logically-- if the input is not invalid,
    • 1:24:03it kind of has to be valid.
    • 1:24:04So why should I ask this question again if it is valid or not?
    • 1:24:08I could remove this if here and simply use an else.
    • 1:24:11But an else if is good if you still have one more question you want to ask,
    • 1:24:15if some other condition is not true.
    • 1:24:19Let me go ahead and clear this here and go back to what we had before.
    • 1:24:22I'll click Source.
    • 1:24:23And now I'll clear my terminal.
    • 1:24:24And actually, let me get out of this program
    • 1:24:26by typing Control C. Let me click Source now.
    • 1:24:28I'll type 1 for casein, see those chicks.
    • 1:24:31And I'll type Source ag-- click Source again.
    • 1:24:33And now I'll see 0.
    • 1:24:34And I'll see Invalid choice.
    • 1:24:36So I think this is really the best designed version of our program yet.
    • 1:24:40We can handle these various cases of user input
    • 1:24:42and show the user the input they want to see now
    • 1:24:45making use of these conditionals.
    • 1:24:46And so when we come back, we'll see how to combine data from different sources.
    • 1:24:50We'll be back in five.
    • 1:24:52We're back.
    • 1:24:53And so we've seen so far how to remove unwanted pieces of data
    • 1:24:57from our data frames, from our vectors.
    • 1:24:59And we've also seen how to subset our data as well.
    • 1:25:03Now we'll take a look at how we can combine data from different sources
    • 1:25:07into one big data set.
    • 1:25:10Now, for this, we'll introduce the idea of an e-commerce kind of data set,
    • 1:25:15where here, let's say some giant like Amazon
    • 1:25:17is trying to keep track of customers and the purchases that they made.
    • 1:25:21So here in this table, every row corresponds to some purchase
    • 1:25:25made on something like amazon.com.
    • 1:25:27Notice how every customer here has their own unique ID.
    • 1:25:31And one identifies me, and one might identify you.
    • 1:25:34But at the end of the day, every customer has their own unique ID.
    • 1:25:38Now, for every transaction, every checkout on Amazon, for instance,
    • 1:25:42we might keep track of the sale amount, how much this user spent on amazon.com.
    • 1:25:47So it seems like user 9971, they spent $29 when they checked out.
    • 1:25:52User 7934, they spent $71 and so on.
    • 1:25:57Now, when you have lots and lots of this kind of data,
    • 1:26:00it might actually not be stored all in one table.
    • 1:26:03It might be partitioned across several different tables, a bit like this.
    • 1:26:07And it will be your job as the programmer
    • 1:26:09to combine data from these different sources
    • 1:26:12into one data set so you can answer and ask
    • 1:26:15the questions you have about this data.
    • 1:26:18Let's go back to RStudio and actually show
    • 1:26:20an example of combining data from these different sources.
    • 1:26:23So here, in RStudio, I will create a program
    • 1:26:28called sales, where I'm trying to combine sales
    • 1:26:31data from different parts of the year.
    • 1:26:33I'll name this file sales.R. And I'll create it.
    • 1:26:36Now, if I go to my File Explorer over here,
    • 1:26:39I'll notice that I have that program sales.R.
    • 1:26:43But I also have these four CSV files.
    • 1:26:47It seems like one is called Q1.
    • 1:26:49The other is called Q2 and Q3 and Q4.
    • 1:26:53Now, we saw last time this idea of Q representing a question,
    • 1:26:58like in a poll given to some potential voters.
    • 1:27:00Here, though, Q means something different.
    • 1:27:03If you're familiar with business, you might
    • 1:27:04have heard of the fiscal year, kind of similar to the calendar
    • 1:27:07year, but the year in which they actually
    • 1:27:09keep track of accounting and so on.
    • 1:27:10It turns out that that year is broken down into four different parts
    • 1:27:14called quarters, three months at a time.
    • 1:27:16So Q1 stands for the first quarter in the fiscal year, Q2,
    • 1:27:21the second quarter, Q3, Q4, and so on.
    • 1:27:24So these are the four parts of the year of sales that this company had.
    • 1:27:29Now, we were given this data in terms of each of those quarters.
    • 1:27:34Why?
    • 1:27:34Maybe a colleague just gave it to us like that.
    • 1:27:36We need to figure out how to piece this data together now.
    • 1:27:38So let's open up sales.R and see how we could accomplish that task.
    • 1:27:43Come back to my computer here.
    • 1:27:45And let me open up sales.R. And now, let me
    • 1:27:48see if I can first read in each of these individual data files.
    • 1:27:53Maybe I'll call the first one simply Q1 for the first quarter, the first three
    • 1:27:59months of this fiscal year.
    • 1:28:00I'll read the CSV called Q1.csv.
    • 1:28:04And I'll do the same for Q2, Q2.csv.
    • 1:28:09The same for Q3.csv and now the same for Q4.csv, just like this.
    • 1:28:17And now, if I were to run all four of these lines of code top to bottom,
    • 1:28:21I could do so with Source.
    • 1:28:22And I would see in my environment now, I would
    • 1:28:26see that I, in fact, have four data frames, one for each CSV.
    • 1:28:31Let's take a look at one of them.
    • 1:28:33So I'll view Q1.
    • 1:28:35View Q1.
    • 1:28:36And I'll see the very same table we saw a little bit earlier.
    • 1:28:40I'll see customer IDs in one column and sale amounts in the other.
    • 1:28:44Remember, every row here represents some purchase that
    • 1:28:47was made from this commerce company.
    • 1:28:50OK.
    • 1:28:51So it seems like Q1 and even Q2 and even if we look at Q3 now,
    • 1:28:57they all seem to have the same structure, the same number of columns,
    • 1:29:02but perhaps different numbers of rows.
    • 1:29:04And this is helpful for us.
    • 1:29:06If we ever have data frames that have the same number of rows
    • 1:29:10and the same names of--
    • 1:29:13same number of columns and the same names of columns as these
    • 1:29:16have, we can combine them using a function called rbind.
    • 1:29:21Rbind is typed like this.
    • 1:29:23It's literally the character r and then bind.
    • 1:29:25And r does not stand for R the language.
    • 1:29:28It stands for row, row bind.
    • 1:29:30We're going to bind the rows of these various data frames into one big data
    • 1:29:35frame.
    • 1:29:36So rbind takes as input several data frames to combine via their rows.
    • 1:29:42I could first give it Q1 and then Q2 and Q3 and Q4.
    • 1:29:46And now, if I save this result in terms of its own object called,
    • 1:29:51let's say, just total sales for the year,
    • 1:29:53if I run this line of code on line six and I view, let's say, sales,
    • 1:29:58I should now see that I have a really big data frame.
    • 1:30:02And to prove it to you, let me go look at my environment over here.
    • 1:30:06Let me make this a little bigger over here.
    • 1:30:08So you might notice that on the right-hand side,
    • 1:30:10I have Q1 and Q2 and Q3 and Q4.
    • 1:30:13Each one has about 2,500 observations.
    • 1:30:16And now sales at the end has about 10,000 observations, or 10,000 rows.
    • 1:30:21Really, it's the combination of each of these rows stacked
    • 1:30:24on top of each other.
    • 1:30:25But I think it's worth visualizing too exactly what we're doing with rbinds.
    • 1:30:29Let me show you some slides to depict just what we did here.
    • 1:30:33I'll come back to our slides and show you, let's take two example data
    • 1:30:36frames, one called Q1 and one called Q2.
    • 1:30:40We want to combine by their rows using here rbind.
    • 1:30:44Well, what happens when rbind runs and takes in, as input, Q1 and then Q2?
    • 1:30:49Well, effectively, it takes that first data
    • 1:30:51frame it has, and it keeps those rows at the top of this new data frame.
    • 1:30:56But then it takes the new data frames, like Q2
    • 1:30:59here, and adds those rows at the bottom of this top data frame.
    • 1:31:03For instance, a bit like this.
    • 1:31:05Notice how I took Q2 over here and kind of added it, bound it by the rows
    • 1:31:09at the bottom of Q1, making this one longer data frame.
    • 1:31:14I've done this here for Q1 and Q2 and Q3 and Q4.
    • 1:31:18I can give as many data frames as input to rbind as I want.
    • 1:31:21All I'm doing here is adding row after row
    • 1:31:24after row to make this data frame even longer.
    • 1:31:27So let's go back into RStudio.
    • 1:31:29And let's see what is inside of my sales table here, the entire thing.
    • 1:31:34I've lost a bit of information, namely in which quarter each of these sales
    • 1:31:40occurred.
    • 1:31:41Like, do they occur in quarter one or quarter two
    • 1:31:43or quarter three or quarter four?
    • 1:31:45I don't know anymore.
    • 1:31:47So we should probably be a bit careful about combining these.
    • 1:31:50And instead, first, maybe add a column to each of these data
    • 1:31:54frames, maybe one called quarter that tells us exactly what quarter
    • 1:31:58this sale was recorded in.
    • 1:32:00So in the Q1 table, maybe I'll add this column called quarter.
    • 1:32:05And recall from last time, if we want to add a column, we "wish it,"
    • 1:32:10quote unquote, into existence.
    • 1:32:11I simply type the data frame's name, followed by a dollar sign,
    • 1:32:14followed by the column I want to exist.
    • 1:32:16And then I assign it some value.
    • 1:32:20Now, in this case, I would love for the quarter column
    • 1:32:24to just show Q1 for every single row.
    • 1:32:27And if I want that to be the case, I need only type Q1 in quotes.
    • 1:32:32And now, if I reread Q1 and run line two, and now, if I, let say, view Q1,
    • 1:32:40this data frame here, well, I'll see I have a new column called quarter.
    • 1:32:44And throughout all the rows, I've set that column equal to Q1.
    • 1:32:50So pretty helpful.
    • 1:32:52But now, if I go back to trying to combine these data frames,
    • 1:32:56what might happen?
    • 1:32:57If I go down to line eight now, I'll run line eight, and oops.
    • 1:33:02I see an error in rbind, which tells me the number of columns of arguments
    • 1:33:07do not match.
    • 1:33:09And I think it's a little obvious what's happened here.
    • 1:33:12So Q1 now has three columns.
    • 1:33:15But Q1, Q3, Q4, these other arguments to rbind, those, in this case,
    • 1:33:20only have two.
    • 1:33:21So we need to make sure we're combining data frames that
    • 1:33:24have the same number of columns.
    • 1:33:26We want to join them at least by row.
    • 1:33:29So let's fix this.
    • 1:33:30Go back to RStudio.
    • 1:33:31And let's go ahead and just make sure that every table has
    • 1:33:34its own column called quarter and that that column is
    • 1:33:37equal to whatever quarter the sales appeared in, so Q2 two for Q2
    • 1:33:43and then Q3, Q3 for Q3 and then Q4 for Q4, just like this.
    • 1:33:55Now, I can rerun this code top to bottom using Source.
    • 1:33:58I see everything worked just as well.
    • 1:34:00And now when I view sales, I now have that other column
    • 1:34:03called quarter that can allow me to differentiate
    • 1:34:06between individual quarters now of sales.
    • 1:34:09So helpful when I combine this data frame to keep track
    • 1:34:12of where each piece of data came from.
    • 1:34:15Now, one kind of last flourish here if we can actually
    • 1:34:18show us another new feature of R is going
    • 1:34:20to be trying to categorize this data.
    • 1:34:23So we combined it.
    • 1:34:25But one thing I want to do is figure out which rows
    • 1:34:28were particularly high-value sales.
    • 1:34:31Maybe my boss wants me to figure out which
    • 1:34:33customers were spending the most money.
    • 1:34:35Well, ideally, we'd want to create a new column
    • 1:34:38and have it be based on the values of some other column.
    • 1:34:41For instance, let's say this is our table again, this one called sales.
    • 1:34:47I still have the same customer ID and the same sale amount.
    • 1:34:50But now I want to categorize this data, to add another column that tells me
    • 1:34:55whether a sale amount was a high-value transaction
    • 1:34:59or if it was just a regular one.
    • 1:35:00So this could look a bit like this.
    • 1:35:02Maybe I add this column called value for the value of this sale.
    • 1:35:07And if it's over 100, I'll mark it, I'll flag it as high-value.
    • 1:35:11But if it's not, well, I'll just make it a regular old sale.
    • 1:35:14And this could help me later on find a subset of my data
    • 1:35:18that includes only those high-value transactions and those customers who
    • 1:35:22spent more money than usual.
    • 1:35:24So let's try to actually add in this value column.
    • 1:35:27And it turns out that to do so, we make use of those same conditionals
    • 1:35:31we just saw.
    • 1:35:32Come back to RStudio here.
    • 1:35:35And why don't we try this.
    • 1:35:38Ideally, I might create some kind of logical expression on sales.
    • 1:35:43I would say if the sales, the sale amount column,
    • 1:35:47is not greater than, in this case, 100, and if it is,
    • 1:35:52well, I want to create a column that has high value for those particular rows.
    • 1:35:58Otherwise, just regular.
    • 1:35:59So let me run this particular logical expression, line 15.
    • 1:36:03And I'll get back this really long logical vector.
    • 1:36:06I see a few TRUEs in there.
    • 1:36:09So it seems like there are a few rows where you just spent over $100.
    • 1:36:12But now my job is to create a vector that if this sale amount was
    • 1:36:17greater than 100, shows high value, and if it wasn't, shows just regular.
    • 1:36:22Well, I could use a conditional.
    • 1:36:24But I could use a special kind of conditional
    • 1:36:26that R has, one that works really well with vectors
    • 1:36:29and producing vectors as well.
    • 1:36:31This is called if else as a function now.
    • 1:36:35If else can be a function.
    • 1:36:36And its first argument is going to be the logical expression
    • 1:36:40to actually evaluate for every row.
    • 1:36:44So here, I have sales, sale amount greater than 100.
    • 1:36:47And if this is true, my second argument to if else
    • 1:36:51will be the value I want to see in the resulting vector.
    • 1:36:55So I want to see High Value here.
    • 1:36:58And the third argument will be, what if it's a case it's not true?
    • 1:37:02Else, in this case.
    • 1:37:03I want to see Regular.
    • 1:37:05And now, with these three arguments, if else will return to me
    • 1:37:09a vector where if this condition is true, I'll see High Value.
    • 1:37:13If it's not true, I'll see Regular.
    • 1:37:16Let's try it.
    • 1:37:17I'll run line 15.
    • 1:37:18And now I'll see a similar vector.
    • 1:37:22But now, all of those TRUEs are replaced by High Value, and all of those FALSEs
    • 1:37:28are replaced by Regular.
    • 1:37:29So it seems to me like this allows me to create
    • 1:37:32some new column for my data frame.
    • 1:37:34I could then assign this vector as a column in my data frame.
    • 1:37:39I could say sales dollar sign, and then maybe I'll
    • 1:37:42make a new column called-- we called it value before.
    • 1:37:44I'll assign that vector produced by if else now to the value column in sales.
    • 1:37:50And if I run this line and now view sales, just like this,
    • 1:37:54I should see that I now have this new column called value.
    • 1:37:57And if I were to visually by sale amount to find those high-value transactions,
    • 1:38:02I would see all of those now are marked as High Value.
    • 1:38:05So you've seen here how to do a lot of things in this lecture,
    • 1:38:08how to subset our data, how to use conditionals
    • 1:38:11to take multiple paths in our programs, and finally, how
    • 1:38:14to combine data from different sources.
    • 1:38:16Next time, we'll dive even deeper into functions,
    • 1:38:18writing some of our very own.
    • 1:38:20We'll see you next time.
  • CS50.ai
Shortcuts
Before using a shortcut, click at least once on the video itself (to give it "focus") after closing this window.
Play/Pause spacebar or k
Rewind 10 seconds left arrow or j
Fast forward 10 seconds right arrow or l
Previous frame (while paused) ,
Next frame (while paused) .
Decrease playback rate <
Increase playback rate >
Toggle captions on/off c
Toggle mute m
Toggle full screen f or double-click video