CS50 Video Player
    • 🧁

    • 🍪

    • 🍐

    • 🍿
    • 0:00:00Introduction
    • 0:15:37Flat-file Databases
    • 0:48:04Relational Databases
    • 1:12:06Break
    • 1:23:44Halloween Costume Contest
    • 1:25:13Shows
    • 1:35:06Schema
    • 1:40:14Data Types and Constraints
    • 1:41:40Primary and Foreign Keys
    • 2:15:45Indexes
    • 2:24:04Python and SQL
    • 2:27:50Race Conditions
    • 2:34:07SQL injection attacks
    • 0:15:37All right, this is CS 50
    • 0:15:41this is week seven on the day before all. Hellos eve.
    • 0:15:45Uh Today, we will introduce um yet another language.
    • 0:15:48The goal of which is not to introduce another language for language's sake.
    • 0:15:52But to really begin to emphasize that when it
    • 0:15:54comes to the world of software programming engineering,
    • 0:15:57more generally, like there's often different tools for different jobs.
    • 0:16:00And if you
    • 0:16:01were to try to use C to solve all of our future problems, like it would actually be
    • 0:16:05painful as you've already seen how much more uh you know,
    • 0:16:08how much more pleasant it is to solve certain problems with Python.
    • 0:16:11But today, quite quickly,
    • 0:16:12we'll realize that even Python is actually not the
    • 0:16:14most pleasant way to solve a lot of problems,
    • 0:16:15especially when it comes to data, data at scale.
    • 0:16:18And in fact,
    • 0:16:19as we transition in the final weeks of CS 50 to the world of web programming.
    • 0:16:23And if you so choose with your final project, mobile programming,
    • 0:16:25you'll actually need more tools in your tool kit than see in Python alone.
    • 0:16:29In fact, today we'll introduce a database centric language called SQL
    • 0:16:33sequel next week we'll explore markup languages like HTML
    • 0:16:37and CS S a bit of javascript as well.
    • 0:16:39And then we'll synthesize all of this together
    • 0:16:41and to the end of the class as some of you might for your final projects as well.
    • 0:16:45But before we do that and talk about data, let's actually start to gather some data.
    • 0:16:50So if you could visit this URL here on your phone or laptop or if easier,
    • 0:16:56here's a barcode version there of you can point your camera at this barcode here
    • 0:17:01and it's gonna pull up a relatively short Google form.
    • 0:17:04That's just gonna ask you a couple of questions
    • 0:17:07so that we can collect some actual live data
    • 0:17:10and actually play around with it initially in Python.
    • 0:17:13So if you go to that URL there, you'll see a Google form.
    • 0:17:16And if you haven't been able to scan it quite yet, that's fine.
    • 0:17:20Just kind of look over the shoulder of the person next to you,
    • 0:17:22but you or they will see a little something like this.
    • 0:17:25So among the questions will be,
    • 0:17:26which is your favorite language on the list only thus,
    • 0:17:29scratch C and Python.
    • 0:17:31And below that,
    • 0:17:32you'll see another question asking about your
    • 0:17:34favorite problem across the problem sets.
    • 0:17:37Thus far each of them is radio buttons, which means you'll be able to select one
    • 0:17:41button for each of those questions.
    • 0:17:44And ultimately,
    • 0:17:45what's going to be nice is that if you've
    • 0:17:47never used Google Forms before as an administrator,
    • 0:17:49all of that data is actually going to end up being not only in Google forms,
    • 0:17:52but also if you so choose in Google spreadsheets,
    • 0:17:55which is an example of course of a spreadsheet software
    • 0:17:57learn spirit to apple numbers on Macs or Microsoft Excel on different platforms.
    • 0:18:02And Google Sheets is going to allow us to store all of that data in rows and columns.
    • 0:18:06And so since Google made both of these products, they integrated one with the other.
    • 0:18:10So in fact, if I on my laptop here in another window, open this up, let me flip over.
    • 0:18:16Here's the live spreadsheet and we'll see that the very first person who
    • 0:18:19buzzed in really liked Python as did a lot of other people thereafter.
    • 0:18:23But hello world was your
    • 0:18:26in Python, which is great. There's a couple of votes for scratch here.
    • 0:18:29If we scroll down,
    • 0:18:30there's one hold out for c who really liked credit in this case here.
    • 0:18:34And if we scroll down further,
    • 0:18:36it looks like Python and scratch are in there a few more CS and so on and so forth.
    • 0:18:41So suppose that we want to actually now analyze this data.
    • 0:18:43Now,
    • 0:18:44any of you have used Excel or numbers or Google spreadsheets know that it
    • 0:18:47comes with built in functions and formulas and we can do all of that.
    • 0:18:50But suppose there was a huge amount of data or suppose that this stage
    • 0:18:53it was not coming in via Google sheets but via your own web application or
    • 0:18:57your own mobile application and it's just ending up in some kind of spreadsheet.
    • 0:19:01Well, wouldn't it be nice if we could actually analyze that kind of data with code?
    • 0:19:04And in fact, you can,
    • 0:19:06and the simplest way to store a bunch of data isn't with anything fancy in fact,
    • 0:19:11but just like a literal text file,
    • 0:19:13something ending in dot txt or maybe more commonly dot CSV.
    • 0:19:17In fact, what we'd call a flat file database is literally just code for a like
    • 0:19:21of text file containing all of your data.
    • 0:19:24But your data typically has delimiter that separate some values from others.
    • 0:19:28In fact,
    • 0:19:28the most common approach dare say is to use
    • 0:19:31indeed comma separated values or CS V files.
    • 0:19:35And what that means is that in simple text alone, just asking maybe unicode,
    • 0:19:39you sort of mimic the idea of rows and columns by using new
    • 0:19:43line characters like backslash N to represent row after row after row.
    • 0:19:47That's pretty straightforward to just move the cursor to the next line.
    • 0:19:50And because text files aren't graphical,
    • 0:19:52so there's no notion of like vertical bars that you can
    • 0:19:54put between what you and I think of as columns,
    • 0:19:57you literally just use a comma or some other
    • 0:19:59such symbol to separate one value from another.
    • 0:20:03So in fact, let me go ahead and do this even if you've never done this before,
    • 0:20:05it turns out in Google sheets and also Excel and numbers, you can export your data,
    • 0:20:10not in some proprietary Apple or Microsoft or Google format,
    • 0:20:14but in kind of a globally uh portable format known as
    • 0:20:18dot CS V.
    • 0:20:19So let me go up to file,
    • 0:20:20I will go to download and notice I can export this in a bunch of different formats.
    • 0:20:25But the one I care about for today is going to be dot CS V on my Mac.
    • 0:20:28That's gonna put it into my downloads folders.
    • 0:20:30And what I'm gonna go ahead here and do in just a
    • 0:20:32moment is let me go ahead and open up uh VS code.
    • 0:20:36Let me go ahead and drag that text file that I downloaded into my code space.
    • 0:20:42And in just a moment, we'll see that
    • 0:20:45we have
    • 0:20:46if I type Ls here a file, if I type LS here, a file called CS 50 2023 and so forth,
    • 0:20:52I'm actually going to go ahead and rename this file quite quickly.
    • 0:20:55And what I'm really going to upload to my
    • 0:20:58uh code space here is a file called Favorites dot CS V.
    • 0:21:02So in short, I've downloaded this file,
    • 0:21:03it gave Google gave it a long default file name.
    • 0:21:06I've renamed it to Favorites dot CS V.
    • 0:21:08Let me go ahead and actually open it now in VS
    • 0:21:10code here and this now is the exact same data.
    • 0:21:13If you're still submitting the form,
    • 0:21:14I'm afraid you didn't make the cut off So we're not going to see more
    • 0:21:17data ending up in this file because I've literally downloaded a copy of it.
    • 0:21:20But indeed, if I scroll through,
    • 0:21:22we'll see that it's got some 399 lines of data or technically 398
    • 0:21:28because the very first one is what we call like a header row,
    • 0:21:30which just describes what each of these columns means.
    • 0:21:34Now notice that even though the Google
    • 0:21:35Sheets interface actually put all of this data
    • 0:21:38in proper graphical rows and columns,
    • 0:21:41you can still see the rows because they're just separate lines and you
    • 0:21:44can kind of see the columns by focusing on like the comma here,
    • 0:21:47the comma here, the comma here and then also the comma here, the comma here,
    • 0:21:52the comma here and so forth.
    • 0:21:53Now a quick subtlety. Hello
    • 0:21:55world is sort of a two word problem name and it it itself has a comma in it.
    • 0:22:00What's to stop me from confusing the comma in hello world
    • 0:22:06from like the commas that Google apparently inserted into this file.
    • 0:22:10Yeah, so automatically you all did not type this, we did not type it into the form,
    • 0:22:14but Google is smart enough as is apple and, and uh Microsoft when they export CS vs
    • 0:22:19to somehow escape seemingly dangerous characters or
    • 0:22:22characters that could just break things.
    • 0:22:24And in this case,
    • 0:22:24a convention in the CS V world is just to quote any strings
    • 0:22:28that themselves have commas so that any program you're writing that reads,
    • 0:22:31this file doesn't get confused.
    • 0:22:33So the only thing that came in automatically from Google is just this time stamp here
    • 0:22:37based on the time of day in our own local time zone that was added automatically.
    • 0:22:42So we have three and not just two columns.
    • 0:22:44So with that said, we have three columns, time stamp language and problem.
    • 0:22:49The latter two of which came from all of you.
    • 0:22:51Let's actually write some code that analyzes this data.
    • 0:22:54Let's figure out what the most popular something or other is here.
    • 0:22:57So I'll close the CS V file.
    • 0:22:59Let me go ahead and use what seems to be folks favorite language
    • 0:23:02thus far and write a file called favorites dot P for instance.
    • 0:23:06And I'm gonna use Python to open that CS V file, analyze it,
    • 0:23:10crunch some numbers and output some statistics if you will.
    • 0:23:13So the easiest way to manipulate CS V files as you
    • 0:23:16might have gleaned is not to just open the file yourself.
    • 0:23:19Look for commas, allocate memory or anything like that.
    • 0:23:21In C in Python, you can literally
    • 0:23:24just import CS V which is a Python module that gives you CS V related functionality
    • 0:23:30in Python. There's a bunch of ways to open files. One way to open a file is like this.
    • 0:23:36In Python, you can say file equals open,
    • 0:23:39similar to F open and C you can specify the name of the file,
    • 0:23:43you want to open like favorites dot CS V.
    • 0:23:45And you can explicitly say you want to open file for reading, quote
    • 0:23:49unquote R just like f open. Strictly speaking in Python read is implied.
    • 0:23:54So if you omit the second argument, it will still work.
    • 0:23:56But for parody with F open, I'll do the same here,
    • 0:23:59then you can maybe do something with file and then at the end,
    • 0:24:04you can close the file like this.
    • 0:24:06So in Python, there really is a mapping between F open and F close.
    • 0:24:10But in Python, the functions are called open
    • 0:24:12and close for short.
    • 0:24:13But a more common way, a more Python way so to speak,
    • 0:24:16ie the way people tend to do it in Python is
    • 0:24:19actually to use a keyword that didn't exist in C where you
    • 0:24:22instead say with and you say with open this file name as
    • 0:24:26a specific file name and then indent it inside of that.
    • 0:24:30Now you can do whatever you want with the file and the implication of using with,
    • 0:24:33even though it's not obvious from the keyword itself is
    • 0:24:36that the file will be automatic closed for you later.
    • 0:24:38So this is just a minor Python convenience so that you don't
    • 0:24:41have to remember to close the file unless something go wrong,
    • 0:24:44it will just close automatically as soon as you're outside of this with block.
    • 0:24:48So how do I go about reading a CS V?
    • 0:24:50The simplest way is to actually give yourself a variable called maybe reader.
    • 0:24:55Like if I want to read the CS V, I'll call my variable reader set that equal to the
    • 0:24:59the return value of a function that comes with
    • 0:25:02Python CS V module called reader in lower case and you just pass in the file name.
    • 0:25:08So the first line there on line three opens the file
    • 0:25:10and gives me access to the bytes there in line four.
    • 0:25:13Now actually uses this library that comes with Python to just go read it for me,
    • 0:25:18figure out where the commas are so that it can
    • 0:25:19hand me line after line the data in the file.
    • 0:25:22Now the first piece of data in the file though was what,
    • 0:25:25what's the first very first row?
    • 0:25:28So it was that header row with time stamp language problem.
    • 0:25:32I actually wanna skip that because that's not data.
    • 0:25:34That's what we might call metadata that's describing my actual data.
    • 0:25:38So one way to deal with this, I can literally just say next reader and next,
    • 0:25:41even though the semantics here are a little weird, it just means no, no,
    • 0:25:44give me the next line from that reader instead of the first by default.
    • 0:25:48And now inside of this with uh block,
    • 0:25:51I can do something like this for each row
    • 0:25:54in that reader. Let's do something super simple.
    • 0:25:57Initially, let's just print out row bracket one,
    • 0:26:01so row bracket one. So what's going on here?
    • 0:26:04Well, the CS V reader in Python is going to return to you inside of this loop,
    • 0:26:09one row after another, each of those rows though has how many columns, three,
    • 0:26:14the time stamp the language and the problem.
    • 0:26:16And just like in c our lists in Python are zero index. So 012.
    • 0:26:21So if I'm printing out row bracket, one, that's the second or middle column.
    • 0:26:26So what's this gonna print
    • 0:26:28row after row,
    • 0:26:30each of the languages that you all uh replied with?
    • 0:26:33So that's all this isn't doing any kind of analytics yet.
    • 0:26:35I'm just kind of going through the motions to
    • 0:26:37at least print out some data of interest.
    • 0:26:39So let me run this Python a favorite stop pie enter
    • 0:26:42and it happens super fast. But if I scroll back in my term
    • 0:26:46window, there is all of that raw data.
    • 0:26:48So this is to say once you know the function names, once you know the keywords,
    • 0:26:52it's actually pretty simple in Python to just get up and running
    • 0:26:54with a file and start looking at the data there in.
    • 0:26:57But it turns out that it's a little
    • 0:26:59sub optimal
    • 0:27:01uh to use the reader alone. It turns out there's better ways to do this.
    • 0:27:06And let me make this clear as follows. I don't strictly need a variable.
    • 0:27:09But let me actually declare a variable called favorite set,
    • 0:27:12it equal to row bracket one and then print out that favorite value.
    • 0:27:16This is not doing anything new. It's just declaring an additional variable.
    • 0:27:20But I wanted to highlight the fact that I'm just kind of trusting that row bracket.
    • 0:27:24One is the problem or is the language I care about is language is language.
    • 0:27:28But suppose one of you
    • 0:27:30or I go into the Google spreadsheet and like, you know, anyone might,
    • 0:27:33with a spreadsheet,
    • 0:27:33you might start moving things around and you might
    • 0:27:35swap some of the columns left and right.
    • 0:27:38You might delete one of the columns,
    • 0:27:39add something else in short spreadsheets are arguably fragile.
    • 0:27:43And that like it's pretty easy in the gooey,
    • 0:27:44the graphical user interface to change them around.
    • 0:27:47And so my code in Python accordingly is arguably kind of fragile because I'm
    • 0:27:51just hoping that row bracket one is always the data that I care about.
    • 0:27:56So what would be marginally better?
    • 0:27:57Well,
    • 0:27:57let's actually use that header row instead and more common
    • 0:28:01arguably in Python is not to use a simple reader,
    • 0:28:04but instead what we would call a dictionary reader.
    • 0:28:06I'm gonna change this to dict reader capital D capital R
    • 0:28:10and that's it for the change to line four on line five.
    • 0:28:13I'm gonna get rid of the next line because I don't want to skip the header.
    • 0:28:16Now, what dict reader does, which reader does not,
    • 0:28:20is it automatically analyzes that first line in the file
    • 0:28:23figures out what are all of your columns called.
    • 0:28:26And thereafter, when you iterate over this reader,
    • 0:28:29what each of your rows now is it's no longer a list of size of size three, bracket,
    • 0:28:36zero, bracket, one bracket, two,
    • 0:28:37each row that you get back in this loop on line five is now wonderfully
    • 0:28:42a dictionary.
    • 0:28:43Instead the keys of which are from the header field, time stamp language problem.
    • 0:28:48The values of which are whatever each of you typed in again and again.
    • 0:28:53So how do I change the
    • 0:28:55I no longer have these numeric indices?
    • 0:28:57Because row is no longer a list, it's a dictionary.
    • 0:29:00So if I literally want the language that you typed in,
    • 0:29:03I can use row bracket, quote
    • 0:29:05unquote language treating row is a DT not as a list anymore.
    • 0:29:10Now, this is again, more robust because if you move the columns around on me,
    • 0:29:13code is still going to work at least so long
    • 0:29:16as you don't rename the columns at the very top.
    • 0:29:18So that's still one assumption I'm making.
    • 0:29:21All right. Well, beyond that,
    • 0:29:23what can I actually do here?
    • 0:29:24Well, just to be clear, I don't strictly speaking, need this variable.
    • 0:29:28So no need to highlight this again.
    • 0:29:30So let me just simplify the code a little bit and get rid
    • 0:29:32of this variable and instead just print out the language in that row
    • 0:29:36as a quick check. Let me rerun
    • 0:29:38uh Python of favorites dot P and it seems to still work it,
    • 0:29:43even though there's a lot of output,
    • 0:29:44we're not gonna check all 399 lines but it looks
    • 0:29:46like it printed out all of those popular languages.
    • 0:29:49Well, what more can we actually do?
    • 0:29:51Well,
    • 0:29:51let's actually now start to crunch some numbers and figure out how many people,
    • 0:29:55like scratch how many people like c how many people like Python, like,
    • 0:29:58let's start to analyze this.
    • 0:29:59So maybe the most pedantic way to do this in Python would be to
    • 0:30:04kind of just create some variables and do all of the counting myself.
    • 0:30:08So let me actually go ahead and do this. Let me delete this code for now.
    • 0:30:11And after opening the reader,
    • 0:30:12let me create a variable called scratch and set it equal to zero,
    • 0:30:16a variable called C set, it equal to zero, a lan
    • 0:30:18uh variable called Python. Set it equal to zero
    • 0:30:21just so that I have three counters similar to what we did in week, zero, week one,
    • 0:30:25anytime we counted anything,
    • 0:30:27honestly, this looks a little stupid and it's not wrong. In fact, this
    • 0:30:31is how you would do it in Python. But Python also has some clever syntax.
    • 0:30:35If you want to be really cool, you could do scratch comma C comma Python
    • 0:30:41equals 000 and do all three at once if you like that.
    • 0:30:45So it tightens up the code a little bit even though the effect is ultimately the same.
    • 0:30:48Now, let's go ahead and iterate over this file row by row by row.
    • 0:30:53And if we see scratch increment, the scratch counter.
    • 0:30:55If we see c increment, the C counter, if we see Python increment,
    • 0:30:58the Python counter instead.
    • 0:31:00So how do I do this?
    • 0:31:01Well, I could do something like four each row in the reader just like before,
    • 0:31:05let me go ahead and get that favorite variable
    • 0:31:08and set it equal to the language in that dictionary
    • 0:31:11that just came back as part of this iteration.
    • 0:31:13And now I can do something like this. If favorite equals equals scratch,
    • 0:31:19then with my indentation, I can do scratch plus equals one. L.
    • 0:31:25If favorite equals equals C, I can go ahead and increment C plus equals one.
    • 0:31:33And I don't think I want
    • 0:31:34just to be safe, even though we only saw three options, I think just to be super safe.
    • 0:31:38L if favorite equals
    • 0:31:40equals Python, then let's go ahead and increment Python.
    • 0:31:43The one thing we can't do that we could do in C is the plus plus trick.
    • 0:31:46So plus equals is as close as we can get. So what have I done inside of this loop?
    • 0:31:51I've just incremented each of those counters by one. If I see scratch or C or Python
    • 0:31:56again and again, all right outside of the loop and outside of the width block.
    • 0:32:01Because once I'm done reading all of the rows,
    • 0:32:03I might as well let the width clause close the
    • 0:32:06file automatically. Let's just go ahead and print out some value.
    • 0:32:09So I'm gonna go ahead and print out
    • 0:32:11scratch colon and, and then inside of there,
    • 0:32:14let's print out whatever the value of that variable is,
    • 0:32:17let's then go ahead and print out C colon and
    • 0:32:21then whatever the value of the C variable is.
    • 0:32:23And then lastly let's print out Python colon and whatever the value is.
    • 0:32:27And now I, I made a three typos here.
    • 0:32:30This is not going to print the values. What do I need to do?
    • 0:32:33Sorry.
    • 0:32:35Yeah,
    • 0:32:35I'm missing the f in front of each of these strings so that I actually get formatted.
    • 0:32:39And that is to say the variables get interpolated inside
    • 0:32:42of the curly braces like we saw last week.
    • 0:32:44All right.
    • 0:32:44So honestly,
    • 0:32:45that's kind of a decent amount of code like 18 or so lines of code just to count
    • 0:32:49the number of responses. But let's see if I got it right.
    • 0:32:52Let me open my terminal and run Python of favorites dot P.
    • 0:32:56And now I see by an overwhelming amount uh folks like Python followed by C,
    • 0:33:02followed by scratch in that, in that,
    • 0:33:05in that order. But that was a decent amount of code to have to write.
    • 0:33:08And it turns out there's actually better ways of doing this, more Python ways,
    • 0:33:11more program
    • 0:33:12ways of doing this.
    • 0:33:13And if we think back to one of our universal data structures,
    • 0:33:16think back to how we preached last week and the week
    • 0:33:19before the value of these dictionaries more generally like the CS
    • 0:33:23V module is clearly a fan of them because that's what
    • 0:33:25Dick Reader is giving us dictionary after dictionary after dictionary.
    • 0:33:28And this was the general idea of a dictionary.
    • 0:33:30It associates keys with values much like you might in
    • 0:33:33a two column table on a chalkboard or the like,
    • 0:33:36well, this is kind of what I need, right?
    • 0:33:38Like if I want to keep track of how many people said scratch and C and Python,
    • 0:33:43you know, if I had a piece of chalk,
    • 0:33:44I could just kind of write scratch and C and Python is
    • 0:33:48three keys and then with my chalk and maybe an eraser,
    • 0:33:51keep track of the values.
    • 0:33:53They all start at zero. Then I add
    • 0:33:54at
    • 0:33:54two at three or maybe on a chalkboard, I actually use hash marks,
    • 0:33:58but like a dictionary is like kind
    • 0:33:59of the perfect data structure for just associating
    • 0:34:02something like scratch C Python with
    • 0:34:04something else keys with values respectively.
    • 0:34:08So this is gonna look a little weirder,
    • 0:34:10but it's going to be pretty conventional to do something like this.
    • 0:34:13Instead, let me go back into vs code.
    • 0:34:16I'll close my terminal window and let me go ahead and actually delete
    • 0:34:20a lot of this because I can simplify this further.
    • 0:34:23Let me go ahead and now give myself maybe a variable, just generically called counts
    • 0:34:29and set that equal to an empty dictionary.
    • 0:34:32And you can actually do this in a couple of ways.
    • 0:34:33You can literally write out dict with nothing in parenthesis,
    • 0:34:36which will give you an empty dictionary like the picture on the board.
    • 0:34:39But a little more conventional is to just use teeth,
    • 0:34:41key strokes and use two curly braces with nothing inside.
    • 0:34:44That gives me an empty dictionary like
    • 0:34:47this picture here.
    • 0:34:48Now, my loop is gonna be the same I'm gonna do for each row in the reader.
    • 0:34:53I'm gonna go ahead and grab the favorite language for this person.
    • 0:34:56So favorite equals row, quote unquote
    • 0:34:59language. But now I'm gonna do something a little different.
    • 0:35:02There's two scenarios here.
    • 0:35:03Either I have seen this language before and I want to increment it by one
    • 0:35:08or if I've never seen this language before because like the loop just started,
    • 0:35:13what should I initialize the count for this language too?
    • 0:35:18So one because I've only seen it once. Exactly.
    • 0:35:20So now let me go ahead and do exactly that.
    • 0:35:23If this current favorite that I am seeing in the row
    • 0:35:27is already in the counts dictionary.
    • 0:35:30And in Python, this is literally how you ask that question.
    • 0:35:33If favorite in counts that will check. Is there a key with this name?
    • 0:35:38Scratch C or Python?
    • 0:35:39If so go into that location in the counts dictionary index into it at the
    • 0:35:45favorite location because favorite is a string.
    • 0:35:47It's either quote unquote scratch C or Python and just increment it
    • 0:35:52by
    • 0:35:53one like this.
    • 0:35:54Else as you noted, if it's not there implicitly, then counts bracket favorite
    • 0:36:00should probably be set equal not to zero because
    • 0:36:03we're literally are seeing it in the current row.
    • 0:36:06Let's initialize it to one.
    • 0:36:07And thereafter, if we see it again,
    • 0:36:09it's gonna be plus equals one plus equals one plus equals one.
    • 0:36:13So now outside of that loop, outside of the width block,
    • 0:36:15let me do this for each favorite in those counts.
    • 0:36:19And this too,
    • 0:36:19in Python is a trick if you want to iterate over all of the keys in a dictionary
    • 0:36:24that is if you want to iterate over the left hand column of all of these keys,
    • 0:36:27you literally can say for something in
    • 0:36:30that dictionary. So for favorite in
    • 0:36:32this is giving me a variable called favorite and
    • 0:36:35updating it automatically top to bottom in that dictionary.
    • 0:36:38Let's go ahead and print out
    • 0:36:40an F string
    • 0:36:41that's going to say whatever the name of that language is colon
    • 0:36:45and whatever the value of that language is in that their dictionary.
    • 0:36:52So again, logically, the only thing that's new is this,
    • 0:36:55I'm now using a one dictionary instead of three variables to keep
    • 0:36:59track of three things like updating this chalkboard with three different things.
    • 0:37:03Scratch C Python.
    • 0:37:05And the last thing I'm doing,
    • 0:37:06which is a little different is once I have that dictionary,
    • 0:37:09whether there's three languages or maybe tomorrow,
    • 0:37:11there'll be fourth because we're going to introduce
    • 0:37:13today. Well, this will iterate over all of those keys and print out the values.
    • 0:37:18All right.
    • 0:37:18So if I didn't do anything wrong, if I do Python of favorites dot pi and hit, enter
    • 0:37:23there, we have it and it happens to be in a different order this time.
    • 0:37:26That's because we saw Python first,
    • 0:37:28we then saw scratch and eventually we saw C
    • 0:37:31but if we wanted to sort these differently,
    • 0:37:33we actually could
    • 0:37:34with some different code. But in short, what have we done?
    • 0:37:37We've created this kind of structure and memory with three keys,
    • 0:37:40Python C and scratch.
    • 0:37:42Because each time we encounter such a language from you all,
    • 0:37:46we either set our counter to one or increment it
    • 0:37:49by one. Instead
    • 0:37:51any questions
    • 0:37:52on this code or this general idea of using dictionaries as
    • 0:37:57kind of like a little cheat sheet for doing some work,
    • 0:37:59some math in this way,
    • 0:38:01super common
    • 0:38:03paradigm.
    • 0:38:05All right. Well, let me tweak this a little bit.
    • 0:38:08Um Right now in my output,
    • 0:38:09we're seeing Python scratch and see maybe for the sake of discussion,
    • 0:38:13suppose we want to sort this by key, we can actually do that.
    • 0:38:15Let me close my terminal temporarily and it turns out in Python,
    • 0:38:18there's a bunch of ways to do this.
    • 0:38:20But the simplest way to sort a dictionary by
    • 0:38:22key is literally to use a function called sorted,
    • 0:38:25comes with Python that just does it for you.
    • 0:38:28And even if you pass it a dictionary,
    • 0:38:29it will sort that dictionary by the left hand
    • 0:38:31column so you can iterate it over alphabetically instead.
    • 0:38:35So if I go back now to VS code, if I open my terminal window and I rerun Python on.
    • 0:38:40Favorite, do
    • 0:38:40I?
    • 0:38:40Now that I've added the sorted call, we should now see,
    • 0:38:43just because it's sorted alphabetically instead.
    • 0:38:46Now that's not that useful.
    • 0:38:49Especially if we had lots of languages,
    • 0:38:50you probably don't care about it being alphabetized as much as you care about it,
    • 0:38:53being ranked by which is the most popular, which is the least popular.
    • 0:38:57And for that, there's a bunch of ways to do this in Python.
    • 0:38:59And I think the simplest way to sort by value the right hand
    • 0:39:03column instead of the left hand column is probably to make this change.
    • 0:39:07Instead let me close my terminal temporarily.
    • 0:39:10Let me still use the sorted function which by default sorts by key.
    • 0:39:15But let's change it to be as follows.
    • 0:39:17Let's change it to sort by a function called counts dot get, which is a little weird.
    • 0:39:24But this comes back to yes,
    • 0:39:25last week's brief discussion of object oriented programming or oop.
    • 0:39:29Remember in Python that almost everything is like an object of some sort.
    • 0:39:33And in is an object. A dictionary is an object.
    • 0:39:36Um A string is an object which is to say that not only do these things have values
    • 0:39:41like quote unquote hello world or 50 these variables.
    • 0:39:44These objects can also have functions built into them. A
    • 0:39:47K
    • 0:39:47A methods. So it turns out that because counts is a dictionary because I made it so
    • 0:39:53that counts dictionary like any dictionary in
    • 0:39:56Python comes with a function called get.
    • 0:39:58And if you just tell the sorted function to use that built in method,
    • 0:40:02it will actually for every key get its value, get its value,
    • 0:40:05get its value and sort effectively by the right hand column
    • 0:40:09instead of the left.
    • 0:40:10Now we'll see down the line perhaps more sophisticated ways of using this.
    • 0:40:14But for now, this just overrides the default behavior and sorts the dictionary,
    • 0:40:18not by key but by value instead.
    • 0:40:20All right. So now watch this, if I run Python of favorites dot P
    • 0:40:24once more previously,
    • 0:40:26it was in the order in which the languages appeared first in the CS V file,
    • 0:40:30then it was sorted alphabetically.
    • 0:40:33Now it should be sorted by
    • 0:40:34value
    • 0:40:36and indeed scratch is the least with 40 C is the next with some of the eight.
    • 0:40:40Python is the biggest with 280. That's not much of a top 10 or top three list.
    • 0:40:44Let's actually reverse it.
    • 0:40:46And the easiest way in Python to do that is to pass a third argument into sorted.
    • 0:40:50And you would know this by just reading the documentation,
    • 0:40:52you can literally say reverse equals true capital T.
    • 0:40:57And now if I rerun this one last time,
    • 0:41:00Python, a favorite stop P,
    • 0:41:01I'll see the same values but with the whole thing reversed in order long story short,
    • 0:41:06even though this might feel like a slog like adding this and looking up this so
    • 0:41:10much easier than like in C where you would have had to like figure out what,
    • 0:41:14how does bubble sort work?
    • 0:41:15Let me implement bubble sort selection sort any of those
    • 0:41:17sorting algorithms or use some other library in Python,
    • 0:41:20you just get a lot more for free, so to speak.
    • 0:41:22It's just built in once you get comfy with the documentation
    • 0:41:25and to be clear, this is an argument as is this as is this? But in Python,
    • 0:41:32we have not only positional arguments which are based
    • 0:41:35on what position they're in left to right,
    • 0:41:37just like C you also have these named
    • 0:41:39parameters whereby they have explicit names that you
    • 0:41:42can use yourself to make clear that you're using this one but not this other one.
    • 0:41:47More parameters in Python can be optional than in C.
    • 0:41:51All right. Any questions
    • 0:41:54about that technique yet.
    • 0:41:56And if you're feeling like this is starting to take the fun out of Python,
    • 0:41:58like that's actually kind of the point of doing this like
    • 0:42:01the hard way.
    • 0:42:03All right. Well, let's do it one other way. That's marginally better.
    • 0:42:06It turns out in Python there really is this reach uh ecosystem of libraries,
    • 0:42:11the code that comes with the language itself or even third parties and coming
    • 0:42:14with the language is another uh module
    • 0:42:18called the collections module or package here.
    • 0:42:21Whereby if I use from collections,
    • 0:42:24I can import something called counter capital C.
    • 0:42:27And it turns out if this felt a little bit painful to create a dictionary yourself,
    • 0:42:31initialize it maybe to zero or one like this
    • 0:42:34turns out you have the same problem that people before you have had.
    • 0:42:37And so there's another way to do this, you can create a variable called counts,
    • 0:42:41set it equal to counter capital C open parent, close print.
    • 0:42:44And this is a different type of object. It's a different type of object in Python that
    • 0:42:49counting capabilities built in.
    • 0:42:51And so if I actually want to use this counter,
    • 0:42:53instead I can do this for each row in the reader.
    • 0:42:56Let's go ahead and grab the favorite language from that row just like before.
    • 0:43:01And without doing any of that headache of like if l if or any of this,
    • 0:43:05you can literally just index into that counter using favorite quote
    • 0:43:10unquote scratch or C or Python
    • 0:43:13and
    • 0:43:14incremented by one what the counter class is gonna do for you. So to speak.
    • 0:43:18Another example of object oriented programming and
    • 0:43:20counts is now an object thereof.
    • 0:43:22What this whole feature of counter is gonna do
    • 0:43:25for you is it's gonna automatically initialize everything to
    • 0:43:28zero even if you've never seen it before and
    • 0:43:30then you can just blindly start incrementing it.
    • 0:43:32So in short, there's just more pleasant ways sometimes to do something
    • 0:43:35in Python as well.
    • 0:43:37All right. How about. Lastly,
    • 0:43:40let's make things maybe.
    • 0:43:43Uh oh actually let's do this. We can even simplify
    • 0:43:47the sorting here. Let me actually take this one step further.
    • 0:43:50Instead of manually figuring out how to sort this, I'm gonna do this four
    • 0:43:55each
    • 0:43:56favorite
    • 0:43:59and the count thereof in the counts variables, most common functions return value,
    • 0:44:06go ahead and print out this as well.
    • 0:44:09So in short again, a bit new syntax, but what's going on here?
    • 0:44:12Well, it turns out that this counts class and
    • 0:44:15sorry, this counter class and in turn,
    • 0:44:17this counts variable comes with a function built in
    • 0:44:20that you would only know from the documentation.
    • 0:44:22It's literally called most underscore common.
    • 0:44:24And what it returns to you when you call it
    • 0:44:27is
    • 0:44:28um a pair of key value, key value.
    • 0:44:32And so this too is a trick in Python that we did
    • 0:44:34not have in C if you want to iterate over something,
    • 0:44:37but grab two variables at a time on each iteration like this,
    • 0:44:41you separate them by commas and can get favorite count, favorite count,
    • 0:44:45favorite count.
    • 0:44:46So if I run this now Python a favorite stop
    • 0:44:49P
    • 0:44:49this two just works and it's getting a little simpler,
    • 0:44:53a little tighter than before than if we had actually done it all manually.
    • 0:44:59Lastly, here's a code that's the shortest version thereof.
    • 0:45:02We're down to like 1415 lines.
    • 0:45:04If I wanted to change this to analyze the most popular problem thus far in the class.
    • 0:45:11How do I go about changing the code to print out
    • 0:45:15top to bottom the most popular problem or problems?
    • 0:45:19What line should change? Yeah.
    • 0:45:21So
    • 0:45:22yeah,
    • 0:45:23line 10 because I've written this in kind of
    • 0:45:25a general purpose way and using dictionaries with keys,
    • 0:45:28it suffices to change language to quote unquote problem because
    • 0:45:30that was the third column from the CS V.
    • 0:45:32And so now if you're curious,
    • 0:45:33let's actually make uh my terminal window a bit bigger Python of favorite do P enter?
    • 0:45:39And OK, tragically,
    • 0:45:40we peaked early with Hello World is the most popular problem thus far,
    • 0:45:44followed by filter, then scratch
    • 0:45:46peaked even earlier, Mario DNA and so forth and a bunch of others thereafter.
    • 0:45:52So based on this sample size, here's the ranking of the problems thus far.
    • 0:45:56So we got it more, more Hello World problems in the weeks to come.
    • 0:46:00All right. So
    • 0:46:02now that we've done that in that way,
    • 0:46:05let's just make this program slightly interactive and see how
    • 0:46:08we can really uh take a fundamentally different approach.
    • 0:46:12I'm gonna go into VS code.
    • 0:46:13I'm gonna keep everything the same except that at the bottom,
    • 0:46:16I'm gonna get rid of this loop because I don't want any more,
    • 0:46:20print out everything I wanna look up specific counts.
    • 0:46:23Like how popular was this problem? How popular was this other problem?
    • 0:46:26And what I'm gonna go ahead and do
    • 0:46:27is to create a variable called favorite set. It equal to
    • 0:46:31I could use get string in the CS 50 library.
    • 0:46:33But we saw last week there's no need to for strings.
    • 0:46:35Certainly,
    • 0:46:35let me just use the input function that comes with
    • 0:46:38Python and prompt the human for their favorite problem.
    • 0:46:41And then let me go ahead and print out for instance,
    • 0:46:44an F string containing whatever their favorite is, colon and whatever the count,
    • 0:46:50uh count is thereof of that favorite close
    • 0:46:54quote. So let me open my terminal window, let me run Python of favorite stop P enter.
    • 0:47:00And if I type in hello comma world looks like 65 people indeed like that one.
    • 0:47:05If I run Python of favorite stop P again, I type in scratch. Now we see that one.
    • 0:47:10If I type in anything else, I'm gonna get its specific value.
    • 0:47:13So this is to say,
    • 0:47:14not only can we write Python code to kind of analyze some
    • 0:47:17data pretty tightly versus the manual code we wrote out earlier,
    • 0:47:22you can also make these programs interactive as well.
    • 0:47:24And this is gonna be a super common paradigm, right?
    • 0:47:26If you go into the world of like consulting analytics data science,
    • 0:47:29more generally like among your um roles is going to be like to analyze data,
    • 0:47:34to ask questions of data.
    • 0:47:35Get back the answer, be asked, questions of data, get back the answer.
    • 0:47:38Honestly,
    • 0:47:39life gets pretty tedious even though you've only been
    • 0:47:41programming in Python perhaps for like one week,
    • 0:47:43week and a half.
    • 0:47:44Now
    • 0:47:44when you have to write code to solve all of the world's problems,
    • 0:47:48and there's this sort of 10
    • 0:47:49in programming that programmers tend to avoid writing code as much as they can.
    • 0:47:54Because ideally you would solve problems with the right tool for the job,
    • 0:47:57minimizing the number of lines of code you actually write.
    • 0:48:01So how do we actually get to that point?
    • 0:48:03Well, instead of just dealing with CS V files, pure text,
    • 0:48:07turns out there's an entire world of proper databases,
    • 0:48:10not flat file databases where you store everything in text files,
    • 0:48:13but like a database program,
    • 0:48:15like a piece of soft are running on a computer
    • 0:48:17running on a server that's always listening for you.
    • 0:48:20It's got a lot of memory, it's got a lot of space and inter a lot of data
    • 0:48:23and it supports a database specific language that makes it much easier,
    • 0:48:28much faster to ask questions of the very same data.
    • 0:48:32It's a relational database in the sense too
    • 0:48:34that it's not even necessarily one spreadsheet,
    • 0:48:36like one set of rows and columns.
    • 0:48:38You can have two sheets, three sheets,
    • 0:48:4030 sheets uh across which there might very well be
    • 0:48:43relationships or relations.
    • 0:48:46So SQL or sequel is a database
    • 0:48:49specific language stands for structured query language.
    • 0:48:51That's a declarative language whereby you're not gonna be in the habit with SQL,
    • 0:48:55typically of writing loops and conditionals and this kind of thing you're
    • 0:49:00instead going to describe the data that you want to get back.
    • 0:49:03You're gonna describe the question that you want the answer to.
    • 0:49:06And we'll do this using a relatively small grammar that is to say there
    • 0:49:10not that many keywords in SQL.
    • 0:49:12It's a pretty small language,
    • 0:49:14but it's going to allow us to eliminate dozens of lines of Python code.
    • 0:49:17Perhaps sequel follows this crud paradigm.
    • 0:49:20So crud, which simply means that in a relational database,
    • 0:49:24you can really only do four things.
    • 0:49:26You can create data,
    • 0:49:27read data that is like look at it or analyze
    • 0:49:30it somehow uh update the data or delete the data.
    • 0:49:33So crud for short, and that really speaks to just how relatively simple
    • 0:49:37the world is,
    • 0:49:38even though we'll just scratch the surface of some of its capabilities today.
    • 0:49:41And you'll explore more over time
    • 0:49:43specifically in sequel.
    • 0:49:45There's going to be other keywords that map to those four ideas.
    • 0:49:48Technically, you don't just create data in the world of SQL.
    • 0:49:50You can also insert data like inserting more rows into a sheet
    • 0:49:54and it's not the word read that people use, people say to select data,
    • 0:49:58but they mean to read data, which is sort of the opposite of writing or creating data.
    • 0:50:02But the U and the D are the same except
    • 0:50:04that there's also a keyword in sequel known as drop,
    • 0:50:07which lets you very destructively very dangerously
    • 0:50:09like delete entire database tables as well.
    • 0:50:12So how do we do this? And what's the connection to our favorites data thus far?
    • 0:50:17Well,
    • 0:50:17here is the syntax and this language called sequel via which you can create a table.
    • 0:50:23So the,
    • 0:50:24the jargon is a little different.
    • 0:50:26But the ideas are exactly the same from the world of spreadsheets,
    • 0:50:28what you call a sheet
    • 0:50:31in a spreadsheet.
    • 0:50:32The database world calls a table.
    • 0:50:34It's a table of rows and columns, but it's the exact same idea.
    • 0:50:38You're going to have discretion over what to call the table.
    • 0:50:41Just like you can call a spreadsheet something or else.
    • 0:50:43And you can also specify the types of data that
    • 0:50:46you want to store in your rows and columns.
    • 0:50:48And it's going to go a little more deeply than just
    • 0:50:49formatting it like in Excel and Numbers and Google spreadsheets,
    • 0:50:52you can actually control maybe how big the data
    • 0:50:55could be depending on the database you're actually using
    • 0:50:58in CS 50. We're going to use a light version of SQL literally
    • 0:51:01language called
    • 0:51:02an implementation of SQL called SQL light,
    • 0:51:05which has really all of the core functionality that you would
    • 0:51:07see in the real world and with larger more scalable systems.
    • 0:51:11But it's going to allow us to focus on a lot of the building blocks.
    • 0:51:14And SQL Light is actually really popular on macs, pcs and phones.
    • 0:51:17Nowadays,
    • 0:51:18a lot of the data that games and other applications
    • 0:51:20on your phone might store actually have a file,
    • 0:51:23a binary file with zeros and ones that's in the sequel light format.
    • 0:51:27So if you do a mobile app, for instance, for your final project,
    • 0:51:30you'll have an opportunity to play with something like this.
    • 0:51:33Well, how do you actually run SQL Light Three?
    • 0:51:35It's just a command built into your code space.
    • 0:51:38So this is a program you could install on your own mac,
    • 0:51:40your own PC or the like we'll do everything as we've
    • 0:51:43done before in the cloud and actually use your code space.
    • 0:51:45And by that, I mean,
    • 0:51:46we can just start to play with this data now using sequel instead of Python.
    • 0:51:53So let me do this.
    • 0:51:53Let me open up my terminal window here and let me go ahead and
    • 0:51:56maximize my terminal window just because we'll focus now on the files here.
    • 0:52:01Recall that I have a file called favorites dot CS V and that CS V file is just text.
    • 0:52:06But let me load it into a proper database.
    • 0:52:09So I can actually use this other language called SQL on it
    • 0:52:12to do this. I'm gonna run SQL LI
    • 0:52:14three, which just means the third version of it.
    • 0:52:16And I'm going to create a new database called favorites dot DB.
    • 0:52:20That's just a convention.
    • 0:52:21But it means here comes a database that I'm going to create.
    • 0:52:24Notice. I'm not using the tabbed code editor.
    • 0:52:26I'm not using the code command because the code command is generally for text files.
    • 0:52:30SQL light three is going to create a binary file zeros and ones.
    • 0:52:33Ultimately, when I run that, it's going to ask me to verify.
    • 0:52:36Yes, I'm going to hit y and then enter and now I'm at the sequel Light Prompt,
    • 0:52:40which is not the dollar sign.
    • 0:52:42It literally says SQL light with an angle bracket.
    • 0:52:44Now, one time only I want to go ahead and load favorites dot CS V into this database.
    • 0:52:51So I can actually play around with it using not Python,
    • 0:52:54but this new language called sequel.
    • 0:52:56And the way I'm going to do this is as follows, I'm going to do dot mode CS V
    • 0:53:00enter
    • 0:53:01and that just puts SQL light into CS V mode.
    • 0:53:04It has different modes for different file formats.
    • 0:53:06I'm going to do dot import and then I'm gonna specify the file that I want to import,
    • 0:53:11which is favorites dot CS V.
    • 0:53:13And then this one's up to me,
    • 0:53:14what is the name of the table I want to
    • 0:53:16create and table again is essentially synonymous with sheets.
    • 0:53:19So I'm gonna call everything the same. I'm gonna call my table favorites as well.
    • 0:53:23So what this command
    • 0:53:25essentially going to do is all of those lines of Python code that open the file,
    • 0:53:30read it row by row and do something with it.
    • 0:53:32This is just built into SQL light.
    • 0:53:33It's gonna load the whole darn CS V into this new
    • 0:53:37favorites dot DB file and then that's it for now.
    • 0:53:40I'm going to go ahead and literally type dot quit to get out of SQL light.
    • 0:53:43I'm back at my dollar sign prompt. If I type L si have not only favorites dot CS V,
    • 0:53:48I also have favorites dot DB now as well.
    • 0:53:52A brand new file and in that file now is an
    • 0:53:56optimized version of the CS V file in that DB file.
    • 0:53:59Now is a version of the data that's gonna lend itself to crud operations,
    • 0:54:04creating reading update and deleting using this new language called sequel.
    • 0:54:10All right. So how do I get into this?
    • 0:54:11Well, let me clear my terminal window and sort of pretend that I'm doing this.
    • 0:54:14Now, the next day, I've already created the database. That's a one time operation.
    • 0:54:17Once you've got the data, now I'm gonna go ahead and again, run sequel light three
    • 0:54:21favorites dot DB just to open the file again.
    • 0:54:25But it's already now, all of the data is in there
    • 0:54:28just as a teaser
    • 0:54:30and let me go ahead and do this dot Schema
    • 0:54:35is a sequel like command that just shows me the schema of this
    • 0:54:39database table and we'll see more about this in a little bit.
    • 0:54:42But for now,
    • 0:54:43this is showing me essentially the sequel command that was automatically
    • 0:54:48run when I imported this database the first time around.
    • 0:54:52And for now just notice that it mentions time stamp, it mentions language,
    • 0:54:56it mentioned
    • 0:54:57problem very loosely. It calls each of those texts.
    • 0:55:00So we're not trying very hard to distinguish one type of data from another.
    • 0:55:04It's all text but notice create table if not exists favorites.
    • 0:55:07This is essentially the create table syntax that
    • 0:55:10I alluded to earlier via which you can create
    • 0:55:13a table in a SQL light database. But more on that in just a bit
    • 0:55:18here now is how we can actually get at the data in that database. It turns out
    • 0:55:24that
    • 0:55:25we can select one or more columns from a database table using syntax like this.
    • 0:55:31Literally the keyword select,
    • 0:55:32then the name of one or more columns that are in
    • 0:55:35that database and then from the specific table that you carry about
    • 0:55:39and notice that in capital letters here are all of the sequel,
    • 0:55:42specific keywords select and from in particular
    • 0:55:46and in lower case by convention,
    • 0:55:47here are the placeholders for the columns that you or I
    • 0:55:50have created and the tables that you or I have created.
    • 0:55:53So if I go back to SQL light here,
    • 0:55:54let me just clear with control L which will just freshen
    • 0:55:57up the screen here so we can focus on what's new.
    • 0:55:59If I wanna select everything from the table called favorites. Here's what I can do.
    • 0:56:06Select star from favorites, semicolon and do forgive me.
    • 0:56:11Semicolons are back for sequel in this case. But
    • 0:56:14select star from favorites uses a syntax. You might not be familiar with star here.
    • 0:56:18Has nothing to do with pointers. Star is a wild card.
    • 0:56:21It means give me everything no matter what it's called from this particular table.
    • 0:56:25When I hit, enter, what we're going to see is
    • 0:56:28the entire contents of the favorite table.
    • 0:56:32That's the result of having imported that CS V into this database.
    • 0:56:35So when I hit enter,
    • 0:56:36there is all of that data and it's SQL light just to be friendly.
    • 0:56:40It's using what we might call aski art.
    • 0:56:42Like just very simple text like hyphens and vertical bars and pluses on the corner
    • 0:56:47to make it look pretty and make it look like it is a proper table.
    • 0:56:50But what you're really seeing is the contents
    • 0:56:52of favorites dot DB specifically in that table
    • 0:56:55specifically, if I only care about languages,
    • 0:56:58let me try something more specific than star select language from favorites,
    • 0:57:03semicolon.
    • 0:57:04This is gonna give me just a single column.
    • 0:57:06Now of all of the favorites that you selected for language specifically,
    • 0:57:10this is a little overwhelming to see all 399 or so pieces of data.
    • 0:57:13So let me actually truncate it a little bit. Let me do select language from
    • 0:57:18favorites limit 10.
    • 0:57:21So we're about to see that there's little tricks you can use to sort of
    • 0:57:24uh tweak the
    • 0:57:26of the language in order to get back more or less data. In fact,
    • 0:57:30turns out there's a bunch of keywords like these
    • 0:57:33built into SQL much like Google spreadsheets, apple numbers,
    • 0:57:37Microsoft Excel and certainly Python is a lot of functionality that
    • 0:57:40you just sort of get for free with the language.
    • 0:57:42If you want to calculate an average count, the number of things in a file,
    • 0:57:45get the unique or distinct values, force everything to lower case,
    • 0:57:49force everything to upper case, get the maximum value, minimum value,
    • 0:57:52much like spreadsheets.
    • 0:57:53If you're familiar with that world, you get all
    • 0:57:55that functionality in sequel, but also more.
    • 0:57:58So for instance, if I go back to my terminal window here,
    • 0:58:00let me go ahead and select the total number of favorites in this table,
    • 0:58:04the total number of rows that you all input it.
    • 0:58:07So I could do select star from favorites semicolon.
    • 0:58:11And then I could literally like start counting these like 1234.
    • 0:58:14There's clearly a better way.
    • 0:58:16And indeed, on our list of functions was a count function.
    • 0:58:20And so the way I can use that in sequel is like this select
    • 0:58:24count of star.
    • 0:58:26So pass star in as an argument to the count function,
    • 0:58:29you don't care what columns you're counting,
    • 0:58:30just count them all from favorites semicolon.
    • 0:58:34And now you're actually gonna get back like a little baby table that has just one row,
    • 0:58:38one column inside of which one cell of which has the total actual count.
    • 0:58:44And it's 398 because 399 recall included the actual header row from the file.
    • 0:58:50All right. So suppose you want to
    • 0:58:53um actually note that this is the exact same thing
    • 0:58:55as counting a specific column because every column has,
    • 0:58:59every row has the same number of columns.
    • 0:59:01Three, we could just say, select the count of languages or select the count of, uh,
    • 0:59:07problems.
    • 0:59:08All of those are going to give me back the same answer.
    • 0:59:11It is therefore conventional and sequel.
    • 0:59:13If you're just trying to count the number of rows, ah,
    • 0:59:14don't even worry about what they're called.
    • 0:59:16Just do count star to get back everything more simply.
    • 0:59:20All right. But what if we want to get back the distinct languages?
    • 0:59:22And we didn't know a priori that this came
    • 0:59:24from a Google form with three radio buttons.
    • 0:59:26Well, we could do something like this. We could select the distinct
    • 0:59:30languages from the favorites table enter and that gives me Python scratch C
    • 0:59:36because distinct is one of the other functions that comes with SQL.
    • 0:59:39This is obviously very easily countable with my human eyes.
    • 0:59:42But if I wanted to do this more dynamically,
    • 0:59:44I could change this to be count the distinct languages
    • 0:59:48and just like in C just like in Python,
    • 0:59:50just like in scratch,
    • 0:59:51I can nest these functions and pass the output of one into the input of another.
    • 0:59:55If I hit enter now, I now get three in this case here.
    • 1:00:00OK. Let me pause to see if there's
    • 1:00:02any questions or confusion just yet. Yeah,
    • 1:00:08it does. Sequel. Eight
    • 1:00:11SQL Light three is a program and it's
    • 1:00:13an implementation of the sequel Light language,
    • 1:00:17which itself is a lightweight version of what the world known as SQL
    • 1:00:21uh which is a very convoluted way of saying there's lots of humans in the world,
    • 1:00:24not everyone agrees.
    • 1:00:25What sequel should be.
    • 1:00:26Microsoft might disagree with Oracle might disagree with other companies as well.
    • 1:00:30So there's a common subset of sequel in the
    • 1:00:33world that almost everyone knows and learns and uses.
    • 1:00:36But there are also some vendor specific features.
    • 1:00:39Sequel li tries to distill things really into the essence.
    • 1:00:41And so that's what you increasingly see on Android, on I Os, on Macs and P CS as well.
    • 1:00:46So we use it because it's
    • 1:00:47relatively canonical.
    • 1:00:48Good question.
    • 1:00:50All right.
    • 1:00:51So let's do a few other things by introducing a few other keywords
    • 1:00:54without trying all of these right now here in this list is a
    • 1:00:59bunch of new keywords that are gonna give us even finer control and
    • 1:01:03we saw limit already and that sort of just limits the output.
    • 1:01:06But you can also have what are called predicates.
    • 1:01:09Like you can literally use the keyword
    • 1:01:10where to start filtering the data without using
    • 1:01:13an if and an L if and an L and an LF and so forth,
    • 1:01:15you can just in one line express something conditionally,
    • 1:01:18you can order the data and you can even group similar data together.
    • 1:01:22So what do I mean by this? Let me go back to BS code here
    • 1:01:25and let me play around with a few different queries. Let me select
    • 1:01:28maybe the count of rows from favorites, which previously was going to be
    • 1:01:35uh, 398 if I just get back all of the rows.
    • 1:01:38But suppose I only want to know how many of you like to see.
    • 1:01:41I can then say something like where the language in each row equals quote unquote. C.
    • 1:01:47And the convention here is to use single quotes though sequel Light is tolerant
    • 1:01:51other formats as well. If I hit enter here, I'll see.
    • 1:01:54Indeed as we saw with Python,
    • 1:01:56the 78 number that honestly took what 1314 15 lines of code.
    • 1:02:00Now I've distilled that kind of query into a single line
    • 1:02:04of SQL code instead by using this built in functionality.
    • 1:02:08Suppose I really want to get specific and how many of you really liked?
    • 1:02:12Hello World in C as your favorite? Well, I could change this query and just
    • 1:02:18your dollar sign prompt your, your shell.
    • 1:02:20You can go up and down in your history in SQL light to save keystrokes.
    • 1:02:23You can use Boolean logic and I can say language equal C and
    • 1:02:28maybe problem equals quote unquote.
    • 1:02:31Hello world and the number of you that like that problem was seven.
    • 1:02:35So really, really early on like hello
    • 1:02:37world NC now notice a couple of key differences. One I'm using and and not Amper
    • 1:02:42Sand,
    • 1:02:43Ampersand like NC,
    • 1:02:44I'm using single equal signs.
    • 1:02:47So SQL behaves like scratch does,
    • 1:02:49which is not like Python or C why
    • 1:02:51different people have implemented different languages differently,
    • 1:02:53equals equals equality.
    • 1:02:56Uh In
    • 1:02:57the world of SQL for comparing things left and right.
    • 1:03:00All right, things are now going to get a little more interesting. But
    • 1:03:03the whole goal of all of that Python code was to
    • 1:03:06sort of analyze the ranking of languages and popularity thereof.
    • 1:03:10Turns out in SQL, once you have the vocabulary,
    • 1:03:13it's pretty easy to do something like that.
    • 1:03:15I'm gonna do this, I'm gonna select all of the languages in the table,
    • 1:03:20but I'm also gonna select the count there of
    • 1:03:24and then I'm gonna do that from the favorites table, but I'm gonna group by
    • 1:03:29language because I claimed a moment ago,
    • 1:03:31that group by is another one of our key
    • 1:03:33phrases in sequel that's gonna let us group data.
    • 1:03:36And what this effectively means that is that if you've got this
    • 1:03:39table with a lot of duplicate languages again and again and again,
    • 1:03:43you can group by that column and essentially smush all of the Python rows together,
    • 1:03:48all of the scratch rows together, all of the
    • 1:03:50crow
    • 1:03:50together. But
    • 1:03:52figure out how many of those rows just got smushed together,
    • 1:03:56effectively doing all of that dictionary leg work or the counter
    • 1:04:00leg work that I did in 1315 lines of Python code.
    • 1:04:04So if I had enter here, this now is the motivation for what we're now starting to do,
    • 1:04:09I have distilled into a single line of code in a language called sequel.
    • 1:04:14What indeed took me more than a dozen lines of Python code just to get back an answer.
    • 1:04:19And I can do the same thing with problem.
    • 1:04:21I can just change language here for instance, to problem instead.
    • 1:04:24But per this list, I can not only group things I can order them.
    • 1:04:27So if you actually want to get like a top 10 or a top three list, well,
    • 1:04:30let's just change this query slightly before the semicolon.
    • 1:04:33Let me order by
    • 1:04:35the count of those rows, uh semicolon.
    • 1:04:39And now what I get is from smallest to largest 40 78 28 2 180.
    • 1:04:43If you want to flip that, that's fine.
    • 1:04:45By default order by uses ascending order, abbreviated A sc.
    • 1:04:50If you want to do descending order desc, you can do that as well.
    • 1:04:54And now we have a top three list from largest to smallest.
    • 1:04:58Now, honestly, this is a bit of a mouthful to use count star over here,
    • 1:05:02count star over here.
    • 1:05:04There's a nice city in sequel two where you can create little aliases of sorts.
    • 1:05:07So if I use the same query again,
    • 1:05:10let me scroll over to the left.
    • 1:05:11I can actually use the keyword as here and I can rename this weird looking column,
    • 1:05:17count star to anything I want.
    • 1:05:19Like I can rename it to N
    • 1:05:21and then at the end of this query,
    • 1:05:23I can order by N essentially creating a synonym if you will for one versus the other.
    • 1:05:28So if I hit enter now, same exact thing.
    • 1:05:30But my little baby table that came back, not a technical term has two columns,
    • 1:05:34one of which is more simply called N now
    • 1:05:36instead of counts star just makes it minorly more
    • 1:05:39convenient in your actual sequel code to reference things
    • 1:05:42that might actually be a little annoying to type.
    • 1:05:45Lastly,
    • 1:05:45suppose we want to get like a top one list and we just want the most popular language.
    • 1:05:50Honestly, I can just do limit one enter that gives me just this tiny little table,
    • 1:05:55a temporary table really with one row.
    • 1:05:58And honestly, if I don't even care about uh what the language is,
    • 1:06:01I can omit that entirely just see how many people
    • 1:06:05really like the most popular language 280 in this case.
    • 1:06:08But of course, it's more interesting to see
    • 1:06:10what it actually is.
    • 1:06:12So in short, just by turning these knobs like syntactically,
    • 1:06:14it's relatively easy to start getting at more
    • 1:06:17and more data and more answers there too.
    • 1:06:20Questions
    • 1:06:22on this. Thus far
    • 1:06:25any questions? No.
    • 1:06:27OK.
    • 1:06:27Well, suppose that um this week, for instance,
    • 1:06:30one of our new problems is gonna be called 50 V
    • 1:06:33and it's gonna allow you to explore the world of
    • 1:06:34sequel in the context of a place called 50.
    • 1:06:36Will suppose that that suddenly becomes your favorite problem.
    • 1:06:39Well, how can we go about adding more data to a database?
    • 1:06:42Well, we've seen create table for creating the table.
    • 1:06:44We've seen select for selecting data there from turns out there's also an insert
    • 1:06:50into a command that you can use to insert new data into a table.
    • 1:06:55Now,
    • 1:06:55I did this sort of in bulk by just importing that whole CS
    • 1:06:58V file and SQL light three did it all for me automatically.
    • 1:07:01But in the real world, if not, if you don't have a captive audience,
    • 1:07:04every one of whom is submitting the form at the same time.
    • 1:07:07But maybe it's an application that's running 24 7,
    • 1:07:10you're going to get more and more data over time, just like Google
    • 1:07:13itself.
    • 1:07:14So if you write code like this, you can insert one row at a time,
    • 1:07:18one row at a time and actually change the data in your table.
    • 1:07:21So just as a check,
    • 1:07:23let me do select star from favorites enter just to see all of the data.
    • 1:07:27And the last data we got was at 1:41 p.m. in 21 seconds. Suppose.
    • 1:07:32Now I've decided I want to insert one new row.
    • 1:07:35I can do this, insert
    • 1:07:38into favorites and then I have to specify what columns do I want to insert into.
    • 1:07:42I'm gonna insert a new language uh column and a new uh problem column, time stamp.
    • 1:07:49I could, I don't really wanna look up the time.
    • 1:07:51So I'm gonna leave that one blank and I'm gonna put
    • 1:07:53in values as follows for this SQL for the language,
    • 1:07:56which wasn't even an option on the form earlier and 50 V
    • 1:08:01for the name of the problem
    • 1:08:03semicolon.
    • 1:08:04So there's a bit of dichotomy here in the first set of parentheses,
    • 1:08:06you specify a comma separated list of the columns that you want to put data into.
    • 1:08:11In the second set of parentheses,
    • 1:08:12you actually specify the values that you wanna put into those columns.
    • 1:08:16So when I hit enter, nothing seems to happen,
    • 1:08:19which in general is a good thing at my terminal.
    • 1:08:21But if I now rerun, select star from favorites, we will see.
    • 1:08:26Voila a brand new row.
    • 1:08:28We don't know what time or date it was inputted at.
    • 1:08:30In fact, we see an old friend null which indicates the absence of a value,
    • 1:08:34but we do indeed see that sequel and 50 V
    • 1:08:37is actually now in there.
    • 1:08:38So in the world of sequel null is has nothing to do with pointers or addresses.
    • 1:08:43The world of sequel is just using the same word
    • 1:08:45to represent the same idea that there's no data here,
    • 1:08:49but it has nothing to do with like actual memory addresses in this case.
    • 1:08:52But
    • 1:08:53suppose that you don't wanna do that and like, no, no, no,
    • 1:08:54let's just delete that 50 bill hasn't even been released yet.
    • 1:08:57Nor in fact, we've even finished talking about sequel. How do we delete data
    • 1:09:01from a database table. Well, there's a delete from command.
    • 1:09:04Let me go back to VS code here.
    • 1:09:06Let me go ahead and clear my terminal just to keep things clean.
    • 1:09:09Let me go ahead and delete from favorites and let me not
    • 1:09:13hit, enter here after a semicolon.
    • 1:09:15This is one of the most destructive things you can do
    • 1:09:18uh as a database administrator.
    • 1:09:20If you Google around, there are horror stories of interns in the real world,
    • 1:09:24executing commands like this at their companies.
    • 1:09:26This will delete everything from favorites.
    • 1:09:29So if you ever do this, remember we told you not to today.
    • 1:09:32But if we add a where clause here,
    • 1:09:35only delete rows where the time stamp column is null,
    • 1:09:40this is more reasonable.
    • 1:09:42Um And frankly,
    • 1:09:43any companies you work for should also have backups of their database.
    • 1:09:45So we shouldn't even be reading about these horror stories but
    • 1:09:48such as the real world.
    • 1:09:49So this is going to delete any row from the favorites
    • 1:09:53table where time stamp capital T because that's how Google did it
    • 1:09:56is null. I go ahead and hit, enter,
    • 1:09:59nothing seems to happen. But if I do select star from favorite semicolon
    • 1:10:05that now row is again gone.
    • 1:10:07So you can use these predicates these where conditions coupled
    • 1:10:11with select coupled with delete and other operations as well.
    • 1:10:15What if I actually want to make a change?
    • 1:10:17So if you wanna update existing data like this, well, we could do this
    • 1:10:21I could update this table,
    • 1:10:23I could set one column equal to this value where some condition is true.
    • 1:10:28So how might this work?
    • 1:10:29Well, let me boldly claim that a lot of you are really gonna like 50 V
    • 1:10:32in the world of sequel. So all of these favorites are sort of passe now.
    • 1:10:37So let's do this update uh the favorites setting
    • 1:10:41the language column equal to quote unquote sequel.
    • 1:10:45And oops, sorry,
    • 1:10:47let me go ahead and update the favorites
    • 1:10:50table setting the language column equal to sequel,
    • 1:10:53quote unquote.
    • 1:10:54And with a comma,
    • 1:10:55let me go ahead and also update the problem column to be equal to quote unquote 50 V.
    • 1:11:01I'm not gonna have any kind of wear here which means like this is just
    • 1:11:04gonna do its thing on all of the call on all of the rows.
    • 1:11:06So if I had to enter, nothing seems to have happened.
    • 1:11:09If I now do select star from favorites,
    • 1:11:11everyone's favorite is literally that problem.
    • 1:11:15So this too is destructive. Like unlike
    • 1:11:17the real digital world,
    • 1:11:18there's no like control Z or undo that like
    • 1:11:21you better have made a backup of your database.
    • 1:11:23Otherwise
    • 1:11:24that's not a good thing.
    • 1:11:26Um In this case, I do have the CS V file so I could just delete my favorites dot DB file.
    • 1:11:30I could
    • 1:11:30re import the data.
    • 1:11:31So I haven't really lost anything of importance,
    • 1:11:33but you could in the case of the real world and any data you're actually working on.
    • 1:11:38So just to make the point, let me go ahead and delete from favorites, semicolon,
    • 1:11:43enter,
    • 1:11:44let me reselect. There's no data there anymore. And in fact, if I do select count
    • 1:11:49star uh from
    • 1:11:52favorites, we'll see as much that the answer is,
    • 1:11:55in fact zero because everything has now been deleted.
    • 1:11:59Any questions then
    • 1:12:00on that code there.
    • 1:12:05No. All right. So if not too scared yet, let's go ahead and take our 10 minute break.
    • 1:12:08Now Halloween candy is served and we'll be back in 10.
    • 1:23:44All right.
    • 1:23:47So we are back and before we dive back into sequel and some Real World data, uh,
    • 1:23:51it turns out unbeknownst to me, we've had a, um, a Halloween costume contest.
    • 1:23:56So it's now time to announce the winners of this year's CS 50 costume contest.
    • 1:24:00If our two winners would like to come on up,
    • 1:24:02who I'm told during break? Dressed up as me.
    • 1:24:11Come on over. Would you like to introduce yourselves?
    • 1:24:16Hi, everyone. I'm David. I am living in Matthews as a first year.
    • 1:24:20I'm planning on studying gov
    • 1:24:21and computer science.
    • 1:24:24Hi, everyone. I'm David.
    • 1:24:26I'm a second year in Mather and I'm planning on just studying computer science.
    • 1:24:31Well, thank you. We have some
    • 1:24:34Oreos for you.
    • 1:24:36Thank you both so much. Did anyone else dress like this?
    • 1:24:38We have two more Oreos if you'd like
    • 1:24:41intentionally.
    • 1:24:44Oh, that's pretty good. Ok. Yes, we have one more winner. Come on down.
    • 1:24:50No,
    • 1:24:55thank you.
    • 1:24:56This is intentional. Oh, ok.
    • 1:25:00Hello. My name is David. Um I'm from Canada and I'm a first year.
    • 1:25:04I'm not sure what I'm gonna study.
    • 1:25:06Ok, welcome as well. Thank you. All right.
    • 1:25:09So,
    • 1:25:12so,
    • 1:25:13um up until now we've played around with the data that you all gave us,
    • 1:25:16which was based very simply on like
    • 1:25:18your favorite language and your favorite problems.
    • 1:25:20But it turns out there's a lot of like real world data in indeed the real world,
    • 1:25:24some of which is quite voluminous.
    • 1:25:26And indeed there can be not just dozens or hundreds but thousands,
    • 1:25:30hundreds of thousands, even millions of rows in the biggest of databases.
    • 1:25:34And so what we thought we do in the latter part of today is really actually
    • 1:25:37get our hands dirty with a real world data set from the internet movie database,
    • 1:25:41otherwise known as IM DB.
    • 1:25:43And in fact, if you go to IM DB dot com,
    • 1:25:45you'll be able to answer via their web interface.
    • 1:25:48Some of the very questions we'll do today using SQL alone.
    • 1:25:51But what you'll find ultimately is that what websites
    • 1:25:54like im DB dot com or their mobile app
    • 1:25:57versions thereof are probably doing is yes giving you
    • 1:26:00a nice pretty graphical interface to type queries.
    • 1:26:03But underneath the hood,
    • 1:26:04they are passing your input into sequel
    • 1:26:08queries or similar queries that they formed.
    • 1:26:11Most of they're just waiting for placeholders like
    • 1:26:13the keywords that you're actually searching for.
    • 1:26:16So let's go ahead and experiment maybe with just some,
    • 1:26:18some real world data initially before we consider how to actually store it at scale.
    • 1:26:22So let me open up just for the sake of discussion,
    • 1:26:24an actual empty spreadsheet just so I have some rows and columns uh to play with.
    • 1:26:29And let me propose that we want to model TV shows from the real world.
    • 1:26:33How could we go about doing this?
    • 1:26:35Well, maybe I could start in this first column,
    • 1:26:37a so to speak and I could create a title column
    • 1:26:40and maybe a column for the star of that show.
    • 1:26:43And a very popular show, of course, is the office.
    • 1:26:46So I might put this into the second cell in that first column and under star,
    • 1:26:50I could put someone like Steve Carell, but of course,
    • 1:26:53he wasn't the only star of the show.
    • 1:26:56There are others as well.
    • 1:26:57And so if I wanna put in someone like Rain Wilson, well,
    • 1:27:00maybe I need a second star column so Rain Wilson.
    • 1:27:03But even as early as the first season, there was also another star in the credits.
    • 1:27:07Uh John
    • 1:27:09Krasinski, uh so he was the star Jenna Fisher uh was top credited in the first season.
    • 1:27:15So Jenna Fisher and then BJ Novak uh Harvard Alum uh
    • 1:27:19was also in the first seasons of opening credits as well.
    • 1:27:23So we've got all uh 12345 of these folks that play a
    • 1:27:27miss me,
    • 1:27:27anyone's name. But here's the beginnings of a real world data set.
    • 1:27:30And we could imagine doing this for like everyone's favorite shows,
    • 1:27:33adding more and more rows.
    • 1:27:35But let's consider as we often do,
    • 1:27:37not just the correctness of this implementation,
    • 1:27:40but the design
    • 1:27:41like it's pretty straightforward, like it's very readable.
    • 1:27:44So I think it's good in that sense.
    • 1:27:46But if you start to nit pick what's poorly designed,
    • 1:27:49even in the world of spreadsheets about what I've done here.
    • 1:27:52Assuming that the next row is another show, the next row is another show and so forth.
    • 1:27:58What's bad about this? Yeah.
    • 1:28:01Yeah.
    • 1:28:01So each row is gonna have a different number of columns and even
    • 1:28:04I kind of couldn't make up my mind from the get go like,
    • 1:28:06do I have just one star column or two or maybe now I'm up to five for even bigger shows.
    • 1:28:10And later in the office when more people got top billing,
    • 1:28:13we're gonna need more than five columns for stars.
    • 1:28:15So that's fine.
    • 1:28:16Like we can clearly scroll to the right and just keep adding more columns.
    • 1:28:20But there should be something about that design that like rubs you the wrong way.
    • 1:28:24Like something feels a little off if some rows have this many columns,
    • 1:28:27others have this many
    • 1:28:28data would be very jagged along the right hand side, it would be very sparse,
    • 1:28:32which would be another way to describe it.
    • 1:28:34There's probably a better way.
    • 1:28:35So maybe I should kind of flip the data around and maybe a
    • 1:28:39better approach here would be to just have one column for star.
    • 1:28:43So let me actually um let me do this,
    • 1:28:45let me just move like Rain Wilson over here and John
    • 1:28:48Krasinski over here and Jenna Fisher over here and BJ
    • 1:28:53Novak over here as well,
    • 1:28:56get rid of all of these superfluous, identically named columns.
    • 1:28:59And now this is sort of better because now I
    • 1:29:01can have any number of stars in the vertical,
    • 1:29:05although it's a little weird to leave this blank.
    • 1:29:07So maybe I should kind of copy paste here. So in some sense, this is better
    • 1:29:12in that now, I only have one title column,
    • 1:29:15one star column and I can just keep adding row row, row row
    • 1:29:18for each show and its stars. But what's now poorly designed about this?
    • 1:29:23Yeah.
    • 1:29:26Yeah, I'm repeating the title and in general copy paste,
    • 1:29:28repeating yourself in code has generally been a bad thing.
    • 1:29:31It's generally gotten us in trouble if I make a,
    • 1:29:33a change or maybe a typographical error somewhere, maybe it propagates elsewhere.
    • 1:29:38And if nothing else, it's just a lot of wasted space.
    • 1:29:40Like if this is actually gonna be stored in a database in a CS V file,
    • 1:29:43why are you duplicating the same string again and again and again,
    • 1:29:46for large TV shows that's just wasteful.
    • 1:29:48It just doesn't seem wise.
    • 1:29:50So how can we eliminate that? Redundancy.
    • 1:29:52Well, unfortunately, in the world of spreadsheets,
    • 1:29:55things kind of now escalate quickly to be just kind of annoying.
    • 1:29:59But let me do it one nonetheless, with just a small bit of data,
    • 1:30:02let me propose that we do this instead,
    • 1:30:04let me create not one sheet but maybe multiple sheets and assume
    • 1:30:08that there's some kind of relationship or relation across these sheets.
    • 1:30:11So just to be pedantic, let me call this sheet, not the default sheet one,
    • 1:30:15but let's call this shows.
    • 1:30:16And in this sheet, I'm gonna have a title column for every show.
    • 1:30:20And I think I'm gonna be proactive here.
    • 1:30:22I'm gonna start giving every show a unique ID number.
    • 1:30:26Much like Harvard affiliates have Harvard I DS Yale
    • 1:30:28affiliates have Yale ID numbers and so forth.
    • 1:30:30Let's go ahead and give each show its
    • 1:30:32own unique identifier for reasons we'll soon see.
    • 1:30:35So for the office,
    • 1:30:37let me just for consistency with the actual internet movie database.
    • 1:30:40I'm gonna give it a unique number of 386676.
    • 1:30:44The specifics don't really matter,
    • 1:30:45but that happens to be what people in the real world actually do.
    • 1:30:49But that's it for TV shows,
    • 1:30:50even though I could imagine there being many more in this sheet.
    • 1:30:53Let me create another sheet now here and I'll call it people and in the people sheets,
    • 1:30:58let me keep track of all of those TV stars.
    • 1:31:01So one will be one column will be name
    • 1:31:03another will be also called ID here, but it's gonna be
    • 1:31:07a, a person ID, not a
    • 1:31:09uh show ID.
    • 1:31:10And here we have, for instance, Steve Carell, we have Rain Wilson, we have John
    • 1:31:16in
    • 1:31:17Ski. We have Jenna Fisher
    • 1:31:20and we have BJ Novak.
    • 1:31:22And this is gonna be a little tedious,
    • 1:31:24but just to be consistent with reality turns out that according to IM DB.
    • 1:31:28Steve Carell's unique number in the world is 136797.
    • 1:31:32Rains is 1933988. John's is 1024677. Jenna's is 278979.
    • 1:31:42And lastly BJ Novak's is one more
    • 1:31:4445983.
    • 1:31:48Ok. So now we have the same people, but they each have a unique ID number.
    • 1:31:52Lastly,
    • 1:31:52let's associate those shows with those people in a
    • 1:31:55way that avoids the two problems we identified earlier,
    • 1:31:58which was having a variable number of columns in
    • 1:32:01one case versus redundancy in the second case,
    • 1:32:04let's really tighten things up.
    • 1:32:06So nothing is in duplicate. That doesn't actually need to be.
    • 1:32:10So I'm going to create a third
    • 1:32:12uh sheet here and I'll call it stars like the TV stars for these shows.
    • 1:32:16And what I'm gonna do is have only two columns, a show ID
    • 1:32:20and a person ID.
    • 1:32:22I could write these in different ways,
    • 1:32:23but it's conventional in the database world to use snake case, so to speak,
    • 1:32:27where everything is under uh everything is
    • 1:32:29lowercase with underscores instead of spaces.
    • 1:32:32And for show ID.
    • 1:32:33Well, we're only mocking up one show for now, but I'm gonna go ahead and say 386676,
    • 1:32:38which is the
    • 1:32:39I claimed and now I'm gonna go ahead and have all of those person
    • 1:32:42ids and this time it's OK to copy paste if only to save time.
    • 1:32:46So I'm going to grab all of these five stars ids
    • 1:32:49paste them there.
    • 1:32:50And I am going to indulge by duplicating the
    • 1:32:53show ID even though we didn't like that earlier.
    • 1:32:56Now,
    • 1:32:56this indeed is kind of escalated quickly because
    • 1:32:58none of these sheets are very useful to look
    • 1:33:01at to the human eye because none of them has like the complete picture we have show
    • 1:33:05in one people in another and then like this cryptic mapping of numbers in the third.
    • 1:33:10But I propose that this is sort of the right way to implement data.
    • 1:33:15If your goal is to have a canonical source of truth
    • 1:33:18for every show and every person that is to say,
    • 1:33:20you only say the show's name once in one place, you only show, you only write the,
    • 1:33:25the TV stars names once in one place and you associate them,
    • 1:33:30you relate one with the other
    • 1:33:31by way of this third sheet here.
    • 1:33:34So if you've not seen it already notice that if the office has this
    • 1:33:38unique ID 386676 notice in the stars table that same value appears multiple times.
    • 1:33:45But what this third sheet is doing is associating
    • 1:33:49that same show ID with 12345 different people.
    • 1:33:54Now, I can see that it's similar in spirit to what we already indicted.
    • 1:33:58It is bad design a moment ago, the office, the office, the office, the office,
    • 1:34:01but think about our world of C in the world of C and really computers in
    • 1:34:04general data takes up finite amount of space
    • 1:34:07typically like an integer is four bytes,
    • 1:34:0932 bits.
    • 1:34:10So even though yes, I'm duplicating this value, it's just the same four bytes,
    • 1:34:14four bytes, four bytes, four bytes.
    • 1:34:16It's not the space off I ce uh null character.
    • 1:34:22It's not the same 11 bytes again and again,
    • 1:34:24it's just a number and numbers tend to be much more efficient.
    • 1:34:27The computers can crunch numbers much more quickly.
    • 1:34:29Duplicating numbers is in general allowed or smiled upon.
    • 1:34:33Duplicating strings will get you into trouble.
    • 1:34:36So with that said,
    • 1:34:37is this a useful spreadsheet?
    • 1:34:39Now, like would you wanna be handed this in your job and ask questions like, hey,
    • 1:34:43who stars in the office?
    • 1:34:44Like you can answer it but you have to look up one sheet, then another,
    • 1:34:48then a third or you need to use like the
    • 1:34:49look up or special functions in Excel or Google sheets.
    • 1:34:52I mean, you're just creating a lot of work for yourself. But
    • 1:34:55academically, if you will systematically,
    • 1:34:57this has a lot of merit because we've avoided all duplication.
    • 1:35:01We've sort of normalized the data, so to speak, by factoring out any duplications.
    • 1:35:06So where are we going with this?
    • 1:35:08Well, it turns out that we'll play now with some actual data from the real world,
    • 1:35:12from the actual internet movie database.
    • 1:35:14And in a moment it's gonna look a little something like this.
    • 1:35:16This is an artist's rendition of five different sorry, six different taste.
    • 1:35:22So not one but six different tables that we
    • 1:35:25created using freely available internet movie database data.
    • 1:35:29They kindly provide not C
    • 1:35:31SVS but T
    • 1:35:32SVS
    • 1:35:32tab separated values, which are essentially the same thing.
    • 1:35:35But you look for tab characters instead of commas
    • 1:35:37in order to convert them as we did into our own format SQL lights in this case.
    • 1:35:42But we'll see that there is a file that we've made available from today,
    • 1:35:46but you can download up the courses website called shows
    • 1:35:49dot DB and it contains all of the same information.
    • 1:35:53And in that shows dot DB file, there are indeed these six tables,
    • 1:35:56but let's focus on just two of them initially.
    • 1:35:58Like IM DB,
    • 1:35:59the internet movie database is all about rating
    • 1:36:01TV shows and tracking that kind of information.
    • 1:36:03So let's actually take a look at some of these ratings
    • 1:36:05and figure out how we can actually answer actual questions.
    • 1:36:08So let me go over to VS code
    • 1:36:10and let me run SQL light of shows dot DB, which is a file that already exists.
    • 1:36:15There's no CS vs no tsvs. We did all of this for you already.
    • 1:36:18When I hit enter, I get my SQL Light prompt.
    • 1:36:21And the first thing I like to do whenever I'm
    • 1:36:22playing with a sequel database for the first time,
    • 1:36:24maybe I got it from a class or my boss or the like is just to wrap my
    • 1:36:28mind around what's inside of the database because you're
    • 1:36:31not typically gonna be provided with pictures like this,
    • 1:36:33you're just,
    • 1:36:34you're gonna be given a file. So let me just select some data.
    • 1:36:37Let me select star from the show's table.
    • 1:36:40I don't really wanna see all of it.
    • 1:36:42So let me just limit it to the 1st 10 shows in the table.
    • 1:36:45And here we can infer what each of the, what the show's table looks like.
    • 1:36:49Every show has an ID A title the year in which it debuted apparently.
    • 1:36:53And the number of episodes as of last night when we exported the data.
    • 1:36:57So
    • 1:36:57that seems to reflect that this picture and
    • 1:37:00this is technically an entity relationship diagram,
    • 1:37:02a a standard way of depicting things.
    • 1:37:04And you'll see that in our picture shows indeed have an ID column, title column,
    • 1:37:09year column and episodes column.
    • 1:37:11Well, what about these ratings?
    • 1:37:12Well, according to the picture that has a show ID, a rating and a votes column,
    • 1:37:16so let's go back to VS code here.
    • 1:37:18And let's do select star from
    • 1:37:21uh ratings limit 10 just to wrap our mind around some of the data.
    • 1:37:25And there we have a show ID on the left,
    • 1:37:27we have rating in the middle which seems to be like a
    • 1:37:29floating point value and then votes which seems to be an integer.
    • 1:37:32So we have some different types of data here,
    • 1:37:34but there's a lot of data. In fact, if I do this now select star from
    • 1:37:39uh shows, let's not select all of the data. Let's do select count stars from shows.
    • 1:37:45And in this database, there are 214,000 shows in this database.
    • 1:37:50So we're well past the 398 rows that we've been talking about thus far.
    • 1:37:54So it turns out per this diagram,
    • 1:37:57there's actually a standard relationship between these two tables,
    • 1:38:01shows and ratings respect.
    • 1:38:02And it's what we would call a 1 to 1 relationship whereby every show in
    • 1:38:07this design has one rating.
    • 1:38:09And this is indicated technically,
    • 1:38:10if you look at what the arrowheads look like on these diagrams,
    • 1:38:13this indicates that it's a 1 to 1 relationship,
    • 1:38:15which means every show has one rating,
    • 1:38:17which means every row in the shows table
    • 1:38:20has a corresponding row in the ratings table.
    • 1:38:22Strictly speaking, they could be in the same table,
    • 1:38:25you could just kind of join them together and make one wider table.
    • 1:38:27But IM DB keeps the data separate.
    • 1:38:29So we two kept them separate in two separate tables here.
    • 1:38:33So what does this actually mean in practice?
    • 1:38:35Well, let's actually take a look in VS code at the schema for these tables.
    • 1:38:40Let me clear my screen
    • 1:38:41and let me do dot schema but specifically look at
    • 1:38:44the schema or the design of the show's table.
    • 1:38:47So you can do dot
    • 1:38:48of shows and any command in SQL light with a
    • 1:38:50dot is SQL light specific um in the real world,
    • 1:38:54if you're using other products like oracle or Postgres or my SQL or others,
    • 1:38:59they have different commands.
    • 1:39:00But anything else that we've been typing,
    • 1:39:02especially the capitalized keywords is indeed uh standard sequel.
    • 1:39:06If I hit enter here, here is what the show's table apparently looks like.
    • 1:39:11In other words,
    • 1:39:11here is the create table command that we the staff ran in order to create
    • 1:39:15this table for you and then we imported a bunch of data into it.
    • 1:39:18Every show has an ID has a title, has a year, has a number of episodes,
    • 1:39:24but there's more detail here.
    • 1:39:25Apparently the ID is an integer. The title is text and it is not null, cannot be null.
    • 1:39:30The year is numeric, whatever that means.
    • 1:39:33And the episodes is an integer when that is now familiar and then primary key.
    • 1:39:37So there's some other stuff going on there which will come back to.
    • 1:39:40But let me also do dot schema ratings enter
    • 1:39:43and we'll see a couple of other data types here,
    • 1:39:46there's show ID which is still an integer but not null rating,
    • 1:39:49which is a real number A K A float,
    • 1:39:52but it too cannot be null and then some number
    • 1:39:54of votes which is an integer cannot be null.
    • 1:39:56And then there's mentioned a foreign key.
    • 1:39:58So unlike our data set for the favorites a bit ago,
    • 1:40:01which we just did automatically and imported this database by us.
    • 1:40:05And by IM DB has been more thoughtful
    • 1:40:07designed where there's actually some relationships across
    • 1:40:11multiple tables rather than previously just one.
    • 1:40:14Now in the world of SQL,
    • 1:40:16we have indeed different data types.
    • 1:40:18For instance, we have these five primarily one cutely named Blob,
    • 1:40:22which is actually binary large object,
    • 1:40:25which generally means like a file or some piece
    • 1:40:27of data that zeros and ones though generally,
    • 1:40:29it's best to store files on file systems like in folders on disk, so to speak,
    • 1:40:34not in your database,
    • 1:40:36there's integers, we've seen, there's numeric,
    • 1:40:38which numeric is more like dates and times,
    • 1:40:40things that are numbers but not necessarily integers or floating point values.
    • 1:40:44There's reels which do have decimal points in them. And then there's just text
    • 1:40:48in other SQL databases in the real world,
    • 1:40:51in your future jobs or your future classes that you might use SQL.
    • 1:40:54Again in there are even more data types in other databases, oracle, my SQL,
    • 1:41:00Postgres and so forth.
    • 1:41:01But these are sort of representative of them. They just get more precise
    • 1:41:04and other systems.
    • 1:41:05But there's also some keywords that we've seen already that you can
    • 1:41:08specify when designing a database that this column cannot be null.
    • 1:41:11If you want to make sure that no one can insert or update data,
    • 1:41:16unbeknownst to you that is null.
    • 1:41:17You can impose that when creating the table.
    • 1:41:20And unlike Excel and Google spreadsheets and apple numbers,
    • 1:41:23which will generally let the human type in or not type in anything they want
    • 1:41:27with a database. You have more protections over the integrity of your data.
    • 1:41:31Moreover, you can specify of columns,
    • 1:41:33values must be unique if you want to avoid duplicates,
    • 1:41:36like you don't want the same person to be able to register twice for your website,
    • 1:41:39thereby making sure they have one unique email address.
    • 1:41:43Your database can help with that too.
    • 1:41:45You don't have to rely on like Python to check if it already exists.
    • 1:41:49But there's this other feature
    • 1:41:51of relational databases that is databases that
    • 1:41:54have multiple tables across which there are relationships
    • 1:41:57and that's these keywords we saw briefly a moment ago, primary key
    • 1:42:00and foreign key
    • 1:42:02and we started to scratch the surface here.
    • 1:42:06It turns out what I was doing was
    • 1:42:07actually best practice in the world of relational databases
    • 1:42:11I gave or really IM DB gave every show in the world, a unique ID.
    • 1:42:16And that
    • 1:42:17ID in this case,
    • 1:42:18386676 is a numeric value and integer that uniquely identifies that TV show.
    • 1:42:24In other words, this is the primary key for this table,
    • 1:42:29technically a sheet.
    • 1:42:30But I'm using that just because it's easier to type in than my black and white window.
    • 1:42:34This ID column is the primary key for shows as I was mocking up earlier.
    • 1:42:39What is the primary key in the people sheet here? It is also ID.
    • 1:42:44It's a different ID, but it's by convention often called the same thing.
    • 1:42:48But this people column called ID is its primary key.
    • 1:42:52And you could perhaps see where this is going.
    • 1:42:54Those same numbers also happen to appear in this third table.
    • 1:42:59But in that context,
    • 1:43:01they're sort of foreign keys like they didn't come from this star sheet,
    • 1:43:05but they are in the star sheet. So they're sort of relatively foreign to it.
    • 1:43:09So
    • 1:43:10foreign keys is simply the presence of primary keys
    • 1:43:13in some other table in some other data set.
    • 1:43:16And so it's just a description of relativity,
    • 1:43:18but the foreign key is the column that uniquely identifies your data.
    • 1:43:22Foreign keys is just the appearance of those same numbers
    • 1:43:25elsewhere.
    • 1:43:27So what does this mean if we go back to VS code here?
    • 1:43:30You'll see that when we created this table using I MD BS real data,
    • 1:43:34we specified that in our show's table which is bigger
    • 1:43:38than the one I mocked up with Google sheets.
    • 1:43:39There has not only an ID and a title,
    • 1:43:41it again also has year in which the show debuted and the total number of episodes
    • 1:43:45because that's juicy data that comes from IM DB beyond what I mocked up a moment ago
    • 1:43:50in the ratings table. Meanwhile, there's a show ID, which is also an in
    • 1:43:55just like this ID.
    • 1:43:56But as the name implies,
    • 1:43:58this show ID column is actually going to be a
    • 1:44:01foreign key that references the shows tables ID column.
    • 1:44:06So this is the relational and relational databases.
    • 1:44:09These are two tables that have a relationship and that
    • 1:44:11relationship is that show ID is sort of referring to
    • 1:44:16this actual ID here.
    • 1:44:18And it allows us essentially conceptually to link these two tables together.
    • 1:44:23So what does this actually mean? Well, let me go ahead and do this.
    • 1:44:26Let me go back to BS code here.
    • 1:44:28I'll clear my terminal and let's play around with some of this data.
    • 1:44:31So let's go ahead and do this just uh to to experiment select star from ratings where
    • 1:44:37let's get all of like the good shows where just like uh rotten Tomatoes will do the
    • 1:44:41off at like 6.0 out of 10. So where rating is greater than equal to 6.0
    • 1:44:47and just so I don't overwhelm my screen. Let me just limit this to the 1st 10 results.
    • 1:44:51In other words,
    • 1:44:51this is sequel syntax for selecting all of the ratings that
    • 1:44:55are at least 6.0 or higher from that table enter.
    • 1:44:59And we see just the 1st 10 of them,
    • 1:45:01not the top 10 because we've not sorted or grouped or anything like that,
    • 1:45:04but the 1st 10 in the table.
    • 1:45:06So what is interesting here is that we've seen just some,
    • 1:45:10a sampling of the data if you will.
    • 1:45:12But this isn't all that interesting here.
    • 1:45:14Let me actually distill this just to the show ID because in other words,
    • 1:45:18I want to know 10 good shows to watch.
    • 1:45:20So let me just select show ID.
    • 1:45:22So same result, but uh less data, it's just that first column thereof.
    • 1:45:27Now, this is gonna be a little annoying.
    • 1:45:29But if I wanna find out the names of these shows,
    • 1:45:33think about the picture from whence we came.
    • 1:45:36All of the show's names are in the show's table,
    • 1:45:39but all of the show's ratings are in the ratings table.
    • 1:45:43So even if I do select star from ratings,
    • 1:45:46I'm never gonna know what show I'm looking at.
    • 1:45:47Like what the heck is show ID? 62614.
    • 1:45:50Well, I could do this,
    • 1:45:51I could select star from shows where the ID of the show equals 62614,
    • 1:45:59semicolon enter.
    • 1:46:01Ok. So I could watch this show from 1981.
    • 1:46:04Let me do another one, select star from shows where ID equals 63881.
    • 1:46:09So I'm just grabbing the second ID from here.
    • 1:46:12Ok. So Cat weasel a kids show from 1970. All right.
    • 1:46:15So I'll watch that So now let's do another one.
    • 1:46:18I'll just copy this like,
    • 1:46:19suffice it to say this is not the best way to look up data
    • 1:46:23where I'm literally like copying and pasting values from one query into the next.
    • 1:46:27But this is where sequel gets a bit powerful.
    • 1:46:29I can have nested queries, I can put one inside of the other.
    • 1:46:33So let me instead do this, let me clear the screen
    • 1:46:35and let me instead do this select
    • 1:46:38star from shows where the idea of the show is in the following list of I DS select.
    • 1:46:48And actually I'll do this on a separate lines,
    • 1:46:50uh,
    • 1:46:50select show id from ratings where the rating value
    • 1:46:57is greater than or equal to 6.0 semicolon.
    • 1:47:01So I've separated this onto two lines.
    • 1:47:03The dot dot dot is just a continuation character which means same query,
    • 1:47:06multiple lines, but the parentheses are deliberate just like grade school math.
    • 1:47:10I want what's in parenthesis to happen first.
    • 1:47:13And so what the database will do is we'll select as before all of the show I
    • 1:47:17DS from the ratings table where the rating value is at least 6.0 out of 10.
    • 1:47:22And that's gonna return to me effectively a list, some kind of collection of show I DS
    • 1:47:27which previously I was copying and pasting. Now,
    • 1:47:30the database will do the leg work for me.
    • 1:47:33It will now select everything from the show's table
    • 1:47:35where the idea of the show is in that list
    • 1:47:38of values and it's actually gonna be more than 10 unless I go in there and say
    • 1:47:42limit 10, which I can do.
    • 1:47:44So let me go ahead and hit enter now.
    • 1:47:46And now I see more useful information,
    • 1:47:48not just the ratings information which in a
    • 1:47:51vacuum tells me nothing about what to watch.
    • 1:47:53Now, I see the show ID, the title the year and the episodes,
    • 1:47:58but notably
    • 1:48:00what is, and if I want to distill this into just the title,
    • 1:48:02let me actually go back here and instead do select just the title from
    • 1:48:07shows where the ID is in this whole list and I'll re execute it
    • 1:48:12by just copying and pasting the same.
    • 1:48:13The only difference now is instead of star, I'm selecting title.
    • 1:48:16Here's now how like the data analyst at IM DB might be
    • 1:48:19selecting 10 shows that are really good to watch according to ratings.
    • 1:48:24But what of course is missing from the output,
    • 1:48:27whether I do star or just title
    • 1:48:30like what's missing. Yeah,
    • 1:48:32like the actual rating,
    • 1:48:33like I know these are at least 6.0 but like which is 7.0 which
    • 1:48:37is 10.0 it'd be nice to actually combine the data in some way.
    • 1:48:41So we can actually do that too because it turns out that when
    • 1:48:43you have two tables in the world of sequel or even more,
    • 1:48:46you can actually join them together,
    • 1:48:49you can join them
    • 1:48:50together. Literally using
    • 1:48:52a keyword called joint.
    • 1:48:54And you can do this as follows.
    • 1:48:56Let me kind of propose with a sample data set,
    • 1:48:58these two tables and dot dot dot just means we don't really care about the specifics,
    • 1:49:02we just care about the structure.
    • 1:49:03So on the left here is a simplified version of my shows table that has a show ID
    • 1:49:08and a show title, but I've omitted a year and episodes as just intellectually,
    • 1:49:13uh distracting from the basic structure here.
    • 1:49:16But they're there in the real table.
    • 1:49:17On the right hand side here, we have the ratings table with just two of its columns,
    • 1:49:21the show ID and the rating,
    • 1:49:23but I've omitted the votes because it
    • 1:49:24doesn't really add anything to the discussion.
    • 1:49:26But let me propose this,
    • 1:49:28notice that these two tables and these two rows they're
    • 1:49:31in definitely have commonalities like they both have the same ID
    • 1:49:36in the left table. It's 386676 A K A the primary key of that row.
    • 1:49:41But it's also appearing in duplicate in the right hand table 386676.
    • 1:49:47And in that context, it's a foreign key. The point though is that they're the same.
    • 1:49:50So wouldn't it be nice if I could sort of treat one table here,
    • 1:49:53one table here and if my fingertips represent these
    • 1:49:55identical values, kind of like glue them together.
    • 1:49:58So I get one wider table with all of the information together to satisfy
    • 1:50:02your concern that we don't even know what the ratings are of those shows.
    • 1:50:06Well, let me go ahead and do this just for artists rendition.
    • 1:50:08Let me flip title and ID, which has no functional effect.
    • 1:50:11It's just gonna put the numbers closer together on the screen.
    • 1:50:14Let me then like literally highlight the fact that these two numbers are identical.
    • 1:50:18And let me propose that we do the equivalent of this, we somehow join these two tables
    • 1:50:23on that common value.
    • 1:50:24Strictly speaking, I don't need both values because they're duplicate.
    • 1:50:27So I don't care if one of them goes away. But what I'd really like to do is select indeed
    • 1:50:32a temporary table that is the joined version of
    • 1:50:36that original data.
    • 1:50:38And frankly, I don't really care as the user, what the idea is, heck,
    • 1:50:40all I care about is what show to watch and what its rating is.
    • 1:50:43Give me the title
    • 1:50:44and the rating.
    • 1:50:45All of these numbers are again, metadata, things that the computer cares about,
    • 1:50:48but we humans probably do not.
    • 1:50:51So how can we implement that idea of taking one data set that
    • 1:50:54has a relationship with this data set and somehow combine it together?
    • 1:50:58Well, let me go back to vs code here. Let me clear my screen
    • 1:51:01and this is gonna be a bit cryptic at first, but it's very step by step.
    • 1:51:06Let me do this select star
    • 1:51:07from show
    • 1:51:10but not from shows alone. Let me join it with the ratings table.
    • 1:51:14So let me select, start everything from shows, joined with ratings,
    • 1:51:18but I need to tell the database. Well, what do I wanna join things on?
    • 1:51:21Like, what are my fingertips specifically?
    • 1:51:23I want them to join on those common integers. So I can literally say on
    • 1:51:28and then I can specify one table on the left.
    • 1:51:30Shows dot ID on the left should equal the ratings tables, show ID column on the right.
    • 1:51:38Again, if I'm joining shows with ratings,
    • 1:51:41it's called ID in one, it's called show ID in the other. But it's the exact same thing.
    • 1:51:46In fact, if I rewind, this is where we came from two tables with the same value.
    • 1:51:51So with this query here, if I go ahead and now specify, not just that,
    • 1:51:55but let me further say where rating is greater than or equal to six 0.0 and heck,
    • 1:52:01let's limit it to 10
    • 1:52:02just fits on the screen. So it's more of a mouthful.
    • 1:52:05But when I hit enter, now,
    • 1:52:07we have a wider table that indeed contains everything
    • 1:52:10star from having joined these two tables left.
    • 1:52:14And right now again,
    • 1:52:15I don't really care about much of this data
    • 1:52:17like year and episodes and definitely not the DS.
    • 1:52:20So let me actually hit up,
    • 1:52:22let me go to the beginning of the query and let me just select the title of the show
    • 1:52:26and the rating of the show, the query is getting a little long and it's wrapping,
    • 1:52:29but it's the same query, but except for star, except instead of star,
    • 1:52:33I've done title comma rating.
    • 1:52:35Now, when I hit enter,
    • 1:52:36like this is the list that would have been nice to see the first time around.
    • 1:52:39Show me 10 shows with a rating of 6.0 or higher.
    • 1:52:43But remind me what the rating actually is.
    • 1:52:45So maybe I can prioritize the sevens, the eights, the nines
    • 1:52:48and even the tens. If any
    • 1:52:51any questions about this technique of joining two tables,
    • 1:52:57this sort of solves the problem that we created in the world of this sheet where I was
    • 1:53:01just kind of playing around where I sort of
    • 1:53:03moved all the data into its separate locations,
    • 1:53:06which is not at all pleasant to use.
    • 1:53:07But with sequel would join, you can still get any of the data you want. Yeah.
    • 1:53:20Correct. So yes, I should have called that out more explicitly in my query here.
    • 1:53:25I was using dot notation which we've seen in Python we've
    • 1:53:27seen in C it means something similar in spirit here,
    • 1:53:29but it has nothing to do with structure objects.
    • 1:53:31In this case, it has to do with tables and columns.
    • 1:53:34So shows dot ID just makes clear that I want the ID column from the
    • 1:53:38shows table to line up with the show ID column from the ratings table.
    • 1:53:43Strictly speaking,
    • 1:53:44I don't need to do that because in this case there's no ambiguity.
    • 1:53:49One table has a column called ID. The other table has a column called show ID.
    • 1:53:53So certainly the database can just figure this out for me.
    • 1:53:56But for best practice and explicit, uh for the sake of being explicit,
    • 1:54:00using the dot notation and table names can help,
    • 1:54:03especially if there's some common language across them.
    • 1:54:06All right. Well, let's go back to the bigger data set here.
    • 1:54:08These are all six tables in IM DB.
    • 1:54:11We focused for just a moment there on like shows and ratings alone.
    • 1:54:14But what about genres?
    • 1:54:15So genres like comedy and documentary and drama and so forth,
    • 1:54:18turns out that this actually implements a different type of relationship.
    • 1:54:22Previously, we saw a 1 to 1 relationship,
    • 1:54:25but it turns out that IM DB supports what's called
    • 1:54:28a one to many relationship when it comes to genres.
    • 1:54:30Why? Well, shows like the office I do think are generally considered common
    • 1:54:34and that's it.
    • 1:54:35But there's certainly other TV shows that
    • 1:54:37might have multiple genres associated with them.
    • 1:54:40Maybe it's comedy and a bit of romance thrown in like rom coms and so forth.
    • 1:54:44So you can imagine some shows having two or three or more genres.
    • 1:54:49And so one to many means that one show can have many genres,
    • 1:54:531 to 1 would mean like one show can have one rating as we've seen.
    • 1:54:57So why don't we go ahead and focus maybe on, um,
    • 1:55:00how about
    • 1:55:02a
    • 1:55:02weary, like this. Let me go back to VS code here. Clear my screen
    • 1:55:05and let's just look at some of those genres, select star from genres
    • 1:55:09and then I'll limit it to 10.
    • 1:55:11And again, I do this just to wrap my mind around a new data set.
    • 1:55:14I could look at the schema, but that tends to be more cryptic.
    • 1:55:16I just want to look at the raw data. OK?
    • 1:55:18It looks like here, there are a bunch of genres, comedy, adventure comedy.
    • 1:55:23So two comedies which is interesting, oh, interesting family action sci fi family.
    • 1:55:28So the values here are duplicated,
    • 1:55:30which it turns out is not the best design of IM DB.
    • 1:55:32We literally just imported the data as they implement it.
    • 1:55:35But notice that show ID 62614 is a comedy
    • 1:55:40but so is show 63881
    • 1:55:43and so is show 65270,
    • 1:55:46so it turns out that in the real world sometimes data is somewhat messy.
    • 1:55:50There's duplication of comedy, comedy, comedy, but such is the way I MD BS data is.
    • 1:55:55But what's more interesting to me for now is notice this,
    • 1:55:58this show ID in three rows is the same.
    • 1:56:02So there's some show out there that's considered by the world to be an adventure,
    • 1:56:06a comedy and a family show.
    • 1:56:08So let's see what that is. Let me just highlight and copy that value 63881
    • 1:56:11and do this select star from shows where the idea of the show equals that value.
    • 1:56:17And it turns out we saw briefly before it's
    • 1:56:19a show from the 19 seventies called Cat Weasel,
    • 1:56:21which falls into all three of those categories.
    • 1:56:24So by using a one to many relationship, sort of depicted by this picture here,
    • 1:56:29you can implement that same idea without having that jagged edge.
    • 1:56:33When we looked at the spreadsheet earlier in an earlier version of this, we had star,
    • 1:56:37star, star star, which we could do again, genre, genre,
    • 1:56:40genre genre.
    • 1:56:41But now we instead have two separate tables
    • 1:56:44where this many to many relationship is implemented
    • 1:56:47across. So let's actually play around with it.
    • 1:56:49Let me go back to um
    • 1:56:51vs code here and let's actually take a look at the schema for
    • 1:56:55genres and we'll see that it's pretty small as the picture suggests.
    • 1:56:59It's called genres. Every row has a show ID, which is an integer cannot be null.
    • 1:57:04Uh It has a genre which is text cannot be null.
    • 1:57:07And that show ID is a foreign key in this table that references an ID column
    • 1:57:11in the show's table so very similar in spirit. And so it really is
    • 1:57:15just kind of on the honor system that we're
    • 1:57:17only putting one row for each show in ratings,
    • 1:57:20but zero or more shows in the genre's table for shows as well.
    • 1:57:24So what can we do once we want to tinker with genres? Well, let me do this.
    • 1:57:28How about we select, uh, the show ID
    • 1:57:32from the genres table where the genre is comedy?
    • 1:57:35Like, I'm in the mood for some comedy.
    • 1:57:37Let's see, all of the available comedies except now let's just limit it to 10.
    • 1:57:40Here are the show I DS for 10 comedies according to the internet movie database.
    • 1:57:45Well, that's not very interesting. I care about the title so we can do that
    • 1:57:49select title from shows where the idea of the show is not equal to. But rather in
    • 1:57:56the following sub query
    • 1:57:58will select show ID from genres where genre equals quote unquote,
    • 1:58:03comedy limit 10 just to keep things simple.
    • 1:58:07So same queries before.
    • 1:58:09But now I'm using it as a nested query to
    • 1:58:11select the actual titles whose I DS match those there enter
    • 1:58:15and there are those titles for 10 comedies,
    • 1:58:18maybe a couple of which we've seen Cat Weasel we've seen before.
    • 1:58:21But what if we want to know?
    • 1:58:22Let's see,
    • 1:58:24maybe we want to get
    • 1:58:26all of the, let's flip it around. Cat Weasel keeps coming up.
    • 1:58:29Why don't we figure out with a query? What all of its genres are?
    • 1:58:33So Cat Weasel, let's see. So let's do, uh
    • 1:58:35let's see, select
    • 1:58:37star
    • 1:58:38from
    • 1:58:39genres limit 10 because I know it's in the top the 1st 10.
    • 1:58:43This was 63881 was Kat Weasel's ID. So let's do this.
    • 1:58:46So select genre from genres where the show ID equals that value OK,
    • 1:58:53so there's the same query as we did before.
    • 1:58:55Can we make this dynamic? Well, we can too, what if I instead more dynamic,
    • 1:58:59do select genre from genres where the show ID
    • 1:59:04and not in, if I'm looking for a specific show, now, I can actually do equals.
    • 1:59:08And in my sub queer,
    • 1:59:10I could do this select ID from shows where the title of the show equals quote unquote
    • 1:59:15cat weasel semicolon enter.
    • 1:59:19So again, even though I'm typing these very quickly,
    • 1:59:21I'm really just composing like similar smaller ideas that we've seen before into
    • 1:59:25larger and larger queries to just get at more of this data.
    • 1:59:28So what's really going on underneath the hood when
    • 1:59:30you can kind of think of it like this?
    • 1:59:32If we've got this relationship between shows and genres,
    • 1:59:35here's an excerpt from shows and I didn't
    • 1:59:38bother showing the thousands of other shows.
    • 1:59:40Here's an excerpt from genres on the, right.
    • 1:59:42What is that queer? What are we essentially trying to do?
    • 1:59:46Well, let me flip this around here.
    • 1:59:48Let me highlight the fact that this is the same. This is the same, this is the same.
    • 1:59:53So wouldn't it be nice if I could kind of get these all together?
    • 1:59:55Well, if I join these tables, we're actually gonna notice an interesting artifact.
    • 1:59:59If I join them together as we did before with ratings,
    • 2:00:02I'm gonna kinda need to fill in the the gap there, right?
    • 2:00:05Because this is not a table like tables,
    • 2:00:07by definition always have the same number of rows and columns,
    • 2:00:10like you can't have gaps in them like this.
    • 2:00:12So the simplest thing to do is just to fill that in this way.
    • 2:00:15But if I were to try to combine two tables that have this one to many relationship,
    • 2:00:21you're actually going to get duplication.
    • 2:00:23It's not duplication in the original tables,
    • 2:00:26but in the temporary tables otherwise known as a result set
    • 2:00:29that's coming back to us. So what do I mean by this?
    • 2:00:32Well, if we actually implement this same idea as before,
    • 2:00:35where we try to join these two tables, let me propose that we do it with this syntax.
    • 2:00:41Let me do select star from shows, join genres, which is just like we did with ratings.
    • 2:00:47But now let's join it on shows dot ID
    • 2:00:50equals genres dot show ID.
    • 2:00:53But let's just do this for cat weasel
    • 2:00:55where ID equals 63881 semicolon with the ratings.
    • 2:01:00It worked perfectly because it was a 1 to 1 relationship.
    • 2:01:03So the rose just got wider if you will.
    • 2:01:05But now because it's a one to many relationship, when you execute these queries,
    • 2:01:10you are gonna get back, duplicate data.
    • 2:01:12But this is OK, it's considered OK, because this is sort of ephemeral.
    • 2:01:16These result sets these temporary tables exist just for us to look
    • 2:01:19at the data just for us to crunch the numbers somehow.
    • 2:01:22It's not actually stored in duplicate in the database itself.
    • 2:01:27Um If I wanted to tighten this further though,
    • 2:01:29let me actually get rid of the star and let me just do title
    • 2:01:32genre.
    • 2:01:33And indeed, we can now see, OK,
    • 2:01:35cat weasel three times has three different categories, but generally,
    • 2:01:38we don't even care about that.
    • 2:01:39So I can even whittle this query down to just selecting genre
    • 2:01:43and that too will just give me the result, effectively hiding the duplication.
    • 2:01:47But when you join data with a one to many relationship,
    • 2:01:50you're temporarily going to get duplicates,
    • 2:01:52which is actually useful because it's very easy then to get at the show's title
    • 2:01:56no matter where you are in some loop.
    • 2:01:59All right. Well, what more can we do here?
    • 2:02:01Well,
    • 2:02:01let me propose that we revisit the main database here
    • 2:02:04with six tables and let's look at perhaps the juiciest and
    • 2:02:07the one that's really what most people use IM DB for
    • 2:02:10is to look up like shows and people they're in,
    • 2:02:12let's focus on these three tables.
    • 2:02:14And we can infer from this diagram
    • 2:02:16that there's now for the first time, three tables involved in a relationship.
    • 2:02:20There's people, there are shows,
    • 2:02:22but I've proposed this intermediary stars table much like I temporarily in Google
    • 2:02:26sheets gave us a a third sheet to kind of link the two together
    • 2:02:30this stars table.
    • 2:02:31We're about to see his purpose in life is to join two other tables together.
    • 2:02:36And in fact, it's only gonna have two columns, show ID
    • 2:02:39and person ID.
    • 2:02:40So what this is gonna do for us is implement this idea many to many relationship. Why?
    • 2:02:45Because any TV show can obviously have many people in it,
    • 2:02:48but one person can presumably star in many different shows.
    • 2:02:52Like Steve Carell has been in multiple shows, not just the office.
    • 2:02:55So when you have a many to many relationship,
    • 2:02:57you actually do need this third table to sort of bridge the two any number of times.
    • 2:03:02But it's gonna make our life a little more unpleasant to get the
    • 2:03:05data we want because it's gonna add some additional steps if you will.
    • 2:03:08So let me do this,
    • 2:03:10suppose that I want to get everything I know about the office.
    • 2:03:14Well,
    • 2:03:14let's start with a single query here in DS
    • 2:03:16code select star from shows where title equals quote unquote
    • 2:03:20the office. And I should see,
    • 2:03:22oh,
    • 2:03:23interesting several attempts at creating a TV
    • 2:03:25show called The Office over the years.
    • 2:03:27You can perhaps infer the year in which the most popular of them began
    • 2:03:312005.
    • 2:03:32So I presume this is the one we all know and have watched at least in the US,
    • 2:03:36which is this 1386676, which matches the ID that I very carefully used earlier.
    • 2:03:41So let me actually be a little more deliberate
    • 2:03:44where the title equals the office and the year equals 2005.
    • 2:03:48That query now gets us the office that we all know in the US and perhaps love.
    • 2:03:53But now let's actually do something like, get all of the people who starred in it,
    • 2:03:57at least according to Im DB, whoever had top billing.
    • 2:04:00So how can I do this?
    • 2:04:02Well, unfortunately, in the show's table,
    • 2:04:04there are no people and there's no stars even.
    • 2:04:08But I could do a nested query like this. Why don't I select
    • 2:04:11the person ID
    • 2:04:13from the stars table?
    • 2:04:16Where
    • 2:04:18whoops uh where
    • 2:04:21the I, the where um
    • 2:04:24sorry,
    • 2:04:27where show ID equals and then in parentheses,
    • 2:04:31let me do that same query as before and for time's sake, I'll just copy paste
    • 2:04:35so that we get back the one and only office in that sub query.
    • 2:04:39So what I'm gonna do is sort of take an intermediate step,
    • 2:04:41a baby step if you will right now I have found in the show's table,
    • 2:04:45the office I care about.
    • 2:04:46But if I want to get to the people table,
    • 2:04:48I have to kind of take a step through the stars table,
    • 2:04:51this intermediate table just to get anywhere close to the names of those people.
    • 2:04:54So what can I get from the stars table?
    • 2:04:56Well,
    • 2:04:57why don't I at least select all of the person I DS
    • 2:05:00in that table that are somehow associated with the same show ID?
    • 2:05:04So in VS code,
    • 2:05:05what I'm doing is this select person ID from that intermediate stars table where
    • 2:05:09the show ID in question is whatever the show ID is for the office,
    • 2:05:13I could literally type 386676, but I'm trying to do this more dynamically.
    • 2:05:17So I've used the nested query instead. All right, this is correct. Whoops.
    • 2:05:21Ironically
    • 2:05:22where show ID,
    • 2:05:25uh, like personally from stars where show ID
    • 2:05:31Oros if you catch it before I do,
    • 2:05:33it's the person I date for him to,
    • 2:05:37oh, select I,
    • 2:05:39oh, thank you. Oh, who said that?
    • 2:05:41Ok, I owe you, Batman. I owe Batman Oreos after class. Thank you, Batman.
    • 2:05:46That's de facto uh Yes, other good prize for that costume.
    • 2:05:50Ok. So let me fix this. My apologies.
    • 2:05:53So let's go ahead and select person ID from stars where show ID equals
    • 2:05:58and this is where I messed up before I did select star from shows.
    • 2:06:02But I can't look for a show ID equaling an entire row of information.
    • 2:06:06I instead need to do select ID from shows where the show's title equals,
    • 2:06:13quote unquote the office and the year of that show is 2005.
    • 2:06:18And just to call this out
    • 2:06:19much like in
    • 2:06:20C I'm quoting strings, but I'm not quoting numbers.
    • 2:06:24It's not necessary for something like 2005. And just to be super clear too,
    • 2:06:28I have generally adopted a style already today of capitalizing any sequel,
    • 2:06:32keywords like select like from like where and so forth
    • 2:06:36and then using lower case for everything else strictly speaking,
    • 2:06:38that's not necessary, stylistically,
    • 2:06:40we would encourage you to be in the habit of using uppercase
    • 2:06:43for your sequel keywords because they just pop more on the screen.
    • 2:06:45It sort of makes things more readable, but
    • 2:06:47strictly speaking,
    • 2:06:48sequel itself does not care about that capitalization of keywords.
    • 2:06:52All right. Now, let me cross my fingers and now I get back this list of person I DS.
    • 2:06:56And again, my goal is to figure out who is in the office that debuted in 2005.
    • 2:07:00This is not that interesting because I don't know who any of these people are,
    • 2:07:03but here's where we can do one additional step and nest a nested query.
    • 2:07:08So let me actually select the names from the
    • 2:07:11people table where the idea of those people is in
    • 2:07:16and then on a new line just to make it pretty
    • 2:07:18where the ID
    • 2:07:20equals this query, which I'll paste here
    • 2:07:24and then that equals and I'll in,
    • 2:07:26I'll another line and indent further this query here.
    • 2:07:30So just to save time, I'm copying and pasting the previous query,
    • 2:07:32but I'm wrapping it with one outermost query.
    • 2:07:35Now, that's saying select names from people where
    • 2:07:38the idea of those people is in this result set where
    • 2:07:42the idea of those shows is in this result set.
    • 2:07:45So the parentheses make clear,
    • 2:07:48ideally what's happening in what order from inside out enter
    • 2:07:52and there we have it at least according to IM DB for the latest season.
    • 2:07:56Like this is the top build stars that are in this here database.
    • 2:08:01All right. So how can we do something else?
    • 2:08:03Well, let me just do it in the other direction.
    • 2:08:05Suppose we want to see all of Steve Carell shows, not just the office.
    • 2:08:08Well, let me do this one.
    • 2:08:09A little quickly select title from shows where the
    • 2:08:12idea of that show is in the following list.
    • 2:08:15Well, what's that?
    • 2:08:17What's that list? Let me go ahead and select sh oh, thank you.
    • 2:08:22I don't have Extra Oreos though for you. So, um,
    • 2:08:25I'll just fix this. I'm sorry,
    • 2:08:26select title from shows where the ID of the show is in.
    • 2:08:32And then here I'm gonna do select show ID from stars where person ID equals.
    • 2:08:39And then here indenting for clarity, I'm gonna select Steve
    • 2:08:43ID by saying,
    • 2:08:44select ID from people where the name of that person is quote unquote Steve Carell.
    • 2:08:50And so in this way, I'm sort of
    • 2:08:53ask, I'm writing the sequel query sort of back uh in reverse.
    • 2:08:57I'm asking first what I care about.
    • 2:08:59But before I can even answer that, I have to answer this nested query.
    • 2:09:02What is Steve Carell's Id
    • 2:09:04once I have that, what are all the show I DS that that person ID has been?
    • 2:09:07And, and then please tell me what the title of all of those shows is.
    • 2:09:11Let me go ahead and cross my fingers and voila some of these,
    • 2:09:14you might have heard of some of you might not have.
    • 2:09:16But if you were to go on im DB dot com and search for Steve Carell,
    • 2:09:19you would presumably see
    • 2:09:21this here. List of shows that he's been in in some particular order.
    • 2:09:26Um Just to show you two other syntax.
    • 2:09:29But let me not emphasize this because it will look complicated.
    • 2:09:32There are other ways to solve the same problem.
    • 2:09:35If you prefer the approach of joining,
    • 2:09:37we can actually join not just two but three tables together,
    • 2:09:41but question first.
    • 2:09:43No, just stretching first. So
    • 2:09:46two final ways to execute the same idea.
    • 2:09:49But the first of them that I just did is arguably relatively simpler.
    • 2:09:53You could do this. I could select the title from the show's table by joining it on
    • 2:10:00the stars table on the shows I
    • 2:10:04column
    • 2:10:04equaling the stars tables show ID column.
    • 2:10:08And then I can further join it on the people table on stars dot person ID
    • 2:10:14equaling
    • 2:10:15people dot ad.
    • 2:10:16So this is a mouthful and even I am kind of crossing my fingers
    • 2:10:18that I didn't screw up when transcribing it from my my print out here.
    • 2:10:22But what I'm effectively doing is joining 123 tables all together by telling
    • 2:10:28the database how to join the shows table with the stars table.
    • 2:10:33And the people table specifically the way to bridge that picture per the diagram is
    • 2:10:38to specify that shows dot ID should be lined up with stars dot Show ID
    • 2:10:44and
    • 2:10:45stars dot person ID should be lined up with people ID.
    • 2:10:50And that's it,
    • 2:10:50that essentially allows us to connect these three tables with their common fields.
    • 2:10:54If I hit enter, now I'm gonna get back somewhat slowly. Actually,
    • 2:10:59a
    • 2:10:59really long list with some duplication of all of those particular shows.
    • 2:11:05Oh, actually,
    • 2:11:05all shows in the database because I didn't practice what I'm preaching.
    • 2:11:09I wanted to search for just Steve Carell.
    • 2:11:11What you're seeing is the entirety of the tens of thousands of TV. Shows.
    • 2:11:14Control C is your friend. Let me go ahead and reload sequel Light and let me
    • 2:11:19type that again. Let me type that once more.
    • 2:11:21Sorry, select title from shows, join stars on shows dot ID equals stars dot Show ID.
    • 2:11:28Join people on stars dot person ID equals people
    • 2:11:35ID. Where this was the part I left out name equals Steve Carell.
    • 2:11:40And if I didn't screw up by typing so fast, enter,
    • 2:11:44now we get a little more slowly than before
    • 2:11:47those same shows that Steve Carell starred in.
    • 2:11:50So this is just to say there's another way of doing this,
    • 2:11:52but maybe a third way which is a little simpler than that explicitly joining them.
    • 2:11:56In that way,
    • 2:11:56you can alternatively still do this select title from shows stars and people,
    • 2:12:03you just literally enumerate with commas,
    • 2:12:06what three tables you want to join somehow
    • 2:12:08and then you can instead of using join.
    • 2:12:10You can just use where clauses to kind of make sure they line up properly.
    • 2:12:14You can say where shows dot id equals stars dot Show ID
    • 2:12:19and people ID equals stars dot person id and
    • 2:12:24name equals quote unquote, Steve Carell.
    • 2:12:27And I realize this is hard to keep track of everything.
    • 2:12:29Now, all these darn darn different ways to do this,
    • 2:12:31this is just to say that there's different approaches to solving the same problem.
    • 2:12:35And for different people,
    • 2:12:36you might think about things a little more uh differently than someone else.
    • 2:12:39If I hit, enter here this too, it's a little slower than the nest, it selects it seems,
    • 2:12:43but it does in fact give us that same answer.
    • 2:12:47And just for thoroughness, if I go back to our diagram,
    • 2:12:49besides the tables we've seen,
    • 2:12:52there's actually another writers table in there as well.
    • 2:12:55If you're curious to see what writers is,
    • 2:12:57let's just glance at that real fast in vs code.
    • 2:12:59Let me do dot schema writers.
    • 2:13:01And it's actually almost the same as stars except this case in sh in writers,
    • 2:13:06we are associating a show with person ID,
    • 2:13:09both of which in this context are foreign keys
    • 2:13:11that indeed reference back shows and people ID,
    • 2:13:14which again if I do this schema stars,
    • 2:13:16which we didn't see before is structurally the same.
    • 2:13:19So the relationship is essentially embodied in this case by IM DB and in turn by us,
    • 2:13:24by way of the table's name
    • 2:13:26TV, stars or writers thereof.
    • 2:13:29All right. I know that's a lot. Any questions
    • 2:13:33before we take a higher level step back? Yeah.
    • 2:13:40A
    • 2:13:40good question.
    • 2:13:41Does SQL provide any way to figure out the
    • 2:13:42mapping between tables without looking at the database?
    • 2:13:45Short answer? No, like this is the dia well, that's not quite fair
    • 2:13:48depending on the database you're using,
    • 2:13:50you might be able to click a button for instance and
    • 2:13:51get a nice pretty picture like this that shows the relationships.
    • 2:13:54Indeed, we use software to generate this.
    • 2:13:56We didn't do this diagram, for instance, by hand, Sequel, Light
    • 2:13:59itself does not provide you with that in SQL. Li the best you can do is run dot schema.
    • 2:14:04And if you don't specify a table name,
    • 2:14:05you'll get everything from the table described.
    • 2:14:08Once you get comfortable with SQL though,
    • 2:14:09the idea is that you can read the text and sort of infer what the structure is.
    • 2:14:12But yes, the graphical programs can generate prettier pictures like this,
    • 2:14:16but it depends on the software you're using.
    • 2:14:18Yeah.
    • 2:14:21OK.
    • 2:14:22Sequel is not case sensitive with respect to its keywords,
    • 2:14:26but table names and other things that you chose.
    • 2:14:29You've got to be consistent with how you capitalize them.
    • 2:14:32I've done everything in lower case,
    • 2:14:33but that tends to be one convention other
    • 2:14:35people might use Camel case where you alternate caps
    • 2:14:38appropriately.
    • 2:14:39All right.
    • 2:14:40So let's take a higher level,
    • 2:14:42look at this and also consider some of the actual real world
    • 2:14:44problems that tragically are still with us in some form today.
    • 2:14:47Notice that some of the queries we executed a bit ago were actually relatively slow,
    • 2:14:52whereas I hit enter and got a lot of my results like that.
    • 2:14:54Those last two queries where I was joining all of those
    • 2:14:57tables looking for Steve Carell's shows were actually relatively slow.
    • 2:15:00And let's try to take a simpler case. Let me do this in SQL light.
    • 2:15:03You can actually time your queries by running dot timer and then turning it on.
    • 2:15:07This is just gonna keep track.
    • 2:15:09Now of how many seconds or milliseconds any of your queries take,
    • 2:15:12if you're curious to figure out what's fast or what's slow,
    • 2:15:14let me do something relatively simple like this select star from shows
    • 2:15:18where the title of the show equals quote unquote the office semicolon.
    • 2:15:22All right. That was pretty darn fast and it took 0.044 seconds. In reality.
    • 2:15:28If you care further,
    • 2:15:29you can break this time down into user time
    • 2:15:31like spent in my account versus system time,
    • 2:15:34which means spent in the operating system essentially.
    • 2:15:36But we'll focus on the real clock,
    • 2:15:38wall clock time of 0.000 0.044 seconds pretty darn fast,
    • 2:15:44but we can actually do better because it turns out the world of sequel,
    • 2:15:48you can create what are called indexes, which is a fancy way of saying
    • 2:15:52a data structure
    • 2:15:54that makes it faster to perform queries like selects and even other queries as well.
    • 2:15:59In a index, you can use syntax like this, create index,
    • 2:16:02the name of the index on a specific table on the specific columns.
    • 2:16:06And what I mean by this is if you know that your application like im
    • 2:16:09DB dot com or their mobile app is gonna search on certain columns frequently.
    • 2:16:14You can prepare the database in advance to build up
    • 2:16:16fancy data structures and memory so we can get back answers even faster than that.
    • 2:16:21So case in point,
    • 2:16:22let me go back to VS code here and let me create an index called
    • 2:16:26whatever title index for instance on the
    • 2:16:29show's table specifically on the title column.
    • 2:16:32So that's simply the Syntex for telling the database in advance index
    • 2:16:36this column because I'm gonna do a lot of searching on it.
    • 2:16:39So I want the queries to be fast, enter,
    • 2:16:41it took a moment, it took almost half a second.
    • 2:16:43But the index I only have to build once even though technically
    • 2:16:46you have to maintain it over time if you're doing updates,
    • 2:16:48deletes and inserts.
    • 2:16:49But now let me do the same query as before.
    • 2:16:52Select star from shows where title equals quote unquote the office
    • 2:16:57previously that query took 0.044 seconds when I hit enter.
    • 2:17:01Now.
    • 2:17:02Boom, I mean it takes no time at all or less time than it's even keeping track of,
    • 2:17:08in terms of significant digits.
    • 2:17:09Now, that might not seem like a big deal to us, humans and our human eyes.
    • 2:17:12But if you've got hundreds thousands millions of users as maybe the real I am
    • 2:17:16B dot com. Has you just saved yourself a fortune in servers and complexity? Why?
    • 2:17:22Because the same server can clearly now handle way more people per unit
    • 2:17:26of time per second because each query takes less and less time.
    • 2:17:29I mean, we're all too familiar here.
    • 2:17:31And at Yale surely with certain university applications that are just so darn
    • 2:17:34slow when you click a button and the stupid thing spins and makes you
    • 2:17:37wait and wait a lot of the time that can be explained by
    • 2:17:40like poor database design or databases that might not have been indexed properly.
    • 2:17:44So when you're searching for some course, for instance, in the course catalog,
    • 2:17:47it's taking forever because underneath the hood,
    • 2:17:50it's essentially doing linear search over everything.
    • 2:17:53But by contrast, in a relational database, when you create an index in advance,
    • 2:17:57because you have a hunch that maybe users are
    • 2:17:59gonna search on that column like uh show titles.
    • 2:18:03Essentially you're building up in memory what's called a B tree,
    • 2:18:05which is not a binary tree.
    • 2:18:07It's still a tree though if you think back to week five.
    • 2:18:10But it's a very short fat tree instead where every node might have two or three
    • 2:18:15or 30 Children, which essentially pulls the height of the tree way up,
    • 2:18:20which is to say that when you search for some value in A B tree,
    • 2:18:23it's invariably going to be in the leaves.
    • 2:18:25So the shorter the tree is the fewer steps it takes to find the value you care about.
    • 2:18:30So when you run create table,
    • 2:18:31that kind of data structure is being magically created for you by the database.
    • 2:18:37So it's not a lame linear search from top to bottom through the entire column.
    • 2:18:42It's not a simple linear search through the
    • 2:18:45higher column, top to bottom.
    • 2:18:48So with that said, we can see this really with more complicated queries.
    • 2:18:52And let me go back to VS code here.
    • 2:18:54Let me propose to run that same slow query before even though it's fine.
    • 2:18:58If you're not comfortable with the syntax, it was relatively slow though to watch.
    • 2:19:02So let's do select title from shows stars and people
    • 2:19:07where shows dot id equals stars dot Show ID
    • 2:19:11and people dot id equals stars dot
    • 2:19:15person id
    • 2:19:17and name equal Steve Carell.
    • 2:19:20So this was the last of those queries that just searches
    • 2:19:22for all of Steve Carell's TV shows without using joins explicitly.
    • 2:19:26But by just enumerating all three tables in question
    • 2:19:28and then using where to cleverly connect the dots,
    • 2:19:31so to speak.
    • 2:19:32But my timer is still on. So notice now when I hit enter,
    • 2:19:35it doesn't just feel slow.
    • 2:19:37It actually took 2.763 seconds like that's slow, that's expensive,
    • 2:19:42that's gonna annoy your users,
    • 2:19:43that's gonna annoy your students if the database is sort of
    • 2:19:46thinking and thinking and thinking and taking that much time.
    • 2:19:48But let's note this, that same query I just executed touched a bunch of calls
    • 2:19:55and it turns out that whenever you declare a primary key in a database,
    • 2:19:58at least in SQL light,
    • 2:19:59you get an index for free like primary keys are automatically indexed.
    • 2:20:03So searching for a number in that column super fast, not linear search.
    • 2:20:07It's something logarithmic most likely or ideally
    • 2:20:09closer to something like constant time.
    • 2:20:11Even
    • 2:20:12here though I'm touching not just shows dot id,
    • 2:20:16but I'm also filtering on stars dot show id.
    • 2:20:19So a foreign key, foreign keys are not indexed by default.
    • 2:20:22I'm looking at people ID,
    • 2:20:23that's a primary key that's indexed but stars dot person
    • 2:20:26ID not indexed by default is a foreign key.
    • 2:20:29Lastly,
    • 2:20:29I'm filtering by name in the in the people table names are not indexed by default.
    • 2:20:36So I'm touching three separate columns, two foreign keys,
    • 2:20:39one name field that have no fancy tree structure built for them.
    • 2:20:43But I can do that. Let me go down to
    • 2:20:46uh my terminal here.
    • 2:20:48Let me create one index called say person index though I could call it anything I want
    • 2:20:53on the stars table on the person ID column.
    • 2:20:56So that index is that foreign key took a 1.7 seconds but I only
    • 2:21:01have to do it once create index uh show index on the stars table,
    • 2:21:06show ID.
    • 2:21:07So another foreign key is getting its own index took 1.4 seconds,
    • 2:21:11but it's a one time thing.
    • 2:21:12And lastly let's index all of those actors
    • 2:21:14names create index called name index on the
    • 2:21:18people table on the name column enter that
    • 2:21:22took 1.0 seconds but at one time operation.
    • 2:21:25So what essentially I built up like three of
    • 2:21:27these trees in memory now specifically for these columns.
    • 2:21:30So now
    • 2:21:32recall previously that that slow query, if I scroll back up,
    • 2:21:38that took, what was it? 2.7 seconds, I think 2.7 seconds previously. But if I now run
    • 2:21:46uh the same thing,
    • 2:21:47select title from shows stars, people where shows ID equals stars dot Show ID
    • 2:21:54and people ID equals stars dot person ID and
    • 2:21:59name equals Steve. So close Carell
    • 2:22:03same query before previously took 2.7 seconds,
    • 2:22:07which was the most annoying of them yet.
    • 2:22:08Now when I hit enter
    • 2:22:10boom, 0.001 seconds,
    • 2:22:14which is the difference again between bunches of linear searches and
    • 2:22:18in this case, searching a fancier week five style
    • 2:22:21uh beet tree in this case. So indexes matter.
    • 2:22:24So what's then maybe the the trade off here?
    • 2:22:28Like why not index every column in every table?
    • 2:22:30Because this is feeling great like we're speeding
    • 2:22:32things up by factors of like 1000 practically.
    • 2:22:35What's the trade off?
    • 2:22:37Lots and lots of memory or space. Yeah.
    • 2:22:39So you're just trading off space for time,
    • 2:22:41which we said a couple of weeks ago is an
    • 2:22:42acceptable trade off depending on what resources you have.
    • 2:22:45But it's probably an over correction to index everything,
    • 2:22:49especially since it will slightly slow down, inserts,
    • 2:22:51updates and deletes because you have to maintain this tree structure.
    • 2:22:54So it doesn't devolve back into uh like a linked list or something linear.
    • 2:22:58But in fact, um being selective about it is perhaps the best strategy.
    • 2:23:04All right, so that we can now solve some other problems.
    • 2:23:08More generally, let me just connect two dots. Even though we focus today on SQL
    • 2:23:13specifically for databases,
    • 2:23:14you can actually combine one language with another and solve different problems.
    • 2:23:19And so, in fact, let me do this, let me revisit our favorites dot pie from earlier.
    • 2:23:23But let me actually now use
    • 2:23:26the favorites database as follows. Let me go into vs code here.
    • 2:23:30Uh Let me remove favorites dot DB because if you recall, it made everyone's um
    • 2:23:37count previously was uh became 50 V. So let me remove that file.
    • 2:23:42Let me run sequel light three on favorites dot DB. Again.
    • 2:23:47Uh Let me create the file an new. Let me set the mode to CS V again.
    • 2:23:51Let me import that file called favorites dot CS
    • 2:23:54V into an identical table as before called favorites
    • 2:23:57and then quit. So I've just reset things to my backup if you will from the CS V file.
    • 2:24:02So I again have a favorites dot DB.
    • 2:24:04Let me now minimize my terminal window here and reopen favorites dot pie.
    • 2:24:09And let me just go ahead and get rid of that
    • 2:24:10version entirely and focus this time on not talking to a CS
    • 2:24:14file, opening it iterating over the rows.
    • 2:24:16We can actually use Python to execute SQL queries
    • 2:24:20and kind of get the best of both worlds.
    • 2:24:22So let me do this here.
    • 2:24:24Let me from the CS 50 Python library import our own sequel functionality.
    • 2:24:30And this is a training wheel.
    • 2:24:31We still provide for sequel because it's just much
    • 2:24:34easier than using the industry standard libraries for sequel.
    • 2:24:36It just is painful uh for simple tasks.
    • 2:24:39So now let me create a variable called DB for database.
    • 2:24:42I'm going to set it equal to the sequel function that CS 50 row
    • 2:24:45and this is gonna look weird.
    • 2:24:46But the way you open a DB file in Python,
    • 2:24:49whether it's with CS fifty's library or someone else's,
    • 2:24:52you say SQL light colon slash slash slash favorites dot DB.
    • 2:24:57So weird syntax, but it's commonplace three slashes,
    • 2:25:00not two like a URL is usually now let's use a variable called favorite and set it
    • 2:25:05equal to the return value of input by asking the human for their favorite TV show.
    • 2:25:09And now previously we opened up a CS V file iterated over it, looking for the show,
    • 2:25:15they typed in show they typed in
    • 2:25:17or rather um
    • 2:25:19oh sorry, not the show the, the problem or the problem that we typed in.
    • 2:25:22Let me instead do this.
    • 2:25:24Let me set,
    • 2:25:25use a DB variables, execute function which comes with the CS 50 library.
    • 2:25:30And let me execute this SQL query,
    • 2:25:32select count star as N from favorites
    • 2:25:38where problem equals question mark and the question
    • 2:25:42mark is a little weird but think of it for now like CS percent S
    • 2:25:45comma favorite
    • 2:25:47gonna plug in whatever the human typed in into that query where the question mark is.
    • 2:25:51So no percent S I'm using a question mark in this world of sequel.
    • 2:25:55This is going to give me back a temporary table and I'm going to store that temporary
    • 2:25:59table in a variable called rows because the
    • 2:26:01temporary table is essentially zero or more rows.
    • 2:26:03So I'm going to name my variable rows.
    • 2:26:05And then if I want to get back the first and only row,
    • 2:26:09I can literally do like row equals rows,
    • 2:26:11bracket zero just to be pedantic just to get literally
    • 2:26:13the first row in that temporary table or result set.
    • 2:26:17And now if I want to print back the N column there in,
    • 2:26:20I can do print
    • 2:26:22row, quote unquote N.
    • 2:26:24So let me take a step back and do this. Let me go into SQL light three of favorites dot DB
    • 2:26:31and let me literally type something like this here.
    • 2:26:34I'm in SQL light three at the bottom.
    • 2:26:36And the problem I'm searching for, for instance is scratch semicolon.
    • 2:26:40Notice that is how in SQL in the command line of SQL light three,
    • 2:26:44I can get back the answer I want.
    • 2:26:46But what if I want to make a Python program that
    • 2:26:47queries the database for that value and then print something out
    • 2:26:50in two weeks time.
    • 2:26:51What if I want to do that to make a web application like im
    • 2:26:54DB dot com or mobile app that writes code to generate a user interface.
    • 2:26:58But that pulls the data from a database.
    • 2:27:00You're not going to have your human users using SQL Light three,
    • 2:27:03you're going to generate the output for them.
    • 2:27:05So let me close my terminal window here. Rather let me close out of sqlite
    • 2:27:09three. Let me now run Python of favorites dot P
    • 2:27:13enter.
    • 2:27:15I'm prompted for my favorite using the input function I type in scratch and hit enter
    • 2:27:19and there my 34.
    • 2:27:21So this is an incredibly common practice to use one language for what it's best at.
    • 2:27:25Like sequel is best at reading data from databases.
    • 2:27:28Python is maybe best in this case for like creating
    • 2:27:31a user interface or eventually making a web application,
    • 2:27:34but it's certainly fine to use one inside of the other
    • 2:27:37the documentation for this library.
    • 2:27:39If and when you get curious is that this URL
    • 2:27:41here along with a lot of CS fifty's own documentation,
    • 2:27:44but there are some problems nonetheless, we,
    • 2:27:47we might encounter in this world and we thought we'd end on these challenges.
    • 2:27:50It turns out
    • 2:27:52that in the world of sequel, even though we haven't touched upon this yet,
    • 2:27:55you're generally working not with like hundreds of
    • 2:27:57people in their favorite languages and problems,
    • 2:27:59not even thousands of movies, but like millions of things in the database,
    • 2:28:03like Instagram posts or tiktok videos or the like, that's,
    • 2:28:06those are huge databases with millions of rows.
    • 2:28:09The problem with sequel and really databases in
    • 2:28:12general is if you have like thousands,
    • 2:28:14millions of things happening at once, things can get out of
    • 2:28:17order and like your math can be wrong,
    • 2:28:18you can lose track of how many likes something has.
    • 2:28:20And so for instance, as of last night,
    • 2:28:22this remains the most popular Instagram post two dates in the world.
    • 2:28:26Uh It was clicked on by so many people and you
    • 2:28:30might think that's pretty straightforward to keep track of click,
    • 2:28:32click, click,
    • 2:28:33but not when there's millions of devices in the world and
    • 2:28:36thousands of servers probably at meta running Instagram's uh back end.
    • 2:28:41So how do you actually keep track of all of these likes?
    • 2:28:44Well,
    • 2:28:44maybe meta is using code like this to implement the counter for likes on Instagram.
    • 2:28:51Maybe they are using lines of code similar to what we just wrote DB dot execute.
    • 2:28:56Select the current number of likes from the posts table where the idea
    • 2:29:01of the post equals whatever the one is that the user clicked on.
    • 2:29:04So the post a moment ago
    • 2:29:06presumably has a unique, a
    • 2:29:07primary key and that number just gets plugged in here when meta wants to figure out.
    • 2:29:12All right, someone just clicked on this post.
    • 2:29:14Let's figure out what the current number of likes.
    • 2:29:15So we can add one to it and update the database.
    • 2:29:18So this query here gives us a temporary table containing the
    • 2:29:22current number of likes before you or someone else clicked.
    • 2:29:25Maybe then we just declare a variable called likes to get at the first rows,
    • 2:29:29likes column.
    • 2:29:30So this is just Python syntax similar to what I just to get the
    • 2:29:33actual number you care about like 34 or whatever million it is here.
    • 2:29:37But then suppose there's a second database query and a third line of
    • 2:29:41code that updates the posts table setting the likes equal to this value
    • 2:29:47where the idea of the post is this value.
    • 2:29:49So these question marks are similar again to spirit to print F's percent S
    • 2:29:53they're placeholders for things that are going to be plugged in after the commas.
    • 2:29:58So if I want to update the number of likes
    • 2:29:59to be whatever the current number is plus one,
    • 2:30:02I put it there as a second argument. And then I plug in the ID of that post.
    • 2:30:06The problem though with large systems like the Metas, the Googles,
    • 2:30:10the Microsoft and others of the world is that
    • 2:30:12they are executing code like this on multiple servers,
    • 2:30:15thousands of servers that might be executing slightly out of order,
    • 2:30:19one might be faster, one might be slower, which is to say,
    • 2:30:23even though these three lines of code represent what should happen when I click on
    • 2:30:27that post and you click on that post and you click on that post.
    • 2:30:30The lines of code chronologically might get shuffled a little bit
    • 2:30:34like maybe this line of code gets executed for me and
    • 2:30:37then it gets executed for you and then you and then
    • 2:30:39the server moves on to the next line of code.
    • 2:30:41So it's sort of multitasking, handling lots of users at once.
    • 2:30:45The problem with this is that you run into a race condition
    • 2:30:48of sorts where the servers are sort of racing to handle one user
    • 2:30:51but other users requests are happening at the same time.
    • 2:30:55So the in analog,
    • 2:30:57the analogy that I was taught years ago in an operating
    • 2:30:59systems class actually pertains to something like a refrigerator here.
    • 2:31:03So we have a mini refrigerator here.
    • 2:31:04Suppose you've got one in your dorm or your house room and you
    • 2:31:07uh come home one day and you really like milk.
    • 2:31:10So you open the fridge and you look inside and there's, oh, we're out of milk.
    • 2:31:14So you close the fridge,
    • 2:31:16you walk out to like CBS or somewhere else and go to get more milk.
    • 2:31:19Meanwhile, though, your roommate comes home, who in this story also likes milk.
    • 2:31:23And so they decide that, oh, I'm out of milk in the fridge.
    • 2:31:26So they maybe head out, maybe they follow a different path to go get more milk as well.
    • 2:31:31Some number of minutes pass. You both come home later on like, oh, darn it.
    • 2:31:35Now, we have like twice as much milk as we need.
    • 2:31:36We don't really like it that much and some of it's gonna go sour now.
    • 2:31:39So it's wasted like we made a mistake.
    • 2:31:41We should not have bought twice as much milk now. Stupid story.
    • 2:31:45But like the point is that both the
    • 2:31:47you made decisions based on the state of a variable.
    • 2:31:51But the problem was that variable was in the process
    • 2:31:54of being updated when someone else looked at it,
    • 2:31:57the first person in the story was now on their way to the store.
    • 2:32:00So the variable was about to be incremented in terms of quantity of milk,
    • 2:32:04but the other person didn't know that yet.
    • 2:32:05So they too tried to increment it. And in that case, we ended up with too much milk.
    • 2:32:09But suppose what might happen here
    • 2:32:12is similar in spirit, suppose that that post at some point
    • 2:32:15time had just 1 million likes.
    • 2:32:17And suppose this line of code got executed for me for you
    • 2:32:20and for you after all three of us clicked on it.
    • 2:32:23Well, the value of our likes variable in meta servers might be 1 million,
    • 2:32:271 million and 1 million, they therefore update 1 million to be 1 million plus one.
    • 2:32:32And so what they update the database to be is 1,000,001,
    • 2:32:35but they do it three times 1 million, +11 million, 1 1,000,001.
    • 2:32:39But they've lost two of those likes
    • 2:32:42because they might
    • 2:32:43inspected the variable while some other server,
    • 2:32:46some other users like was being processed.
    • 2:32:49So long story short, when you have lots of data on lots of servers,
    • 2:32:53all of which is happening very quickly,
    • 2:32:55you run into these so called race conditions and code like this can be dangerous
    • 2:32:59even though it might not look incorrect at a glance.
    • 2:33:02Thankfully, in the world of sequel, though you won't generally have to do this.
    • 2:33:05Certainly for problems. That's sake.
    • 2:33:07There are solutions to this but too many engineers in the world don't know,
    • 2:33:11don't remember this or don't appreciate this reality.
    • 2:33:14There are keywords in certain databases that let you instead begin a transaction,
    • 2:33:19which means essentially that these three lines
    • 2:33:22of code should either all happen together
    • 2:33:25or not at all. They should be atomic.
    • 2:33:27That is to say they should all happen without
    • 2:33:29interruption or they just shouldn't happen at all.
    • 2:33:32And that ensures that you can uh that the math
    • 2:33:36does not go wrong because my like will get counted and
    • 2:33:39then you're like, and then you're like,
    • 2:33:40as opposed to them being intermingled and lost track of accordingly.
    • 2:33:44So in Python, using the CS 50 library
    • 2:33:47you could wrap these three lines of code
    • 2:33:49by saying begin transaction to the database,
    • 2:33:52commit the transaction to the database.
    • 2:33:54And that relatively simple solution avoids this problem of race conditions.
    • 2:33:59But this too is a topic. If you Google invariably, you'll see that this is a problem
    • 2:34:03that has hit various servers and apps over time.
    • 2:34:07But there's one other problem that the world is still not very good at.
    • 2:34:10And that's known as a sequel injection attack.
    • 2:34:13And it turns out that even with what we've been doing, even code like this,
    • 2:34:16here is all too easily vulnerable to being hacked if you will,
    • 2:34:21that is misused in some way.
    • 2:34:23Unless you practice what I'm preaching, which is using placeholders like this.
    • 2:34:28It turns out that it's very dangerous to take user input in generally like most
    • 2:34:32of your users might be nice, good people,
    • 2:34:34but there's always gonna be someone who's malicious or curious or
    • 2:34:37just execute something you don't expect and things can go wrong.
    • 2:34:40So in the world of SQL, here's what can happen.
    • 2:34:42For instance, here's the login screen for uh Yale accounts.
    • 2:34:45Here's the uh comparable screen for Harvard accounts and all of us in
    • 2:34:48their habit of using one of these screens or something similar or another,
    • 2:34:52you're often asked for like a user name or email address and a password.
    • 2:34:55But suppose that
    • 2:34:57Harvard or Yale or Google or Microsoft or wherever
    • 2:35:00are taking user input from forms like this be it on the
    • 2:35:02web or on a mobile app and they're just plugging your input
    • 2:35:06into a SQL query that they wrote in advance that just is
    • 2:35:09waiting for your user name or password to be plugged in.
    • 2:35:12So they can ask a complete query of the database.
    • 2:35:15It turns out that if I'm a bit malicious or curious,
    • 2:35:18I could maybe type in some funky syntax
    • 2:35:21fields like this that I know have special meaning to certain databases.
    • 2:35:25And it turns out in the world of sequel sequel light, in particular,
    • 2:35:29single quotes are clearly important because I've been using them all
    • 2:35:32day long for strings and I didn't mention this yet.
    • 2:35:35But the comment character in SQL light is dash dash.
    • 2:35:38If you want the rest of a line to be ignored, you just say dash dash.
    • 2:35:41So it's not hash it's not slash slash like in Python and C respectively,
    • 2:35:45it's dash dash or two hyphen.
    • 2:35:46So
    • 2:35:47suppose that I'm a hacker or curious student and I want to see if
    • 2:35:50like Harvard knows what it's doing when it comes to SQL injection attacks.
    • 2:35:53I could literally type in maybe my user name with a single quote and then dash dash.
    • 2:35:58Well, why would I do this? Well, suppose for the sake of discussion that some
    • 2:36:02developer at Harvard
    • 2:36:04or any website really has written a line of code like this to
    • 2:36:07get the to check if the user name and password just typed in
    • 2:36:11match what's in the database? So how might I do this?
    • 2:36:14And if so let the user log in, show them their account or whatever.
    • 2:36:18So here's the line of code in question.
    • 2:36:20Um Select star from users where user name
    • 2:36:24equals question mark and password equals question mark.
    • 2:36:26This is correct. This is green because it's good.
    • 2:36:29It is not vulnerable to attack because I'm using these placeholders
    • 2:36:33here which even though we've implemented in the CS 50 library,
    • 2:36:36most sequel libraries support the same syntax, but it can vary by system.
    • 2:36:40This is safe.
    • 2:36:41What is not safe is to use some of the stuff you learned last week where you
    • 2:36:45can just use F strings or format your
    • 2:36:47own strings and interpolate values with curly braces.
    • 2:36:50For instance, suppose you took me at my word last week that you can do this in Python.
    • 2:36:54This code here, unfortunately, in yellow is not safe.
    • 2:36:58Here's a format string in Python. Here's the beginning of a SQL query.
    • 2:37:02Here's a placeholder for the user name, user name,
    • 2:37:04here's a placeholder for the user's password and
    • 2:37:06I've proactively correctly put single quotes around them.
    • 2:37:10The problem is if you just blindly plug user input
    • 2:37:14into premade queries like this,
    • 2:37:17they can finish your thought for you in ways you don't expect if you trust the users.
    • 2:37:22For instance, if I plug in for my user name,
    • 2:37:24mailing at Harvard dot edu, single quote dash dash.
    • 2:37:28Notice what happens. Here's the single quote from the query. Here's what I typed in.
    • 2:37:34But wait a minute,
    • 2:37:35it looks like this single quote that I typed in finishes.
    • 2:37:38The thought that the developer started the dash
    • 2:37:41dash means heck ignore the rest of that.
    • 2:37:43And indeed,
    • 2:37:44I've left it grayed out because what effectively happens is that only executed
    • 2:37:48is what's in yellow here at the moment. Why?
    • 2:37:51Because everything after that closed quote which
    • 2:37:53finishes the developer's thought is just ignored.
    • 2:37:56So that, I mean,
    • 2:37:56this is literally an example of how you can hack
    • 2:37:58into a database by injecting sequel like dash dash.
    • 2:38:01This is an example of sequel silly as it is. It's a comment
    • 2:38:05that tells the database to ignore the password. So what does this mean?
    • 2:38:08Well, of course,
    • 2:38:09you're gonna get Rose back now because if
    • 2:38:11you're only searching for someone by user name,
    • 2:38:13it doesn't matter what their password is what they typed in.
    • 2:38:16You've essentially blacked out that part and
    • 2:38:18you're not even checking the password.
    • 2:38:20The effect then would be that you could log in
    • 2:38:22as me or Carter or anyone else just by name
    • 2:38:25their user names. If the Harvard developer wrote code in this way.
    • 2:38:30And even though I'm pretty sure Harvard key does not suffer from this.
    • 2:38:33So many darn websites have in the past.
    • 2:38:35And if you Google SQL injection attacks, search by and Google, for instance,
    • 2:38:39like the past month or the past year,
    • 2:38:41you will tragically likely see results because
    • 2:38:43humans continue to make this mistake.
    • 2:38:45The solution though ultimately
    • 2:38:47is actually just use placeholders,
    • 2:38:49just use the library that escapes potentially dangerous input.
    • 2:38:54And this looks a little weird,
    • 2:38:55but in c we saw that the escape character was a backslash and that
    • 2:38:59may backslash n or backslash something else
    • 2:39:01be treated specially weirdly in sequel.
    • 2:39:04It tends to be just another single quote. So if you do quote, quote,
    • 2:39:08that actually means I, I mean, a literal quote.
    • 2:39:10It's not like the empty string or nothing in between it. So it looks weird
    • 2:39:14but long story short,
    • 2:39:15if you use a library like CS fifties or anything
    • 2:39:18in the real world that handles escaping of user input,
    • 2:39:21the whole query you can think of as being now good and green again because it doesn't
    • 2:39:25matter what the human types in any scary
    • 2:39:27characters will be properly escaped by the database.
    • 2:39:31What's the equivalent then
    • 2:39:33uh when it comes to?
    • 2:39:35Um,
    • 2:39:36so it'll depend on the library you're using,
    • 2:39:38but it almost always is the syntax using question marks
    • 2:39:41or some similar placeholder.
    • 2:39:44All right.
    • 2:39:45So with that said,
    • 2:39:46you are now inducted into the hall of people
    • 2:39:49who know now a little something about databases,
    • 2:39:52we've only just scratched the surface of using the language,
    • 2:39:54but it's now something we'll use to
    • 2:39:55build up more and more interesting applications,
    • 2:39:57especially for final projects when we soon transition
    • 2:39:59to web programming or mobile app development.
    • 2:40:01If you go that route,
    • 2:40:03but you'll soon be able to speak a language language literally and figuratively
    • 2:40:07that those before you have acquired as well. So you are now qualified to understand
    • 2:40:12this sort of internet joke.
    • 2:40:14This is someone who if I zoom in
    • 2:40:16was trying to get out of paying some camera based tolls by tricking the state or
    • 2:40:21the city into deleting or dropping their whole
    • 2:40:23database drop means delete the whole thing,
    • 2:40:25not just the rows they're in.
    • 2:40:26Ok. So maybe not that funny.
    • 2:40:28But this is now with the, the note will end on similar to other XK CD comics.
    • 2:40:32We've introduced you to like every CS person out there has seen knows this comment.
    • 2:40:37So if you ever refer to with a wink,
    • 2:40:39uh of uh if you ever refer to little bobby tables with a wink,
    • 2:40:43if it's a computer scientist on the other end, they'll know whom you mean?
    • 2:40:51Ok. There we go.
    • 2:40:55All right, tough crowd. All right, Batman.
    • 2:40:57Come on down for your cookies and we'll see you next time.
  • CS50.ai
Shortcuts
Before using a shortcut, click at least once on the video itself (to give it "focus") after closing this window.
Play/Pause spacebar or k
Rewind 10 seconds left arrow or j
Fast forward 10 seconds right arrow or l
Previous frame (while paused) ,
Next frame (while paused) .
Decrease playback rate <
Increase playback rate >
Toggle captions on/off c
Toggle mute m
Toggle full screen f or double-click video