CS50 Video Player
    • 🧁

    • 🍭

    • 🍇

    • 🍿
    • 0:00:00Introduction
    • 0:01:17SQL
    • 0:03:21Data Collection
    • 0:07:46Flat File Databases
    • 0:09:24Reading from a CSV
    • 0:14:26Filtering Duplicates
    • 0:21:21Counting Titles
    • 0:28:53Sorting Data
    • 0:30:26Lambda Functions
    • 0:36:07Cleaning Data
    • 0:39:48Regular Expressions
    • 0:45:50Search
    • 0:48:17Relational Databases
    • 0:51:20Create, Read, Update, Delete
    • 0:53:32sqlite3
    • 0:57:09SELECT
    • 1:01:58WHERE
    • 1:04:37DELETE
    • 1:05:59UPDATE
    • 1:11:31Redundancies
    • 1:18:00Table Relationships
    • 1:27:49INSERT
    • 1:30:10Python and SQL
    • 1:39:55shows.db
    • 1:41:49SQL Types
    • 1:43:11Many-to-Many Relationships
    • 1:48:27Indexes
    • 1:59:14JOIN
    • 2:03:38SQL Injection Attacks
    • 2:11:15Race Conditions
    • 2:20:10Transactions
    • 2:21:26This was CS50
    • 0:00:00[MUSIC PLAYING]
    • 0:00:03[MUSIC PLAYING]
    • 0:01:18DAVID J. MALAN: This is CS50.
    • 0:01:20And this is week 7, the week, here, of Halloween.
    • 0:01:23Indeed, special thanks to CS50's own Valerie and her mom
    • 0:01:26for having created this very festive scenery, and all past ones as well.
    • 0:01:30Today, we pick up where we left off last time,
    • 0:01:33which, recall, we introduced Python.
    • 0:01:35And that was our big transition from C, where suddenly things
    • 0:01:38started to look new again, probably, syntactically.
    • 0:01:40But also, probably things hopefully started to feel easier.
    • 0:01:44Well, with that said, problem set 6 certainly added some challenges,
    • 0:01:48and you did some new things.
    • 0:01:49But hopefully you've begun to appreciate that with Python, just a lot more stuff
    • 0:01:53is easier to do.
    • 0:01:54You get more out of the box with the language itself.
    • 0:01:56And that's going to be so useful over the coming weeks
    • 0:01:59as we transition further to introducing something called databases today,
    • 0:02:03web programming next week and the week after.
    • 0:02:06So that by term's end, and perhaps even for your final project,
    • 0:02:08you really are building something from scratch
    • 0:02:11using all of these various tools somehow together.
    • 0:02:15So before we do that, though, today, let's consider
    • 0:02:17what we weren't really able to do last week, which was actually
    • 0:02:22create and store data ourselves.
    • 0:02:25In Python, we've played around with the CSV, comma-separated values library.
    • 0:02:30And you've been able to read in CSVs from disk,
    • 0:02:33so to speak, that is, from files in your programming environments.
    • 0:02:37But we haven't necessarily started saving data, persisting data ourselves.
    • 0:02:41And that's a huge limitation, because pretty much all of the examples
    • 0:02:44we've done thus far with a couple of exceptions
    • 0:02:46have involved my providing input at the keyboard or even vocally.
    • 0:02:49But then nothing happens to it.
    • 0:02:51It disappears the moment the program quits,
    • 0:02:53because it was only being stored in memory.
    • 0:02:55But today, we'll start to focus all the more on storing things on disk,
    • 0:02:59that is, storing things in files and folders
    • 0:03:02so that you can actually write programs that remember
    • 0:03:04what it is the human did last time.
    • 0:03:06And ultimately, you can actually make mobile or web apps
    • 0:03:09that actually begin to grow, and grow, and grow their data sets,
    • 0:03:12as might happen if you get more and more users, for instance, on a website.
    • 0:03:15To play, then, with this new capability of being able to write files,
    • 0:03:19let's go ahead and just collect some data.
    • 0:03:22In fact, those of you here in person, if you
    • 0:03:24want to pull up this URL on your phone or laptop,
    • 0:03:26that's going to lead you to a Google Form.
    • 0:03:29And that Google Form is going to ask you in just a moment for really
    • 0:03:34just your favorite TV show.
    • 0:03:35And it's going to ask you to categorize it
    • 0:03:37according to a genre, like comedy, or drama, or action, or musical,
    • 0:03:41or something like that.
    • 0:03:42And this is useful, because if you've ever
    • 0:03:44used a Google Form before, or Microsoft's equivalent with Office 365,
    • 0:03:47it's a really useful mechanism at just collecting data from users,
    • 0:03:50and then ultimately, putting it into a spreadsheet form.
    • 0:03:54So this is a screenshot of the form that those of you
    • 0:03:58here in person or tuning in on Zoom are currently filling out.
    • 0:04:01It's asking only two questions.
    • 0:04:02What's the title of your favorite TV show?
    • 0:04:04And what are one or more genres into which your TV show falls?
    • 0:04:09And I'll go ahead and pivot now to the view
    • 0:04:12that I'll be able to see as the person who created this form, which
    • 0:04:15is quite simply a Google spreadsheet.
    • 0:04:17Google Forms has this nice feature, if you've never noticed,
    • 0:04:19that allows you to export your data to a Google Spreadsheet.
    • 0:04:22And then from there, we can actually grab the file
    • 0:04:25and download it to my own Mac or your own PC
    • 0:04:27so that we can actually play around with the data that's come in.
    • 0:04:30So in fact, let me go ahead and slide over
    • 0:04:32to this, the live Google Spreadsheet.
    • 0:04:36And you'll see, probably, a whole bunch of familiar TV shows here,
    • 0:04:40all coming in.
    • 0:04:41And if we keep scrolling, and scrolling, and scrolling--
    • 0:04:43only 46, 47.
    • 0:04:45There we go, up to 50 plus already.
    • 0:04:47If you need that URL again here, if you're just tuning in,
    • 0:04:50you can go to this URL here.
    • 0:04:52And in just a moment, we'll have a bunch of data
    • 0:04:55with which we can start to experiment.
    • 0:04:59I'll give you a moment or so there.
    • 0:05:06All right.
    • 0:05:08Let me hang in there a little longer.
    • 0:05:09OK, we've got over 100 submissions.
    • 0:05:11Good.
    • 0:05:12Good, even more coming in now.
    • 0:05:15And we can see them coming in live.
    • 0:05:16Here, let me switch back to the spreadsheet.
    • 0:05:18The list is growing, and growing, and growing.
    • 0:05:21And in just a moment--
    • 0:05:22let me give Carter a moment to help me export it in real time.
    • 0:05:26Carter, just give me a heads up when it's reasonable for me
    • 0:05:29to download this file.
    • 0:05:32All right, and I'll begin to do this very slowly.
    • 0:05:35So I'm going to go up to the File menu, if you've never done this before.
    • 0:05:37Download-- you can download a whole bunch of formats, one in Excel.
    • 0:05:40But more simply, and the one we'll start to play with here,
    • 0:05:42is comma-separated values.
    • 0:05:43So CSV files we used this past week, why are they useful?
    • 0:05:47Now that you've played with them or used them in past real world,
    • 0:05:50what's the utility of a CSV file versus something like Excel, for instance?
    • 0:05:55Why CSV in the first place?
    • 0:05:59Any instincts?
    • 0:06:00Yeah?
    • 0:06:00AUDIENCE: Because it's just a text file?
    • 0:06:02DAVID J. MALAN: OK, so storage is compelling.
    • 0:06:04A simple text file with ASCII or Unicode text is probably pretty small.
    • 0:06:07I like that.
    • 0:06:08Other thoughts?
    • 0:06:10AUDIENCE: Structure of it?
    • 0:06:11DAVID J. MALAN: Yeah, well said.
    • 0:06:12It's just a simple text format, but using conventions
    • 0:06:14like commas you can represent the idea of columns using new lines,
    • 0:06:18backslash ends invisibly at the end of your lines,
    • 0:06:20you can create the idea of rows.
    • 0:06:22So it's a very simple way of implementing what
    • 0:06:24we might call a flat-file database.
    • 0:06:26It's a way of storing data in a flat, that is,
    • 0:06:28very simple file that's just pure ASCII or Unicode text.
    • 0:06:32And more compellingly, I dare say, is that with a CSV file,
    • 0:06:35it's completely portable.
    • 0:06:36Something is portable in the world of computing
    • 0:06:38if it means you can use it on a Mac or a PC running this operating system,
    • 0:06:41or this other one.
    • 0:06:42And portability is nice because if I were to download an Excel file,
    • 0:06:45there'd be a whole bunch of people in this room and online
    • 0:06:47who couldn't download it because they haven't bought Microsoft Excel
    • 0:06:50or installed it.
    • 0:06:50Or if they have a Mac, or if it's a .numbers file in the Mac world,
    • 0:06:55a PC user might not be able to download it.
    • 0:06:57So a CSV is indeed very portable.
    • 0:06:59So I'm going to go ahead and download, quite simply, the CSV
    • 0:07:02version of this file.
    • 0:07:04That's going to put it onto my own Mac's Downloads folder.
    • 0:07:06And let me go ahead here, and in just a moment, let me just simplify the name.
    • 0:07:11Because it actually downloads it at a pretty large name.
    • 0:07:14And give me just one moment here, and you'll see that, indeed,
    • 0:07:18on my Mac I have a file called favorites.csv.
    • 0:07:21I shortened the name real quick.
    • 0:07:22And now what I'm going to do is go over to VS Code, and in VS Code,
    • 0:07:28I'm going to open my File Explorer.
    • 0:07:30And if I minimize my window here for a moment, a handy feature of VS Code
    • 0:07:34is that you can just drag and drop a file, for instance, into your Explorer.
    • 0:07:37And voila, it's going to automatically upload it for you.
    • 0:07:40So let me go ahead and full screen here, close my Explorer,
    • 0:07:43temporarily close my Terminal window.
    • 0:07:45And you'll see here a CSV file, favorites.csv.
    • 0:07:48And the first row, by convention, has whatever
    • 0:07:51the columns were in Google Spreadsheets, or Office 365,
    • 0:07:54in Excel online, timestamp, comma, title, comma, genres.
    • 0:07:58Then, we have timestamps, which indicates
    • 0:08:00when people started submitting.
    • 0:08:01Looks like a couple of people were super eager to get
    • 0:08:03started an hour or two ago.
    • 0:08:05And then, you have the title next, after a comma.
    • 0:08:09But there's kind of a curiosity after that.
    • 0:08:12Sometimes I see the genre like comedy, comedy, comedy,
    • 0:08:15but sometimes it's like crime, comma, drama, or action, comma, crime, comma,
    • 0:08:19drama.
    • 0:08:20And those things are quoted.
    • 0:08:22And yet, I didn't do any quotes.
    • 0:08:24You probably didn't type any quotes.
    • 0:08:25Where are those quotes coming from in this CSV file?
    • 0:08:30Why are they there if we infer?
    • 0:08:31Yeah?
    • 0:08:32AUDIENCE: [INAUDIBLE]
    • 0:08:35DAVID J. MALAN: Yeah, so you have a corner case, if you will.
    • 0:08:38Because if you're using commas, as you described,
    • 0:08:40to separate your data into what are effectively columns, well,
    • 0:08:44you've painted yourself into a corner if your actual data
    • 0:08:47has commas in it itself.
    • 0:08:48So what Google has done, what Microsoft does, what Apple does
    • 0:08:51is, they quote any strings of text that themselves
    • 0:08:54have commas so that these are now English grammatical commas,
    • 0:08:58not CSV specific commas.
    • 0:09:00So it's a way of escaping your data, if you will.
    • 0:09:03And escaping just means to call out a symbol in a special way
    • 0:09:06so it's not misinterpreted as something else.
    • 0:09:08All right, so this is all to say that we now
    • 0:09:10have all of this data with which we can play in the form of what we'll
    • 0:09:13start calling a flat-file database.
    • 0:09:15So suppose I wanted to now start manipulating this data,
    • 0:09:19and I want to store it ultimately, indeed, in this CSV format.
    • 0:09:22How can I actually start to read this data,
    • 0:09:24maybe clean it up, maybe do some analytics on it
    • 0:09:26and actually figure out, what's the most popular show among those who submitted
    • 0:09:30here over the past few minutes?
    • 0:09:32Well, let me go ahead and close this.
    • 0:09:34Let me go ahead, then, and open up, for instance, just my Terminal window.
    • 0:09:38And let's code up a file called favorites.py.
    • 0:09:41And let's go ahead and iteratively start simple by just opening up this file
    • 0:09:46and printing out what's inside of it.
    • 0:09:47So you might recall that we can do this by doing something like import CSV
    • 0:09:51to give myself some CSV reading functionality.
    • 0:09:54Then, I can go ahead and do something like with open, the name of the file
    • 0:09:59that I want to open in read mode.
    • 0:10:02Quote, unquote, "r" means to read it.
    • 0:10:03And then, I can say as file, or whatever other name
    • 0:10:06for a variable to say that I want to open this file,
    • 0:10:09and essentially store some kind of reference to it in that variable
    • 0:10:12called file.
    • 0:10:13Then, I can give myself a reader, and I can say csv.reader,
    • 0:10:16passing in that file as input.
    • 0:10:18And this is the magic of that library.
    • 0:10:19It deals with the process of opening it, reading it, and giving you
    • 0:10:23back something that you can just iterate over, like with a for loop
    • 0:10:26I do want to skip the first row, and recall that I can do this.
    • 0:10:30Next, reader, is this little trick that just says, ignore the first row.
    • 0:10:33Because the first one is special.
    • 0:10:35It said timestamp, title, genres.
    • 0:10:37That's not your data, that was mine.
    • 0:10:39But this means now that I've skipped that first row.
    • 0:10:41Everything hereafter is going to be the title of a show
    • 0:10:44that you all like, so let me do this.
    • 0:10:46For row in the reader, let's go ahead and print out the title
    • 0:10:51of the show each of you typed in.
    • 0:10:52How do I get at the title of the show each of you typed in?
    • 0:10:57It's somewhere inside of row.
    • 0:10:58Row recalls a list.
    • 0:11:00So what do I want to type next in order to get
    • 0:11:02at the title of the current row just as a quick check here?
    • 0:11:08What do I want to type to get at the title of the row,
    • 0:11:11keeping in mind, again, that it was timestamp, title, genres?
    • 0:11:15Yeah?
    • 0:11:16AUDIENCE: [INAUDIBLE]
    • 0:11:16DAVID J. MALAN: So row bracket 1 would give me
    • 0:11:18the second column, 0 index, that is, the one in the middle with the title.
    • 0:11:22So this program isn't that interesting yet,
    • 0:11:24but it's a quick and dirty way to figure out, all right, what's my data
    • 0:11:27look like?
    • 0:11:27Let me actually just do a little bit of a check here
    • 0:11:29and see if it contains the data I think it does.
    • 0:11:32Let me maximize my Terminal window here.
    • 0:11:34Let me run Python of favorites.py, hitting Enter.
    • 0:11:37And you'll see now a purely textual list of all of the shows
    • 0:11:42you all seem to like here.
    • 0:11:44But what's noteworthy about it?
    • 0:11:47Specific shows aside, judgment aside as to people's
    • 0:11:49TV tastes, what's interesting or noteworthy about the data that
    • 0:11:54might create some problems for us if we start to analyze this data,
    • 0:11:58and figure out what's the most popular?
    • 0:11:59How many people like this or that?
    • 0:12:02What do you think?
    • 0:12:03Yeah?
    • 0:12:04AUDIENCE: User errors [INAUDIBLE].
    • 0:12:07DAVID J. MALAN: Yeah, there might be user errors,
    • 0:12:09or just stylistic differences that give the appearance that one show
    • 0:12:12is different from the other.
    • 0:12:14For instance, here.
    • 0:12:16Let's see if I can see an example on the screen here.
    • 0:12:20Yeah, so friends here is an all lowercase, Friends here is capitalized.
    • 0:12:24No big deal.
    • 0:12:25We can sort of mitigate that.
    • 0:12:26But this is just a tiny example of where data in the real world
    • 0:12:29can get messy fast.
    • 0:12:30And that probably wasn't even a typo.
    • 0:12:32It was just someone not caring as much to capitalize it, and that's fine.
    • 0:12:37Your users are going to type what they're going to type.
    • 0:12:39So let's see if we can't now begin to get at more specific data,
    • 0:12:43and maybe even clean some of this data up.
    • 0:12:45Let me go back into my file called favorites.py here,
    • 0:12:50and let's actually do something a little more user friendly for me.
    • 0:12:54Instead of a reader, recall that there was this dictionary reader that's
    • 0:12:58just a little more user friendly.
    • 0:12:59And it means I can type in dictionary reader here, passing in the same file.
    • 0:13:04But now, when I iterate over this reader variable, what is each row?
    • 0:13:10When using a DictReader instead of a reader, recall, and this
    • 0:13:14is just a peculiarity of the CSV library,
    • 0:13:17this gives me back, not a list of cells, but what instead,
    • 0:13:22which is marginally more user friendly for me?
    • 0:13:24Yeah?
    • 0:13:25AUDIENCE: [INAUDIBLE]
    • 0:13:27DAVID J. MALAN: Yeah.
    • 0:13:28I can now use open bracket, quotes, and the title.
    • 0:13:30Because what's coming back now is a dict object, that is,
    • 0:13:33a dictionary which has keys and values.
    • 0:13:37The keys of which are the column headings.
    • 0:13:39The values of which are the data I actually care about.
    • 0:13:41So this is just marginally better because, one, it's
    • 0:13:44just way more obvious to me, the author of this code, what it is I'm
    • 0:13:47getting at.
    • 0:13:48I don't remember what column the title was.
    • 0:13:50Was it 0?
    • 0:13:50Was it 1?
    • 0:13:51Was it 2?
    • 0:13:51That's something you're going to forget over time.
    • 0:13:53And God forbid someone changes the data by just dragging and dropping
    • 0:13:56the columns in Excel, or Apple Numbers, or Google Spreadsheets.
    • 0:13:59That's going to break all of your numeric indices.
    • 0:14:01And so a dictionary reader is arguably just
    • 0:14:03better design because it's more robust against changes
    • 0:14:07and potential errors like that.
    • 0:14:08Now the effect of this change isn't going to be really any different.
    • 0:14:12If I run Python of favorites.py, voila, I get all of the same results.
    • 0:14:16But I've now not made any assumptions as to where each of the columns
    • 0:14:21actually is numerically.
    • 0:14:22All right.
    • 0:14:23Well, let's go ahead and now filter out some duplicates.
    • 0:14:26Because there's a lot of commonality among some of the shows here, so let's
    • 0:14:29see if we can't filter out duplicates.
    • 0:14:32If I'm reading a CSV file top to bottom, what intuitively might be the logic
    • 0:14:38I want to implement to filter out duplicates?
    • 0:14:41It's not going to be quite as simple as a simple function that does it for me.
    • 0:14:44I'm going to have to build this.
    • 0:14:47But logically, if you're reading a file from top to bottom,
    • 0:14:50how might you go about, in Python or just any context,
    • 0:14:54getting rid of duplicate values?
    • 0:14:58Yeah, what do you think?
    • 0:14:59AUDIENCE: [INAUDIBLE]
    • 0:15:05DAVID J. MALAN: Sure.
    • 0:15:06I could use a list and I could add each title to the list,
    • 0:15:09but first check if I put this into the list before.
    • 0:15:13So let's try a little something like that.
    • 0:15:15Let me go ahead and create a variable at the top of my program here.
    • 0:15:18I'll call it titles, for instance, initialize to an empty list,
    • 0:15:21open bracket, close bracket.
    • 0:15:23And then, inside of my loop here, instead of printing it out,
    • 0:15:27let's start to make a decision.
    • 0:15:29So if the current row's title is in the titles list
    • 0:15:39I don't want to put it there.
    • 0:15:40And actually, let me invert the logic so I'm doing something proactively.
    • 0:15:43So if it's not the case that row bracket title is in titles,
    • 0:15:48then, go ahead and do something like titles.append the current row's title.
    • 0:15:56And recall that we saw .append a week or so ago,
    • 0:15:59where it just allows you to append to the current list.
    • 0:16:01And then, what can I do at the very end, after I'm all
    • 0:16:04done reading the whole file?
    • 0:16:06Why don't I go ahead and say, for title in titles,
    • 0:16:09go ahead and print out the current title?
    • 0:16:12So it's two loops now, and we can come back to the quality of that design.
    • 0:16:16But let me go ahead here and rerun Python of favorites.py.
    • 0:16:19Let me increase the size of my Terminal window so we can focus just on this,
    • 0:16:22and hit Enter.
    • 0:16:23And now, I'm just skimming.
    • 0:16:28I don't think I'm seeing duplicates, although I
    • 0:16:31am seeing some near duplicates.
    • 0:16:34For instance, there's Friends again.
    • 0:16:36And if we keep going, and going, and going, and going,
    • 0:16:40there's Friends again.
    • 0:16:41Oh, interesting, so that's curious that I seem to have multiple Friends,
    • 0:16:47and I have this one here, too.
    • 0:16:48So how might we clean this up further?
    • 0:16:51I like your instincts, and it's a step closer to it.
    • 0:16:53What are we going to have to do to really filter out
    • 0:16:55those near duplicates?
    • 0:16:57Any thoughts?
    • 0:16:59AUDIENCE: You could set everything to lower [INAUDIBLE]..
    • 0:17:04DAVID J. MALAN: Yeah.
    • 0:17:04What are the common mistakes to summarize?
    • 0:17:06We could ignore the capitalization altogether and maybe
    • 0:17:09just force everything to lowercase, or everything to uppercase.
    • 0:17:12Doesn't matter which, but let's just be consistent.
    • 0:17:13And for those of you who might have accidentally or instinctively hit
    • 0:17:16the spacebar at the beginning of your input or even at the end,
    • 0:17:19we can strip that off, too.
    • 0:17:21Stripping whitespace is a common thing just to clean up user input.
    • 0:17:24So let me go back into my code here, and let me go ahead
    • 0:17:27and tweak the title a little bit.
    • 0:17:30Let me say that the current title inside of this loop
    • 0:17:33is not going to be just the current row's title.
    • 0:17:36But let me go ahead and strip off, from the left and the right implicitly,
    • 0:17:40any whitespace.
    • 0:17:41If you read the documentation for the strip function, it does just that.
    • 0:17:44It gets rid of whitespace to the left, whitespace to the right.
    • 0:17:47And then, if I want to force everything to maybe uppercase,
    • 0:17:50I can just uppercase the entire string.
    • 0:17:52And remember, what's handy about Python is you can chain some of these function
    • 0:17:56calls together by just using dots again and again.
    • 0:17:59And that just takes whatever just happened,
    • 0:18:01like the whitespace got stripped off, then, it additionally
    • 0:18:04uppercases the whole thing as well.
    • 0:18:06So now, I'm going to just check whether this specific title is in titles.
    • 0:18:10And if not, I'm going to go ahead and append that title,
    • 0:18:14massaged into this different format, if you will.
    • 0:18:17So I'm throwing away some information.
    • 0:18:19I'm sacrificing all of the nuances of your grammar and input
    • 0:18:23to the form itself.
    • 0:18:25But at least I'm trying to canonicalize size, that is,
    • 0:18:27standardize what the data actually looks like.
    • 0:18:29So let me go ahead and run Python of favorites.py again and hit Enter.
    • 0:18:33Oh, and this is just user error.
    • 0:18:35Maybe you haven't seen this before.
    • 0:18:36This just looks like a mistake on my part.
    • 0:18:40I meant to say not even uppercase.
    • 0:18:43That's completely wrong.
    • 0:18:44The function is called upper, now that I think of it.
    • 0:18:46All right.
    • 0:18:46Let's go and increase the size of the Terminal window again.
    • 0:18:49Run Python of favorites.py.
    • 0:18:50And now, it's a little more overwhelming to look at because it's not sorted yet
    • 0:18:55and it's all capitalized.
    • 0:18:57But I don't think I'm seeing multiple Friends, so to speak.
    • 0:19:03There's one Friends up here and that's it.
    • 0:19:06I'm back up at my prompt already.
    • 0:19:07So we seem now to be filtering out duplicates.
    • 0:19:10Now, before we dive in further and clean this up further than this,
    • 0:19:13what else could we have done?
    • 0:19:14Well, it turns out that in Python 2 you often
    • 0:19:17do get a lot of functionality built into the language.
    • 0:19:19And I'm kind of implementing myself the idea of a set.
    • 0:19:22If you think back to mathematics, a set is typically
    • 0:19:24something with a bunch of values that has duplicates filtered out.
    • 0:19:28Recall that Python already has this for us.
    • 0:19:31And we saw it really briefly when I whipped up the dictionary
    • 0:19:34implementation a couple of weeks back.
    • 0:19:36So I could actually define my titles to be a set instead of a list,
    • 0:19:40and this would just modestly allow me to refine my code here, such
    • 0:19:46that I don't have to bother checking for duplicates anyway.
    • 0:19:48I can instead just say something like, titles.add
    • 0:19:52the current title, like this.
    • 0:19:55Marginally better design if you know that a set exists because you're just
    • 0:19:58getting more functionality out of this.
    • 0:20:00All right, so let's clean the data up further.
    • 0:20:02We've now gone ahead and fixed the problem of case sensitivity.
    • 0:20:06We threw away whitespace in case someone had hit the spacebar
    • 0:20:08with some of the input.
    • 0:20:09Let's go ahead now and sort these things by the titles themselves.
    • 0:20:13So instead of just printing out the titles in the same order
    • 0:20:17you all inputted them, but filtering out duplicates as we go, let me go ahead
    • 0:20:22and use another function in Python you might not have seen,
    • 0:20:24which is literally called sorted, and will
    • 0:20:26take care of the process of actually sorting titles for you.
    • 0:20:31Let me go ahead and increase the font size of my Terminal,
    • 0:20:34run Python of favorites.py, and hit Enter.
    • 0:20:36And now you can really see how many of these shows start with the word "the"
    • 0:20:40or do not.
    • 0:20:41Now it's a little easier to wrap our minds around,
    • 0:20:43just because it's at least sorted alphabetically.
    • 0:20:46But now you can really see some of the differences in people's inputs.
    • 0:20:50So far, so good.
    • 0:20:51But a few of you decided to stylize Avatar in three different ways here.
    • 0:20:56Brooklyn 99 is a couple of different ways here.
    • 0:20:59And I think if we keep going we'll see further and further variances that we
    • 0:21:02did not fix by focusing on whitespace and capitalization alone.
    • 0:21:06So already here, this is only, what, 100 plus, 200 rows.
    • 0:21:09Already real-world data starts to get messy quickly,
    • 0:21:12and that might not bode well when we actually
    • 0:21:14want to keep around real data from real users.
    • 0:21:16You can imagine an actual website or a mobile application
    • 0:21:19dealing with this kind of thing on scale.
    • 0:21:21Well, let's go ahead and do this.
    • 0:21:23Let's actually figure out the popularity of these various shows
    • 0:21:26by now iterating over my data, and keeping track of how many of you
    • 0:21:31inputted a given title.
    • 0:21:33We're going to ignore the problems like Brooklyn 99 and the Avatar.
    • 0:21:37Sorry, yeah, Avatar, where there was things
    • 0:21:42that were different beyond just whitespace and capitalization.
    • 0:21:47But let's go ahead and keep track of, now,
    • 0:21:48how many of you inputted each of these titles.
    • 0:21:52So how can I do this?
    • 0:21:53I'm still going to take this approach of iterating over
    • 0:21:55the CSV file from top to bottom.
    • 0:21:58We've used a couple of data structures thus far,
    • 0:22:00a list to keep track of titles, or a set to keep track of titles.
    • 0:22:04But what if I now want to keep around a little more information?
    • 0:22:07For each title, I want to keep around how many times I've seen it before.
    • 0:22:13I'm not doing that yet.
    • 0:22:14I'm throwing away the total number of times I see these shows.
    • 0:22:17How could I start to keep that around?
    • 0:22:20AUDIENCE: Use a dictionary.
    • 0:22:21DAVID J. MALAN: We could use a dictionary, and how?
    • 0:22:23Elaborate on that.
    • 0:22:25AUDIENCE: [INAUDIBLE]
    • 0:22:26DAVID J. MALAN: Perfect, really good instincts.
    • 0:22:28Using a dictionary, insofar as it lets us
    • 0:22:29store keys and values, that is, associate something with something
    • 0:22:33else.
    • 0:22:33This is why a dictionary or hash tables more
    • 0:22:36generally are such a useful, practical data structure.
    • 0:22:39Because they just let you remember stuff in some kind of structured way.
    • 0:22:43So if the keys are going to be the titles I've seen,
    • 0:22:45the values could be the number of times I've seen each of those titles.
    • 0:22:49And so it's kind of like just having a two-column table on paper.
    • 0:22:54For instance, if I were going to do this on a piece of paper,
    • 0:22:56I might just have two columns here, where
    • 0:22:59maybe this is the title that I've seen, and this is the count over here.
    • 0:23:04This is, in effect, a dictionary in Python.
    • 0:23:08It's two columns, keys on the left, values on the right.
    • 0:23:11And this, if I can implement in code, will actually
    • 0:23:13allow me to store this data, and then maybe do some simple arithmetic
    • 0:23:17to figure out which is the most popular.
    • 0:23:19So let's do this.
    • 0:23:20Let me go ahead and change my titles to not be a list, not be a set.
    • 0:23:24Let's have it be a dictionary instead, either doing this, or more succinctly,
    • 0:23:29two curly braces that are empty gives me an empty dictionary automatically.
    • 0:23:33What do I now want to do?
    • 0:23:34I think most of my code can stay the same.
    • 0:23:37But down here, I don't want to just blindly add titles
    • 0:23:40to the data structure.
    • 0:23:42I somehow need to keep track of the count.
    • 0:23:45And unfortunately, if I just do this-- let's do titles,
    • 0:23:48bracket, title, plus equals 1.
    • 0:23:53This is a reasonable first attempt at this.
    • 0:23:56Because what am I doing?
    • 0:23:57If titles is a dictionary and I want to look up the current title therein,
    • 0:24:02the syntax for that, like before, is titles, bracket, and then
    • 0:24:05the key you want to use to index into the dictionary.
    • 0:24:08It's not a number in this case, it's an actual word, a title.
    • 0:24:12And you're just going to increment it by one,
    • 0:24:14and then eventually I'll come back and finish my second loop
    • 0:24:17and do things in terms of the order.
    • 0:24:19But for now, let's just keep track of the total counts.
    • 0:24:23Let me go ahead and increase my Terminal window.
    • 0:24:25Let me do Python of favorites.py and hit Enter.
    • 0:24:29Huh.
    • 0:24:30How I Met Your Mother is giving me a key error.
    • 0:24:34What does that mean?
    • 0:24:36And why am I seeing this?
    • 0:24:39And in fact, just to give a little bit of a breadcrumb here,
    • 0:24:42let me zoom out here.
    • 0:24:44Let me open up the CSV file again real quickly.
    • 0:24:46And wow, we didn't even get past the second row in the file
    • 0:24:50or the first show in the file.
    • 0:24:52Notice that How I Met Your Mother, somewhat lowercased,
    • 0:24:54is the very first show in therein.
    • 0:24:57What's your instinct for why this is happening?
    • 0:24:59AUDIENCE: You don't have a starting point.
    • 0:25:01DAVID J. MALAN: I don't have a starting point.
    • 0:25:03I'm adding one to what?
    • 0:25:04I'm blindly indexing into the dictionary using a key, How I Met Your Mother,
    • 0:25:09that doesn't yet exist in the dictionary.
    • 0:25:11And so Python throws what's called a key error
    • 0:25:14because the key you're trying to use just doesn't exist yet.
    • 0:25:17So logically, how could we fix this?
    • 0:25:21We're close.
    • 0:25:21We got half of the problem solved, but I'm not handling the obvious, now,
    • 0:25:25case of nothing being there.
    • 0:25:26Yeah?
    • 0:25:27AUDIENCE: Creating a counter.
    • 0:25:28DAVID J. MALAN: Creating a--
    • 0:25:29AUDIENCE: Counter.
    • 0:25:30DAVID J. MALAN: Creating the counter itself.
    • 0:25:32So maybe I could do something like this.
    • 0:25:34Let me close my Terminal window and let me ask a question first.
    • 0:25:38If the current title is in the dictionary already, if title in titles,
    • 0:25:44that's going to give me a true-false answer it turns out.
    • 0:25:47Then, I can safely say, titles, bracket, title, plus equals 1.
    • 0:25:52And recall, this is just shorthand notation for the same thing as in C,
    • 0:25:56title plus 1.
    • 0:25:58Whoops, typo.
    • 0:25:59Don't do that.
    • 0:26:00That's the same thing as this but it's a little more succinct
    • 0:26:02just to say plus equals 1.
    • 0:26:04Else, if it's logically not the case that the current title is in the titles
    • 0:26:09dictionary, then I probably want to say titles, bracket, title equals?
    • 0:26:13Feel free to just shout it out.
    • 0:26:15AUDIENCE: Zero.
    • 0:26:15DAVID J. MALAN: Zero.
    • 0:26:16I just have to put some value there so that the key itself is also there.
    • 0:26:21All right.
    • 0:26:21So now that I've got this going on, let me go ahead
    • 0:26:24and undo my sorting temporarily.
    • 0:26:26And now let me go ahead and do this.
    • 0:26:29I can, as a quick check, let me go ahead and just run the code
    • 0:26:33as is, Python of favorites.py.
    • 0:26:35I'm back in business.
    • 0:26:37It's printing correctly, no key errors, but it's not sorted.
    • 0:26:39And I'm not seeing any of the counts.
    • 0:26:41Let me just quickly add the counts, and there's
    • 0:26:43a couple of ways I could do this.
    • 0:26:45I could, say, print out the title, and then, maybe, let's do something like--
    • 0:26:52how about just, comma, titles, bracket, title?
    • 0:26:57So I'm going to print two things at once,
    • 0:26:59both the current title in the dictionary,
    • 0:27:01and whatever its value is by indexing into it.
    • 0:27:04Let me increase my Terminal window.
    • 0:27:06Let me run Python of favorites.py, Enter, and OK.
    • 0:27:10Huh.
    • 0:27:11Huh.
    • 0:27:14None of you said a whole lot of TV shows, it seems.
    • 0:27:17What's the logical error here?
    • 0:27:21What did I do wrong if I look back at my code here?
    • 0:27:25Yeah?
    • 0:27:25Why so many 0s?
    • 0:27:27AUDIENCE: [INAUDIBLE]
    • 0:27:29DAVID J. MALAN: Exactly.
    • 0:27:30To summarize, I initialized the count to 0 the first time I saw it,
    • 0:27:35but I should have initialized it at least to 1 because I just saw it.
    • 0:27:38Or I should change my code a bit.
    • 0:27:40So for instance, if I go back in here, the simplest fix
    • 0:27:42is probably to initialize to 1, because on this iteration of the loop,
    • 0:27:46obviously, I'm seeing this title for the very first time.
    • 0:27:49Or I could change my logic a little bit.
    • 0:27:51I could do something like this instead.
    • 0:27:53If the current title is not in titles, then I could initialize it to 0.
    • 0:27:58And then I could get rid of the else, and now blindly index
    • 0:28:02into the titles dictionary.
    • 0:28:04Because now, on line 11, I can trust that lines 9 and 10
    • 0:28:09took care of the initialization for me if need be.
    • 0:28:12Which one is better?
    • 0:28:13I don't know.
    • 0:28:13This one's a little nicer, maybe because it's one line fewer.
    • 0:28:17But I think both approaches are perfectly reasonable and well-designed.
    • 0:28:20But the key thing, no pun intended, is that we
    • 0:28:22have to make sure the key exists before we presume to actually incrue.
    • 0:28:27Oh, this is wrong.
    • 0:28:28This is incorrect code.
    • 0:28:30What did I do wrong?
    • 0:28:33OK, yes.
    • 0:28:33There we go.
    • 0:28:34So otherwise, everyone would have liked this show once, and no matter
    • 0:28:37how many people said the same thing.
    • 0:28:39Now the code is as it should be.
    • 0:28:41So let me go ahead and open up my Terminal window again.
    • 0:28:43Let me run Python of favorites.py, and now we see more reasonable counts.
    • 0:28:47Some shows weren't that popular.
    • 0:28:49There's just 1s and maybe 2s.
    • 0:28:50But I bet if we sort these things we can start to see a little more detail.
    • 0:28:56So how else can we do this?
    • 0:28:57Well, turns out, when dealing with a dictionary like this--
    • 0:29:04let's go ahead and just sort the titles themselves.
    • 0:29:07So let's reintroduce the sorted function as I did before, but no other changes.
    • 0:29:12Let me go ahead now and run Python of favorites.py.
    • 0:29:14Now it's just a little easier to wrap your mind around it
    • 0:29:17because at least it's alphabetical.
    • 0:29:18But it's not sorted by value, it's sorted by key.
    • 0:29:22But sure enough, if we scroll down, there's something down here,
    • 0:29:26for instance, like, let's see, The Office.
    • 0:29:29That's definitely going to be a contender
    • 0:29:30for most popular, 15 responses.
    • 0:29:32But let's see what's actually going to bubble up to the top.
    • 0:29:35Unfortunately, the sorted function only sorts dictionaries by keys
    • 0:29:40by default, not by values.
    • 0:29:44But it turns out, in Python, if you read the documentation
    • 0:29:47for the sorted function, you can actually
    • 0:29:49pass in other arguments that tell it how to sort things.
    • 0:29:54For instance, if I want to do things in reverse order,
    • 0:29:57I can add a second parameter to the sorted function called reverse.
    • 0:30:02And it's a named parameter.
    • 0:30:03You literally say, reverse equals true, so
    • 0:30:05that the position of it in the comma-separated list doesn't matter.
    • 0:30:08If I now rerun this after increasing my Terminal window,
    • 0:30:11you'll see now that it's in the opposite order.
    • 0:30:13Now adventure and Anne with an E is at the bottom
    • 0:30:16of the output instead of the top.
    • 0:30:18How can I tell it to sort by values instead of by key?
    • 0:30:26Well, let's go ahead and do this.
    • 0:30:28Let me go ahead and define a function.
    • 0:30:30I'm just going to call it f to keep things simple.
    • 0:30:33And this f function is going to take a title as input.
    • 0:30:36And given a given title, it's going to return the value of that title.
    • 0:30:41So actually, maybe a better name for this would be get value,
    • 0:30:44and/or we could come up with something else as well.
    • 0:30:46The purpose of the get value function, to be clear,
    • 0:30:49is to take it as input a title and then return the corresponding value.
    • 0:30:54Why is this useful?
    • 0:30:55Well, it turns out that the sorted function in Python,
    • 0:30:57according to its documentation, also takes a key parameter,
    • 0:31:01where you can pass in, crazy enough, the name of a function
    • 0:31:05that it will use in order to determine what it should sort by, by the key,
    • 0:31:11or by the value, or in other cases, even other types of data as well.
    • 0:31:16So there's a curiosity here, though, that's very deliberate.
    • 0:31:19Key is the name of the parameter, just like reverse
    • 0:31:21was the name of this other parameter.
    • 0:31:23The value of it, though, is not a function call.
    • 0:31:26It's a function name.
    • 0:31:27Notice I am not doing this, no parentheses.
    • 0:31:30I'm instead passing in get value, the function I wrote, by its name.
    • 0:31:35And this is a feature of Python and certain other languages.
    • 0:31:37Just like variables, you can actually pass whole functions
    • 0:31:41around so that they can be called for you later on by someone else.
    • 0:31:45So what this means is that the sorted function written by Python,
    • 0:31:48they didn't know what you're going to want to sort by today.
    • 0:31:51But if you provide them with a function called get value, or anything else, now
    • 0:31:55their sorted function will use that function
    • 0:31:58to determine, OK, if you don't want to sort by the key of the dictionary, what
    • 0:32:01do you want to sort by?
    • 0:32:02This is going to tell it to sort by the value
    • 0:32:05by returning the specific value we care about.
    • 0:32:08So let me go ahead now and rerun this after increasing my Terminal, Python
    • 0:32:12of favorites.py, Enter.
    • 0:32:14Here we have now an example of all of the titles you all
    • 0:32:17typed in, albeit forced to uppercase and with any whitespace thrown out.
    • 0:32:22And now, The Office is an easy win over Friends,
    • 0:32:24versus Community, versus Game of Thrones, Breaking Bad, and then
    • 0:32:27a lot of variants thereafter.
    • 0:32:29So there's a lot of steps to go through.
    • 0:32:31This isn't that bad once you've done it once,
    • 0:32:33and you know what these functions are, and you
    • 0:32:35know that these parameters exist.
    • 0:32:36But it's a lot of work.
    • 0:32:38That's 17 lines of code just to analyze a CSV file
    • 0:32:42that you all created by way of those Google Form submissions.
    • 0:32:45But it took me a lot of work just to get simple answers out of it.
    • 0:32:48And indeed, that's going to be among the goals
    • 0:32:50for today, ultimately, is, how can we just make this easier?
    • 0:32:52It's one thing to learn new things in Python,
    • 0:32:54but if we can avoid writing code, or this much code,
    • 0:32:57that's going to be a good thing.
    • 0:32:58And so one other technique we can introduce here
    • 0:33:01that does allow us to write a little less code
    • 0:33:03is, we can actually get rid of this function.
    • 0:33:05It turns out, in Python, if you just need to make a function
    • 0:33:09but it's going to be used and then essentially thrown away,
    • 0:33:11it's not something you're going to be reusing in multiple places--
    • 0:33:14it's not like a library function that you want to keep around--
    • 0:33:17you can actually just do this.
    • 0:33:19You can change the value of this key parameter
    • 0:33:23to be what's called a lambda function, which
    • 0:33:25is a fancy way of saying a function that technically has no name.
    • 0:33:29It's an anonymous function.
    • 0:33:30Why does it have no name?
    • 0:33:32Well, it's kind of stupid that I invented this name on line 13.
    • 0:33:35I used it on line 16, and then I never again used it.
    • 0:33:38If there's only being used in one place, why bother giving it a name at all?
    • 0:33:42So if you instead, in Python, say lambda,
    • 0:33:45and then type out the name of the parameter
    • 0:33:47you want this anonymous function to take,
    • 0:33:50you can then say, go ahead and return this value.
    • 0:33:54Now let's notice the inconsistencies here.
    • 0:33:57When you use this special lambda keyword that says, hey Python,
    • 0:33:59give me an anonymous function, a function with no name,
    • 0:34:02it then says, Python, this anonymous function will take one parameter.
    • 0:34:06Notice there's no parentheses.
    • 0:34:08And that's deliberate, if confusing.
    • 0:34:10It just tightens things up a little bit.
    • 0:34:12Notice that there's no return keyword, which similarly tightens things
    • 0:34:16up a bit, albeit inconsistently.
    • 0:34:18But this line of code I've just highlighted
    • 0:34:22is actually identical in functionality to this.
    • 0:34:26But it throws away the word [INAUDIBLE].
    • 0:34:28It throws away the word get value.
    • 0:34:29It throws away the parentheses, and it throws away the return keyword just
    • 0:34:33to tighten things up.
    • 0:34:34And it's well suited for a problem like this
    • 0:34:37where I just want to pass in a tiny little function that
    • 0:34:39does something useful.
    • 0:34:41But it's not something I'm going to reuse.
    • 0:34:42It doesn't need multiple lines to take up space.
    • 0:34:45It's just a nice, elegant one liner.
    • 0:34:47That's all a lambda function does.
    • 0:34:48It allows you to create an anonymous function right then and there.
    • 0:34:51And then the function you're passing it to, like sorted, will use it as before.
    • 0:34:57Indeed, if I run Python of favorites.py after growing my Terminal window,
    • 0:35:00the result is exactly the same.
    • 0:35:02And we see at the bottom here all of those small results.
    • 0:35:06Are any questions, then, on this syntax, on these ideas?
    • 0:35:10The goal here has been to write a Python program that just starts
    • 0:35:13to analyze or clean up data like this.
    • 0:35:18Yeah?
    • 0:35:19AUDIENCE: [INAUDIBLE]
    • 0:35:23DAVID J. MALAN: Could you use the lambda if it's just returning immediately?
    • 0:35:26It's really meant for one line of code, generally.
    • 0:35:29So you don't use the return keyword.
    • 0:35:31You just say what it is you want to return.
    • 0:35:33AUDIENCE: [INAUDIBLE]
    • 0:35:37DAVID J. MALAN: Good question.
    • 0:35:38Could you do more in that one line if it's
    • 0:35:40got to be a more involved algorithm?
    • 0:35:42Yes, but you would just ultimately return the value in question.
    • 0:35:45In short, if it's getting at all sophisticated
    • 0:35:47you don't use the lambda function in Python.
    • 0:35:49You go ahead and actually just define a name for it,
    • 0:35:52even if it's a one-off name.
    • 0:35:54JavaScript, another language we'll look at in a few weeks,
    • 0:35:56makes heavier use, I dare say, of lambda functions.
    • 0:35:59And those can actually be multiple, multiple lines,
    • 0:36:02but Python does not support that instinct.
    • 0:36:05All right.
    • 0:36:06So let's go ahead and do one other thing.
    • 0:36:07Office was clearly popping out of the code here quite a bit.
    • 0:36:10Let's go ahead and write a slightly different program
    • 0:36:12that maybe just focuses on The Office for the moment,
    • 0:36:15just focuses on The Office.
    • 0:36:16So let me go ahead and throw most of this code away, up until this point
    • 0:36:21when I'm inside of my inner loop.
    • 0:36:23And let me go ahead, and I don't even want the global variable here.
    • 0:36:26All I want to do is focus on the current title.
    • 0:36:28How could I detect if someone likes The Office?
    • 0:36:30Well, I could say something like--
    • 0:36:33how about this?
    • 0:36:34So counter equals 0.
    • 0:36:36We'll just focus on The Office.
    • 0:36:38If title equals, equals The Office, I could then go ahead and say,
    • 0:36:43counter plus equals 1.
    • 0:36:47I don't need a key.
    • 0:36:48There's no dictionary involved now.
    • 0:36:49It's just a simple integer variable.
    • 0:36:51And then, down here I'll say something like,
    • 0:36:55number of people who like The Office is, whatever this value is.
    • 0:37:00And I'll put in counter in curly braces, and then I'll
    • 0:37:03turn this whole thing into an F string.
    • 0:37:05All right, let me go ahead and run this.
    • 0:37:07Python of favorites.py, Enter.
    • 0:37:10Number of people who like The Office is 15.
    • 0:37:12All right, so that's great.
    • 0:37:13But let's go ahead now and deliberately muddy the data a bit.
    • 0:37:17All of you were very nice in that you typed in The Office.
    • 0:37:21But you can imagine someone just typing Office,
    • 0:37:23for instance, maybe there, maybe there.
    • 0:37:25And many people might just write Office, you could imagine.
    • 0:37:28Didn't happen here, but suppose it did, and probably
    • 0:37:30would have if we had even more and more submissions over time.
    • 0:37:33Now let's go ahead and rerun this program, no changes to the code.
    • 0:37:37Now only 13 people like The Office.
    • 0:37:39So let's fix this.
    • 0:37:40The data is now as I mutated it to have a couple Offices, and many The Offices.
    • 0:37:45How could I change my Python code to now count both of those situations?
    • 0:37:51What could I change up here in order to improve this situation?
    • 0:37:55Any thoughts?
    • 0:37:57Yeah?
    • 0:37:58AUDIENCE: You write the title [INAUDIBLE]..
    • 0:38:02DAVID J. MALAN: Yeah, so I could just ask two questions like that.
    • 0:38:05If title equals The Office, or title equals, equals just
    • 0:38:09Office, for instance.
    • 0:38:10And I'm still don't have to worry about capitalization.
    • 0:38:12I don't have to worry about spaces because I at least threw that all away.
    • 0:38:15Now I can go ahead and rerun this code.
    • 0:38:18Let me go run it a third time.
    • 0:38:19OK, so we're back up to 15.
    • 0:38:22So I like that.
    • 0:38:25You could imagine this not scaling very well.
    • 0:38:28Avatar had three different permutations, and there were some others
    • 0:38:31if we dug deeper that there might have been more variants.
    • 0:38:34Could we do something a little more general purpose?
    • 0:38:36Well, we could do something like this.
    • 0:38:38If Office in the title--
    • 0:38:41this is kind of a cool thing you can do with Python.
    • 0:38:44It's very English-like, just ask the question, albeit tersely.
    • 0:38:46This, interesting, just got me into trouble.
    • 0:38:50Now, all of a sudden, we're up to 16.
    • 0:38:53Does anyone know what the other one is?
    • 0:38:56AUDIENCE: Someone put V Office.
    • 0:38:58DAVID J. MALAN: What Office?
    • 0:38:59AUDIENCE: Someone entered a V Office, [INAUDIBLE]..
    • 0:39:04DAVID J. MALAN: Oh, interesting.
    • 0:39:05Yes, so they hit The.
    • 0:39:07OK.
    • 0:39:08[APPLAUSE]
    • 0:39:11DAVID J. MALAN: OK.
    • 0:39:12Someone did that, sure.
    • 0:39:13So The V Office.
    • 0:39:17OK, this one's actually going to be hard to correct for.
    • 0:39:19I can't really think of a general--
    • 0:39:21well, this is actually a good example of data gets messy fast.
    • 0:39:25And you could imagine doing something where, OK, we
    • 0:39:28could have like 26 conditions if someone said The A Office, or The B Office,
    • 0:39:32right?
    • 0:39:33You could imagine doing that.
    • 0:39:34But then there's surely going to be other typos that are possible.
    • 0:39:37So that's actually a hard one to fix.
    • 0:39:39But it turns out we got lucky and now this is actually the accurate count.
    • 0:39:44But the data is itself messy.
    • 0:39:46Let me show another way that just adds another tool to our toolkit.
    • 0:39:49It turns out that there's this feature in many programming languages, Python
    • 0:39:54among them, called regular expressions.
    • 0:39:57And this is actually a really powerful technique
    • 0:39:59that we'll just scratch the surface of here.
    • 0:40:00But it's going to be really useful, actually, maybe toward final projects,
    • 0:40:04in web programming, any time you want to clean up data or validate data.
    • 0:40:08And actually, just to make this clear, give me a moment
    • 0:40:11before I switch screens here.
    • 0:40:14And let me open up a Google Form from scratch.
    • 0:40:18Give me just a moment to create something real quick.
    • 0:40:22If you've never noticed this before when creating a Google Form,
    • 0:40:25you can do a question.
    • 0:40:28And if you want the user to type in something
    • 0:40:30very specific as a short text answer like this,
    • 0:40:33you might know that there's toggles like this in Google's world,
    • 0:40:35like you can require it.
    • 0:40:36Or you can do response validation.
    • 0:40:39You could say, what's your email?
    • 0:40:41And then you could say something like, text is an email.
    • 0:40:47So here's an example in Google Forms how you can validate users' input.
    • 0:40:52But a feature most of you have probably never noticed, or cared about, or used,
    • 0:40:57is this thing called a regular expression, where
    • 0:40:59you can actually define a pattern.
    • 0:41:01And I could actually reimplement that same idea by doing something like this.
    • 0:41:04I can say, let the user type in anything represented by .star, then an at sign,
    • 0:41:11then something else, then a literal period, then, for instance,
    • 0:41:16something else.
    • 0:41:17So it's very cryptic, admittedly, at first glance.
    • 0:41:19But this means any character 0 more times.
    • 0:41:23This means any character 0 more times.
    • 0:41:26This means a literal period, because apparently
    • 0:41:28dot means any character in the context of these patterns.
    • 0:41:32Then this thing means any character 0 more times.
    • 0:41:36So I should actually be a little more nitpicky.
    • 0:41:38You don't want 0 or more times, you want 1 or more times.
    • 0:41:41So this with the plus means any character 1 or more time.
    • 0:41:45So there has to be something there.
    • 0:41:47And I think I want the same thing here 1 or more times, 1 or more times.
    • 0:41:51Or heck, if I want to restrict this form in some sense to edu addresses,
    • 0:41:56I could change that last thing to literally .edu.
    • 0:41:59And so long story short, even though this looks,
    • 0:42:01I'm sure, pretty cryptic, there's this mini language built into Python,
    • 0:42:06and JavaScript, and Java, and other languages that allows you to express
    • 0:42:09patterns in a standardized way.
    • 0:42:12And this pattern is actually something we can implement in code, too.
    • 0:42:15And let me switch back to Python for a second just
    • 0:42:18to do the same kind of idea.
    • 0:42:19Let me toggle back to my code here.
    • 0:42:22Let me put up, for instance, a summary of what it is you can do.
    • 0:42:26And here's just a quick summary of some of the available symbols.
    • 0:42:33A period may represent any character. .star or .asterisks means 0 or more
    • 0:42:39characters.
    • 0:42:39So the dot means anything, so it can be A or nothing.
    • 0:42:43It can be B or nothing.
    • 0:42:44It can be A, B, A, B, C. It can be any combination of 0 or more characters.
    • 0:42:49Change that to a plus and you now express one or more characters.
    • 0:42:52Question mark means something is optional.
    • 0:42:55Caret symbol means start matching at the beginning of the user's input.
    • 0:42:59Dollar sign means stop matching at the end of the user's input.
    • 0:43:05So we won't play with all of these just now.
    • 0:43:07But let me go over here and actually tackle this Office problem.
    • 0:43:11Let me go ahead and import a new library called the regular expression library,
    • 0:43:15import re.
    • 0:43:17And then, down here, let me say this.
    • 0:43:20If re.search, this pattern.
    • 0:43:25Let's just search for Office, quote, unquote, in the current title.
    • 0:43:30Then we're going to go ahead and increase the counter.
    • 0:43:32So it turns out that the regular expression library
    • 0:43:35has a function called search that takes as its first argument a pattern,
    • 0:43:39and then, as its second argument the string you
    • 0:43:41want to analyze for that pattern.
    • 0:43:44So it's sort of looking for a needle in this haystack, from left to right.
    • 0:43:47Let me go ahead now and run this version of the program, Enter.
    • 0:43:52And now I screwed up because I forgot my colon, but that's old stuff.
    • 0:43:56Enter.
    • 0:43:57Huh.
    • 0:43:59Number of people who like The Office is now 0.
    • 0:44:01So this seems like a big--
    • 0:44:02thank you-- big step backwards.
    • 0:44:05What did I do wrong?
    • 0:44:08Yeah?
    • 0:44:08AUDIENCE: [INAUDIBLE]
    • 0:44:10DAVID J. MALAN: Yeah.
    • 0:44:11I forced all my input to uppercase, so I probably need to do this.
    • 0:44:14So we'll come back to other approaches there.
    • 0:44:16Let me rerun it now.
    • 0:44:17OK, now we're back up to 16.
    • 0:44:19But I could even, let's say--
    • 0:44:22I could tolerate just The Office.
    • 0:44:25How about this, or how about something like, or The Office?
    • 0:44:30Let me do this instead.
    • 0:44:32And let me use these other special characters.
    • 0:44:34This caret sign means the beginning of the string.
    • 0:44:37This dollar sign weirdly represents the end of the string.
    • 0:44:40I'm adding in some parentheses just like in math, just to add another symbol
    • 0:44:44here, the or symbol here.
    • 0:44:46And this is saying start matching at the beginning of the user string.
    • 0:44:50Check if the beginning of the string is Office, or the beginning of the string
    • 0:44:54is The Office.
    • 0:44:55And then, you better be at the end of the string.
    • 0:44:58So they can't keep typing words before or after that input.
    • 0:45:01Let me go ahead and rerun the program.
    • 0:45:03And now we're down to 15, which used to be our correct answer,
    • 0:45:07but then we noticed The V Office.
    • 0:45:10How can we deal with that?
    • 0:45:12It's going to be messier to deal with that.
    • 0:45:16How about if I tolerate any character represented by dot
    • 0:45:21in between The and Office?
    • 0:45:23Now if I rerun it, now I really have this expressive capability.
    • 0:45:28So this is only to say, there are so many ways in languages, in general,
    • 0:45:32to solve problems.
    • 0:45:33And some of these tools are more sophisticated than others.
    • 0:45:35This is one that you've actually probably glanced at but never used
    • 0:45:38in the context of Google Forms for years if you're
    • 0:45:41in the habit of creating these for student groups or other activities.
    • 0:45:43But it's now something you can start to leverage.
    • 0:45:45And we're just scratching the surface of what's actually possible with this.
    • 0:45:49But let's now do one final example just using some Python code here.
    • 0:45:53And let's actually write a program that's
    • 0:45:55a little more general purpose that allows me to search for any given title
    • 0:46:00and figure out its popularity.
    • 0:46:01So let me go ahead and simplify this.
    • 0:46:04Let's get rid of our regular expressions.
    • 0:46:06Let's go ahead and continue capitalizing the title.
    • 0:46:09And let's go ahead to--
    • 0:46:11at the beginning of this program, and first ask the user for the title
    • 0:46:16they want to search for.
    • 0:46:17So title equals, let's ask the user for input,
    • 0:46:20which is essentially the same thing as our CS50 get_string function.
    • 0:46:23Ask them for the title.
    • 0:46:24And then whatever they type in, let's go ahead and strip whitespace
    • 0:46:28and uppercase the thing again.
    • 0:46:30And now, inside of my loop, I could say something like this.
    • 0:46:35If the current row's title after stripping whitespace and forcing
    • 0:46:42it to uppercase, too, equals the user's title, then, go ahead
    • 0:46:46and maybe increment a counter.
    • 0:46:49So I still need that counter back.
    • 0:46:51So let me go ahead and define this maybe in here, counter equals 0.
    • 0:46:56And then, at the very end of this program,
    • 0:46:58let me go ahead and print out just the popularity
    • 0:47:01of whatever the human typed in.
    • 0:47:03So again, the only difference is I'm asking the human for some input
    • 0:47:06this time.
    • 0:47:06I'm initializing my counter to 0, then I'm
    • 0:47:09searching for their title in the CSV file
    • 0:47:12by doing the same massaging of the data by forcing it to uppercase
    • 0:47:15and getting rid of the whitespace.
    • 0:47:18So now, when I run Python of favorites.py, Enter,
    • 0:47:21I could type in the office all lowercase even, and now we're down to 13.
    • 0:47:3013, why?
    • 0:47:34Oh, that's correct.
    • 0:47:36Because I'm the one that went in and removed those The keywords a bit ago.
    • 0:47:40If we fixed those, we would be back up to 15.
    • 0:47:43If we added support for The V Office, we would be up to 16 as well.
    • 0:47:47All right, any questions then on these various manipulations?
    • 0:47:50And if you're feeling like, oh, my god, this
    • 0:47:52is so much Python code just to do simple things, that's the point.
    • 0:47:55And indeed, even though it's a powerful language
    • 0:47:57and can solve these kinds of problems, we had to write almost 20 lines of code
    • 0:48:00just to ask a single question like this.
    • 0:48:03But any questions on how we did this, or on any of these building
    • 0:48:07blocks along the way?
    • 0:48:10Anything here?
    • 0:48:11No?
    • 0:48:12All right.
    • 0:48:12That was a lot.
    • 0:48:13Let's take a five-minute break here.
    • 0:48:14When we come back, we'll do it better.
    • 0:48:17So we are back.
    • 0:48:18And the rest of today is ultimately about, how
    • 0:48:20can we store, and manipulate, and change, and retrieve data
    • 0:48:24more efficiently than we might by just writing raw code?
    • 0:48:28This isn't to say that you shouldn't use Python to do the kinds of things
    • 0:48:31that we just did.
    • 0:48:32And in fact, it might be super common if you're getting a lot of messy input
    • 0:48:37from users that you might want to clean it up.
    • 0:48:39And maybe the best way to do that is to write a program so that step-by-step
    • 0:48:42you can make all of the requisite changes and fixes
    • 0:48:44like we did with The Office, for instance, again and again,
    • 0:48:47and reuse that code, especially if more and more submissions are
    • 0:48:50coming through.
    • 0:48:51But another theme of today, ultimately, is
    • 0:48:53that sometimes there are different, if not better tools for the same job.
    • 0:48:57And in fact, now at this point in the term,
    • 0:48:59as we begin to introduce not just Python, but in a moment
    • 0:49:02a language called SQL, and next week, a language called JavaScript,
    • 0:49:06and the week after that, synthesizing a whole lot of these languages
    • 0:49:09together is to just kind of paint a picture of how
    • 0:49:12you might decide what the trade-offs are between using this tool, or this tool,
    • 0:49:15or this other tool.
    • 0:49:16Because undoubtedly you can solve problems moving forward
    • 0:49:19in many different ways with many different tools.
    • 0:49:22So let's give you another tool, one with which
    • 0:49:25you can implement a proper relational database.
    • 0:49:28What we just saw in the form of CSV files
    • 0:49:31are what we might call flat-file databases.
    • 0:49:33Again, just a very simple file, flat in that there's no hierarchy to it.
    • 0:49:37It's just rows and columns.
    • 0:49:39And that is all ultimately storing ASCII or Unicode text.
    • 0:49:44A relational database, though, is something that's actually
    • 0:49:47closer to a proper spreadsheet program.
    • 0:49:50A CSV is an individual sheet, if you will,
    • 0:49:53from a spreadsheet when you export it.
    • 0:49:55If you had multiple sheets in a spreadsheet,
    • 0:49:57you would have to export multiple CSVs.
    • 0:49:59And that gets annoying quickly in code if you
    • 0:50:01have to open up this CSV, this CSV, all of which
    • 0:50:03represent different sheets or tabs in a proper spreadsheet.
    • 0:50:07A relational database is more like a spreadsheet program
    • 0:50:11that you, a programmer, now can interact with.
    • 0:50:14You can write data to it.
    • 0:50:16You can read data from it, and you can have multiple sheets, a.k.a.,
    • 0:50:19tables storing all of your data.
    • 0:50:22So whereas Excel and numbers in Google spreadsheet
    • 0:50:24are meant to be reused really by humans with their mouse and their keyboard,
    • 0:50:27clicking, and pointing, and manipulating things graphically,
    • 0:50:29a relational database using a language called
    • 0:50:32SQL is one in which the programmer has similar capabilities,
    • 0:50:37but doing so in code.
    • 0:50:39Specifically, using a language called SQL, and at a scale
    • 0:50:42that's much grander than spreadsheets alone.
    • 0:50:45In fact, if you try on your Mac or PC to open a spreadsheet that's
    • 0:50:48got tens of thousands of rows, it'll probably
    • 0:50:51work fine, hundreds of thousands of rows, millions of rows, no way.
    • 0:50:54At some point your Mac or PC is going to struggle
    • 0:50:57to open particularly large data sets.
    • 0:50:59And that, too, is where proper databases come
    • 0:51:01into play and proper languages for databases come
    • 0:51:04into play, when it's all about scale.
    • 0:51:06And indeed, most any mobile app or web app today that you or someone else
    • 0:51:09might write should probably plan on lots of data if it's successful.
    • 0:51:13So we need the right tools for that problem.
    • 0:51:15So fortunately, even though we're about to learn yet another language,
    • 0:51:19it only does four things fundamentally, known by this silly acronym, CRUD.
    • 0:51:24SQL, this language for databases, supports the ability
    • 0:51:27to create data, read data, update data, and delete data.
    • 0:51:32That's it.
    • 0:51:33There's a few more keywords that exist in this language called SQL
    • 0:51:36that we'll soon see.
    • 0:51:37But at the end of the day, even if you're
    • 0:51:39starting to feel like this is a lot very quickly,
    • 0:51:42it all boils down to these four basic operations.
    • 0:51:44And the four commands in SQL, if you will,
    • 0:51:47functions in a sense that implement those four ideas happen to be these.
    • 0:51:51They're almost the same but with some slight variance.
    • 0:51:54The ability to create or insert data is the C. The ability to select data
    • 0:51:59is the R, or read.
    • 0:52:01Update is the same.
    • 0:52:02Delete is the same, but drop is also a keyword as well.
    • 0:52:05So we'll see these and a few other keywords
    • 0:52:06in SQL that, at the end of the day, just allow you to create, read, and update
    • 0:52:10data using verbs, if you will, like these.
    • 0:52:14So to do that, what's the syntax going to be?
    • 0:52:18Well, we won't get into the weeds too quickly on this.
    • 0:52:20But here's a representative syntax of how
    • 0:52:22you can create using this language called SQL, in your very
    • 0:52:25own database, a brand new table.
    • 0:52:28This is so easy in Excel, and Google Spreadsheets, and Apple Numbers.
    • 0:52:30You want a new sheet, you click the plus button.
    • 0:52:32You get a new tab.
    • 0:52:33You give it a name, and boom, you're done.
    • 0:52:35In the world of programming, though, if you want to create the analogue of that
    • 0:52:40spreadsheet in the computer's memory, you create something called a table,
    • 0:52:43like a sheet, that has a name, and then in parentheses has one or more columns.
    • 0:52:48But unlike Google Spreadsheets, and Apple Numbers, and Excel,
    • 0:52:51you have to decide as the programmer what types of data
    • 0:52:55you're going to be storing in each of these columns.
    • 0:52:57Now even though Excel, and Google Spreadsheets,
    • 0:52:59and Numbers does allow you to format or present data in different ways,
    • 0:53:03it's not strongly typed data like it is, for instance, when we were using C.
    • 0:53:07And heck, even in Python there's underlying data types.
    • 0:53:10Even if you don't have to type them explicitly,
    • 0:53:12databases are going to want to know, are you storing integers?
    • 0:53:14Are you storing real numbers or floats?
    • 0:53:16Are you storing text?
    • 0:53:17Why?
    • 0:53:18Because especially as your data scales, the more hints
    • 0:53:20you give the database about your data, the more performance it can be,
    • 0:53:24the faster it can help you get at and store that data.
    • 0:53:27So types are about to be important again,
    • 0:53:29but there's not going to be that many of them, fortunately.
    • 0:53:31Now how can I go about converting, for instance, some real data,
    • 0:53:34like that from you, my favorites.csv file,
    • 0:53:37into a proper relational database?
    • 0:53:39Well, it turns out that using SQL I can do this
    • 0:53:42in VS Code on my own Mac, or PC, or in the cloud
    • 0:53:45here by just importing the CSV into a database.
    • 0:53:48We'll see eventually how to do this manually.
    • 0:53:50For now, I'm going to use more of an automated process.
    • 0:53:52So let me go over to VS Code here.
    • 0:53:54Let me type ls to see where we left off before.
    • 0:53:57I had two files favorites.csv, which I downloaded from Google Spreadsheets.
    • 0:54:01Recall that I made a couple of changes.
    • 0:54:02We deleted a couple of Thes from the file for The Office.
    • 0:54:06But this is the same file as before, and then we
    • 0:54:08have favorites.py, which we'll set aside for now.
    • 0:54:11I'm going to go ahead now and run a command SQLite3.
    • 0:54:14So in the world of relational databases, there's
    • 0:54:18many different products out there, many different software that
    • 0:54:23implements the SQL language.
    • 0:54:25Microsoft has their own.
    • 0:54:26There's something called MySQL that's been very popular for years.
    • 0:54:30Facebook, for instance, used it early on.
    • 0:54:32PostgreSQL, Microsoft Access Server, Oracle,
    • 0:54:35and maybe a whole bunch of other product names
    • 0:54:36you might have encountered over time, which
    • 0:54:38is to say there's many different types of tools, and servers,
    • 0:54:42and software in which you can use SQL.
    • 0:54:44We're going to use a very lightweight version of the SQL language
    • 0:54:47today called SQLite.
    • 0:54:49This is the version of SQL that's generally
    • 0:54:51used on iPhones and Android devices these days.
    • 0:54:54If you download an app that stores data like your own contacts,
    • 0:54:56typically is stored using SQLite.
    • 0:54:59Because it's fairly lightweight, but you can still store hundreds,
    • 0:55:02thousands, even tens of thousands of pieces of data
    • 0:55:05even using this lightweight version thereof.
    • 0:55:07SQLite3 is like version 3 of this tool.
    • 0:55:10We're going to go ahead and run SQLite3 with a file called favorites.db.
    • 0:55:16It's conventional in the world of SQLite to name your file something.db.
    • 0:55:20I'm going to create a database called favorites.db.
    • 0:55:22Once I'm inside of the program, now I'm going to go ahead and enter CSV Mode.
    • 0:55:27Again, not something you have to memorize,
    • 0:55:28just something you can look up as needed.
    • 0:55:30And then, I'm going to import favorites.csv
    • 0:55:34into a table, that is, a sheet, if you will, called favorites as well.
    • 0:55:40Now I'm going to hit Enter and I'm going to go ahead and exit the program
    • 0:55:44altogether and type ls.
    • 0:55:45Now I have three files in my current directory--
    • 0:55:48the CSV file, the Python file from before, and now favorites.db.
    • 0:55:52But if I did this right, all of the data you all typed into the CSV file
    • 0:55:56has now been loaded into a proper database where I can now use
    • 0:55:59this SQL language to access it instead.
    • 0:56:03So let's go ahead again and run SQLite3 of favorites.db, which now exists.
    • 0:56:07And now, at the SQLite prompt I can start
    • 0:56:10to play around and see what this data is.
    • 0:56:13For instance, I can look, by typing .schema,
    • 0:56:16at what the schema is of my data, what's the design.
    • 0:56:19Now no thought was put into the design of this data at the moment
    • 0:56:22because I automated the whole process.
    • 0:56:23Once we start creating our own databases we'll
    • 0:56:26give more thought to the data types and the columns that we have.
    • 0:56:29But we can see what SQLite presumed I wanted just
    • 0:56:34by importing the data by default.
    • 0:56:36What the import command did for me a moment ago is essentially the syntax.
    • 0:56:41It automated the process of creating a table, if it doesn't exist,
    • 0:56:44called favorites.
    • 0:56:45And then notice, in parentheses it gave me three columns--
    • 0:56:48timestamp, title, and genres, which were inferred, obviously, from the CSV.
    • 0:56:53All three of which have been decreed to be text.
    • 0:56:56Again, once we're more comfortable we'll create our own tables,
    • 0:56:59choose our own types and column names.
    • 0:57:01But for now, I just automated the whole process just
    • 0:57:03to get us started by using this built-in import command as well.
    • 0:57:08All right.
    • 0:57:08So what now can I begin to do?
    • 0:57:11Well, if I wanted to, for instance, start playing around with data therein,
    • 0:57:16I might execute a couple of different commands.
    • 0:57:23Let me find the right one here-- one of which would be select.
    • 0:57:28Select being one of our most versatile tools
    • 0:57:31to select data from this database.
    • 0:57:33So if I have these three columns here-- timestamp,
    • 0:57:35title, and genres, suppose I want to select all of the titles.
    • 0:57:39Doing that earlier in Python required importing the CSV library,
    • 0:57:43opening the file, creating a reader or a DictReader, iterating over every row,
    • 0:57:48adding every title to a dictionary or just printing it out,
    • 0:57:51and dot, dot, dot.
    • 0:57:52There was a dozen or so lines of code when we first began.
    • 0:57:55Now, how about this?
    • 0:57:56Select title from favorites, semicolon, done.
    • 0:58:01So now, with this particular language, the output is very textual
    • 0:58:05and it's simulating what it looks like if it were more graphical by creating
    • 0:58:08this table, so to speak.
    • 0:58:10Select title from favorites is a distillation
    • 0:58:14in a different language called SQL of all the lines of code
    • 0:58:17I wrote early on when we first started playing with favorites.py.
    • 0:58:20SQL is therefore optimized for reading, and creating, and updating,
    • 0:58:25and ultimately, deleting data.
    • 0:58:27So here's perhaps a better tool for the job once you have the data.
    • 0:58:30Tossing it into a more powerful, versatile format
    • 0:58:34might allow you now to get more work done more quickly
    • 0:58:37without having to reinvent the wheel.
    • 0:58:38Someone else has figured out how to select data like this.
    • 0:58:41What more can I do here?
    • 0:58:43Well, let me go ahead and pull up, in a moment, just a little bit
    • 0:58:47of a cheat sheet here.
    • 0:58:49Give me one second to find this.
    • 0:58:53So suppose I want to now select data a little more powerfully.
    • 0:58:58So here's what I just did in a canonical way.
    • 0:59:00So select typically works like this.
    • 0:59:01You select columns from a specific table, semicolon.
    • 0:59:05Unfortunately, stupid semicolons are back.
    • 0:59:08Select columns from table then, is the generic form of what I just did.
    • 0:59:12More specifically, I selected one column called title from favorites.
    • 0:59:17Favorites is the name of the table.
    • 0:59:18Semicolon ends my thought.
    • 0:59:19Suppose I wanted to get two things, like the genres that each of you inputted.
    • 0:59:23I could instead do select title, comma, genres from favorites,
    • 0:59:28and then, a semicolon, and Enter.
    • 0:59:30It's going to look a little ugly on my screen
    • 0:59:32because some of these titles and--
    • 0:59:33OK, one of you really went all out with Community.
    • 0:59:37You can see that it's just wrapping in an ugly way,
    • 0:59:40but it's just now showing me two columns.
    • 0:59:43If we scroll up to the very top again, the left most of one,
    • 0:59:46Black Mirror went all out, too.
    • 0:59:48Thank you.
    • 0:59:49And now, OK, we're going to have to clean some of these up.
    • 0:59:52Game of Thrones, good comedy, yes.
    • 0:59:57Keep going, keep going, keep going.
    • 0:59:59So now we've selected two of the columns that we care about.
    • 1:00:02There it is.
    • 1:00:03OK, so it's crazy wide because of all of those genres.
    • 1:00:06But it allows me to select exactly the data I want.
    • 1:00:09Let's go back to the titles, though, and perhaps start playing around
    • 1:00:12with some modifiers here.
    • 1:00:13For instance, it turns out, using SQL there's a lot of functionality
    • 1:00:18built into the language.
    • 1:00:19You've got a lot of functions, similar to Excel or Google Spreadsheets
    • 1:00:22where you can have formulas.
    • 1:00:24SQL provides you with some of the same heuristics that
    • 1:00:26allow you to apply operations like these on entire columns.
    • 1:00:30For instance, you can take averages, count the total,
    • 1:00:32get the distinct values, force things to lowercase, uppercase, min,
    • 1:00:36and max, and so forth.
    • 1:00:37So let's try distinct, for instance.
    • 1:00:39Let me go back to my Terminal, and let's say, select,
    • 1:00:43how about the distinct titles from the favorites table?
    • 1:00:48Enter.
    • 1:00:49I didn't bother selecting the genres because I
    • 1:00:51want it to be a little prettier.
    • 1:00:52And you can see here that we have just the distinct titles,
    • 1:00:58except for issues of formatting.
    • 1:01:00So whitespace is going to be an issue again.
    • 1:01:02Capitalization is going to be a thing again.
    • 1:01:04So there's a trade-off.
    • 1:01:05One of the things I was doing in Python was forcing everything to uppercase
    • 1:01:09and then getting rid of whitespace.
    • 1:01:10But we could combine some of these.
    • 1:01:12I could do something like force every title to uppercase,
    • 1:01:15then get the distinct value.
    • 1:01:16And that's actually going to get rid of some of those values as well.
    • 1:01:19And again, I did it all in one simple line that was fast.
    • 1:01:21So let me pull up at the bottom of the screen again.
    • 1:01:24I selected distinct upper titles from favorites,
    • 1:01:28and that did everything for me at once in just one breath.
    • 1:01:31Suppose I want to get the total number of counts of titles.
    • 1:01:33How about select count of all of those titles from favorites?
    • 1:01:40Semicolon, Enter, and now you get back a mini table
    • 1:01:43that contains just your answer, 158 in this case.
    • 1:01:47So that's the total number of, not distinct,
    • 1:01:50but total titles that we had in the file.
    • 1:01:52And we could continue to manipulate the data further using, again,
    • 1:01:56functions like these here.
    • 1:01:58But there's also additional filtration we can do.
    • 1:02:01We can also qualify our selections by saying where some condition is true.
    • 1:02:07So just as in Scratch, and C, and Python, you have Boolean expressions,
    • 1:02:10you can have the same in SQL as well, where I can filter my data where
    • 1:02:16something is true or false.
    • 1:02:19Like allows me to do approximations.
    • 1:02:21If I want to get something that's like The Office
    • 1:02:23but not necessarily T-H-E, space, Office,
    • 1:02:26I could do pattern matching using like here.
    • 1:02:29Order by, limit, and grouped by are other commands I can execute, too.
    • 1:02:33So let me go back and do a couple of these here.
    • 1:02:35How about, let me just get, oh, I don't know, all of the titles from favorites
    • 1:02:42but limit it to 10 results.
    • 1:02:44That might be one thing that's helpful to see if you just care about some
    • 1:02:47of the data at the top there instead.
    • 1:02:50How about, select all of the titles from favorites, where the title itself
    • 1:02:56is like, quote, unquote, "Office?"
    • 1:02:59And this will give me only two answers.
    • 1:03:02Those are the two rows, recall, that I mutated by getting rid of the word The.
    • 1:03:07Notice that like allows me too tolerate uppercase and lowercase.
    • 1:03:11Because if I instead just use the equal sign,
    • 1:03:14and in SQL a single equal sign does, in fact, mean equality.
    • 1:03:21For comparison's sake, it's not doing assignment.
    • 1:03:23This is not how you assign data in SQL.
    • 1:03:26I got back no answers there.
    • 1:03:27So indeed, the equal sign is giving me literal answers
    • 1:03:31that searches just for what I typed in.
    • 1:03:33How could I get all of these?
    • 1:03:35Well, similar in spirit to regular expressions but not quite as powerful
    • 1:03:39in SQL, I could do something like this.
    • 1:03:41I can select the title from favorites where the title is like, quote,
    • 1:03:45unquote, "Office."
    • 1:03:46But I can add, a bit weirdly, percent signs to the left and the right.
    • 1:03:52So the language SQL supports the same notion of pattern matching
    • 1:03:57but much more limited out of the box.
    • 1:03:59If we want more powerful regular expressions
    • 1:04:01we probably do want to use Python instead.
    • 1:04:03But the percent sign here means 0 or more characters
    • 1:04:06on the left, 0 or more characters on the right.
    • 1:04:09So this will just grab any title that contains O-F-F-I-C-E in it in that
    • 1:04:14order.
    • 1:04:15And now I get all 16, it would seem, of those results, again.
    • 1:04:19How do I know it's 16?
    • 1:04:20Well, I can just get the count of those titles
    • 1:04:23and get back that answer instead as well.
    • 1:04:26So again, it takes some getting used to, the vocabulary
    • 1:04:29and the syntax that you can use.
    • 1:04:30There's these building blocks and others.
    • 1:04:32But SQL is really designed, again, for creating, reading, updating,
    • 1:04:35and deleting data.
    • 1:04:36For instance, I've never really been a fan of Friends, for instance.
    • 1:04:41So right now if I do select, how about title from favorites
    • 1:04:46where title like, quote, unquote, Friends with the percent signs?
    • 1:04:53We can see that there's a whole bunch of them.
    • 1:04:55That's how many exactly.
    • 1:04:56Let's just do a quick count.
    • 1:04:57So that's nine of them.
    • 1:04:59Well, delete from favorites.
    • 1:05:03OK, you and me, delete from favorites, where title like Friends, Enter.
    • 1:05:10Nothing seems to happen, but bye-bye Friends.
    • 1:05:14[APPLAUSE]
    • 1:05:15DAVID J. MALAN: Thank you.
    • 1:05:19So now we've actually changed the data.
    • 1:05:21And this is what's compelling about a proper database.
    • 1:05:25Yes, you could technically write Python code that not only reads the CSV file,
    • 1:05:29but also writes it.
    • 1:05:30You can change using quote, unquote, "A" for append,
    • 1:05:32or quote, unquote, "W" for write, instead of quote, unquote,
    • 1:05:35"R" for read alone.
    • 1:05:37But it's definitely a little more involved to do that in Python.
    • 1:05:39But with SQL, you can update the data in real time.
    • 1:05:42And if I were actually running a web application here or a database
    • 1:05:45for a mobile app, that change, theoretically,
    • 1:05:47would be reflected everywhere on your own devices
    • 1:05:50if you're somehow talking to this application.
    • 1:05:52So that's the direction we're headed.
    • 1:05:53This other thing has been bothering me.
    • 1:05:55So select, how about title from favorites, where title equals,
    • 1:06:02what was it?
    • 1:06:03The V Office, was it?
    • 1:06:06Yeah, it was that one.
    • 1:06:07How about we update favorites by setting title
    • 1:06:12equal to The Office, where title equals quote, unquote, "The V Office"
    • 1:06:19semicolon?
    • 1:06:20And now, if I select the same thing again
    • 1:06:22I can go up and down with my arrow keys quickly.
    • 1:06:24Now there is no The V Office.
    • 1:06:27We've actually changed that value.
    • 1:06:29How about genres?
    • 1:06:30Select genres from favorites, where the title is title
    • 1:06:35equals Game of Thrones, semicolon.
    • 1:06:39These were kind of long, and I don't really agree with all of that.
    • 1:06:43So how about we update favorites, set genres equal to, sure,
    • 1:06:49action, adventure, sure, drama?
    • 1:06:52OK, so it's a decent list.
    • 1:06:54Fantasy, sure, thriller, war.
    • 1:06:56OK, anything really but comedy, I would say.
    • 1:07:01Let's go ahead and hit Enter now.
    • 1:07:03And now, if I select genres again, same query, now we've canonicalized that.
    • 1:07:07We've thrown data away.
    • 1:07:09So whether or not that is right is probably
    • 1:07:11a bit subjective and argumentative.
    • 1:07:13But I have at least cleaned up my data, which is, again, the U in CRUD.
    • 1:07:16Create, read, update, delete, you can do it that easily.
    • 1:07:20Beware using delete.
    • 1:07:22Beware worse using drop, whereby you can drop an entire table.
    • 1:07:26But via these kinds of commands, can we actually now
    • 1:07:29manipulate our data much more rapidly and with single thoughts.
    • 1:07:33And in fact, if you're an aspiring statistician, or data scientist,
    • 1:07:36or analyst in the real world, SQL is such a commonly used language
    • 1:07:40because it allows you to really dive into data quickly, and ask
    • 1:07:43questions of the data, and get back answers quite quickly.
    • 1:07:45And this is a simple data set.
    • 1:07:47You can do this with much larger data sets as we soon will, too.
    • 1:07:51Or any questions on what we've seen of SQL thus far?
    • 1:07:55Only scratched the surface, but again, it
    • 1:07:57boils down to creating, reading, updating, and deleting data.
    • 1:08:03Questions here?
    • 1:08:04All right.
    • 1:08:05Well, let's consider the design of this data.
    • 1:08:07Recall that if I do .schema, that shows me the design of my table,
    • 1:08:11the so-called schema of my data.
    • 1:08:13This is OK.
    • 1:08:15It gets the job done, and frankly, everything the user typed in
    • 1:08:17was arguably text, including the timestamp, which is the date and time.
    • 1:08:21But so the data set itself is somewhat simple.
    • 1:08:24But if we look at the data set itself, especially genres, let's do this.
    • 1:08:29Select genres from favorites.
    • 1:08:31And let me point out one other thing stylistically, too.
    • 1:08:34I am very deliberately capitalizing all of the special SQL keywords,
    • 1:08:39and I'm lowercasing all of the column names and the table names.
    • 1:08:42This is a convention, and honestly, it just helps you read,
    • 1:08:45I think, the code when you're co-mingling your names for columns
    • 1:08:49and tables with proper SQL keywords.
    • 1:08:52But I could just as easily do select genres from favorites,
    • 1:08:58but again, the SQL specific keywords don't quite jump out as much.
    • 1:09:01So stylistically, we would recommend this, selecting genres
    • 1:09:04from favorites, semicolon.
    • 1:09:06So here is where--
    • 1:09:09oh.
    • 1:09:11OK, that was not intended.
    • 1:09:12I accidentally made every show, including
    • 1:09:15The Office about action, adventure, drama, fantasy, thriller, and war.
    • 1:09:20How did I do that accidentally?
    • 1:09:24What did I do wrong?
    • 1:09:25AUDIENCE: [INAUDIBLE]
    • 1:09:30DAVID J. MALAN: Yeah.
    • 1:09:31So beware, this is funny.
    • 1:09:32I think I did say beware around this time.
    • 1:09:34So the SQL database took me-- literally, I updated favorites,
    • 1:09:37setting genres equal to that, semicolon, end of thought.
    • 1:09:41I really wanted to say where title equals,
    • 1:09:43quote, unquote, "Game of Thrones."
    • 1:09:46Unfortunately, there isn't an undo command or time machine
    • 1:09:49with a SQL database, so the best we can do here
    • 1:09:51is, let's actually get rid of favorites.db.
    • 1:09:56Let's run SQLite of favorites.db again, which now will be recreated.
    • 1:10:02Let me change myself into CSV mode.
    • 1:10:04Let me import, into my favorites table, the CSV file.
    • 1:10:09And now, Friends is back, for better or for worse,
    • 1:10:14but so are all of our genres.
    • 1:10:18If I now reload the file and do select, star, from--
    • 1:10:21sorry.
    • 1:10:21Select genres from favorites, that was the result I was getting.
    • 1:10:25It's much messier, but that's because some of these are quite long.
    • 1:10:28But now we're back to the original data.
    • 1:10:30Lesson here, be sure to back up your work.
    • 1:10:32All right.
    • 1:10:33So what more can we now do with this data?
    • 1:10:36Well, I don't love the design of the genres table for a couple of reasons.
    • 1:10:40One, we didn't have any sort of validation,
    • 1:10:43but user input is going to be messy.
    • 1:10:45There's just a lot of redundancy in here.
    • 1:10:47Let's go ahead and do this.
    • 1:10:49Let me select all the comedies you all typed in.
    • 1:10:51So select title from favorites, where genres equals,
    • 1:10:57quote, unquote, "comedy."
    • 1:11:00OK, so there's all of the shows that are explicitly comedies.
    • 1:11:05But I think there might actually be others.
    • 1:11:08Let me scroll back up here.
    • 1:11:11Comedy, drama.
    • 1:11:12What was a comedy and a drama?
    • 1:11:14How about let's search for the-- oops, let me copy paste comedy, comma, drama.
    • 1:11:19OK, so The Office, in this case, was considered comedy and drama, Billions,
    • 1:11:23It's Always Sunny in Philadelphia, and Gilmore Girls as well.
    • 1:11:27But notice that I get many more when I just search for comedy.
    • 1:11:31So the catch here is that, because I have all of these genres implemented
    • 1:11:36the way Google did, as a comma-separated list,
    • 1:11:39it's actually really hard and messy to get at any show, all of the shows
    • 1:11:43that are somewhere described as comedy.
    • 1:11:46Because if I search for quote, unquote, "comedy," the only answers
    • 1:11:49I'm going to get are this one, whatever that show is, this one, whatever
    • 1:11:53that show is, this one.
    • 1:11:55But I'm not going to get this one.
    • 1:11:56I'm not going to get this one.
    • 1:11:58Why?
    • 1:11:59If I'm searching for, where genres equals, quote, unquote, "comedy,"
    • 1:12:02why am I missing those other shows?
    • 1:12:07Why am I missing?
    • 1:12:07Yeah?
    • 1:12:08AUDIENCE: [INAUDIBLE]
    • 1:12:11DAVID J. MALAN: Exactly.
    • 1:12:12It's not just a comedy, it's a comedy and a drama,
    • 1:12:14and a comedy or a news show, and so forth.
    • 1:12:16So I have to search for these commas, so this gets messy quickly, right?
    • 1:12:20Let me copy this so I can do this.
    • 1:12:22Let me search for where genres equals comedy.
    • 1:12:26How about, or genres equals comedy, drama, or genres
    • 1:12:33equals this whole thing, comedy, news, talk show?
    • 1:12:36I'm going to get more and more results.
    • 1:12:38But that's not going to scale well.
    • 1:12:39What could I do instead of enumerating with ors
    • 1:12:42all of the different permutations of genres, do you think?
    • 1:12:45AUDIENCE: [INAUDIBLE]
    • 1:12:49DAVID J. MALAN: Yeah.
    • 1:12:50So I could use the keyword is, similar in Python to the word in.
    • 1:12:54I could use the like keyword so that so long
    • 1:12:56as the genres is like comedy somewhere in there,
    • 1:13:02that's going to give me all of them, so long as the word comedy is in there.
    • 1:13:05But let me go ahead and just open the form from earlier.
    • 1:13:12Let me see if I can open this real quick before I toggle over.
    • 1:13:15If we look back at the form, recall that there
    • 1:13:16were all of those radio buttons asking for the specific genres
    • 1:13:22into which something fell.
    • 1:13:24And if I open this, let me full screen here and now open the original form.
    • 1:13:29You'll see all of the genres here, none of which
    • 1:13:32are that worrisome except for a corner case is jumping out at me.
    • 1:13:38Where might the like keyword alone get me into trouble?
    • 1:13:43It's not with comedy.
    • 1:13:44I'm OK with comedy.
    • 1:13:45AUDIENCE: Music and musical?
    • 1:13:47DAVID J. MALAN: Yeah, music and musical are deliberately on the list here.
    • 1:13:51Because, one, they're separate genres.
    • 1:13:53But if I just search for something that's like music,
    • 1:13:56I'm going to accidentally suck in all of the musicals, which
    • 1:13:58might not be what I intend.
    • 1:13:59If music is a music video or whatever, and musical is actually
    • 1:14:03a different type of show, I don't want to just do that.
    • 1:14:06So it seems just very messy.
    • 1:14:08I could probably hack something together with-- maybe add some commas in there,
    • 1:14:11or something like this.
    • 1:14:12But this is just not a good design for the data.
    • 1:14:15Google has done it this way because it's just
    • 1:14:17simple to actually keep the user's data all in a single column,
    • 1:14:22and just as they did, separate it by commas.
    • 1:14:24But this is a real messy way to use CSV is
    • 1:14:28by putting comma-separated values in your comma-separated values.
    • 1:14:32Arguably, the folks at Google probably just did this
    • 1:14:35because it's just simpler.
    • 1:14:36And they didn't want to give people multiple sheets
    • 1:14:38or complicate things using some other weirder character than commas alone.
    • 1:14:42But I bet there's a better way for us to do this.
    • 1:14:44And let me go ahead and do this.
    • 1:14:45Let me go back into my code here.
    • 1:14:47And in just a moment, I'm going to grab a program
    • 1:14:50that I wrote in advance that's going to use Python to open up the CSV file,
    • 1:14:54iterate over all of the rows, and load the data into two tables this time,
    • 1:14:59two tables, one called shows, and one called genres,
    • 1:15:02so as to actually separate these two things out.
    • 1:15:05Give me just a moment to grab the code.
    • 1:15:07And when I run this, I'll only have to run it once.
    • 1:15:10Let me go ahead and run Python in a moment,
    • 1:15:13and I'll reveal the results in a sec.
    • 1:15:15This is going to be version 8 of the code online.
    • 1:15:18When I do this, let me go ahead and open up this file.
    • 1:15:22Give me a second to move it into this directory.
    • 1:15:26Version 8, OK.
    • 1:15:28So here we have version 8 of this that's available online
    • 1:15:31that's going to do the following.
    • 1:15:32And I'll gloss over some of the details just so
    • 1:15:34that we don't get stuck in the weeds of some of this code.
    • 1:15:38I'm going to be using, at the top of this program,
    • 1:15:41as we'll soon see, a CS50 library, not for the sake of get_string,
    • 1:15:45or get_int, or get_float, but because there's some built-in SQL
    • 1:15:48functionality that we didn't discuss a couple of weeks back with the CS50
    • 1:15:51library itself.
    • 1:15:52But inside of the CS50 library we'll see there is a special function called
    • 1:15:56SQL that gives you the ability using this weird URL-like looking thing,
    • 1:16:01technically called a URI, that allows me to open a file called favorites.db.
    • 1:16:06And long story short, all of the subsequent code
    • 1:16:08is going to iterate over this favorites.csv file that we downloaded.
    • 1:16:12And it's going to import it into the SQLite database,
    • 1:16:16but it's going to use two tables instead of just one.
    • 1:16:19So give me just a moment to run this, and then I'll
    • 1:16:21reveal the actual results.
    • 1:16:23This is going to be run on favorites.csv.
    • 1:16:31And taking a look here, give me just a moment.
    • 1:16:39Oh, give me a sec.
    • 1:16:43Come on.
    • 1:16:45Come on.
    • 1:16:46This program should not be taking this long.
    • 1:16:49Sorry.
    • 1:16:50Let's open this real fast.
    • 1:16:54Whoops, not that file.
    • 1:16:59OK.
    • 1:16:59Let me just skim this code real quick to see where we've gone wrong.
    • 1:17:02[INAUDIBLE] reader.
    • 1:17:04Reader, title, show ID in certain two shows.
    • 1:17:10[INAUDIBLE] genres split, DB execute.
    • 1:17:15All right.
    • 1:17:15This is me debugging in real time.
    • 1:17:17All those times we encourage you to use print, this is me actually using print.
    • 1:17:22We'll see how quickly I can recover from this.
    • 1:17:24Python of favorites version 8.
    • 1:17:29OK, so here's me debugging in real time.
    • 1:17:32It's printing it.
    • 1:17:32Oh, maybe I just didn't wait long enough.
    • 1:17:34OK, so here we go.
    • 1:17:35What I'm doing is printing out the dictionary that represents
    • 1:17:39each row that you all typed in.
    • 1:17:41And we're actually making progress.
    • 1:17:43All right.
    • 1:17:44I was too impatient and didn't wait long enough.
    • 1:17:46So in a moment--
    • 1:17:47there we go.
    • 1:17:47All right, so all we have to do sometimes is wait.
    • 1:17:50Let me go ahead now and open this file using SQLite3.
    • 1:17:54So in SQLite3 I now have a different version of favorites.db.
    • 1:17:58I named it number 8 for consistency.
    • 1:17:59Once I've run the program I can do .schema to look inside of it.
    • 1:18:03And here's what the two tables in this database are going to look like.
    • 1:18:07I've created a table called shows, this time to represent all of the TV shows
    • 1:18:10that are favorites, that has two columns.
    • 1:18:14One is called ID, one is called Title.
    • 1:18:16But now I'm going to start taking out for a spin some
    • 1:18:19of the other features of SQL.
    • 1:18:20And besides there being text, it turns out there's a data type called integer.
    • 1:18:24Besides there being a data type called text,
    • 1:18:26there's also a special key phrase that you can specify
    • 1:18:30that the title can never be null.
    • 1:18:31Think back to our use of null in C. Think back
    • 1:18:35to the keyword none in Python.
    • 1:18:37This is a database constraint that allows you to ensure that none of you
    • 1:18:40can't have of favorite TV show.
    • 1:18:42If you submit the form, you have to have typed in a title for it
    • 1:18:46to end up in our database here.
    • 1:18:48And you'll notice one other new feature.
    • 1:18:50It turns out, on this table I'm defining what's
    • 1:18:53called a primary key, specifically to be the ID column.
    • 1:18:56More on that in just a moment.
    • 1:18:58Meanwhile, the second table my code has created for me, as we'll soon see,
    • 1:19:02gives me a column called show ID, and then, a genre,
    • 1:19:08the value of which is text that can also not be null.
    • 1:19:11And then more on this in a moment.
    • 1:19:12This table has what we're going to call a foreign key,
    • 1:19:15specifically the show ID column that references shows ID.
    • 1:19:20So before we get into the weeds of this, this
    • 1:19:23is now a way of creating the relation in relational database.
    • 1:19:27If I have two tables now, not just one, they can somehow
    • 1:19:31be linked together by a common column.
    • 1:19:34In other words, the shows column--
    • 1:19:37shows table is going to give me a table with two columns--
    • 1:19:41an ID and a title.
    • 1:19:43Every title you gave me, I'm going to assign a unique value.
    • 1:19:46The genre's table, meanwhile, is going to associate individual genres
    • 1:19:51singular with that same idea.
    • 1:19:54And the result of this, to pop back to the Terminal here, is, let's do this.
    • 1:20:01Select star from shows of this new database,
    • 1:20:05and you'll see that I've given, indeed, all of the shows you all typed
    • 1:20:09in unique identifiers.
    • 1:20:10I didn't filter out duplicates or do anything beyond just forcing everything
    • 1:20:13to uppercase.
    • 1:20:14So there's going to be some duplicates here because I didn't
    • 1:20:16want to get rid of anyone's data.
    • 1:20:18But you'll see that, indeed, I've given everyone
    • 1:20:21a unique identifier, from the very first person who
    • 1:20:24typed How I Met Your Mother, all the way down to input number 158.
    • 1:20:28Meanwhile, if I do select star from genres, which is now a table, not just
    • 1:20:35a column in the original data, now you'll
    • 1:20:38see a much better design for this data.
    • 1:20:43Notice what I've done here.
    • 1:20:44Let me go all the way to the top and you'll see two columns, one of which
    • 1:20:47is called show ID, the other of which is called genre.
    • 1:20:51And again, I wrote some code to do this because I
    • 1:20:53had to take Google's messy output where everything was separated by commas.
    • 1:20:56I had to tear away the commas and then put each genre into this table
    • 1:21:00by itself.
    • 1:21:01Even though we haven't introduced the syntax via which
    • 1:21:04we can reconstitute the data and reassociate
    • 1:21:07your genres with your titles, why, at a glance,
    • 1:21:10might this be a better design now?
    • 1:21:13Even though I've doubled the number of tables from one to two,
    • 1:21:16why is this probably on the direction toward a better design?
    • 1:21:21What might your instincts be?
    • 1:21:26Why is this cleaner?
    • 1:21:27Again, first time with SQL, why is it better, perhaps,
    • 1:21:31that we've done this with our genre's table?
    • 1:21:33Can I come to you?
    • 1:21:35Why might this be better?
    • 1:21:36Yeah.
    • 1:21:37Oh, just because we had the conversation before about the commas.
    • 1:21:41AUDIENCE: [INAUDIBLE]
    • 1:21:42DAVID J. MALAN: Exactly.
    • 1:21:42It's as simple as that.
    • 1:21:43We've cleaned up the data by giving every genre, every word in the genres
    • 1:21:49column in the original Google Spreadsheet
    • 1:21:51its own cell in this table, if you will.
    • 1:21:54And now notice show ID might appear multiple times.
    • 1:21:56Whoever typed in How I Met Your Mother, they only associated one genre with it.
    • 1:22:01And so we see that show ID 1 is a comedy.
    • 1:22:04But whoever typed in--
    • 1:22:05I forget the name of the second show offhand.
    • 1:22:07But that person, whoever was assigned show ID 2 checked off
    • 1:22:11a whole bunch of the genre's boxes.
    • 1:22:13That happened again with show ID 3, 4.
    • 1:22:17Persons 5, 6, 7 only checked one box.
    • 1:22:20And so you can see now that we've associated the data with what we
    • 1:22:25might call a one-to-many relationship.
    • 1:22:27A one-to-many relationship, whereby for every one show in the show's table,
    • 1:22:33it can now have many genres associated with it, each of which
    • 1:22:36is represented by a separate row here.
    • 1:22:41So again, if I go ahead and select star from shows--
    • 1:22:44let's limit it to the first 10 just to focus on a subset of the data.
    • 1:22:48How I Met Your Mother, The Sopranos was the second input there.
    • 1:22:51It would seem that now that I've created the data in this way,
    • 1:22:55I could ideally somehow search the data, but a little more correctly.
    • 1:22:59I don't have to worry about the commas.
    • 1:23:01I don't have to worry about the hackish approach of music
    • 1:23:03being a substring of musical.
    • 1:23:05But how can I actually get back at this data?
    • 1:23:08Well, let's go ahead and do this.
    • 1:23:09Suppose I did want to get back maybe all of the comedies.
    • 1:23:13All of the comedies, no matter whether the person checked just the comedy
    • 1:23:17box or multiple boxes instead.
    • 1:23:19How now, given that I have two tables, could I
    • 1:23:23go about selecting only the titles of comedies?
    • 1:23:27I've actually made the problem a little harder,
    • 1:23:29but again, SQL is going to give me a solution for this.
    • 1:23:32The problem is that if I want to search for comedies,
    • 1:23:35I have to check the genres table first.
    • 1:23:37And then what's that going to give me?
    • 1:23:39If I search the genres table for comedies,
    • 1:23:43what's that going to give me back potentially?
    • 1:23:46Yeah?
    • 1:23:47AUDIENCE: Show ID.
    • 1:23:48DAVID J. MALAN: Maybe show ID.
    • 1:23:49So let me try that.
    • 1:23:49Let me do select show ID from genres, where the genre in a given row
    • 1:23:55equals quote, unquote, "comedy."
    • 1:23:57No commas, no like, no percent signs.
    • 1:24:00Because literally, that column now is singular words, like comedy, or drama,
    • 1:24:04or the like.
    • 1:24:05Let me go ahead and hit Enter here.
    • 1:24:07OK, so I got back a whole bunch of ID numbers.
    • 1:24:10Now this could very quickly get annoying.
    • 1:24:13It looks like show ID 1, 2, 4, 5, 6, 7, 9, and so forth, are all comedies.
    • 1:24:17So I could do something really crazy like, select title from shows,
    • 1:24:23where ID equals 1, or ID equals 2.
    • 1:24:29This is not going to scale very well, but this
    • 1:24:32is why SQL is especially powerful.
    • 1:24:34You can actually compose one SQL question from multiple ones.
    • 1:24:39So let's do this.
    • 1:24:40Why don't I select the title where the ID of the show
    • 1:24:44is in the following list of IDs?
    • 1:24:48Select show ID from genres, where the specific genre is, quote, unquote,
    • 1:24:54"comedy."
    • 1:24:55So I've got two SQL queries.
    • 1:24:58One is deliberately nested inside of parentheses.
    • 1:25:01That's going to give me back that whole list of show IDs.
    • 1:25:04But that's exactly what I want to then look up
    • 1:25:07the titles for by selecting title from shows where the ID of the show
    • 1:25:11is in that big, tall list.
    • 1:25:13And so now if I hit Enter, I get back only those shows
    • 1:25:18that were somehow flagged as comedy, whether you in the audience
    • 1:25:22checked one box for comedy, two boxes, or all of the boxes.
    • 1:25:26Somehow we teased out comedy, again, just
    • 1:25:28by using that Python script, which loaded this data not
    • 1:25:31into one big table, but instead, two.
    • 1:25:33And if we want to clean this up, let's do a couple of things.
    • 1:25:36Let's, outside of the parentheses, do order by title.
    • 1:25:40This is a way of sorting the data in SQL very easily.
    • 1:25:43Now we have a whole list of the same titles that are now sorted.
    • 1:25:48And what was the keyword with which I could filter out duplicates?
    • 1:25:52Yeah, distinct.
    • 1:25:53So let's try this.
    • 1:25:54Same query, but let's select only the distinct titles from that whole query.
    • 1:25:59And notice, I've very deliberately done it this way.
    • 1:26:01And to this day, any time I'm using SQL, I
    • 1:26:03don't just start at the beginning and type out my whole thought,
    • 1:26:06and just get it right on the first try.
    • 1:26:07I very commonly start with the subquery, if you
    • 1:26:10will, the thing in parentheses, just to get myself
    • 1:26:12one step toward what I care about.
    • 1:26:14Then I add to it.
    • 1:26:15Then I add to it.
    • 1:26:16Then I add to it, just like we've encouraged in Python and C,
    • 1:26:18taking baby steps in order to get to the answer you actually care about,
    • 1:26:22like this one now.
    • 1:26:23And other than this mistake, which we didn't
    • 1:26:26fix because I re-imported the data after accidentally changing everyone's genre,
    • 1:26:30we now have an alphabetized list of all of the same data.
    • 1:26:35But now it's better designed, because we have it split across these two tables.
    • 1:26:40Oh, thank you.
    • 1:26:41OK, just thanks.
    • 1:26:44What questions do we have, if any here?
    • 1:26:48Questions on this approach?
    • 1:26:53Yeah?
    • 1:26:55AUDIENCE: [INAUDIBLE]
    • 1:27:00DAVID J. MALAN: Oh, now that we have a database,
    • 1:27:02how do we transfer it to a CSV?
    • 1:27:04There are ways to do that.
    • 1:27:06And in fact, there's a command within SQLite
    • 1:27:08that allows you to export your data back to a CSV file.
    • 1:27:11If you want to email it to someone and you
    • 1:27:13want them to be able to open it in Excel, or Google Spreadsheets, or Apple
    • 1:27:16Numbers, or the like, you can go in the other direction.
    • 1:27:19Generally though, once you're in the world of SQL
    • 1:27:21you're probably storing your data there long term.
    • 1:27:24And you're probably updating it, maybe deleting it, adding to it,
    • 1:27:27and so forth.
    • 1:27:28For instance, the one command I did not show earlier
    • 1:27:30is, suppose someone forgot a show.
    • 1:27:33Let's see, did I see this in the output?
    • 1:27:35All right, so Curb Your Enthusiasm.
    • 1:27:37Saw that last night.
    • 1:27:38It was just, yeah.
    • 1:27:39Did anyone see it last night?
    • 1:27:41No?
    • 1:27:41All right, well, just the one person that checked that box, so you and me.
    • 1:27:44What's another show that didn't make the list?
    • 1:27:46How about Seinfeld?
    • 1:27:48It's now on Netflix, apparently.
    • 1:27:49So insert into shows.
    • 1:27:55What do we want to insert?
    • 1:27:56Well, we want to insert maybe an ID and a title.
    • 1:27:59But I don't actually care what the ID is,
    • 1:28:02so I'm just going to insert a title.
    • 1:28:03And the value I'm going to give to that title
    • 1:28:05is going to be, quote, unquote, "Seinfeld."
    • 1:28:08And then, let me go ahead and hit semicolon.
    • 1:28:11Nothing seems to happen, but let me rerun the big query
    • 1:28:14from before looking for comedies.
    • 1:28:16And unfortunately, Seinfeld has not yet been flagged as a comedy,
    • 1:28:19so let's get this right, too.
    • 1:28:21What intuitively I'm going to have to do to associate, now,
    • 1:28:25Seinfeld with my comedies?
    • 1:28:27I just inserted into the show's table.
    • 1:28:30What more needs to happen before we can flag Seinfeld as a comedy?
    • 1:28:33AUDIENCE: [INAUDIBLE]
    • 1:28:35DAVID J. MALAN: Say again?
    • 1:28:36AUDIENCE: [INAUDIBLE]
    • 1:28:37DAVID J. MALAN: Yeah.
    • 1:28:38So I need to insert into the genres table two things now,
    • 1:28:42a show ID, like this, and then, the name of the genre,
    • 1:28:48which presumably is comedy.
    • 1:28:49What values do I want to insert?
    • 1:28:50Well, the show ID, I better grab that.
    • 1:28:52Oh, I don't even know what it is.
    • 1:28:54I'm going to have to figure out what that is.
    • 1:28:55So I could do this in a couple of ways.
    • 1:28:57Let me do a one-time thing.
    • 1:28:59Select star from shows, where title equals,
    • 1:29:02quote, unquote, "Seinfeld" semicolon 159.
    • 1:29:06So now I could do, insert into genres a show ID
    • 1:29:11and a genre name, the values 159, and, quote, unquote, "comedy" semicolon,
    • 1:29:20Enter.
    • 1:29:21And now, if I scroll back in my history and execute that really big query
    • 1:29:24again, looking for all distinct comedies,
    • 1:29:26now Seinfeld has made the list.
    • 1:29:29But I did this manually so I didn't actually capitalize it.
    • 1:29:32Let's clean that up.
    • 1:29:33Let's do update.
    • 1:29:35Let's do update my shows.
    • 1:29:37Set title equals to Seinfeld semicolon.
    • 1:29:43No?
    • 1:29:43OK, thank you, where title equals, quote, unquote, "Seinfeld."
    • 1:29:47Let's not make that mistake again.
    • 1:29:49Enter.
    • 1:29:50And now, if I execute that really big query, now Seinfeld is,
    • 1:29:53indeed, considered a comedy.
    • 1:29:56So where are we going with this?
    • 1:29:57Well, thus far we've been doing all of this pretty manually.
    • 1:30:00And this is absolutely what an analyst, a data scientist type person
    • 1:30:02might do if just manipulating a pretty large data set just
    • 1:30:05to get at interesting answers that might be across one,
    • 1:30:08two, or even many more tables.
    • 1:30:09Eventually, in a few weeks, we're going to start to automate all of this
    • 1:30:12by writing code in Python that generates SQL to do this.
    • 1:30:16If you go to most any website on the internet today,
    • 1:30:19and you, for instance, log in, odds are you're typing a username and password,
    • 1:30:23clicking Submit.
    • 1:30:24What's then happening?
    • 1:30:25Well, the website might not be implemented in Python
    • 1:30:28but it's probably implemented in some language, Python, JavaScript, Java,
    • 1:30:31Ruby, something else.
    • 1:30:33And that language is probably using something like a relational database
    • 1:30:38to use SQL to get your username, get your password,
    • 1:30:41and compare the two against what you've typed in.
    • 1:30:43And actually, it's hopefully not getting your actual password,
    • 1:30:46but something called the hash thereof.
    • 1:30:48But there's probably a database involved doing that.
    • 1:30:50When you buy something on Amazon.com and you click Check Out,
    • 1:30:53odds are there's some code on Amazon's server
    • 1:30:56that's looking at what it is you added to your shopping cart,
    • 1:30:59and then maybe using a for loop of some sort, in Python or another language.
    • 1:31:03It's doing a whole bunch of SQL inserts to store in their database what
    • 1:31:08it is you bought.
    • 1:31:09There's other types of databases, too, but SQL databases,
    • 1:31:12or relational databases are quite popular.
    • 1:31:14So let's go ahead and write one other program here
    • 1:31:16in Python that now merges these two languages together, whereby
    • 1:31:21I'm going to use SQL inside of a Python program
    • 1:31:24so I can implement my logic of my program in Python,
    • 1:31:28step-by-step, line-by-line.
    • 1:31:30But when I want to get at some data I can actually talk to a SQL database.
    • 1:31:34So let me go ahead and open favorites.py.
    • 1:31:39And let me go ahead and throw away some of what we did earlier and really
    • 1:31:45just now add a SQL to the mix.
    • 1:31:47From the CS50 library, let's import the SQL function.
    • 1:31:51This will be useful to use because most third-party libraries
    • 1:31:54that deal with SQL and Python are more complicated than they need to be.
    • 1:31:57So I think you'll find this library easier to use.
    • 1:32:00Let's then do the following.
    • 1:32:01Create a variable called db for database.
    • 1:32:03But I could call it anything I want.
    • 1:32:05Let's use that you URI, which is a fancy way of saying something
    • 1:32:09that looks like a URL, but that actually opens up a database on disk, that is,
    • 1:32:17in the current folder.
    • 1:32:19Let's now ask the user for a title by prompting them for a, quote, unquote,
    • 1:32:22"title" like this.
    • 1:32:23And let's strip off any whitespace just so that the data is not messy.
    • 1:32:27And then, let's go ahead and do this.
    • 1:32:30And this is the new logic.
    • 1:32:32I'm going to go ahead now and write a line of code that uses Python
    • 1:32:36to talk to the original favorites.db.
    • 1:32:39So again, I'm not using the two-table database, which is in favorites8.db.
    • 1:32:43I'm using the original that we imported from your own data,
    • 1:32:47and I'm going to do the following.
    • 1:32:49I'm going to use db.execute to execute a SQL command inside of Python.
    • 1:32:54I'm going to select the count of shows from the favorites
    • 1:33:02table, where the title the user typed in is like this question mark.
    • 1:33:09And why I'm doing that is as follows.
    • 1:33:11Just like in C, when we had percent S, in SQL for now,
    • 1:33:15the analogue is going to be a question mark.
    • 1:33:17So same idea, different syntax.
    • 1:33:18Instead of percent S, it's just a question mark.
    • 1:33:21And using a comma outside of this first string, using CS50's execute
    • 1:33:26function I can pass in a SQL string, a command,
    • 1:33:29then any arguments I want to plug into the question marks therein.
    • 1:33:33So the goal at hand is to actually write a program that's
    • 1:33:36going to search favorites.csv, a.k.a., favorites.db for the total number
    • 1:33:42of people that liked a particular show.
    • 1:33:45So this is going to select the count of people from the favorites table
    • 1:33:49where the title they typed in is like whatever the user has just now typed
    • 1:33:53in.
    • 1:33:53This db execute function returns a list.
    • 1:33:56It returns a list of rows.
    • 1:33:57And you would only know that by my telling you or reading
    • 1:34:00the documentation.
    • 1:34:01And therefore, if I want to get back to the total count,
    • 1:34:04I'm going to go ahead and grab the first row from those rows.
    • 1:34:08Because it's only going to give me back the count.
    • 1:34:11And then I'm going to go ahead and print out that row's first value.
    • 1:34:16But it's going to be a little weird.
    • 1:34:17Technically the column is going to be called "count" star, quote, unquote,
    • 1:34:21which is a little weird.
    • 1:34:22Let me add one more feature to the mix.
    • 1:34:24You can actually give nicknames to columns
    • 1:34:26that are coming back, especially if they are the result of functions like this.
    • 1:34:30I can just call that column counter, in all lowercase.
    • 1:34:34That means I can now say get back the counter key inside of this dictionary.
    • 1:34:41So just to recap, what have we done?
    • 1:34:43We've imported the CS50 library SQL function.
    • 1:34:46We've, with this line of code, opened the favorites.db
    • 1:34:49file that you and I created earlier by importing your CSV into SQLite.
    • 1:34:54I'm now just asking the user for a title they want to search for.
    • 1:34:58I'm now executing this SQL query on that database,
    • 1:35:02plugging in whatever the human typed in as their title
    • 1:35:05in order to get back a total count.
    • 1:35:07And I'm giving the count a nickname, an alias of counter,
    • 1:35:10just so it's more self-explanatory.
    • 1:35:13This function, db execute, no matter what, always returns a list of rows,
    • 1:35:18even if there's only one row inside of it.
    • 1:35:20So this line of code just gives me the first and only row.
    • 1:35:23And then, this goes inside of that row, which it turns out is a dictionary,
    • 1:35:27and gives me the key counter and the value it corresponds to.
    • 1:35:33So what, to be clear, is this doing?
    • 1:35:35Let's go ahead and run this manually in my Terminal window first.
    • 1:35:38Let me run SQLite3 on favorites--
    • 1:35:41Well, let's do this.
    • 1:35:43On favorites.db, let me import the data again.
    • 1:35:47So mode csv.import in from favorites.csv into a favorites table.
    • 1:35:54So I've just recreated the same data set that you all
    • 1:35:57gave me earlier in favorites.db.
    • 1:35:59If I were to do this manually, let's search for The Office again.
    • 1:36:02Select, count star from favorites, where title like, and let's
    • 1:36:09just manually type it in for now, The Office.
    • 1:36:12We'll search for the one with the word The, semicolon.
    • 1:36:15I get back 12.
    • 1:36:16But technically, notice what I get back.
    • 1:36:18I technically get back a miniature table containing one column and one row.
    • 1:36:25What if I want to rename that column?
    • 1:36:27That's where the as keyword comes in.
    • 1:36:28So select count star as counter.
    • 1:36:31Notice what happens, Enter.
    • 1:36:32I just get back-- same simple table, but I've
    • 1:36:35renamed the column to be counter just because it's a little more
    • 1:36:38self-explanatory as to what it is.
    • 1:36:40So what am I doing with this line of code?
    • 1:36:43This line of code is returning to me that miniature temporary table
    • 1:36:47in the form of a list of dictionaries.
    • 1:36:50The list contains one row, as we'll see, and it
    • 1:36:55contains one column, as we'll see, the key for which is counter.
    • 1:37:00So let's now run the code itself.
    • 1:37:02I'm going to get out of SQLite3 and I'm going to run Python of favorites.py.
    • 1:37:07Enter.
    • 1:37:08I'm being prompted for a title.
    • 1:37:09I'm going to type in The Office and cross my fingers, and there's that 12.
    • 1:37:13Why is it 12?
    • 1:37:14Well, there's a typo again because I re-imported the CSV.
    • 1:37:17I had deleted two of the Thes, so we're back at the original data set.
    • 1:37:21So there's 12 total that have, quote, unquote, "The Office"
    • 1:37:25in the title like that.
    • 1:37:27So what have we done?
    • 1:37:28We've combined some Python with some SQL,
    • 1:37:31but we've relegated all of the complexity of searching for something,
    • 1:37:34the selecting of something, gotten rid of all
    • 1:37:36of the with keyword, the open keyword, the for loop,
    • 1:37:39the reader the DictReader, and all of that.
    • 1:37:41And it's just one line of SQL now, using the best of both worlds.
    • 1:37:46All right, any questions on what we've just done here or how any of this
    • 1:37:51works?
    • 1:37:54Any questions here?
    • 1:37:56Yeah?
    • 1:37:57AUDIENCE: [INAUDIBLE]
    • 1:38:01DAVID J. MALAN: When does this function return more than one row?
    • 1:38:04Was that the question?
    • 1:38:05AUDIENCE: Yeah.
    • 1:38:05DAVID J. MALAN: Yeah.
    • 1:38:06So let's do that by changing the problem at hand.
    • 1:38:08This program was designed just to select the total count.
    • 1:38:10Let's go ahead and select, for instance, all
    • 1:38:16of the ways you all typed in The Office by selecting the title this time.
    • 1:38:24If I do this in SQLite3, let me go ahead and do this again
    • 1:38:28after increasing my Terminal window.
    • 1:38:29Let's do it manually.
    • 1:38:30Select title from favorites, where the title is like,
    • 1:38:35quote, unquote, "The Office," semicolon.
    • 1:38:38I get back all of these different rows, and we didn't even notice this one.
    • 1:38:41There's actually another little typo in there
    • 1:38:43with some capitalization of the E, and the C, and the E. That
    • 1:38:47would be an example of a query that gives me back therefore
    • 1:38:50for multiple rows.
    • 1:38:52So let's now change my Python program.
    • 1:38:53If I now, in my Python program, do this, I get back a whole bunch of rows
    • 1:38:59containing all of those titles.
    • 1:39:01I can now do, for row in rows, I can print out the current row's title,
    • 1:39:06and now manipulate all of those things together.
    • 1:39:09Let me keep both on the screen.
    • 1:39:10Let me run Python of favorites.py.
    • 1:39:12And that for loop now should iterate, what, 10 or more times,
    • 1:39:16once for each of those titles.
    • 1:39:18And indeed, if I type in The Office again, Enter.
    • 1:39:22Whoops.
    • 1:39:24Row title.
    • 1:39:25What did I do wrong?
    • 1:39:26Oh, I should not be renaming title to counter this time.
    • 1:39:30So that's just a dumb mistake on my part.
    • 1:39:31Let me rerun it again.
    • 1:39:33And now I should see after typing in The Office,
    • 1:39:36Enter, a whole bunch of The Offices.
    • 1:39:38And because I'm using like, even the missed
    • 1:39:40capitalizations are coming through, because like is case insensitive.
    • 1:39:43It doesn't matter if it's uppercase or lowercase.
    • 1:39:45Whereas had I used the equal sign I would get back only the same ones
    • 1:39:50capitalized correctly.
    • 1:39:51All right, any questions on this next?
    • 1:39:55All right, so let's transition to a larger, juicier data
    • 1:39:59set, and consider some of the issues that
    • 1:40:01arise when actually now using SQL and skating toward a world in which we're
    • 1:40:05using SQL for mobile apps, web apps, and generally speaking,
    • 1:40:08very large data sets.
    • 1:40:10So let's start with a larger data set just like that.
    • 1:40:14Give me just a moment to switch screens over to what we have for you today,
    • 1:40:19which is an actual relational database that we've created out
    • 1:40:22of a real-world data set from IMDb.
    • 1:40:26So InternetMovieDatabase.com is a website
    • 1:40:29where you can search for TV shows, and movies, and actors,
    • 1:40:31and so forth, all using their database behind the scenes.
    • 1:40:34IMDb wonderfully makes their data set available as not CSV files,
    • 1:40:39but TSV files, tab-separated values.
    • 1:40:42And so what we did is, before class we downloaded those TSV files.
    • 1:40:46We wrote a Python program similar to my favorites8.py file
    • 1:40:50earlier that read in all of those TSV files,
    • 1:40:53created some SQL tables in an IMDb database
    • 1:40:58for you in SQLite that has multiple tables and multiple columns.
    • 1:41:03So let's go and wrap our minds around what's actually in this data set.
    • 1:41:07Let me go back to VS Code here, and in just a moment,
    • 1:41:10I'm going to go ahead and copy the file, which we've named shows.db.
    • 1:41:15And I'm going to go ahead and increase my Terminal and do SQLite3 of shows.db.
    • 1:41:20Whenever playing around with a SQLite database for the first time,
    • 1:41:23typing .schema is perhaps a good place to start to give you a sense
    • 1:41:26of what's in there.
    • 1:41:27And things just escalated quickly.
    • 1:41:29There's a lot in this data set, because, indeed, there's
    • 1:41:31going to be tens of hundreds of thousands of rows in this data set,
    • 1:41:35and also problem set 7, where we'll look at the movie side of things
    • 1:41:39and not just the TV shows.
    • 1:41:40So what is the schema that we have created for you
    • 1:41:43from IMDb's actual real-world data?
    • 1:41:47One, there's a table called shows.
    • 1:41:48And notice we've just added whitespace by hitting Enter a bunch of times
    • 1:41:51to make it a little more stylistically readable.
    • 1:41:54The shows table has an ID column, a title column, a year,
    • 1:41:57and the total number of episodes for a given show.
    • 1:42:00And the types of those columns are integer, text, numeric, and integer.
    • 1:42:05So it turns out there's actually a few different data
    • 1:42:08types that are worth being aware of when it comes to creating tables themselves.
    • 1:42:13In fact, in SQLite there's five data types, and only five,
    • 1:42:18fortunately, one of which is, indeed, integer, negative or positive,
    • 1:42:21numeric, which is kind of a catchall for dates and times,
    • 1:42:25things that are numeric but are not just integers,
    • 1:42:27and not just real numbers, for instance.
    • 1:42:29Real number is what we've generally thought of as float up until now.
    • 1:42:33Text, of course, is just text, but notice
    • 1:42:35that you don't have to worry about how big it is.
    • 1:42:37Like in Python, it will size to fit.
    • 1:42:39And then there's BLOB, which is binary large object, which
    • 1:42:41is for just raw 0s and 1s, like for files or things like that.
    • 1:42:45But we'll generally use the other four of these.
    • 1:42:47And so, indeed, when we imported this data for you
    • 1:42:50we decided that every show would be given an ID, which is just an integer.
    • 1:42:56Every show has, of course, a title, which should not be null.
    • 1:42:59Otherwise, why is it in the database?
    • 1:43:01Every show has a year, which is numeric according
    • 1:43:04to that definition a moment ago.
    • 1:43:06And the total number of episodes for a show is going to be an integer.
    • 1:43:09What now is with these primary keys that we mentioned earlier, too?
    • 1:43:13A primary key is the column that uniquely identifies all of the data.
    • 1:43:18In our case, with the favorites, I automatically
    • 1:43:21gave each of your submissions a unique ID so that even if two or more of you
    • 1:43:24typed in The Office, your submission still
    • 1:43:27had a unique identifier, a number that allowed me to then correlate it
    • 1:43:32with your genres, just as we saw a moment ago.
    • 1:43:36In this version of IMDb, there's also genres.
    • 1:43:39But they don't come from us, they come from IMDb.com.
    • 1:43:42And so a genre has a show ID, and a genre just like our database.
    • 1:43:46But these are real-world genres with a bit more filtration.
    • 1:43:49Notice, though, just like my version, there's a foreign key.
    • 1:43:54A foreign key is the appearance of another table's primary key
    • 1:43:59in its own table.
    • 1:44:02So when you have a table like genres, which is somehow
    • 1:44:05cross referencing the original shows table, if shows have a primary key
    • 1:44:10called ID, and those same numbers appear in the genres table
    • 1:44:15under the column called show ID, by definition, show ID is a foreign key.
    • 1:44:20It's the same numbers but it's foreign in the sense
    • 1:44:22that the number is being used in this table,
    • 1:44:25even though it's officially defined primarily in this other table.
    • 1:44:29This is what we mean by relational databases.
    • 1:44:31You have multiple tables with some column in common, numbers typically.
    • 1:44:37And those numbers allow you to line the two tables up in such a way
    • 1:44:40that you can reconnect the shows with their genres,
    • 1:44:44just like we did with our smaller data set a moment ago.
    • 1:44:46This logic is extended further.
    • 1:44:49Notice that the IMDb database we've created for you has a stars table,
    • 1:44:53like TV show stars, the actors therein.
    • 1:44:56And that table, interestingly, has no mention of people
    • 1:45:00and no mention of shows, per se.
    • 1:45:02It only has a column called show ID, which is an integer,
    • 1:45:05and a person ID, which is an integer.
    • 1:45:08Meanwhile, if we scrolled down to the bottom,
    • 1:45:14you will see a table called people.
    • 1:45:17And we have decided in IMDb's world that every person in the TV show world
    • 1:45:23will have a unique identifier that's a number, a name that's text, a birth
    • 1:45:27date, which is numeric, and then, again, specifying that ID
    • 1:45:31is going to be their primary key.
    • 1:45:35So what's going on here?
    • 1:45:36Well, it turns out that TV stars and writers are both types of people.
    • 1:45:42So using this relational database, notice the road we're going down.
    • 1:45:47We're factoring out commonalities.
    • 1:45:49And if a person can be different things in life,
    • 1:45:52well, we're defining them first and foremost as people.
    • 1:45:55And then, notice these two tables are almost the same.
    • 1:45:58The stars table has a show ID, which is a number,
    • 1:46:00and a person ID, which is a number, which
    • 1:46:02allows us via this middleman table, if you will, to link people with TV shows.
    • 1:46:10Similarly, the writers table allows us to connect shows with people, too,
    • 1:46:16by just recording those numbers.
    • 1:46:18So if we go into this data set, let's do the following.
    • 1:46:20Let's do select star from people semicolon.
    • 1:46:24So a huge amount of data is coming back.
    • 1:46:27This is hundreds of thousands of rows now based on the ID numbers alone.
    • 1:46:31So this is real-world data now flying across the screen.
    • 1:46:34There's a lot of people in the TV show business, not just actors and writers,
    • 1:46:38but others as well.
    • 1:46:39It's still going.
    • 1:46:40There's a lot of data there.
    • 1:46:41So my god, if you had to do anything manual in this data
    • 1:46:44set it's probably not going to work out very well.
    • 1:46:46And actually, we're up to, what, a million people in this data
    • 1:46:49set, plus, which would mean this probably isn't even
    • 1:46:51going to open very well in Excel, or Google Spreadsheets, or Apple Numbers.
    • 1:46:55SQL probably is the better approach here.
    • 1:46:57Let's search for someone specific, like select star
    • 1:47:00from people, where name equals Steve Carell, for instance, sticking
    • 1:47:06with comedies.
    • 1:47:06All right, so there's Steve Carell.
    • 1:47:08He is person number 136,797, born in 1962.
    • 1:47:14And that's as much data as we have on Steve Carell here.
    • 1:47:16How do we figure out what shows, for instance, he's in?
    • 1:47:19Well, let's see, select star from shows, semicolon.
    • 1:47:23There's a crazy number of shows out there in the IMDb database.
    • 1:47:27And you can see it here again flying across the screen.
    • 1:47:30Feels like we're going to have to employ some techniques in order
    • 1:47:33to get at all of Steve Carell's shows.
    • 1:47:37So how are we going to do that?
    • 1:47:39Well, god, this is a lot of data here.
    • 1:47:41And in fact, yeah, we have, what, 15 million
    • 1:47:45shows plus in this data set, too.
    • 1:47:47So doing things efficiently is now going to start to matter.
    • 1:47:50So let's actually do this.
    • 1:47:51Let me select a specific show.
    • 1:47:53Select star from shows where title equals, quote, unquote, "The Office."
    • 1:47:58And there presumably shouldn't be typos in this data
    • 1:48:00because it comes from the real website IMDb.com.
    • 1:48:03Let's get back to show.
    • 1:48:05Turns out there's been a lot of The Offices out in the world.
    • 1:48:08The one that started in 2005 is the one that we want,
    • 1:48:12presumably the most popular with 188 episodes.
    • 1:48:14How can we get just that?
    • 1:48:16Maybe we could do and year equals, how about 2005?
    • 1:48:21All right, so now we've got back just the ID of The Office
    • 1:48:25that we care about.
    • 1:48:26And let's do this, too.
    • 1:48:27Let me turn on a timer within SQLite just
    • 1:48:30to get a sense of running time now.
    • 1:48:31Let me do that again.
    • 1:48:33Select star from shows, where title equals The Office,
    • 1:48:36and year equals 2005.
    • 1:48:37And let's keep it simple.
    • 1:48:38Let's just do titles for now.
    • 1:48:40Enter.
    • 1:48:41All right, so not terribly long.
    • 1:48:43It found it pretty fast, but it looks like it took how much real time?
    • 1:48:460.02 seconds, not bad for just a title.
    • 1:48:50But just to plant a seed, it turns out that we can probably
    • 1:48:53speed even this up.
    • 1:48:54Let me do this.
    • 1:48:54Let me create something called an index, which is another use of the C
    • 1:48:58in CRUD for creating something.
    • 1:49:00And I'm going to call this title index.
    • 1:49:02And I'm going to create it on the shows table,
    • 1:49:07specifically on the title column.
    • 1:49:09And we'll see in a moment what this is going to do for me.
    • 1:49:11Enter.
    • 1:49:12Took a moment, like 0.349 seconds, to create something called an index.
    • 1:49:17But now watch, if I select star from shows searching for The Office again,
    • 1:49:21previously it took me 0.021 seconds.
    • 1:49:24Not bad, but now, wow.
    • 1:49:27Literally no time at all, or so low that it wasn't really measurable.
    • 1:49:31And I'll do it again just to get a sense of things.
    • 1:49:33Still quite low.
    • 1:49:34Now even though 0.021 seconds, not crazy long, imagine now having a lot of data,
    • 1:49:40a lot of users running a real website or real mobile app.
    • 1:49:42Every millisecond we can start to shave off is going to be compelling.
    • 1:49:45So what is it we just did?
    • 1:49:47Well, we actually just created something called an index.
    • 1:49:51And this is a nice way to tie in, now, some
    • 1:49:53of our week 5 discussion of data structures,
    • 1:49:56and our week 3 discussion of running times.
    • 1:49:58An index in a database is some kind of fancy data
    • 1:50:01structure that allows the database to do better than linear search.
    • 1:50:06Literally, as you just saw, these tables are crazy long or tall right now,
    • 1:50:10very linear, that is.
    • 1:50:12And so when I first searched for The Office,
    • 1:50:14it was literally doing linear search, top to bottom, looking at as many as,
    • 1:50:17what, a million plus rows.
    • 1:50:19That's relatively slow.
    • 1:50:20It's not that slow, 0.021 seconds.
    • 1:50:23But that's relatively slow just theoretically, algorithmically,
    • 1:50:27doing anything linearly.
    • 1:50:28But if you instead create an index using syntax
    • 1:50:32like this, which I just did, creating an index on the title column of the show's
    • 1:50:37table, that's like giving the database a clue in advance saying,
    • 1:50:41hey, I know I'm going to search on this column in this table a lot.
    • 1:50:44Do something with data structures to speed things up.
    • 1:50:47And so if you think back to our discussion of data structures,
    • 1:50:50maybe it's using a tree.
    • 1:50:51Maybe it's using a trie or a hash table, some fancier two-dimensional data
    • 1:50:56structure is generally going to lift the data up creating right maybe a tree
    • 1:50:59structure.
    • 1:51:00So it's just much faster to find data, especially
    • 1:51:03if it's sorting it now based on title, and not
    • 1:51:06just storing it in one long list.
    • 1:51:08And in fact, in the world of relational databases,
    • 1:51:10the type of structure that's often used in a database
    • 1:51:12is something called a B-tree.
    • 1:51:13It's not a binary tree.
    • 1:51:15Different use of the letter B, but it looks a little something like the trees
    • 1:51:19we've seen.
    • 1:51:19It's not binary because some of the nodes
    • 1:51:21might have more than two children or fewer,
    • 1:51:24but it's a very wide but relatively shallow tree.
    • 1:51:28It's not very tall.
    • 1:51:29And the upside of that is that if your data is stored in this tree,
    • 1:51:33the database can find it more quickly.
    • 1:51:36And the reason it took half a second, a third of a second to build the index
    • 1:51:41is because SQLite needed to take some non-zero amount of time
    • 1:51:45to just build up this tree in memory.
    • 1:51:47And it has algorithms for doing so based on alphabetization or other techniques.
    • 1:51:51But you spend a bit of time up front, a third of a second.
    • 1:51:55And then thereafter, wow.
    • 1:51:57Every subsequent query, if I keep doing it again and again,
    • 1:52:00is going to be crazy low, 0.000, maybe 0.001.
    • 1:52:04But an order of magnitude, a factor of 10 or 100 faster than it
    • 1:52:08previously was earlier.
    • 1:52:10So we have these indexes which allow us to get at data faster.
    • 1:52:14But what if we want to actually get data that's
    • 1:52:17now across these multiple tables?
    • 1:52:19How can we do that?
    • 1:52:20And how might these indices or indexes help further?
    • 1:52:23Well, it turns out there is a way that we've seen already
    • 1:52:26indirectly to join two tables together.
    • 1:52:29Previously, when I selected the ID of The Office,
    • 1:52:33and then I searched for it in the other table using select in a nested query,
    • 1:52:37I was joining two tables together.
    • 1:52:40And it turns out there's a couple of ways to do this.
    • 1:52:42Let's go ahead now and, for instance, find all of Steve Carell's TV shows.
    • 1:52:46Not just The Office but all of them, too.
    • 1:52:48Unfortunately, if we look at our schema, shows up here have no mention of TV--
    • 1:52:56oh, shows over here has no mention of the TV stars in them.
    • 1:53:01And people have no mention of shows.
    • 1:53:05We somehow need to use this table here to connect the two.
    • 1:53:09And this is called a join table, in the sense that using two integer columns--
    • 1:53:14it joins the two tables together logically.
    • 1:53:17And so if you're savvy enough with SQL, you can do what I did with my hands
    • 1:53:21earlier and like recombine tables by using these common IDs,
    • 1:53:26these integers together.
    • 1:53:27So let me do this.
    • 1:53:28Let me go ahead and figure out, step-by-step, Steve Carell's shows.
    • 1:53:32So how am I going to do this?
    • 1:53:33Well, if I select star from people, where name equals Steve Carell,
    • 1:53:39fortunately, there's only one of them.
    • 1:53:40So this gives me back his name, his ID, and his birth year.
    • 1:53:46But it's really only his ID that I care about.
    • 1:53:48Why?
    • 1:53:49Because in order to get back his shows, I need to link person ID with show ID.
    • 1:53:55So I need to know his ID number.
    • 1:53:57So what could I do with this?
    • 1:53:59Well, remember the schema and the stars table.
    • 1:54:04I've just gotten, from the people table, Steve Carell's ID.
    • 1:54:07I bet by transitivity I could now use his person ID, his ID,
    • 1:54:13to get back all of his show IDs.
    • 1:54:15And then once I've got all of his show IDs, I can take it one step further
    • 1:54:19and get back all of his shows' titles.
    • 1:54:21So the answer is actually English words and not just random, seemingly,
    • 1:54:25integers.
    • 1:54:26So let me go ahead and do this.
    • 1:54:27Let me, again, get Steve Carell's ID number, but not star.
    • 1:54:31Star represents everything.
    • 1:54:33It's a wildcard character in SQL.
    • 1:54:35Let me just select the ID of Steve Carell.
    • 1:54:38And that gives me back 136,797.
    • 1:54:41And it's only giving me back one value.
    • 1:54:43The thing called ID is just the column heading up above.
    • 1:54:46Now, suppose I want to select all of the show IDs
    • 1:54:51that Steve Carell is affiliated with.
    • 1:54:53Let me select Show ID from stars, where the person ID in stars
    • 1:55:00happens to equal Steve Carell's ID.
    • 1:55:03So again, I'm building up my answer in reverse and taking these baby steps.
    • 1:55:07On the right, in parentheses, I'm getting Steve Carell's ID.
    • 1:55:11On the left, I am now selecting all of the show IDs
    • 1:55:15that have some connection with that person ID in the stars table.
    • 1:55:19This answer, too, is not going to be that illuminating.
    • 1:55:21It's just a whole bunch of integers that have no meaning to me as a human.
    • 1:55:25But let's take this one step further.
    • 1:55:27And even though my code is getting long, I
    • 1:55:29could hit Enter and format it nicely, especially
    • 1:55:31if I were doing this in a code file.
    • 1:55:33But I'm just doing it interactively for now.
    • 1:55:35Let's now select all of the titles from the shows table,
    • 1:55:39where the ID of the show is in this following previous query.
    • 1:55:48So again, the query is getting long.
    • 1:55:49But notice, it's the third and last step.
    • 1:55:52Select title from the shows table, where the ID of the show
    • 1:55:55is in the list of all of the show IDs that
    • 1:55:58came back from the stars table searching for Steve Carell's person ID.
    • 1:56:02How did we get that person ID?
    • 1:56:03Let me scroll to the end.
    • 1:56:04Well, I selected, in my innermost parentheses, Steve Carell's own ID.
    • 1:56:11So now, when I hit Enter, voila.
    • 1:56:13I get all of Steve Carell's TV shows up until now.
    • 1:56:16And if I want to tidy this up further, I can use the same tricks as before.
    • 1:56:19Order by title, semicolon.
    • 1:56:22Now I've got it all alphabetized as before.
    • 1:56:25So again, with SQL comes the ability to search--
    • 1:56:28I mean, look how quickly we do this, 0.094 seconds
    • 1:56:31to search across three different tables to get back this answer.
    • 1:56:34But my data is now all neatly designed in individual tables,
    • 1:56:38which is going to be important now that the data set is so large.
    • 1:56:42But let me take this one step further.
    • 1:56:44Let me go ahead and do this.
    • 1:56:46Let me go ahead and point out that with this query,
    • 1:56:51notice that I'm searching on--
    • 1:56:54let's say I'm searching on a person ID here.
    • 1:56:58And at the end here, I'm searching on a name column here.
    • 1:57:02So let me actually go ahead and do this.
    • 1:57:05Let me go ahead and see if we can't speed this up.
    • 1:57:09This query at the moment takes 0.092 seconds.
    • 1:57:13Let's see if we can't speed this up further by just quickly creating
    • 1:57:15a few more of those B-trees in the databases memory.
    • 1:57:18Create an index called person index, and I'm going to do this on the stars table
    • 1:57:24on the person ID column.
    • 1:57:25Enter.
    • 1:57:26It's taking a moment, taking a moment.
    • 1:57:28That's almost a full second because that's a big table.
    • 1:57:30Let's create another index called show index on the stars table.
    • 1:57:35Why?
    • 1:57:35Because I want to search by the show ID also.
    • 1:57:37That was part of my big query.
    • 1:57:39Takes a moment.
    • 1:57:40OK, just more than about 2/3 of a second.
    • 1:57:43Now let's create one last one, another index called name index,
    • 1:57:46but I could call these things anything I want, on the people table.
    • 1:57:49Why?
    • 1:57:49Because I'm also searching on the name column.
    • 1:57:51So in short, I'm creating indexes on each
    • 1:57:53of the columns that are somehow involved in my search query,
    • 1:57:57going from one table to the other.
    • 1:57:59Now let's go back to the previous query, which, recall, took--
    • 1:58:07I think I erased it, 0.091.
    • 1:58:10All right.
    • 1:58:10Well, it was roughly this order of magnitude.
    • 1:58:12We're not seeing the data now.
    • 1:58:14But let me go ahead and run my original big query once.
    • 1:58:17And boom, we're down to almost nothing.
    • 1:58:20So again, creating these indexes in memory
    • 1:58:22has the effect of rapidly speeding up our computation time.
    • 1:58:27Now if you've ever used, for instance, the my.harvard course shopping tool
    • 1:58:31here on campus, or Yale's analogue, you might wonder, why is the thing so slow?
    • 1:58:35This could be one of the reasons why large data sets with thousands of rows,
    • 1:58:39thousands of courses tend to be slow, if,
    • 1:58:41and I'm only conjecturing, if the database isn't properly indexed.
    • 1:58:44If you're building your own web application
    • 1:58:46and you're finding that users are waiting and waiting,
    • 1:58:49and things are spinning and spinning, what might be among the problems?
    • 1:58:52Well, it could absolutely just be bad algorithms and bad code that you wrote.
    • 1:58:55Or it might be that you haven't thought about, well,
    • 1:58:58what column should be optimized for searches and filtration
    • 1:59:01like I've done here in order to speed up subsequent queries?
    • 1:59:05Again, from the outside in, we can only conjecture.
    • 1:59:07But ultimately, this is just one of the things that
    • 1:59:11explains performance problems as well.
    • 1:59:14All right, let's point out just a couple of final syntactic things,
    • 1:59:17and then we'll consider, bigger picture, some problems
    • 1:59:19that might arise in this world.
    • 1:59:22If these nested, nested queries start to get a little much,
    • 1:59:26there are other ways, just so you've seen it,
    • 1:59:29that you can execute similar logic in SQL.
    • 1:59:31For instance, if I know in advance that I
    • 1:59:34want to connect Steve Carell to his show IDs and to their titles,
    • 1:59:39we can do something more like this.
    • 1:59:41Select title from the people table, joined with the stars table on people
    • 1:59:52ID equals stars.personID.
    • 1:59:54So what am I doing?
    • 1:59:55New syntax.
    • 1:59:56And again, this is not something you'll have to memorize or ingrain right away.
    • 1:59:59But just so you've seen other approaches, select title from people
    • 2:00:04join stars.
    • 2:00:05This is an explicit way to say, take the people table in one hand, the stars
    • 2:00:09table in the other hand, and somehow join them
    • 2:00:12as I keep doing with my fingertips here.
    • 2:00:13How to join them?
    • 2:00:15Join them so that the people, the ID column in the people table lines up
    • 2:00:20with the person ID in the stars table.
    • 2:00:23But that's not quite everything.
    • 2:00:25I could also say, join further on the shows table,
    • 2:00:28where the stars show ID equals the shows ID column.
    • 2:00:35So what am I doing here?
    • 2:00:36That's saying, go further and join the stars table with the show's table,
    • 2:00:45joining the show ID column with the ID column.
    • 2:00:48Again, this starts to get a little messy to think about.
    • 2:00:51But now I can just say, where name equals, quote, unquote, "Steve Carell."
    • 2:00:55I can do in one query what previously took me three nested queries
    • 2:00:59and get back the same answers.
    • 2:01:00And I can still add in my order by title to get back the result.
    • 2:01:05And if I do this a little more neatly, let me type this out a little
    • 2:01:09differently.
    • 2:01:10Let me type this out by adding a new line-- ah, I can't do that here.
    • 2:01:16I'm going to leave it alone for now.
    • 2:01:18We can type it on multiple lines in other contexts.
    • 2:01:20And let me do one last thing.
    • 2:01:23Do I want to show that?
    • 2:01:25I'm going to show it, but this is not something
    • 2:01:26you should ingrain just yet either.
    • 2:01:28Select title from people, stars, and shows.
    • 2:01:31If you know in advance that you want to do something with all three tables,
    • 2:01:34you can just enumerate them, one table name after the other.
    • 2:01:38And then you can say where people.ID equals stars.personID.
    • 2:01:43And now I'm hitting Enter so that it formats
    • 2:01:45a little more readably on my screen.
    • 2:01:47And stars.showID equals shows.ID, and lastly, name equals Steve Carell.
    • 2:01:55In short, you specify that you want to select data from all three
    • 2:01:59of these tables.
    • 2:02:00And then you tell the database how to combine foreign keys with primary keys,
    • 2:02:06that is, the columns that have those integers in common.
    • 2:02:09If I hit Enter now, I get the same exact results, ever
    • 2:02:12more so if I also add in an order by title.
    • 2:02:16Oops.
    • 2:02:17All right.
    • 2:02:18That's why I didn't want to do this earlier.
    • 2:02:20I have to go back through my history multiple times to actually get back
    • 2:02:23the multi-line query this time.
    • 2:02:24All right.
    • 2:02:25That was a lot all at once.
    • 2:02:27But this is only to say that, even as we make the design of the data
    • 2:02:31more sophisticated, and we put some of it over here, some of it
    • 2:02:33over here, some of it over here so as to avoid duplication of data, weird hacks
    • 2:02:38like putting commas in the data, we can still get back all of the answers
    • 2:02:41that we might want across these several tables.
    • 2:02:44And using indexes, we can significantly speed up these processes
    • 2:02:48so as to handle 10 times as many, a 100 times as many users
    • 2:02:52on the same actual database.
    • 2:02:53There is going to be a downside.
    • 2:02:55And thinking back to our discussion of algorithms and data
    • 2:02:57structures in past weeks, what might be a downside of creating these indexes?
    • 2:03:02Because as of now, I created four separate indexes on the name column,
    • 2:03:06the title column, and some other columns, too.
    • 2:03:09Why wouldn't I just go ahead and index everything
    • 2:03:11if it's clearly speeding things up?
    • 2:03:14Memory, so space.
    • 2:03:15Any time you're starting to benefit time wise in computer science,
    • 2:03:18odds are you're sacrificing space, or vice versa.
    • 2:03:21And probably indexing absolutely everything is a little dumb
    • 2:03:25because you're going to waste way more space than you might actually need.
    • 2:03:29So figuring out where the right inflection point is
    • 2:03:31is part of the process of designing and just getting better at these things.
    • 2:03:36Now unfortunately, a whole lot of things can go wrong in this world,
    • 2:03:40and they continue to in the real world with people using SQL databases.
    • 2:03:44And in fact, here on out, if you're reading
    • 2:03:46something technical about SQL databases, and websites being hacked in some form,
    • 2:03:51and passwords leaking out, unfortunately, all too often
    • 2:03:54it is because of what are called SQL injection attacks.
    • 2:03:57And just to give you a sense now to counterbalance,
    • 2:03:59maybe [INAUDIBLE] enthusiasm for like, oh, that
    • 2:04:01was neat how we can do things so quickly.
    • 2:04:03With great power comes responsibility in this world, too.
    • 2:04:06And so many people introduce bugs into their code
    • 2:04:09by not quite appreciating how it is the data is getting into your application.
    • 2:04:17So what do I mean by that?
    • 2:04:18Here, for instance, is a typical login screen for Yale.
    • 2:04:21And here's the analogue for Harvard where you're prompted,
    • 2:04:23every day probably, for your username and your password,
    • 2:04:26your email address and your password here.
    • 2:04:28Suppose, though, that behind this login page,
    • 2:04:32whether Harvard's or Yale's, there's some website.
    • 2:04:35And that website is using SQL underneath the hood
    • 2:04:38to store all of the Harvard or Yale people's
    • 2:04:40usernames, passwords, ID numbers, courses, transcripts,
    • 2:04:43all of that stuff.
    • 2:04:44So there's a SQL database underneath the website.
    • 2:04:47Well, what might go wrong with this process?
    • 2:04:50Unfortunately, there's some special syntax in SQL
    • 2:04:52just like there is in C and Python.
    • 2:04:54For instance, there are comments in SQL, too.
    • 2:04:56If you do two hyphens, dash, dash, that's a comment in SQL.
    • 2:05:00And if you, the programmer, aren't sufficiently distrustful of your users,
    • 2:05:06such that you defend against potentially adversarial attacks,
    • 2:05:09you might do something like this.
    • 2:05:11Suppose that I somewhat maliciously or curiously log in
    • 2:05:16by typing my username, Malan@harvard.edu, and then maybe
    • 2:05:19a single quote and a dash, dash.
    • 2:05:20Why?
    • 2:05:21Because I'm trying to suss out if there is a vulnerability here
    • 2:05:24to a SQL injection attack.
    • 2:05:26Do not do this in general.
    • 2:05:27But if I were the owner of the website trying to see if I've made any mistake,
    • 2:05:31I might try using potentially dangerous characters in my input.
    • 2:05:35Dangerous how?
    • 2:05:36Because single quote is used for quoting things in SQL, as we've seen--
    • 2:05:40single quotes or double quotes.
    • 2:05:41Dash, dash, I claim now, is used for commenting.
    • 2:05:44But let's now imagine what the code underneath the hood
    • 2:05:47might be for something like Yale's login or Harvard's login.
    • 2:05:52What if it's code that looks like this?
    • 2:05:54So let me read it from left to right.
    • 2:05:56Suppose that they are using something like CS50's own execute function,
    • 2:06:00and they've got some SQL typed into the website that
    • 2:06:03says select star from users, where username equals this,
    • 2:06:07and password equals that.
    • 2:06:09And they're plugging in username and password.
    • 2:06:12So what am I doing here?
    • 2:06:13Well, when the user types their username password, hits Enter,
    • 2:06:16I probably want to select that user from my database
    • 2:06:18to see if the username and passwords match.
    • 2:06:21So the underlying SQL might be, select star
    • 2:06:23from users, where username equals question mark,
    • 2:06:25and password equals question mark.
    • 2:06:27Users is the table.
    • 2:06:28One column is username.
    • 2:06:29One column is password.
    • 2:06:31All right.
    • 2:06:32And if we get back one row, presumably Malan@harvard.edu
    • 2:06:37exists with that password.
    • 2:06:38We should let him proceed from there on out.
    • 2:06:41So that's some pseudo code, if you will, for this scenario.
    • 2:06:45What if, though, this code is not as well written as it currently
    • 2:06:49is, and isn't using question marks?
    • 2:06:51So the question mark syntax is a fairly common SQL thing,
    • 2:06:54where the question marks are used as placeholders,
    • 2:06:56just like in printf, percent S was.
    • 2:06:59But this function, db.execute from CS50's library
    • 2:07:02and third-party libraries as well, is also
    • 2:07:05doing some good stuff with these question marks,
    • 2:07:07and defending against the following attack.
    • 2:07:09Suppose that you were not using a third-party library like ours
    • 2:07:12and you were just manually constructing your SQL queries like this.
    • 2:07:16You were to do something like this instead using an f-string in Python.
    • 2:07:19You're comfortable with format strings now.
    • 2:07:21You've gotten into the habit of using curly braces and plugging in values.
    • 2:07:24Suppose that you, the aspiring programmer,
    • 2:07:27is just using techniques that you've been taught.
    • 2:07:29So you have an f-string with select star from users,
    • 2:07:32where username equals, quote, unquote, "username" in curly braces.
    • 2:07:36And password equals, quote, unquote, "password" in curly braces.
    • 2:07:41As of what, two weeks ago, this was perfectly
    • 2:07:44legitimate technique in Python to plug in values into a string.
    • 2:07:49But notice if you are using single quotes yourself
    • 2:07:53and the user has typed in single quotes to their input, what
    • 2:07:58could go wrong here?
    • 2:07:59Where are we going with this if you're just blindly plugging user input
    • 2:08:04into your own prepared string of text?
    • 2:08:08Yeah?
    • 2:08:09AUDIENCE: [INAUDIBLE]
    • 2:08:14DAVID J. MALAN: Yeah.
    • 2:08:15Worst case, they could insert what is actually SQL code into your database
    • 2:08:20as follows.
    • 2:08:21Generally speaking, if you're using special syntax like single quotes
    • 2:08:25to surround the user's input, you'd better
    • 2:08:27hope that they don't have an apostrophe in their name.
    • 2:08:29Or you better hope that they don't type a single quote as well.
    • 2:08:31Because what if their single quote finishes your single quote instead,
    • 2:08:36and then the rest of this is somehow ignored?
    • 2:08:37Well, let's consider how this might happen.
    • 2:08:39Let me go ahead in here.
    • 2:08:40This got a little blurry here, but let me
    • 2:08:42plug in here-- wow, that looks awful.
    • 2:08:44Let me fix the red.
    • 2:08:47Just change this to white so it's more readable.
    • 2:08:50What happens if the user does this instead?
    • 2:08:56They type in, like I did into the screenshot,
    • 2:08:59'Malan@harvard.edu,' single quote, dash, dash.
    • 2:09:02What has just happened logically, even though we've
    • 2:09:05only just begun with SQL today?
    • 2:09:06Well, select star from users, where username equals Malan@harvard.edu, end
    • 2:09:12quote.
    • 2:09:13What's bad about the rest of this?
    • 2:09:17Dash, dash, I claim, means a comment, which
    • 2:09:19means my color coding is going to be a little blurry again.
    • 2:09:22But everything after the dash, dash is just ignored.
    • 2:09:25The logic, then, of the SQL query, then, is
    • 2:09:27to just say, select Malan@harvard.edu from the database,
    • 2:09:30not even checking the password anymore.
    • 2:09:33Therefore, you will get back at least one row.
    • 2:09:36So length of rows will equal 1, and so presumably the rest of the pseudo code
    • 2:09:41logs the user in, gives them access to my my.harvard account,
    • 2:09:44or whatever it is.
    • 2:09:45And they've pretended to be me simply by using a single quote and a dash,
    • 2:09:50dash in the username field.
    • 2:09:52Again, please don't go start doing this later today
    • 2:09:54on Harvard, Yale, or other websites.
    • 2:09:56But it could be as simple as that.
    • 2:09:57Why?
    • 2:09:58Because the programmer practiced what they
    • 2:10:00were taught, which was just to use curly braces to plug in,
    • 2:10:04in f-strings, values.
    • 2:10:05But if you don't understand how the user's input is going to be used,
    • 2:10:08and if you don't distrust your users fundamentally, for every good person
    • 2:10:12out there there's going to be, unfortunately,
    • 2:10:14some adversary who just wants to try to find fault in your code or hack
    • 2:10:19into your data set.
    • 2:10:20This is what's known as a SQL injection attack,
    • 2:10:22because the user can type something that happens to be or look like SQL,
    • 2:10:27and trick your database into doing something it didn't intend to,
    • 2:10:31like, for instance, logging the user in.
    • 2:10:34Worst case, they could even do something else.
    • 2:10:36Maybe the user types a semicolon, then the word drop, or the word update.
    • 2:10:41You could imagine doing semicolon update table grades, where
    • 2:10:45name equals Malan, and set the grade equal to A instead of B,
    • 2:10:49or something like that.
    • 2:10:50The ability to inject SQL into the database
    • 2:10:53means you can do anything you want with the data set, either constructively,
    • 2:10:56or worse, destructively.
    • 2:11:00And now, just a quick, little cartoon that should now make sense.
    • 2:11:09OK, to, like, one of us, two of us.
    • 2:11:13Awkwardly somewhat funny.
    • 2:11:14All right, so let's move on to one last condition.
    • 2:11:16There's one other problem that can go awry here.
    • 2:11:19Oh, and I should explain this.
    • 2:11:20So this is an allusion to the son, Robert, having typed in semicolon.
    • 2:11:25The word drop, table, students, and doing some of the same technique.
    • 2:11:28This is humor that only CS people would understand
    • 2:11:31because it's the mom realizing, oh, her son's doing a SQL injection
    • 2:11:35attack onto the database.
    • 2:11:36Less funny when you explain it, but once you notice the syntax, that's all this
    • 2:11:39is an allusion to.
    • 2:11:40All right.
    • 2:11:41So one final threat, now that you are graduating
    • 2:11:44to the world of proper databases and away from CSV files alone.
    • 2:11:49Things can go wrong when using databases,
    • 2:11:52and honestly, even using CSV files if you have multiple users.
    • 2:11:55And thus far, you and I have had the luxury
    • 2:11:57in almost every program we've written that it's just me using my code.
    • 2:12:00It's just you using your code.
    • 2:12:01And even if your teaching fellow or TA is using it, probably
    • 2:12:04not at the same time.
    • 2:12:06But the world gets interesting if you start putting your code on phones,
    • 2:12:10on websites, such that now you might have two users literally trying
    • 2:12:14to log in at the same time, literally clicking a button
    • 2:12:17at the same, or nearly the same time.
    • 2:12:19What happens, then, if a computer is trying
    • 2:12:22to handle requests from two different people at once,
    • 2:12:25as might happen all the time on a website?
    • 2:12:27You might get what are called race conditions.
    • 2:12:29And this is a problem in computing in general, not just with SQL, not just
    • 2:12:33with Python, really just any time you have shared data,
    • 2:12:36like a database, as follows.
    • 2:12:39This apparently is one of the most liked Instagram posts ever.
    • 2:12:43It is literally just a picture of an egg.
    • 2:12:46Has anyone clicked on this egg?
    • 2:12:47Like, a couple?
    • 2:12:48Oh, OK.
    • 2:12:48Wow.
    • 2:12:49All right, so yes.
    • 2:12:50So go search for this photo if you'd like to add to the likes on Instagram.
    • 2:12:53The account is world_record_egg.
    • 2:12:56This is just a screenshot of Instagram of that picture of an egg.
    • 2:12:59If you're in the habit of using Instagram,
    • 2:13:00or like any social media site, there's some equivalent of a like button
    • 2:13:03or a heart button these days.
    • 2:13:04And that's actually a really hard problem.
    • 2:13:06Such a simple idea to count the number of likes something
    • 2:13:10has, but that means someone has to click on it.
    • 2:13:12Your code has to detect the click.
    • 2:13:14Your code has to update the database, and then do it again and again,
    • 2:13:17even if multiple people are perhaps right now clicking on that same egg.
    • 2:13:22And unfortunately, bad things can happen if two people try to do something
    • 2:13:28at the same time on a computer.
    • 2:13:30How might this happen?
    • 2:13:31So here's some more code, half pseudocode, half Python code here,
    • 2:13:35as follows.
    • 2:13:36Suppose that what happens when you, literally, right now, maybe click
    • 2:13:40on the like button on the Instagram post.
    • 2:13:43Suppose that code, like the following, is executed on Facebook servers.
    • 2:13:47db.execute of select likes from posts where ID equals question mark.
    • 2:13:54All right.
    • 2:13:54So what am I assuming here?
    • 2:13:57I'm assuming that that photograph has a unique ID.
    • 2:13:59It's some big integer, whatever it was, randomly assigned.
    • 2:14:02I'm assuming that when you click on the heart
    • 2:14:05the unique ID is somehow sent to Instagram servers
    • 2:14:08so that their code can call it ID.
    • 2:14:10And I'm assuming that Instagram is using its SQL database
    • 2:14:13and selecting, from a posts table, the current number of likes
    • 2:14:17of that egg for that given ID number.
    • 2:14:21Why?
    • 2:14:21Because I need to know how many likes it already has if I want to add one to it
    • 2:14:24and then update the database.
    • 2:14:26I need to select the data, then I need to update the data here.
    • 2:14:29All right.
    • 2:14:30So in some Python code here, let's store, in a variable called
    • 2:14:33likes, whatever comes back in the first row from the likes column.
    • 2:14:37Again, this is new syntax specific to our library,
    • 2:14:40but a common way of getting back first row and the column called
    • 2:14:43likes therein.
    • 2:14:44So at this point in the story, likes is storing
    • 2:14:46the total number of likes, in the millions or whatever it is,
    • 2:14:49of that particular egg.
    • 2:14:50Then I do this.
    • 2:14:51Execute update posts, set the number of likes
    • 2:14:56equal to this value, where the ID of the post equals this value.
    • 2:15:00What do I want to update the likes to?
    • 2:15:02Whatever likes currently is plus 1, and then plugging in the ID.
    • 2:15:06So a simple idea, right?
    • 2:15:08I'm checking the value of the likes, and maybe it's 10.
    • 2:15:11I'm changing 10 to 11 and then updating the table.
    • 2:15:15But a problem can arise if two people have
    • 2:15:18clicked on that egg at roughly the same time, or literally, the same time.
    • 2:15:23Why is that?
    • 2:15:24Well, in the world of databases and servers,
    • 2:15:26and the Instagrams of the world have thousands of physical servers nowadays.
    • 2:15:31So they can support millions, billions even, of users nowadays.
    • 2:15:35What can go wrong?
    • 2:15:36Well, typically code like this is not what we'll call atomic.
    • 2:15:40To be atomic means that it all executes together or not at all.
    • 2:15:44Rather, code typically is executed, as you might imagine, line by line.
    • 2:15:49And if your code is running on a server that multiple people have access to,
    • 2:15:53which is absolutely the case for an app like Instagram,
    • 2:15:55if you and I click on the heart at roughly the same time,
    • 2:15:58for efficiency, the computer, the server, owned by Instagram,
    • 2:16:02might execute this line of code for me.
    • 2:16:04Then it might execute this line of code for you.
    • 2:16:06Then this line of code for me, then this line of code for you,
    • 2:16:09then this line of code for me, then this line of code for you.
    • 2:16:11That is to say, our queries might get intermingled chronologically.
    • 2:16:16Because it'd be a little obnoxious if, when you're using Instagram,
    • 2:16:19I'm blocked out while you're interacting with the site.
    • 2:16:22It'd be a lot nicer for efficiency and fairness if somehow they
    • 2:16:24do a little bit of work for me, a little bit of work for you,
    • 2:16:27and back and forth, and back and forth, equitably on the server.
    • 2:16:30So that's what typically happens by default. These lines of code
    • 2:16:33get executed independently.
    • 2:16:35And they can happen in alternating order with other users.
    • 2:16:39You can get them combined like this.
    • 2:16:41Same order top to bottom, but other things might happen in between.
    • 2:16:45So suppose that the number of likes at the very beginning was 10.
    • 2:16:50And suppose that Carter and I both click on that egg at roughly the same time.
    • 2:16:54And suppose this line of code gets executed for me,
    • 2:16:56and that gives me a value in likes, ultimately, of 10.
    • 2:16:59Suppose, then, that the computer takes a break from dealing with my request,
    • 2:17:03does the same code for Carter, and gets back
    • 2:17:05what value for the current number of likes?
    • 2:17:07Also 10 for Carter.
    • 2:17:09Because mine has not been recorded yet.
    • 2:17:11At this point in the story, somewhere in the computer's memory
    • 2:17:13there's a likes variable for me, storing 10.
    • 2:17:16There's a likes variable storing 10 for Carter.
    • 2:17:19Then this line of code executes for me.
    • 2:17:21It updates the database to be likes plus 1, which stores 11 in the database.
    • 2:17:25Then Carter's code is executed, updating the same row in the database
    • 2:17:30to 11, unfortunately.
    • 2:17:34Because his value of likes happened to be the same value of mine.
    • 2:17:38And so the metaphor here, that if we had a refrigerator on stage we would
    • 2:17:41actually act out, is something that was taught to me years ago in an operating
    • 2:17:44systems class, whereby the most similar analogue in the real world would be
    • 2:17:50if you've got a mini fridge in your dorm room.
    • 2:17:52And one of you and your roommates comes home, opens the fridge, and realizes,
    • 2:17:58oh, we're out of milk, was how the story went in my day.
    • 2:18:01So you close the refrigerator, and you walk across the street, go to CVS,
    • 2:18:05and get in line to buy some milk.
    • 2:18:06Meanwhile, your roommate comes home.
    • 2:18:08They, too, inspect the state of your refrigerator, a.k.a., a variable,
    • 2:18:12open the door, and realizes, oh, we're out of milk.
    • 2:18:15I'll go get more milk.
    • 2:18:15Close the fridge, go across the street, and head
    • 2:18:18to maybe a different store, or the line is long enough
    • 2:18:20that you don't see each other at the store.
    • 2:18:21So long story short, you both eventually get home, open the door, and damn it,
    • 2:18:25now there's milk from your other roommate
    • 2:18:27there because you both made a decision on this
    • 2:18:30based on the state of a variable that you independently examined.
    • 2:18:35And you didn't somehow communicate.
    • 2:18:37Now in the real world, this is absolutely solvable.
    • 2:18:40How would you fix this or avoid this problem in the real world?
    • 2:18:44Literally, own roommate, own fridge.
    • 2:18:45AUDIENCE: Text your roommate [INAUDIBLE]..
    • 2:18:47DAVID J. MALAN: Perfect.
    • 2:18:48Let them know, so somehow communicate.
    • 2:18:50And in fact, the terminology here would be multiple threads
    • 2:18:52can somehow intercommunicate by having shared state,
    • 2:18:55like the iMessage thread on your phone.
    • 2:18:57You could leave a note.
    • 2:18:58You could, more dramatically, lock the refrigerator somehow,
    • 2:19:01thereby making the milk purchasing process atomic.
    • 2:19:05The fundamental problem is that for efficiency, again,
    • 2:19:08computers tend to intermingle logic that needs
    • 2:19:11to happen when it's happening across multiple users just for fairness' sake,
    • 2:19:15for scheduling sake.
    • 2:19:17You need to make sure that all three of these lines of code
    • 2:19:19execute for me, and then for Carter, and then for you
    • 2:19:23if you want to ensure that this count is correct.
    • 2:19:25And for years, when social media was first getting off the ground,
    • 2:19:29this was a super hard problem.
    • 2:19:31Twitter used to go down all of the time, and tweets,
    • 2:19:33and retweets were a thing that were similarly happening
    • 2:19:36with a very high frequency.
    • 2:19:37These are hard problems to solve.
    • 2:19:39And thankfully, there are solutions.
    • 2:19:40And we won't get into the weeds of how you might use these things,
    • 2:19:43but know that there are solutions in the form of things
    • 2:19:46called locks, which I use that word deliberately with the fridge.
    • 2:19:49Software locks can allow you to protect a variable so no one else can
    • 2:19:53look at it until you're done with it.
    • 2:19:55There are things called transactions, which
    • 2:19:57allow you to do the equivalent of sending a message to, or really locking
    • 2:20:01out your roommate from accessing that same variable, too,
    • 2:20:04but for slightly less amount of time.
    • 2:20:06There are solutions to these problems.
    • 2:20:08So for instance, in Python, the same code now in green
    • 2:20:12might look a little something like this.
    • 2:20:14When you know that something has to happen all at once,
    • 2:20:17altogether, you first begin a transaction, and you do your thing,
    • 2:20:21and then you commit the transaction at the very end.
    • 2:20:23Here, too, though, there's going to be a downside.
    • 2:20:25Typically, the more you use transactions in this way,
    • 2:20:29potentially the higher the probability is
    • 2:20:31that you're going to box someone out or make Carter's request a little slower.
    • 2:20:35Why?
    • 2:20:35Because we can't interact at the same time.
    • 2:20:37Or you might make his request fail if he tries to update
    • 2:20:39something that's already been updated.
    • 2:20:41So you generally want to have as few lines of code
    • 2:20:44together in between these transactions so that you get in and you get out.
    • 2:20:47And you go to CVS and you get back really fast so as to not
    • 2:20:50cause these kind of performance things.
    • 2:20:52So things indeed escalated quickly today.
    • 2:20:55The original goal was just to solve problems using a different language
    • 2:20:58more effectively than Python.
    • 2:20:59But as soon as you have these more powerful techniques,
    • 2:21:01a whole new set of problems arises.
    • 2:21:03Takes practice to get comfortable with.
    • 2:21:05But ultimately, this is all leading us toward the introduction next week
    • 2:21:09of web programming with HTML, CSS, and some JavaScript.
    • 2:21:12The week after, bringing Python and SQL back into the mix.
    • 2:21:15So that by term's end, we've really now used
    • 2:21:16all of these different languages for what they're best at.
    • 2:21:19And over the next few weeks, the goal is to make sure you're understanding
    • 2:21:22and comfortable with what each of these things is good and bad for.
    • 2:21:25Let's go ahead and wrap here.
    • 2:21:27I'll stick around for questions.
    • 2:21:28We'll see you next time.
    • 2:21:30[MUSIC PLAYING]
  • CS50.ai
Shortcuts
Before using a shortcut, click at least once on the video itself (to give it "focus") after closing this window.
Play/Pause spacebar or k
Rewind 10 seconds left arrow or j
Fast forward 10 seconds right arrow or l
Previous frame (while paused) ,
Next frame (while paused) .
Decrease playback rate <
Increase playback rate >
Toggle captions on/off c
Toggle mute m
Toggle full screen f or double-click video