CS50 Video Player
    • 🧁

    • 🍦

    • 🍍

    • 🍿
    • 0:00:00Introduction
    • 0:00:15Language
    • 0:04:55Syntax and Semantics
    • 0:10:23Context-Free Grammar
    • 0:20:35nltk
    • 0:28:00n-grams
    • 0:30:28Tokenization
    • 0:38:00Markov Models
    • 0:42:41Bag-of-Words Model
    • 0:46:38Naive Bayes
    • 1:09:18Information Retrieval
    • 1:12:06tf-idf
    • 1:21:04Information Extraction
    • 1:30:13WordNet
    • 1:32:06Word Representation
    • 1:38:18word2vec
    • 0:00:00[MUSIC PLAYING]
    • 0:00:17SPEAKER 1: OK, welcome back, everyone, to our final topic in an introduction
    • 0:00:21to artificial intelligence with Python.
    • 0:00:23And today, the topic is language.
    • 0:00:25So thus far in the class, we've seen a number
    • 0:00:27of different ways of interacting with AI, artificial intelligence,
    • 0:00:30but it's mostly been happening in the way of us formulating problems
    • 0:00:34in ways that I can understand-- learning to speak the language of AI,
    • 0:00:38so to speak, by trying to take a problem and formulated as a search problem,
    • 0:00:41or by trying to take a problem and make it a constraint satisfaction problem--
    • 0:00:45something that our AI is able to understand.
    • 0:00:47Today, we're going to try and come up with algorithms and ideas that
    • 0:00:50allow our AI to meet us halfway, so to speak--
    • 0:00:53to be able to allow AI to be able to understand, and interpret, and get
    • 0:00:56some sort of meaning out of human language--
    • 0:00:58the type of language, in the spoken language,
    • 0:01:00like English, or some other language that we naturally speak.
    • 0:01:03And this turns out to be a really challenging task for AI.
    • 0:01:06And it really encompasses a number of different types of tasks
    • 0:01:09all under the broad heading of natural language processing,
    • 0:01:13the idea of coming up with algorithms that
    • 0:01:15allow our AI to be able to process and understand natural language.
    • 0:01:19So these tasks vary in terms of the types of tasks
    • 0:01:22we might want an AI to perform, and therefore, the types of
    • 0:01:24algorithms that we might use.
    • 0:01:25Them but some common tasks that you might see
    • 0:01:28are things like automatic summarization.
    • 0:01:30You give an AI a long document, and you would like for the AI
    • 0:01:33to be able to summarize it, come up with a shorter
    • 0:01:35representation of the same idea, but still in some kind of natural language,
    • 0:01:39like English.
    • 0:01:40Something like information extraction-- given a whole corpus of information
    • 0:01:44in some body of documents or on the internet,
    • 0:01:46for example, we'd like for our AI to be able to extract
    • 0:01:49some sort of meaningful semantic information out of all of that content
    • 0:01:54that it's able to look at and read.
    • 0:01:56Language identification-- the task of, given a page,
    • 0:01:59can you figure out what language that document is written in?
    • 0:02:01This is the type of thing you might see if you use a web browser where,
    • 0:02:04if you open up a page in another language,
    • 0:02:06that web browser might ask you, oh, I think it's in this language-- would
    • 0:02:09you like me to translate into English for you, for example?
    • 0:02:12And that language identification process is a task
    • 0:02:15that our AI needs to be able to do, which is then related then
    • 0:02:17to machine translation, the process of taking text in one language
    • 0:02:21and translating it into another language-- which there's
    • 0:02:24been a lot of research and development on really
    • 0:02:26over the course of the last several years.
    • 0:02:28And it keeps getting better, in terms of how
    • 0:02:30it is that AI is able to take text in one language
    • 0:02:33and transform that text into another language as well.
    • 0:02:37In addition to that, we have topics like named entity recognition.
    • 0:02:40Given some sequence of text, can you pick out what the named entities are?
    • 0:02:43These are names of companies, or names of people,
    • 0:02:46or names of locations for example, which are often relevant or important parts
    • 0:02:50of a particular document.
    • 0:02:51Speech recognition as a related task not to do with the text that is written,
    • 0:02:55but text that is spoken-- being able to process audio and figure out,
    • 0:02:58what are the actual words that are spoken there?
    • 0:03:01And if you think about smart home devices, like Siri or Alexa,
    • 0:03:04for example, these are all devices that are now
    • 0:03:06able to listen to when we are able to speak, figure out
    • 0:03:09what words we are saying, and draw some sort of meaning out of that as well.
    • 0:03:13We've talked about how you could formulate something,
    • 0:03:15for instance, as a hit and Markov model to be able to draw
    • 0:03:17those sorts of conclusions.
    • 0:03:19Text classification, more generally, is a broad category
    • 0:03:22of types of ideas, whenever we want to take some kind of text
    • 0:03:25and put it into some sort of category.
    • 0:03:27And we've seen these classification type problems
    • 0:03:29and how we can use statistical machine learning approaches
    • 0:03:31to be able to solve them.
    • 0:03:32We'll be able to do something very similar with natural language
    • 0:03:35that we may need to make a couple of adjustments that we'll see soon.
    • 0:03:38And then something like word sense disambiguation,
    • 0:03:41the idea that, unlike in the language of numbers,
    • 0:03:45where AI has very precise representations of everything, words
    • 0:03:48and are a little bit fuzzy, in terms of their meaning,
    • 0:03:50and words can have multiple different meanings--
    • 0:03:52and natural language is inherently ambiguous,
    • 0:03:55and we'll take a look at some of those ambiguities in due time today.
    • 0:03:58But one challenging task, if you want an AI
    • 0:04:00to be able to understand natural language,
    • 0:04:02is being able to disambiguate or differentiate
    • 0:04:05between different possible meanings of words.
    • 0:04:08If I say a sentence like, I went to the bank, you need to figure out,
    • 0:04:12do I mean the bank where I deposit and withdraw money or do
    • 0:04:14I mean the bank like the river bank?
    • 0:04:16And different words can have different meanings
    • 0:04:18that we might want to figure out.
    • 0:04:19And based on the context in which a word appears--
    • 0:04:21the wider sentence, or paragraph, or paper
    • 0:04:23in which a particular word appears--
    • 0:04:25that might help to inform how it is that we
    • 0:04:27disambiguate between different meanings or different senses
    • 0:04:31that a word might have.
    • 0:04:32And there are many other topics within natural language processing,
    • 0:04:35many other algorithms that have been devised
    • 0:04:37in order to deal with and address these sorts of problems.
    • 0:04:40And today, we're really just going to scratch the surface,
    • 0:04:42looking at some of the fundamental ideas that are behind many of these ideas
    • 0:04:46within natural language processing, within this idea of trying to come up
    • 0:04:49with AI algorithms that are able to do something meaningful with the languages
    • 0:04:53that we speak everyday.
    • 0:04:55And so to introduce this idea, when we think about language,
    • 0:04:58we can often think about it in a couple of different parts.
    • 0:05:01The first part refers to the syntax of language.
    • 0:05:04This is more to do with just the structure of language
    • 0:05:07and how it is that that structure works.
    • 0:05:09And if you think about natural language, syntax is one of those things
    • 0:05:13that, if you're a native speaker of a language,
    • 0:05:15it comes pretty readily to you.
    • 0:05:16You don't have to think too much about it.
    • 0:05:18If I give you a sentence from Sir Arthur Conan Doyle's Sherlock Holmes,
    • 0:05:21for example, a sentence like this--
    • 0:05:23"just before 9:00 o'clock, Sherlock Holmes stepped briskly into the room"--
    • 0:05:27I think we could probably all agree that this
    • 0:05:29is a well-formed grammatical sentence.
    • 0:05:31Syntactically, it makes sense, in terms of the way
    • 0:05:34that this particular sentence is structured.
    • 0:05:37And syntax applies not just to natural language, but to programming languages
    • 0:05:40as well.
    • 0:05:40If you've ever seen a syntax error in a program that you've written,
    • 0:05:44it's likely because you wrote some sort of program
    • 0:05:47that was not syntactically well-formed.
    • 0:05:49The structure of it was not a valid program.
    • 0:05:52In the same way, we can look at English sentences, or sentences
    • 0:05:54in any natural language, and make the same kinds of judgments.
    • 0:05:57I can say that this sentence is syntactically well-formed.
    • 0:06:01When all the parts are put together, all these words are in this order,
    • 0:06:04it constructs a grammatical sentence, or a sentence that most people would agree
    • 0:06:08is grammatical.
    • 0:06:09But there are also grammatically ill-formed sentences.
    • 0:06:11A sentence like, "just before Sherlock Holmes
    • 0:06:149 o'clock stepped briskly the room"--
    • 0:06:16well, I think we would all agree that this is not a well-formed sentence.
    • 0:06:19Syntactically, it doesn't make sense.
    • 0:06:22And this is the type of thing that, if we want our AI, for example,
    • 0:06:25to be able to generate natural language--
    • 0:06:27to be able to speak to us the way like a chat bot would speak to us,
    • 0:06:30for example--
    • 0:06:31well then our AI is going to need to be able to know this distinction somehow,
    • 0:06:34is going to be able to know what kinds of sentences are grammatical,
    • 0:06:37what kinds of sentences are not.
    • 0:06:39And we might come up with rules or ways to statistically learn these ideas,
    • 0:06:42and we'll talk about some of those methods as well.
    • 0:06:45Syntax can also be ambiguous.
    • 0:06:47There are some sentences that are well-formed and not well-formed,
    • 0:06:50but certain way-- there are certain ways that you could take a sentence
    • 0:06:54and potentially construct multiple different structures for that sentence.
    • 0:06:58A sentence like, "I saw the man on the mountain with a telescope," well,
    • 0:07:01this is grammatically well-formed-- syntactically, it makes sense--
    • 0:07:05but what is the structure of the sentence?
    • 0:07:07Is it the man on the mountain who has the telescope, or am
    • 0:07:10I seeing the man on the mountain and I am using the telescope in order
    • 0:07:13to see the man on the mountain?
    • 0:07:15There's some interesting ambiguity here, where it could have potentially
    • 0:07:19two different types of structures.
    • 0:07:21And this is one of the ideas that will come back to also,
    • 0:07:23in terms of how to think about dealing with AI when natural language is
    • 0:07:27inherently ambiguous.
    • 0:07:29So that then is syntax, the structure of language,
    • 0:07:32and getting an understanding for how it is
    • 0:07:34that, depending on the order and placement of words,
    • 0:07:36we can come up with different structures for language.
    • 0:07:38But in addition to language having structure, language also has meaning.
    • 0:07:42And now we get into the world of semantics, the idea of,
    • 0:07:44what it is that a word, or a sequence of words,
    • 0:07:47or a sentence, or an entire essay actually means?
    • 0:07:51And so a sentence like, "just before 9:00, Sherlock Holmes
    • 0:07:54stepped briskly into the room," is a different sentence
    • 0:07:58from a sentence like, "Sherlock Holmes stepped briskly into the room just
    • 0:08:01before 9:00."
    • 0:08:03And yet they have effectively the same meaning.
    • 0:08:06They're different sentences, so an AI reading
    • 0:08:08them would recognize them as different, but we as humans
    • 0:08:11can look at both the sentences and say, yeah,
    • 0:08:13they mean basically the same thing.
    • 0:08:15And maybe, in this case, it was just because I moved the order of the words
    • 0:08:18around.
    • 0:08:18Originally, 9 o'clock with near the beginning of the sentence.
    • 0:08:21Now 9 o'clock is near the end of the sentence.
    • 0:08:23But you might imagine that I could come up with a different sentence entirely,
    • 0:08:26a sentence like, "a few minutes before 9:00, Sherlock Holmes
    • 0:08:29walked quickly into the room."
    • 0:08:31And OK, that also has a very similar meaning,
    • 0:08:34but I'm using different words in order to express that idea.
    • 0:08:37And ideally, AI would be able to recognize
    • 0:08:40that these two sentences, these different sets of words that
    • 0:08:43are similar to each other, have similar meanings,
    • 0:08:46and to be able to get at that idea as well.
    • 0:08:49Then there are also ways that a syntactically well-formed sentence
    • 0:08:52might not mean anything at all.
    • 0:08:54A famous example from linguist Noam Chomsky is this sentence here--
    • 0:08:57"colorless green ideas sleep furiously."
    • 0:09:00Syntactically, that sentence is perfectly fine.
    • 0:09:03Colorless and green are adjectives that modify the noun ideas.
    • 0:09:07Sleep is a verb.
    • 0:09:08Furiously is an adverb.
    • 0:09:09These are correct constructions, in terms of the order of words,
    • 0:09:12but it turns out this sentence is meaningless.
    • 0:09:15If you tried to ascribe meaning to the sentence, what does it mean?
    • 0:09:18And it's not easy to be able to determine
    • 0:09:20what it is that it might mean.
    • 0:09:21Semantics itself can also be ambiguous, given that different structures can
    • 0:09:25have different types of meanings.
    • 0:09:26Different words can have different kinds of meanings,
    • 0:09:29so the same sentence with the same structure
    • 0:09:31might end up meaning different types of things.
    • 0:09:33So my favorite example from the LA times is
    • 0:09:35a headline that was in the Los Angeles Times a little while back.
    • 0:09:39The headline says, "Big rig carrying fruit crashes on 210 freeway,
    • 0:09:43creates jam."
    • 0:09:44So depending on how it is you look at the sentence--
    • 0:09:46how you interpret the sentence-- it can have multiple different meanings.
    • 0:09:50And so here too are challenges in this world of natural language processing,
    • 0:09:53being able to understand both the syntax of language
    • 0:09:56and the semantics of language.
    • 0:09:58And today, we'll take a look at both of those ideas.
    • 0:10:00We're going to start by talking about syntax
    • 0:10:02and getting a sense for how it is that language is structured,
    • 0:10:05and how we can start by coming up with some rules, some ways
    • 0:10:09that we can tell our computer, tell our AI what types of things
    • 0:10:12are valid sentences, what types of things are not valid sentences.
    • 0:10:16And ultimately, we'd like to use that information
    • 0:10:19to be able to allow our AI to draw meaningful conclusions,
    • 0:10:21to be able to do something with language.
    • 0:10:23And so to do so, we're going to start by introducing
    • 0:10:25the notion of formal grammar.
    • 0:10:27And what formal grammar is all about its formal grammar
    • 0:10:30is a system of rules that generate sentences in a language.
    • 0:10:34I would like to know what are the valid English sentences--
    • 0:10:38not in terms of what they mean--
    • 0:10:39just in terms of their structure-- their syntactic structure.
    • 0:10:42What structures of English are valid, correct sentences?
    • 0:10:45What structures of English are not valid?
    • 0:10:47And this is going to apply in a very similar way to other natural languages
    • 0:10:50as well, where language follows certain types of structures.
    • 0:10:54And we intuitively know what these structures mean,
    • 0:10:56but it's going to be helpful to try and really formally define
    • 0:10:59what the structures mean as well.
    • 0:11:01There are a number of different types of formal grammar
    • 0:11:04all across what's known as the Chomsky hierarchy of grammars.
    • 0:11:07And you may have seen some of these before.
    • 0:11:09If you've ever worked with regular expressions before,
    • 0:11:11those belong to a class of regular languages.
    • 0:11:14They correspond to regular languages, which is a particular type of language.
    • 0:11:19But also on this hierarchy is a type of grammar
    • 0:11:21known as a context-free grammar.
    • 0:11:23And this is the one we're going to spend the most
    • 0:11:25time on taking a look at today.
    • 0:11:27And what a context-free grammar is it is a way of taking--
    • 0:11:31of generating sentences in a language or via what
    • 0:11:34are known as rewriting rules-- replacing one symbol with other symbols.
    • 0:11:39And we'll take a look in a moment at just what that means.
    • 0:11:42So let's imagine, for example, a simple sentence in English,
    • 0:11:45a sentence like, "she saw the city"--
    • 0:11:48a valid, syntactically well-formed English sentence.
    • 0:11:52But we'd like for some way for our AI to be able to look at the sentence
    • 0:11:55and figure out, what is the structure of the sentence?
    • 0:12:00If you imagine a guy in question answering format--
    • 0:12:02if you want to ask the AI a question like, what did she see,
    • 0:12:05well, then the AI wants to be able to look at this sentence
    • 0:12:08and recognize that what she saw is the city-- to be able to figure that out.
    • 0:12:13And it requires some understanding of what
    • 0:12:15it is that the structure of this sentence really looks like.
    • 0:12:19So where do we begin?
    • 0:12:20Each of these words-- she, saw, the, city--
    • 0:12:23we are going to call terminal symbols.
    • 0:12:25There are symbols in our language-- where each of these words is just
    • 0:12:28a symbol--
    • 0:12:29where this is ultimately what we care about generating.
    • 0:12:32We care about generating these words.
    • 0:12:34But each of these words we're also going to associate
    • 0:12:37with what we're going to call a non-terminal symbol.
    • 0:12:40And these non-terminal symbols initially are going to look kind of like parts
    • 0:12:43of speech, if you remember back to like English grammar--
    • 0:12:46where she is a noun, saw is a V for verb,
    • 0:12:49the is a D. D stands for determiner.
    • 0:12:52These are words like the, and a, and and, for example.
    • 0:12:55And then city-- well, city is also a noun, so an N goes there.
    • 0:12:59So each of these--
    • 0:13:00N, V, and D--
    • 0:13:01these are what we might call non-terminal symbols.
    • 0:13:04They're not actually words in the language.
    • 0:13:07She saw the city-- those are the words in the language.
    • 0:13:10But we use these non-terminal symbols to generate the terminal symbols,
    • 0:13:14the terminal symbols which are like, she saw the city--
    • 0:13:16the words that are actually in a language like English.
    • 0:13:20And so in order to translate these non-terminal symbols into terminal
    • 0:13:24symbols, we have what are known as rewriting rules,
    • 0:13:27and these rules look something like this.
    • 0:13:29We have N on the left side of an arrow, and the arrow
    • 0:13:32says, if I have an N non-terminal symbol,
    • 0:13:35then I can turn it into any of these various different possibilities
    • 0:13:39that are separated with a vertical line.
    • 0:13:42So a noun could translate into the word she.
    • 0:13:45A noun could translate into the word city, or car, or Harry,
    • 0:13:49or any number of other things.
    • 0:13:50These are all examples of nouns, for example.
    • 0:13:53Meanwhile, a determiner, D, could translate into the, or a, or an.
    • 0:13:58V for verb could translate into any of these verbs.
    • 0:14:01P for preposition could translate into any of those prepositions--
    • 0:14:04to, on, over, and so forth.
    • 0:14:06And then ADJ for adjective can translate into any of these possible adjectives
    • 0:14:11as well.
    • 0:14:12So these then are rules in our context-free grammar.
    • 0:14:15When we are defining what it is that our grammar is,
    • 0:14:18what is the structure of the English language or any other language,
    • 0:14:21we give it these types of rules saying that a noun could
    • 0:14:24be any of these possibilities, a verb could be any of those possibilities.
    • 0:14:29But it turns out we can then begin to construct other rules where
    • 0:14:32it's not just one non-terminal translating into one terminal symbol.
    • 0:14:37We're always going to have one non-terminal on the left-hand side
    • 0:14:40of the arrow, but on the right-hand side of the arrow,
    • 0:14:42we could have other things.
    • 0:14:43We could even have other non-terminal symbols.
    • 0:14:46So what do I mean by this?
    • 0:14:48Well, we have the idea of nouns-- like she, city, car, Harry, for example--
    • 0:14:53but there are also a noun phrases--
    • 0:14:55like phrases that work as nouns--
    • 0:14:57that are not just a single word, but there are multiple words.
    • 0:15:00Like the city is two words, that together, operate
    • 0:15:04as what we might call a noun phrase.
    • 0:15:06It's multiple words, but they're together operating as a noun.
    • 0:15:08Or if you think about a more complex expression, like the big city--
    • 0:15:12three words all operating as a single noun--
    • 0:15:15or the car on the street--
    • 0:15:17multiple words now, but that entire set of words operates kind of like a noun.
    • 0:15:22It substitutes as a noun phrase.
    • 0:15:25And so to do this, we'll introduce the notion
    • 0:15:27of a new non-terminal symbol called NP, which will stand for noun phrase.
    • 0:15:32And this rewriting rule says that a noun phrase it could be a noun--
    • 0:15:36so something like she is a noun, and therefore, it
    • 0:15:39can also be a noun phrase--
    • 0:15:40but a noun phrase could also be a determiner, D, followed by a noun--
    • 0:15:46so two ways we can have a noun phrase in this very simple grammar.
    • 0:15:49Of course, the English language is more complex than just this,
    • 0:15:51but a noun phrase is either a noun or it is a determiner followed by a noun.
    • 0:15:57So for the first example, a noun phrase that is just a noun,
    • 0:16:00that would allow us to generate noun phrases like she,
    • 0:16:04because a noun phrase is just a noun, and a noun
    • 0:16:07could be the word she, for example.
    • 0:16:10Meanwhile, if we wanted to look at one of the examples of these, where
    • 0:16:13a noun phrase becomes a determiner and a noun,
    • 0:16:16then we get a structure like this.
    • 0:16:18And now we're starting to see the structure of language
    • 0:16:21emerge from these rules in a syntax tree, as we'll call it,
    • 0:16:24this tree-like structure that represents the syntax of our natural language.
    • 0:16:29Here, we have a noun phrase, and this noun phrase
    • 0:16:31is composed of a determiner and a noun, where the determiner is the word the,
    • 0:16:36according to that rule, and noun is the word city.
    • 0:16:40So here then is a noun phrase that consists of multiple words inside
    • 0:16:43of the structure.
    • 0:16:45And using this idea of taking one symbol and rewriting it using other symbols--
    • 0:16:50that might be terminal symbols, like the and city,
    • 0:16:52but might also be non-terminal symbols, like D for determiner or N for noun--
    • 0:16:57then we can begin to construct more and more complex structures.
    • 0:17:01In addition to noun phrases, we can also think about verb phrases.
    • 0:17:04So what might a verb phrase look like?
    • 0:17:06Well, a verb phrase might just be a single verb.
    • 0:17:09In a sentence like "I walked," walked is a verb,
    • 0:17:13and that is acting as the verb phrase in that sentence.
    • 0:17:17But there are also more complex verb phrases that aren't just a single word,
    • 0:17:21but that are multiple words.
    • 0:17:22If you think of the sentence like "she saw the city," for example,
    • 0:17:25saw the city is really that entire verb phrase.
    • 0:17:29It's taking up like what it is that she is doing, for example.
    • 0:17:33And so our verb phrase might have a rule like this.
    • 0:17:35A verb phrase is either just a plain verb
    • 0:17:38or it is a verb followed by a noun phrase.
    • 0:17:43And we saw before that a noun phrase is either a noun
    • 0:17:45or it is a determiner followed by a noun.
    • 0:17:48And so a verb phrase might be something simple,
    • 0:17:50like verb phrase it is just a verb.
    • 0:17:52And that verb could be the word walked for example.
    • 0:17:55But it could also be something more sophisticated,
    • 0:17:57something like this noun, where we begin to see a larger syntax tree,
    • 0:18:01where the way to read the syntax tree is that a verb
    • 0:18:04phrase is a verb and a noun phrase, where
    • 0:18:07that verb could be something like saw.
    • 0:18:09And this is a noun phrase we've seen before, this noun phrase that
    • 0:18:12is the city-- a noun phrase composed of the determiner the and the noun
    • 0:18:17city all put together to construct this larger verb phrase.
    • 0:18:21And then just to give one more example of a rule,
    • 0:18:23we could also have a rule like this--
    • 0:18:24sentence S goes to noun phrase and a verb phrase.
    • 0:18:28The basic structure of a sentence is that it is
    • 0:18:30a noun phrase followed by verb phrase.
    • 0:18:32And this is a formal grammar way of expressing the idea
    • 0:18:35that you might have learned when you learned English grammar, when you read
    • 0:18:38that a sentence is like a subject and a verb, subject and action--
    • 0:18:42something that's happening to a particular noun phrase.
    • 0:18:45And so using this structure, we could construct
    • 0:18:47a sentence that looks like this.
    • 0:18:49A sentence consists of a noun phrase and a verb phrase.
    • 0:18:53A noun phrase could just be a noun, like the word she.
    • 0:18:56The verb phrase could be a verb and a noun phrase,
    • 0:18:58where-- this is something we've seen before-- the verb is saw
    • 0:19:00and the noun phrase is the city.
    • 0:19:03And so now look what we've done here.
    • 0:19:05What we've done is, by defining a set of rules,
    • 0:19:08there are algorithms that we can run that take these words--
    • 0:19:11and the CYK algorithm, for example, is one example of this if you want to look
    • 0:19:15into that--
    • 0:19:15where you start with a set of terminal symbols, like she saw the city,
    • 0:19:20and then using these rules, you're able to figure out,
    • 0:19:22how is it that you go from a sentence to she saw the city?
    • 0:19:26And it's all through these rewriting rules.
    • 0:19:28So the sentence is a noun phrase and a verb phrase.
    • 0:19:31A verb phrase could be a verb and a noun phrase, so on and so forth,
    • 0:19:34where you can imagine taking this structure
    • 0:19:37and figuring out how it is that you could generate a parse tree--
    • 0:19:41a syntax tree-- for that set of terminal symbols, that set of words.
    • 0:19:46And if you tried to do this for a sentence that was not grammatical,
    • 0:19:49something like "saw the city she," well, that wouldn't work.
    • 0:19:53There'd be no way to take a sentence and use
    • 0:19:56these rules to be able to generate that sentence that
    • 0:19:58is not inside of that language.
    • 0:20:01So this sort of model can be very helpful
    • 0:20:03if the rules are expressive enough to express
    • 0:20:06all the ideas that you might want to express inside of natural language.
    • 0:20:09Of course, using just the simple rules we have here,
    • 0:20:12there are many sentences that we won't be able to generate-- sentences
    • 0:20:14that we might agree are grim and syntactically well-formed,
    • 0:20:18but that we're not going to be able to construct using these rules.
    • 0:20:21And then, in that case, we might just need
    • 0:20:23to have some more complex rules in order to deal with those sorts of cases.
    • 0:20:28And so this type of approach can be powerful
    • 0:20:30if you're dealing with a limited set of rules and words
    • 0:20:33that you really care about dealing with.
    • 0:20:35And one way we can actually interact with this in Python
    • 0:20:37is by using a Python library called NLTK, short for natural language
    • 0:20:42toolkit, which we'll see a couple of times today,
    • 0:20:44which has a wide variety of different functions and classes
    • 0:20:47that we can take advantage of that are all
    • 0:20:49meant to deal with natural language.
    • 0:20:51And one such algorithm that it has is the ability to parse
    • 0:20:54a context-free grammar, to be able to take some words
    • 0:20:57and figure out according to some context-free grammar,
    • 0:20:59how would you construct the syntax tree for it?
    • 0:21:02So let's go ahead and take a look at NLTK
    • 0:21:04now by examining how we might construct some context-free grammars with it.
    • 0:21:09So here inside of cfg0--
    • 0:21:12cfg's short for context-free grammar--
    • 0:21:14I have a sample context-free grammar which has rules that we've seen before.
    • 0:21:19So sentence goes to noun phrase followed by a verb phrase.
    • 0:21:22Noun phrase is either a determiner and a noun or a noun.
    • 0:21:25Verb phrase is either a verb or a verb and a noun phrase.
    • 0:21:29The order of these things doesn't really matter.
    • 0:21:32Determiners could be the word the or the word a.
    • 0:21:34A noun could be the word she, city, or car.
    • 0:21:37And a verb could be the word saw or it could be the word walked.
    • 0:21:42Now, using NLTK, which I've imported here at the top,
    • 0:21:45I'm going to go ahead and parse this grammar
    • 0:21:47and save it inside of this variable called parser.
    • 0:21:50Next, my program is going to ask the user for input.
    • 0:21:52Just type in a sentence, and dot split will just
    • 0:21:55split it on all of the spaces, so I end up
    • 0:21:57getting each of the individual words.
    • 0:22:00We're going to save that inside of this list called sentence.
    • 0:22:03And then we'll go ahead and try to parse the sentence, and for each sentence
    • 0:22:08we parse, we're going to pretty print it to the screen,
    • 0:22:10just so it displays in my terminal.
    • 0:22:12And we're also going to draw it.
    • 0:22:13It turns out that NLTK has some graphics capacity,
    • 0:22:16so we can really visually see what that tree looks like as well.
    • 0:22:19And there are multiple different ways a sentence might be parsed,
    • 0:22:22which is why we're putting it inside of this for loop.
    • 0:22:24And we'll see why that can be helpful in a moment too.
    • 0:22:27All right, now that I have that, let's go ahead and try it.
    • 0:22:30I'll cd into cfg, and we'll go ahead and run cfg0.
    • 0:22:34So it then is going to prompt me to type in a sentence.
    • 0:22:37And let me type in a very simple sentence-- something
    • 0:22:39like she walked, for example.
    • 0:22:42Press Return.
    • 0:22:43So what I get is, on the left-hand side, you
    • 0:22:45can see a text-based representation of the syntax tree.
    • 0:22:48And on the right side here-- let me go ahead and make it bigger--
    • 0:22:51we see a visual representation of that same syntax tree.
    • 0:22:55This is how it is that my computer has now parsed the sentence she walked.
    • 0:22:59It's a sentence that consists of a noun phrase and a verb phrase,
    • 0:23:02where each phrase is just a single noun or verb, she and then walked--
    • 0:23:06same type of structure we've seen before,
    • 0:23:09but this now is our computer able to understand
    • 0:23:11the structure of the sentence, to be able to get
    • 0:23:13some sort of structural understanding of how it is that parts of the sentence
    • 0:23:17relate to each other.
    • 0:23:19Let me now give it another sentence.
    • 0:23:21I could try something like she saw the city, for example--
    • 0:23:25the words we were dealing with a moment ago.
    • 0:23:27And then we end up getting this syntax tree out of it--
    • 0:23:31again, a sentence that has a noun phrase and a verb phrase.
    • 0:23:34The noun phrase is fairly simple.
    • 0:23:35It's just she.
    • 0:23:36But the verb phrase is more complex.
    • 0:23:38It is now saw the city, for example.
    • 0:23:42Let's do one more with this grammar.
    • 0:23:44Let's do something like she saw a car.
    • 0:23:47And that is going to look very similar--
    • 0:23:49that we also get she.
    • 0:23:50But our verb phrase is now different.
    • 0:23:51It's saw a car, because there are multiple possible determiners
    • 0:23:55in our language and multiple possible nouns.
    • 0:23:57I haven't given this grammar rule that many words,
    • 0:23:59but if I gave it a larger vocabulary, it would then
    • 0:24:01be able to understand more and more different types of sentences.
    • 0:24:06And just to give you a sense of some added complexity we could add here,
    • 0:24:09the more complex our grammar, the more rules we add,
    • 0:24:12the more different types of sentences we'll
    • 0:24:14then have the ability to generate.
    • 0:24:15So let's take a look at cfg1, for example,
    • 0:24:18where I've added a whole number of other different types of rules.
    • 0:24:21I've added the adjective phrases, where we can have multiple adjectives inside
    • 0:24:25of a noun phrase as well.
    • 0:24:27So a noun phrase could be an adjective phrase followed by a noun phrase.
    • 0:24:31If I wanted to say something like the big city,
    • 0:24:33that's an adjective phrase followed by a noun phrase.
    • 0:24:37Or we could also have a noun and a prepositional phrase--
    • 0:24:40so the car on the street, for example.
    • 0:24:43On the street is a prepositional phrase, and we
    • 0:24:46might want to combine those two ideas together, because the car on the street
    • 0:24:50can still operate as something kind of like a noun phrase as well.
    • 0:24:53So no need to understand all of these rules in too much detail--
    • 0:24:56it starts to get into the nature of English grammar--
    • 0:24:59but now we have a more complex way of understanding these types of sentences.
    • 0:25:04So if I run Python cfg1--
    • 0:25:07and I can try typing something like she saw the wide street, for example--
    • 0:25:13a more complex sentence.
    • 0:25:14And if we make that larger, you can see what this sentence looks like.
    • 0:25:18I'll go ahead and shrink it a little bit.
    • 0:25:21So now we have a sentence like this-- she saw the wide street.
    • 0:25:26The wide street is one entire noun phrase,
    • 0:25:28saw the wide street is an entire verb phrase,
    • 0:25:31and she saw the wide street ends up forming that entire sentence.
    • 0:25:35So let's take a look at one more example to introduce this notion of ambiguity.
    • 0:25:40So I can run Python cfg1.
    • 0:25:42Let me type a sentence like she saw a dog with binoculars.
    • 0:25:48So there's a sentence, and here now is one possible syntax tree
    • 0:25:52to represent this idea--
    • 0:25:54she saw, the noun phrase a dog, and then the prepositional phrase
    • 0:25:59with binoculars.
    • 0:26:00And the way to interpret the sentence is that what it is that she saw was a dog.
    • 0:26:06And how did she do the seeing?
    • 0:26:07She did the seeing with binoculars.
    • 0:26:10And so this is one possible way to interpret this.
    • 0:26:13She was using binoculars.
    • 0:26:14Using those binoculars, she saw a dog.
    • 0:26:18But another possible way to pass that sentence
    • 0:26:21would be with this tree over here, where you have something
    • 0:26:25like she saw a dog with binoculars, where a dog with binoculars
    • 0:26:31forms an entire noun phrase of its own--
    • 0:26:33same words in the same order, but a different grammatical structure,
    • 0:26:37where now we have a dog with binoculars all inside of this noun phrase,
    • 0:26:41meaning what did she see?
    • 0:26:42What she saw was a dog, and that dog happened
    • 0:26:44to have binoculars with the dog-- so different ways to parse the sentence--
    • 0:26:49structures for the sentence-- even given the same possible sequence of words.
    • 0:26:53And NLTK's algorithm and this particular algorithm
    • 0:26:56has the ability to find all of these, to be
    • 0:26:58able to understand the different ways that you might
    • 0:27:00be able to parse a sentence and be able to extract some sort of useful meaning
    • 0:27:05out of that sentence as well.
    • 0:27:07So that then is a brief look at what we can do--
    • 0:27:11using getting the structure of language, of using these context-free grammar
    • 0:27:16rules to be able to describe the structure of language.
    • 0:27:19But what we might also care about is understanding
    • 0:27:22how it is that these sequences of words are
    • 0:27:24likely to relate to each other in terms of the actual words themselves.
    • 0:27:29The grammar that we saw before could allow us to generate a sentence like,
    • 0:27:33I eat a banana, for example, where I is the noun phrase and ate a banana
    • 0:27:37is a verb phrase.
    • 0:27:39But it would also allow for sentences like, I
    • 0:27:41eat a blue car, for example, which is also syntactically well-formed
    • 0:27:46according to the rules, but is probably a less likely sentence that a person is
    • 0:27:50likely to speak.
    • 0:27:51And we might want for our AI to be able to encapsulate
    • 0:27:54the idea that certain sequences of words are more or less likely than others.
    • 0:28:00So to deal with that, we'll introduce the notion of an n-gram,
    • 0:28:03and an n-gram, more generally, just refers to some sequence
    • 0:28:06of n items inside of our text.
    • 0:28:09And those items might take various different forms.
    • 0:28:12We can have character n-grams, which are just a contiguous
    • 0:28:15sequence of n characters-- so three characters in a row,
    • 0:28:18for example, or four characters in a row.
    • 0:28:20We can also have word n-grams, which are a contiguous
    • 0:28:23sequence of n words in a row from a particular sample of text.
    • 0:28:28And these end up proving quite useful, and you
    • 0:28:30can choose our n to decide how many how long is our sequence going to be.
    • 0:28:34So when n is 1, we're just looking at a single word or a single character.
    • 0:28:39And that is what we might call a unigram, just one item.
    • 0:28:42If we're looking at two characters or two words,
    • 0:28:45that's generally called a bigram-- so an n-gram
    • 0:28:47where n is equal to 2, looking at two words that are consecutive.
    • 0:28:51And then, if there are three items, you might
    • 0:28:53imagine we'll often call those trigrams-- so three characters
    • 0:28:56in a row or three words that happen to be in a contiguous sequence.
    • 0:29:00And so if we took a sentence, for example--
    • 0:29:04here's a sentence from, again, Sherlock Holmes--
    • 0:29:06"how often have I said to you that, when you
    • 0:29:08have eliminated the impossible, whatever remains,
    • 0:29:10however improbable, must be the truth."
    • 0:29:13What are the trigrams that we can extract from the sentence?
    • 0:29:16If we're looking at sequences of three words,
    • 0:29:18well, the first trigram would be how often
    • 0:29:21have-- just a sequence of three words.
    • 0:29:23And then we can look at the next trigram,
    • 0:29:25often have I. The next trigram is have I said.
    • 0:29:29Then I said to, said to you, to you that, for example--
    • 0:29:32those are all trigrams of words, sequences of three contiguous words
    • 0:29:36that show up in the text.
    • 0:29:38And extracting those bigrams and trigrams, or n-grams more generally,
    • 0:29:43turns out to be quite helpful, because often,
    • 0:29:45when we're dealing with analyzing a lot of text,
    • 0:29:48it's not going to be particularly meaningful for us to try
    • 0:29:50and analyze the entire text at one time.
    • 0:29:53But instead, we want to segment that text into pieces that we
    • 0:29:57can begin to do some analysis of--
    • 0:29:59that our AI might never have seen this entire sentence before,
    • 0:30:03but it's probably seen the trigram to you that before,
    • 0:30:07because to you that is something that might have come up in other documents
    • 0:30:11that our AI has seen before.
    • 0:30:13And therefore, it knows a little bit about that particular sequence
    • 0:30:16of three words in a row-- or something like have I said,
    • 0:30:20another example of another sequence of three words that's probably
    • 0:30:24quite popular, in terms of where you see it inside the English language.
    • 0:30:28So we'd like some way to be able to extract these sorts of n-grams.
    • 0:30:32And how do we do that?
    • 0:30:33How do we extract sequences of three words?
    • 0:30:35Well, we need to take our input and somehow separate it
    • 0:30:39into all of the individual words.
    • 0:30:41And this is a process generally known as tokenization,
    • 0:30:45the task of splitting up some sequence into distinct pieces,
    • 0:30:48where we call those pieces tokens.
    • 0:30:50Most commonly, this refers to something like word tokenization.
    • 0:30:53I have some sequence of text and I want to split it up
    • 0:30:55into all of the words that show up in that text.
    • 0:30:58But it might also come up in the context of something
    • 0:31:01like sentence tokenization.
    • 0:31:02I have a long sequence of text and I'd like to split it up
    • 0:31:05into sentences, for example.
    • 0:31:08And so how might word tokenization work, the task of splitting up
    • 0:31:11our sequence of characters into words?
    • 0:31:13Well, we've also already seen this idea.
    • 0:31:15We've seen that, in word tokenization just a moment ago, I
    • 0:31:18took an input sequence and I just called Python's split method on it, where
    • 0:31:22the split method took that sequence of words
    • 0:31:25and just separated it based on where the spaces showed up in that word.
    • 0:31:29And so if I had a sentence like, whatever remains, however improbable,
    • 0:31:33must be the truth, how would I tokenize this?
    • 0:31:37Well, the naive approach is just to say, anytime you see a space,
    • 0:31:41go ahead and split it up.
    • 0:31:42We're going to split up this particular string just by looking for spaces.
    • 0:31:46And what we get when we do that is a sentence like this--
    • 0:31:49whatever remains, however improbable, must be the truth.
    • 0:31:53But what you'll notice here is that, if we just split things
    • 0:31:56up in terms of where the spaces are, we end up keeping the punctuation around.
    • 0:32:00There's a comma after the word remains.
    • 0:32:02There's a comma after improbable, a period after truth.
    • 0:32:06And this poses a little bit of a challenge, when
    • 0:32:08we think about trying to tokenize things into individual words,
    • 0:32:11because if you're comparing words to each other, this word
    • 0:32:15truth with a period after it--
    • 0:32:16if you just string compare it, it's going
    • 0:32:18to be different from the word truth without a period after it.
    • 0:32:21And so this punctuation can sometimes pose a problem for us,
    • 0:32:23and so we might want some way of dealing with it-- either treating punctuation
    • 0:32:27as a separate token altogether or maybe removing that punctuation entirely
    • 0:32:30from our sequence as well.
    • 0:32:32So that might be something we want to do.
    • 0:32:35But there are other cases where it becomes a little bit less clear.
    • 0:32:38If I said something like, just before 9:00 o'clock,
    • 0:32:40Sherlock Holmes stepped briskly into the room,
    • 0:32:43well, this apostrophe after 9 o'clock--
    • 0:32:46after the O in 9 o'clock-- is that something we should remove?
    • 0:32:48Should be split based on that as well, and do O and clock?
    • 0:32:52There's some interesting questions there too.
    • 0:32:54And it gets even trickier if you begin to think about hyphenated words--
    • 0:32:57something like this, where we have a whole bunch of words
    • 0:33:00that are hyphenated and then you need to make a judgment call.
    • 0:33:03Is that a place where you're going to split things apart
    • 0:33:06into individual words, or are you going to consider frock-coat, and well-cut,
    • 0:33:09and pearl-grey to be individual words of their own?
    • 0:33:13And so those tend to pose challenges that we need to somehow deal with
    • 0:33:16and something we need to decide as we go about trying
    • 0:33:19to perform this kind of analysis.
    • 0:33:21Similar challenges arise when it comes to the world of sentence tokenization.
    • 0:33:25Imagine this sequence of sentences, for example.
    • 0:33:29If you take a look at this particular sequence of sentences,
    • 0:33:31you could probably imagine you could extract the sentences pretty readily.
    • 0:33:35Here is one sentence and here is a second sentence,
    • 0:33:38so we have two different sentences inside of this particular passage.
    • 0:33:43And the distinguishing feature seems to be the period--
    • 0:33:46that a period separates one sentence from another.
    • 0:33:48And maybe there are other types of punctuation
    • 0:33:50you might include here as well--
    • 0:33:52an exclamation point, for example, or a question mark.
    • 0:33:55But those are the types of punctuation that we know
    • 0:33:58tend to come at the end of sentences.
    • 0:34:00But it gets trickier again if you look at a sentence like this-- not just
    • 0:34:04sure talking to Sherlock, but instead of talking to Sherlock,
    • 0:34:07talking to Mr. Holmes.
    • 0:34:09Well now, we have a period at the end of Mr.
    • 0:34:11And so if you were just separating on periods,
    • 0:34:13you might imagine this would be a sentence,
    • 0:34:15and then just Holmes would be a sentence,
    • 0:34:17and then we'd have a third sentence down below.
    • 0:34:19Things do get a little bit trickier as you start
    • 0:34:23to imagine these sorts of situations.
    • 0:34:25And dialogue too starts to make this trickier as well--
    • 0:34:27that if you have these sorts of lines that are inside of something that--
    • 0:34:31he said, for example--
    • 0:34:33that he said this particular sequence of words
    • 0:34:35and then this particular sequence of words.
    • 0:34:37There are interesting challenges that arise there too,
    • 0:34:40in terms of how it is that we take the sentence
    • 0:34:42and split it up into individual sentences as well.
    • 0:34:46And these are just things that our algorithm needs to decide.
    • 0:34:48In practice, there usually some heuristics that we can use.
    • 0:34:51We know there are certain occurrences of periods,
    • 0:34:53like the period after Mr., or in other examples where
    • 0:34:56we know that is not the beginning of a new sentence,
    • 0:34:59and so we can encode those rules into our AI
    • 0:35:01to allow it to be able to do this tokenization the way
    • 0:35:04that we want it to.
    • 0:35:06So once we have these ability to tokenize a particular passage--
    • 0:35:09take the passage, split it up into individual words--
    • 0:35:12from there, we can begin to extract what the n-grams actually are.
    • 0:35:17So we can actually take a look at this by going
    • 0:35:20into a Python program that will serve the purpose of extracting
    • 0:35:23these n-grams.
    • 0:35:24And again, we can use NLTK, the Natural Language Toolkit, in order
    • 0:35:27to help us here.
    • 0:35:28So I'll go ahead and go into ngrams and we'll take a look at ngrams.py.
    • 0:35:33And what we have here is we are going to take
    • 0:35:36some corpus of text, just some sequence of documents,
    • 0:35:39and use all those documents and extract what the most popular n-grams happen
    • 0:35:43to be.
    • 0:35:44So in order to do so, we're going to go ahead and load data from a directory
    • 0:35:48that we specify in the command line argument.
    • 0:35:50We'll also take in a number n as a command line argument
    • 0:35:53as well, in terms of what our number should be,
    • 0:35:55in terms of how many sequences-- words we're going to look at in sequence.
    • 0:36:00Then we're going to go ahead and just count up all of the nltk.ngrams.
    • 0:36:05So we're going to look at all of the grams across this entire corpus
    • 0:36:09and save it inside this variable ngrams.
    • 0:36:11And then we're going to look at the most common ones
    • 0:36:14and go ahead and print them out.
    • 0:36:15And so in order to do so, I'm not only using NLTK--
    • 0:36:18I'm also using counter, which is built into Python as well, where I can just
    • 0:36:21count up, how many times do these various different grams appear?
    • 0:36:25So we'll go ahead and show that.
    • 0:36:27We'll go into ngrams, and I'll say something like python ngrams--
    • 0:36:31and let's just first look for the unigrams, sequences
    • 0:36:34of one word inside of a corpus.
    • 0:36:37And the corpus that I've prepared is I have
    • 0:36:39all of the-- or some of these stories from Sherlock Holmes
    • 0:36:42all here, where each one is just one of the Sherlock Holmes stories.
    • 0:36:47And so I have a whole bunch of text here inside of this corpus,
    • 0:36:50and I'll go ahead and provide that corpus as a command line argument.
    • 0:36:54And now what my program is going to do is
    • 0:36:55it's going to load all of the Sherlock Holmes stories into memory--
    • 0:36:59or all the ones that I've provided in this corpus at least--
    • 0:37:01and it's just going to look for the most popular unigrams,
    • 0:37:04the most popular sequences of one word.
    • 0:37:07And it seems the most popular one is just the word the used in 9,700 times;
    • 0:37:12followed by I, used 5,000 times; and, used about 5,000 times--
    • 0:37:15the kinds of words you might expect.
    • 0:37:18So now let's go ahead and check for bigrams, for example, ngrams 2, holmes.
    • 0:37:24All right, again, sequences of two words now that appear multiple times--
    • 0:37:28of the, in the, it was, to the, it is, I have-- so on and so forth.
    • 0:37:32These are the types of bigrams that happen
    • 0:37:34to come up quite often inside this corpus, the inside of the Sherlock
    • 0:37:37Holmes stories.
    • 0:37:38And it probably is true across other corpses as well,
    • 0:37:41but we could only find out if we actually tested it.
    • 0:37:43And now, just for good measure, let's try
    • 0:37:45one more-- maybe try three, looking now for trigrams that happen to show up.
    • 0:37:50And now we get it was the, one of the, I think that, out of the.
    • 0:37:54These are sequences of three words now that
    • 0:37:56happen to come up multiple times across this particular corpus.
    • 0:38:00So what are the potential use cases here?
    • 0:38:02Now we have some sort of data.
    • 0:38:04We have data about how often particular sequences of words
    • 0:38:07show up in this particular order, and using that,
    • 0:38:11we can begin to do some sort of predictions.
    • 0:38:13We might be able to say that, if you see the words that it was,
    • 0:38:18there's a reasonable chance the word that
    • 0:38:19comes after it should be the word a.
    • 0:38:22And if I see the words one of, it it's reasonable to imagine
    • 0:38:26that the next word might be the word the, for example,
    • 0:38:29because we have this data about trigrams, sequences of three words
    • 0:38:32and how often they come up.
    • 0:38:33And now, based on two words, you might be
    • 0:38:36able to predict what the third word happens to be.
    • 0:38:40And one model we can use for that is a model we've actually seen before.
    • 0:38:43It's the Markov model.
    • 0:38:45Recall again that the Markov model really
    • 0:38:47just refers to some sequence of events that happen one time
    • 0:38:50step after a one time step, where every unit has some ability
    • 0:38:54to predict what the next unit is going to be--
    • 0:38:57or maybe the past two units predict with the next unit is going to be,
    • 0:39:00or the past three predict with the next one is going to be.
    • 0:39:03And we can use a Markov model and apply it
    • 0:39:05to language for a very naive and simple approach
    • 0:39:08at trying to generate natural language, at getting our AI
    • 0:39:11to be able to speak English-like text.
    • 0:39:14And the way it's going to work is we're going to say something like, come up
    • 0:39:18with some probability distribution.
    • 0:39:20Given these two words, what is the probability
    • 0:39:23distribution over what the third word could possibly
    • 0:39:25be based on all the data?
    • 0:39:27If you see it was, what are the possible third words we might?
    • 0:39:30Have how often do they come up?
    • 0:39:32And using that information, we can try and construct
    • 0:39:35what we expect the third word to be.
    • 0:39:37And if you keep doing this, the effect is
    • 0:39:39that our Markov model can effectively start
    • 0:39:42to generate text-- can be able to generate text that
    • 0:39:45was not in the original corpus, but that sounds
    • 0:39:48kind of like the original corpus.
    • 0:39:49It's using the same sorts of rules that the original corpus was using.
    • 0:39:54So let's take a look at an example of that
    • 0:39:56as well, where here now, I have another corpus that I have here,
    • 0:40:01and it is the corpus of all of the works of William Shakespeare.
    • 0:40:04So I've got a whole bunch of stories from Shakespeare, and all of them
    • 0:40:09are just inside of this big text file.
    • 0:40:12And so what I might like to do is look at what all of the n-grams are--
    • 0:40:16maybe look at all the trigrams inside of shakespeare.txt--
    • 0:40:20and figure out, given two words, can I predict
    • 0:40:23what the third word is likely to be?
    • 0:40:24And then just keep repeating this process--
    • 0:40:26I have two words--
    • 0:40:27predict the third word; then, from the second and third, word
    • 0:40:29predict the fourth word; and from the third and fourth word,
    • 0:40:31predict the fifth word, ultimately generating random sentences that
    • 0:40:36sounds like Shakespeare, that are using similar patterns of words
    • 0:40:39that Shakespeare used, but that never actually showed up in Shakespeare
    • 0:40:43itself.
    • 0:40:44And so to do so, I'll show you generator.py,
    • 0:40:47which, again, is just going to read data from a particular file.
    • 0:40:50And I'm using a Python library called markovify, which is just
    • 0:40:54going to do this process for me.
    • 0:40:56So there are libraries out here that can just train on a bunch of text
    • 0:40:59and come up with a Markov model based on that text.
    • 0:41:02And I'm going to go ahead and just generate
    • 0:41:04five randomly generated sentences.
    • 0:41:07So we'll go ahead and go in to markov.
    • 0:41:11I'll run the generator on shakespeare.txt.
    • 0:41:14What we'll see is it's going to load that data, and then here's what we get.
    • 0:41:18We get five different sentences, and these
    • 0:41:21are sentences that never showed up in any Shakespeare play,
    • 0:41:24but that are designed to sound like Shakespeare,
    • 0:41:27that are designed to just take two words and predict,
    • 0:41:30given those two words, what would Shakespeare have been likely to choose
    • 0:41:34as the third word that follows it.
    • 0:41:35And you know, these sentences probably don't have any meaning.
    • 0:41:38It's not like the AI is trying to express any sort of underlying meaning
    • 0:41:41here.
    • 0:41:42It's just trying to understand, based on the sequence
    • 0:41:44of words, what is likely to come after it as a next word, for example.
    • 0:41:50And these are the types of sentences that it's able to come up with,
    • 0:41:53just generating.
    • 0:41:54And if you ran this multiple times, you would end up getting different results.
    • 0:41:58I could run this again and get an entirely different set
    • 0:42:01of five different sentences that also are
    • 0:42:04supposed to sound kind of like the way that Shakespeare's sentences sounded
    • 0:42:08as well.
    • 0:42:10And so that then was a look at how it is we
    • 0:42:12can use Markov models to be able to naively attempt generating language.
    • 0:42:16The language doesn't mean a whole lot right now.
    • 0:42:18You wouldn't want to use the system in its current form
    • 0:42:21to do something like machine translation,
    • 0:42:23because it wouldn't be able to encapsulate any meaning,
    • 0:42:26but we're starting to see now that our AI is getting a little bit better
    • 0:42:30at trying to speak our language, at trying
    • 0:42:31to be able to process natural language in some sort of meaningful way.
    • 0:42:36So we'll now take a look at a couple of other tasks
    • 0:42:38that we might want our AI to be able to perform.
    • 0:42:41And one such task is text categorization, which really is just
    • 0:42:44a classification problem.
    • 0:42:46And we've talked about classification problems already,
    • 0:42:48these problems where we would like to take some object
    • 0:42:51and categorize it into a number of different classes.
    • 0:42:54And so the way this comes up in text is anytime you have some sample of text
    • 0:42:58and you want to put it inside of a category, where I want to say something
    • 0:43:02like, given an email, does it belong in the inbox or does it belong in spam?
    • 0:43:06Which of these two categories does it belong in?
    • 0:43:08And you do that by looking at the text and being
    • 0:43:12able to do some sort of analysis on that text to be able to draw conclusions,
    • 0:43:16to be able to say that, given the words that show up in the email,
    • 0:43:20I think this is probably belonging in the inbox,
    • 0:43:22or I think it probably belongs in spam instead.
    • 0:43:25And you might imagine doing this for a number
    • 0:43:27of different types of classification problems of this sort.
    • 0:43:30So you might imagine that another common example of this type of idea
    • 0:43:34is something like sentiment analysis, where I want to analyze,
    • 0:43:37given a sample of text, does it have a positive sentiment
    • 0:43:41or does it have a negative sentiment?
    • 0:43:43And this might come up in the case of a product reviews on a website,
    • 0:43:47for example, or feedback on a website, where you have a whole bunch of data--
    • 0:43:50samples of text that are provided by users of a website--
    • 0:43:53and you want to be able to quickly analyze, are these reviews positive,
    • 0:43:57are the reviews negative, what is it that people
    • 0:43:59are saying, just to get a sense for what it is that people are saying,
    • 0:44:03to be able to categorize text into one of these two different categories.
    • 0:44:08So how might we approach this problem?
    • 0:44:10Well, let's take a look at some sample product reviews.
    • 0:44:13Here are some sample prep reviews that we might come up with.
    • 0:44:16My grandson loved it.
    • 0:44:16So much fun.
    • 0:44:17Product broke after a few days.
    • 0:44:20One of the best games I've played in a long time.
    • 0:44:22Kind of cheap and flimsy.
    • 0:44:23Not worth it.
    • 0:44:24Different product reviews that you might imagine seeing on Amazon, or eBay,
    • 0:44:28or some other website where people are selling products, for instance.
    • 0:44:31And we humans can pretty easily categorize these
    • 0:44:34into positive sentiment or negative sentiment.
    • 0:44:37We'd probably say that the first and the third one, those
    • 0:44:39are positive sentiment messages.
    • 0:44:41The second one and the fourth one, those are probably
    • 0:44:44negative sentiment messages.
    • 0:44:46But how could a computer do the same thing?
    • 0:44:48How could it try and take these reviews and assess, are they positive
    • 0:44:53or are they negative?
    • 0:44:55Well, ultimately, it depends upon the words
    • 0:44:57that happen to be in this particular-- these particular reviews-- inside
    • 0:45:02of these particular sentences.
    • 0:45:03For now we're going to ignore the structure
    • 0:45:06and how the words are related to each other,
    • 0:45:08and we're just going to focus on what the words actually are.
    • 0:45:11So there are probably some key words here, words like loved,
    • 0:45:14and fun, and best.
    • 0:45:16Those probably show up in more positive reviews, whereas words
    • 0:45:20like broke, and cheap, and flimsy--
    • 0:45:23well, those are words that probably are more
    • 0:45:24likely to come up inside of negative reviews, instead of positive reviews.
    • 0:45:29So one way to approach this sort of text analysis idea
    • 0:45:33is to say, let's, for now, ignore the structures of these sentences-- to say,
    • 0:45:37we're not going to care about how it is the words relate to each other.
    • 0:45:40We're not going to try and parse these sentences to construct
    • 0:45:43the grammatical structure like we saw a moment ago.
    • 0:45:45But we can probably just rely on the words that were actually
    • 0:45:49used-- rely on the fact that the positive reviews are
    • 0:45:52more likely to have words like best, and loved, and fun,
    • 0:45:54and that the negative reviews are more likely to have the negative words
    • 0:45:58that we've highlighted there as well.
    • 0:46:00And this sort of model-- this approach to trying to think about language--
    • 0:46:03is generally known as the bag of words model,
    • 0:46:05where we're going to model a sample of text not by caring about its structure,
    • 0:46:09but just by caring about the unordered collection of words that
    • 0:46:12show up inside of a sample-- that all we care about
    • 0:46:16is what words are in the text.
    • 0:46:18And we don't care about what the order of those words is.
    • 0:46:20We don't care about the structure of the words.
    • 0:46:22We don't care what noun goes with what adjective
    • 0:46:25or how things agree with each other.
    • 0:46:26We just care about the words.
    • 0:46:28And it turns out this approach tends to work
    • 0:46:31pretty well for doing classifications like positive sentiment
    • 0:46:34or negative sentiment.
    • 0:46:36And you could imagine doing this in a number of ways.
    • 0:46:38We've talked about different approaches to trying to solve classification style
    • 0:46:41problems, but when it comes to natural language,
    • 0:46:43one of the most popular approaches is that naive Bayes approach.
    • 0:46:48And this is one approach to trying to analyze the probability that something
    • 0:46:52is positive sentiment or negative sentiment,
    • 0:46:54or just trying to categorize it some text into possible categories.
    • 0:46:58And it doesn't just work for text-- it works for other types of ideas
    • 0:47:01as well-- but it is quite popular in the world
    • 0:47:03of analyzing text and natural language.
    • 0:47:05And the naive Bayes approach is based on Bayes' rule, which
    • 0:47:09you might recall back from when we talked about probability,
    • 0:47:11that the Bayes' rule looks like this--
    • 0:47:14that the probability of some event b, given a
    • 0:47:17can be expressed using this expression over here.
    • 0:47:20Probability of b given a is the probability of a given b multiplied
    • 0:47:25by the probability of b divided by the probability of a.
    • 0:47:28And we saw that this came about as a result of just the definition
    • 0:47:32of conditional independence and looking at what it means for two events
    • 0:47:35to happen together.
    • 0:47:37This was our formulation then of Bayes' rule, which
    • 0:47:40turned out to be quite helpful.
    • 0:47:41We were able to predict one event in terms of another
    • 0:47:43by flipping the order of those events inside of this probability calculation.
    • 0:47:49And it turns out this approach is going to be quite helpful--
    • 0:47:51and we'll see why in a moment--
    • 0:47:53for being able to do this sort of sentiment analysis,
    • 0:47:55because I want to say you know, what is the probability
    • 0:47:58that a message is positive, or what is the pop probability
    • 0:48:02that the message is negative?
    • 0:48:03And I'll go ahead and simplify this just using the emojis just
    • 0:48:06for simplicity-- probability of positive, probability of negative.
    • 0:48:10And that is what I would like to calculate,
    • 0:48:12but I'd like to calculate that given some information--
    • 0:48:15given information like here is a sample of text--
    • 0:48:18my grandson loved it.
    • 0:48:20And I would like to know not just what is the probability that any message is
    • 0:48:24positive, but what is the probability that the message is positive,
    • 0:48:27given my grandson loved it as the text of the sample?
    • 0:48:32So given this information that inside the sample are the words my grandson
    • 0:48:36loved it, what is the probability then that this is a positive message?
    • 0:48:41Well, according to the bag of words model, what we're going to do
    • 0:48:44is really ignore the ordering of the words--
    • 0:48:46not treat this as a single sentence that has some structure to it,
    • 0:48:50but just treat it as a whole bunch of different words.
    • 0:48:52We're going to say something like, what is the probability
    • 0:48:55that this is a positive message, given that the word my
    • 0:48:58was in the message, given that the word grandson was in the message,
    • 0:49:01given that the word loved within the message, and given the word it
    • 0:49:05was in the message?
    • 0:49:06The bag of words model here--
    • 0:49:07we're treating the entire simple sample as just a whole bunch
    • 0:49:11of different words.
    • 0:49:12And so this then is what I'd like to calculate, this probability--
    • 0:49:15given all those words, what is the probability
    • 0:49:18that this is a positive message?
    • 0:49:20And this is where we can now apply Bayes' rule.
    • 0:49:23This is really the probability of some b, given some a.
    • 0:49:28And that now is what I'd like to calculate.
    • 0:49:30So according to Bayes' rule, this whole expression is equal to--
    • 0:49:34well, it's the probability--
    • 0:49:35I switched the order of them--
    • 0:49:37it's the probability of all of these words,
    • 0:49:40given that it's a positive message, multiplied
    • 0:49:42by the probability that is the positive message divided
    • 0:49:46by the probability of all of those words.
    • 0:49:49So this then is just an application of Bayes' rule.
    • 0:49:51We've already seen where I want to express the probability of positive,
    • 0:49:56given the words, as related to somehow the probability of the words,
    • 0:50:02given that it's a positive message.
    • 0:50:04And it turns out that-- as you might recall, back
    • 0:50:06when we talked about probability, that this denominator is
    • 0:50:09going to be the same.
    • 0:50:10Regardless of whether we're looking at positive or negative messages,
    • 0:50:13the probability of these words doesn't change,
    • 0:50:15because we don't have a positive or negative down below.
    • 0:50:18So we can just say that, rather than just say
    • 0:50:20that this expression up here is equal to this expression down below,
    • 0:50:23it's really just proportional to just the numerator.
    • 0:50:27We can ignore the denominator for now.
    • 0:50:29Using the denominator would get us an exact probability.
    • 0:50:32But it turns out that what we'll really just do
    • 0:50:34is figure out what the probability is proportional to, and at the end,
    • 0:50:38we'll have to normalize the probability distribution-- make
    • 0:50:41sure the probability distribution ultimately sums up to the number 1.
    • 0:50:46So now I've been able to formulate this probability--
    • 0:50:49which is what I want to care about--
    • 0:50:51as proportional to multiplying these two things together-- probability of words,
    • 0:50:56given positive message, multiplied by the probability of positive message.
    • 0:51:01But again, if you think back to our probability rules,
    • 0:51:04we can calculate this really as just a joint probability of all of these
    • 0:51:09things happening-- that the probability of positive message multiplied
    • 0:51:14by the probability of these words, given the positive message--
    • 0:51:17well, that's just the joint probability of all of these things.
    • 0:51:20This is the same thing as the probability
    • 0:51:23that it's a positive message, and my isn't the sentence or in the message,
    • 0:51:27and grandson is in the sample, and loved is in the sample,
    • 0:51:30and it is in the sample.
    • 0:51:33So using that rule for the definition of joint probability,
    • 0:51:36I've been able to say that this entire expression is now
    • 0:51:40proportional to this sequence--
    • 0:51:43this joint probability of these words and this positive that's
    • 0:51:47in there as well.
    • 0:51:49And so now the interesting question is just how
    • 0:51:51to calculate that joint probability.
    • 0:51:54How do I figure out the probability that,
    • 0:51:55given some arbitrary message, that it is positive, and the word my is in there,
    • 0:51:59and the word grandson is in there, and the word loved is in there,
    • 0:52:03and the word it is in there?
    • 0:52:04Well, you'll recall that we can calculate a joint probability
    • 0:52:07by multiplying together all of these conditional probabilities.
    • 0:52:12If I want to know the probability of a, and b, and c,
    • 0:52:16I can calculate that as the probability of a times
    • 0:52:19the probability of b, given a, times the probability of c, given a and b.
    • 0:52:24I can just multiply these conditional probabilities together
    • 0:52:27in order to get the overall joint probability that I care about.
    • 0:52:31And we could do the same thing here.
    • 0:52:32I could say, let's multiply the probability
    • 0:52:35of positive by the probability of the word my showing up in the message,
    • 0:52:39given that it's positive, multiplied by the probability of grandson
    • 0:52:42showing up in the message, given that the word my is in there
    • 0:52:45and that it's positive, multiplied by the probability of loved,
    • 0:52:48given these three things, multiplied by the probability of it,
    • 0:52:51given these four things.
    • 0:52:53And that's going to end up being a fairly complex calculation to make,
    • 0:52:56one that we probably aren't going to have
    • 0:52:58a good way of knowing the answer to.
    • 0:53:00What is the probability that grandson is in the message, given
    • 0:53:04that it is positive and the word my is in the message?
    • 0:53:08That's not something we're really going to have a readily easy answer to,
    • 0:53:12and so this is where the naive part of naive Bayes comes about.
    • 0:53:15We're going to simplify this notion.
    • 0:53:16Rather than compute exactly what that probability distribution is,
    • 0:53:20we're going to assume that these words are
    • 0:53:23going to be effectively independent of each other,
    • 0:53:26if we know that it's already a positive message.
    • 0:53:28If it's a positive message, it doesn't change the probability
    • 0:53:32that the word grandson is in the message,
    • 0:53:34if I know that the word loved is in the message, for example.
    • 0:53:37And that might not necessarily be true in practice.
    • 0:53:39In the real world, it might not be the case
    • 0:53:41that these words are actually independent,
    • 0:53:43but we're going to assume it to simplify our model.
    • 0:53:45And it turns out that simplification still
    • 0:53:48lets us get pretty good results out of it as well.
    • 0:53:51And what we're going to assume is that the probability that all of these words
    • 0:53:55show up depend only on whether it's positive or negative.
    • 0:53:58I can still say that loved is more likely to come up
    • 0:54:01in a positive message than a negative message, which is probably true,
    • 0:54:04but we're also going to say that it's not going to change whether or not
    • 0:54:08loved is more likely or less likely to come up if I know that the word my is
    • 0:54:12in the message, for example.
    • 0:54:13And so those are the assumptions that we're going to make.
    • 0:54:16So while top expression is proportional to this bottom expression,
    • 0:54:20we're going to say it's naively proportional to this expression,
    • 0:54:24probability of being a positive message.
    • 0:54:27And then, for each of the words that show up in the sample,
    • 0:54:30I'm going to multiply what's the probability that my
    • 0:54:33is in the message, given that it's positive,
    • 0:54:35times the probability of grandson being in the message, given
    • 0:54:37that it's positive-- and then so on and so forth
    • 0:54:40for the other words that happen to be inside of the sample.
    • 0:54:44And it turns out that these are numbers that we can calculate.
    • 0:54:47The reason we've done all of this math is to get to this point,
    • 0:54:50to be able to calculate this probability of distribution that we care about,
    • 0:54:54given these terms that we can actually calculate.
    • 0:54:58And we can calculate then, given some data available to us.
    • 0:55:02And this is what a lot of natural language processing
    • 0:55:04is about these days.
    • 0:55:05It's about analyzing data.
    • 0:55:07If I give you a whole bunch of data with a whole bunch of reviews,
    • 0:55:10and I've labeled them as positive or negative,
    • 0:55:13then you can begin to calculate these particular terms.
    • 0:55:17I can calculate the probability that a message is positive just
    • 0:55:20by looking at my data and saying, how many
    • 0:55:22positive samples were there, and divide that by the number of total samples.
    • 0:55:26That is my probability that a message is positive.
    • 0:55:29What is the probability that the word loved is in the message, given
    • 0:55:32that it's positive?
    • 0:55:33Well, I can calculate that based on my data too.
    • 0:55:35Let me just look at how many positive samples have the word loved in it
    • 0:55:38and divide that by my total number of positive samples.
    • 0:55:41And that will give me an approximation for,
    • 0:55:44what is the probability that loved is going to show up inside of the review,
    • 0:55:47given that we know that the review is positive.
    • 0:55:51And so this then allows us to be able to calculate these probabilities.
    • 0:55:55So let's not actually do this calculation.
    • 0:55:56Let's calculate for the sentence, my grandson loved it.
    • 0:56:00Is it a positive or negative review?
    • 0:56:01How could we figure out those probabilities?
    • 0:56:04Well, again, this up here is the expression we're trying to calculate.
    • 0:56:07And I'll give you a hint the data that is available to us.
    • 0:56:10And the way to interpret this data in this case
    • 0:56:13is that, of all of the messages, 49% of them were positive and 51% of them
    • 0:56:19were negative.
    • 0:56:19Maybe online reviews tend to be a little bit more negative than they
    • 0:56:22are positive-- or at least based on this particular data
    • 0:56:24sample, that's what I have.
    • 0:56:26And then I have distributions for each of the various different words--
    • 0:56:31that, given that it's a positive message,
    • 0:56:34how many positive messages had the word in my in them?
    • 0:56:38It's about 30%.
    • 0:56:39And for negative messages, how many of those had the word my in them?
    • 0:56:42About 20%-- so it seems like the word my comes up more often in positive
    • 0:56:47messages-- at least slightly more often based on this analysis here.
    • 0:56:52Grandson, for example-- maybe that showed up
    • 0:56:54in 1% of all positive messages and 2% of all negative messages
    • 0:56:58had the word grandson in it.
    • 0:57:00The word loved showed up in 32% of all positive messages, 8%
    • 0:57:05of all negative messages, for example.
    • 0:57:07And then the word it up in 30% of positive messages,
    • 0:57:1040% of negative messages-- again, just arbitrary data here just for example,
    • 0:57:15but now we have data with which we can begin to calculate this expression.
    • 0:57:19So how do I calculate multiplying all these values together?
    • 0:57:22Well, it's just going to be multiplying probability
    • 0:57:25that it's positive times the probability of my, given positive,
    • 0:57:29times the probability of grandson, given positive--
    • 0:57:32so on and so forth for each of the other words.
    • 0:57:34And if you do that multiplication and multiply all of those values together,
    • 0:57:37you get this, 0.00014112.
    • 0:57:42By itself, this is not a meaningful number,
    • 0:57:44but it's going to be meaningful if you compared this expression--
    • 0:57:48the probability that it's positive times the probability of all of the words,
    • 0:57:53given that I know that the message is positive,
    • 0:57:55and compare it to the same thing, but for negative sentiment messages
    • 0:57:59instead.
    • 0:57:59I want to know the probability that it's a negative message
    • 0:58:03times the probability of all of these words,
    • 0:58:05given that it's a negative message.
    • 0:58:07And so how can I do that?
    • 0:58:09Well, to do that, you just multiply probability of negative times
    • 0:58:13all of these conditional probabilities.
    • 0:58:15And if I take those five values, multiply all of them together,
    • 0:58:19then what I get is this value for negative 0.00006528--
    • 0:58:26again, in isolation, not a particularly meaningful number.
    • 0:58:30What is meaningful is treating these two values as a probability distribution
    • 0:58:35and normalizing them, making it so that both of these values sum up to 1
    • 0:58:39the way of probability distribution should.
    • 0:58:41And we do so by adding these two up and then dividing each of these values
    • 0:58:45by their total in order to be able to normalize them.
    • 0:58:48And when we do that, when we normalize this probability distribution,
    • 0:58:51you end up getting something like this, positive 0.6837, negative 0.3163.
    • 0:58:58It seems like we've been able to conclude that we are about 68%
    • 0:59:02confident-- we think there's a probability of 0.68
    • 0:59:06that this message is a positive message-- my grandson loved it.
    • 0:59:09And why are we 68% confident?
    • 0:59:11Well, it seems like we're more confident than not because the word
    • 0:59:15loved showed up in 32% of positive messages,
    • 0:59:18but only 8% of negative messages.
    • 0:59:20So that was a pretty strong indicator.
    • 0:59:22And for the others, while it's true that the word
    • 0:59:25it showed up more often in negative messages,
    • 0:59:27it wasn't enough to offset that loved shows up
    • 0:59:30far more often in positive messages than negative messages.
    • 0:59:34And so this type of analysis is how we can apply naive Bayes.
    • 0:59:37We've just done this calculation.
    • 0:59:39And we end up getting not just a categorization of positive or negative,
    • 0:59:42but I get some sort of confidence level.
    • 0:59:44What do I think the probability is that it's positive?
    • 0:59:47And I can say I think it's positive with this particular probability.
    • 0:59:52And so naive Bayes can be quite powerful at trying to achieve this.
    • 0:59:55Using just this bag of words model, where all I'm doing
    • 0:59:58is looking at what words show up in the sample,
    • 1:00:00I'm able to draw these sorts of conclusions.
    • 1:00:03Now, one potential drawback-- something that you'll notice pretty quickly
    • 1:00:07if you start applying this room exactly as is--
    • 1:00:10is what happens depending on if 0's are inside this data somewhere.
    • 1:00:15Let's imagine, for example, this same sentence-- my grandson loved it--
    • 1:00:20but let's instead imagine that this value here, instead of being 0.01,
    • 1:00:24was 0, meaning inside of our data set, it has never
    • 1:00:28before happened that in a positive message the word grandson showed up.
    • 1:00:33And that's certainly possible.
    • 1:00:35If I have a pretty small data set, it's probably likely
    • 1:00:37that not all the messages are going to have the word grandson.
    • 1:00:40Maybe it is the case that no positive messages have ever
    • 1:00:43had the word grandson in it, at least in my data set.
    • 1:00:46But if it is the case that 2% of the negative messages
    • 1:00:49have still had the word grandson in it, then we
    • 1:00:52run into an interesting challenge.
    • 1:00:54And the challenge is this-- when I multiply all of the positive numbers
    • 1:00:57together and multiply all the negative numbers together to calculate these two
    • 1:01:00probabilities, what I end up getting is a positive value of 0.000.
    • 1:01:06I get pure 0's, because when I multiply all of these numbers
    • 1:01:10together-- when I multiply something by 0,
    • 1:01:12doesn't matter what the other numbers are-- the result is going to be 0.
    • 1:01:15And the same thing can be said of negative numbers as well.
    • 1:01:19So this then would seem to be a problem that, because grandson has never
    • 1:01:24showed up in any of the positive messages inside of our sample,
    • 1:01:27we're able to say-- we seem to be concluding that there is a 0%
    • 1:01:31chance that the message is positive.
    • 1:01:33And therefore, it must be negative, because the only cases where
    • 1:01:37we've seen the word grandson come up is inside of a negative message.
    • 1:01:39And in doing so, we've totally ignored all of the other probabilities
    • 1:01:43that a positive message is much more likely to have the word loved in it,
    • 1:01:46because we've multiplied by 0, which just
    • 1:01:49means none of the other probabilities can possibly matter at all.
    • 1:01:53So this then is a challenge that we need to deal with.
    • 1:01:55It means that we're likely not going to be
    • 1:01:57able to get the correct results if we just purely use this approach.
    • 1:02:00And it's for that reason there are a number of possible ways
    • 1:02:02we can try and make sure that we never multiply something by 0.
    • 1:02:06It's OK to multiply something by a small number,
    • 1:02:08because then it can still be counterbalanced
    • 1:02:10by other larger numbers, but multiplying by 0 means it's the end of the story.
    • 1:02:14You multiply a number by 0, and the output's
    • 1:02:16going to be 0, no matter how big any of the other numbers happen to be.
    • 1:02:21So one approach that's fairly common a naive Bayes is
    • 1:02:23this idea of additive smoothing, adding some value alpha to each of the values
    • 1:02:29in our distribution just to smooth the data little bit.
    • 1:02:31One such approach is called Laplace smoothing,
    • 1:02:33which basically just means adding one to each value in our distribution.
    • 1:02:37So if I have 100 samples and zero of them contain the word grandson,
    • 1:02:43well then I might say that, you know what?
    • 1:02:45Instead, let's pretend that I've had one additional sample where the word
    • 1:02:49grandson appeared and one additional sample where the word grandson didn't
    • 1:02:53appear.
    • 1:02:53So I'll say all right, now I have one 1 of 102--
    • 1:02:57so one sample that does have the word grandson out of 102 total.
    • 1:03:01I'm basically creating two samples that didn't exist before.
    • 1:03:05But in doing so, I've been able to smooth the distribution a little bit
    • 1:03:08to make sure that I never have to multiply anything by 0.
    • 1:03:12By pretending I've seen one more value in each category than I actually have,
    • 1:03:17this gets us that result of not having to worry
    • 1:03:19about multiplying a number by 0.
    • 1:03:22So this then is an approach that we can use in order
    • 1:03:24to try and apply naive Bayes, even in situations
    • 1:03:27where we're dealing with words that we might not necessarily have seen before.
    • 1:03:31And let's now take a look at how we could actually apply that in practice.
    • 1:03:35It turns out that NLTK, in addition to having the ability to extract
    • 1:03:38n-grams and tokenize things into words, also
    • 1:03:41has the ability to be able to apply naive Bayes on some samples of text,
    • 1:03:45for example.
    • 1:03:46And so let's go ahead and do that.
    • 1:03:48What I've done is, inside of sentiment, I've prepared a corpus of just
    • 1:03:52know reviews that I've generated, but you can imagine using real reviews.
    • 1:03:55I just have a couple of positive reviews-- it was great.
    • 1:03:58So much fun.
    • 1:03:58Would recommend.
    • 1:03:59My grandson loved it.
    • 1:04:00Those sorts of messages.
    • 1:04:01And then I have a whole bunch of negative reviews-- not worth it,
    • 1:04:04kind of cheap, really bad, didn't work the way we expected--
    • 1:04:07just one on each line.
    • 1:04:08A whole bunch of positive reviews and negative reviews.
    • 1:04:11And what I'd like to do now is analyze them somehow.
    • 1:04:15So here then is sentiment up high, and what we're going to do first
    • 1:04:19is extract all of the positive and negative sentences,
    • 1:04:23create a set of all of the words that were used across all of the messages,
    • 1:04:28and then we're going to go ahead and train NLTK's naive Bayes classifier
    • 1:04:33on all of this training data.
    • 1:04:34And with the training data effectively is is I
    • 1:04:36take all of the positive messages and give them the label positive, all
    • 1:04:40the negative messages and give them the label negative,
    • 1:04:42and then I'll go ahead and apply this classifier to it, where I'd say,
    • 1:04:45I would like to take all of this training data
    • 1:04:48and now have the ability to classify it as positive or negative.
    • 1:04:52I'll then take some input from the user.
    • 1:04:53They can just type in some sequence of words.
    • 1:04:56And then I would like to classify that sequence
    • 1:04:59as either positive or negative, and then I'll
    • 1:05:01go ahead and print out what the probabilities of each happened to be.
    • 1:05:04And there are some helper functions here that just organize things in the way
    • 1:05:07that NLTK is expecting them to be.
    • 1:05:09But the key idea here is that I'm taking the positive messages,
    • 1:05:12labeling them, taking the negative messages,
    • 1:05:14labeling them, putting them inside of a classifier,
    • 1:05:16and then now trying to classify some new text that comes about.
    • 1:05:21So let's go ahead and try it.
    • 1:05:23I'll go ahead and go into sentiment, and we'll run Python sentiment,
    • 1:05:26passing in as input that corpus that contains
    • 1:05:29all of the positive and negative messages--
    • 1:05:31because depending on the corpus, that's going to affect the probabilities.
    • 1:05:34The effectiveness of our ability to classify
    • 1:05:36is entirely dependent on how good our data is, and how much data we have,
    • 1:05:41and how well they happen to be labeled.
    • 1:05:42So now I can try something and say--
    • 1:05:44let's try a review like, this was great--
    • 1:05:47just some review that I might leave.
    • 1:05:49And it seems that, all right, there is a 96% chance it estimates
    • 1:05:53that this was a positive message--
    • 1:05:544% chance that it was a negative, likely because the word great
    • 1:05:58shows up inside of the positive messages,
    • 1:06:00but doesn't show up inside of the negative messages.
    • 1:06:03And that might be something that our AI is able to capitalize on.
    • 1:06:06And really, what it's going to look for are the differentiating words--
    • 1:06:09that if the probability of words like this and was
    • 1:06:12and is pretty similar between positive and negative words,
    • 1:06:15then the naive Bayes classifier isn't going
    • 1:06:17to end up using those values as having some sort of importance
    • 1:06:21in the algorithm.
    • 1:06:21Because if they're the same on both sides,
    • 1:06:23you multiply that value for both positive and negative,
    • 1:06:26you end up getting about the same thing.
    • 1:06:28What ultimately makes the difference in naive Bayes
    • 1:06:30is when you multiply by value that's much bigger for one category
    • 1:06:34than for another category-- when one word like great
    • 1:06:36is much more likely to show up in one type of message
    • 1:06:39than another type of message.
    • 1:06:41And that's one of the nice things about naive Bayes
    • 1:06:43is that, without me telling it, that great
    • 1:06:45is more important to care about than this or was.
    • 1:06:48Naive Bayes can figure that out based on the data.
    • 1:06:50It can figure out that this shows up about the same amount of time
    • 1:06:53between the two, but great, that is a discriminator,
    • 1:06:56a word that can be different between the two types of messages.
    • 1:07:00So I could try it again--
    • 1:07:01type in a sentence like, lots of fun, for example.
    • 1:07:04This one it's a little less sure about--
    • 1:07:0662% chance that it's positive, 37% chance that it's negative-- maybe
    • 1:07:10because there aren't as clear discriminators
    • 1:07:12or differentiators inside of this data.
    • 1:07:15I'll try one more--
    • 1:07:16say kind of overpriced.
    • 1:07:20And all right, now 95%, 96% sure that this
    • 1:07:23is a negative sentiment-- likely because of the word
    • 1:07:25overpriced, because it's shown up in a negative sentiment expression
    • 1:07:29before, and therefore, it thinks, you know what, this is probably
    • 1:07:31going to be a negative sentence.
    • 1:07:34And so naive Bayes has now given us the ability to classify text.
    • 1:07:37Given enough training data, given enough examples,
    • 1:07:40we can train our AI to be able to look at natural language, human words,
    • 1:07:44figure out which words are likely to show up
    • 1:07:46in positive as opposed to negative sentiment messages,
    • 1:07:48and categorize them accordingly.
    • 1:07:50And you could imagine doing the same thing
    • 1:07:52anytime you want to take text and group it into categories.
    • 1:07:55If I want to take an email and categorize as email--
    • 1:07:58as a good email or as a spam email, you could apply a similar idea.
    • 1:08:01Try and look for the discriminating words,
    • 1:08:04the words that make it more likely to be a spam email or not,
    • 1:08:07and just train a naive Bayes classifier to be able to figure out
    • 1:08:10what that distribution is and to be able to figure out how to categorize
    • 1:08:14an email as good or as spam.
    • 1:08:15Now, of course, it's not going to be able to give us a definitive answer.
    • 1:08:19It gives us a probability distribution, something like 63%
    • 1:08:22positive, 37% negative.
    • 1:08:25And that might be why our spam filters and our emails sometimes make mistakes,
    • 1:08:29sometimes think that a good email is actually spam or vice
    • 1:08:32versa, because ultimately, the best that it can do
    • 1:08:36is calculate a probability distribution.
    • 1:08:37If natural language is ambiguous, we can usually
    • 1:08:40just deal in the world of probabilities to try and get
    • 1:08:42an answer that is reasonably good, even if we aren't able to guarantee for sure
    • 1:08:47that it is the number that we actually expect for it to be.
    • 1:08:50That then was a look at how we can begin to take some text
    • 1:08:54and to be able to analyze the text and group it into some sorts of categories.
    • 1:08:59But ultimately, in addition just being able to analyze text and categorize it,
    • 1:09:04we'd like to be able to figure out information about the text,
    • 1:09:08get it some sort of meaning out of the text as well.
    • 1:09:11And this starts to get us in the world of information,
    • 1:09:13of being able to try and take data in the form of text
    • 1:09:16and retrieve information from it.
    • 1:09:18So one type of problem is known as information retrieval, or IR,
    • 1:09:22which is the task of finding relevant documents in response to a query.
    • 1:09:26So this is something like you type in a query into a search engine,
    • 1:09:30like Google, or you're typing in something
    • 1:09:32into some system that's going to look for-- inside of a library catalog,
    • 1:09:35for example-- that's going to look for responses to a query.
    • 1:09:38I want to look for documents that are about the US constitution or something,
    • 1:09:43and I would like to get a whole bunch of documents
    • 1:09:45that match that query back to me.
    • 1:09:47But you might imagine that what I really want to be able to do
    • 1:09:50is, in order to solve this task effectively,
    • 1:09:53I need to be able to take documents and figure out,
    • 1:09:55what are those documents about?
    • 1:09:57I want to be able to say what is it that these particular documents are
    • 1:10:01about-- what of the topics of those documents--
    • 1:10:03so that I can then more effectively be able to retrieve information
    • 1:10:08from those particular documents.
    • 1:10:10And this refers to a set of tasks generally known as topic modeling,
    • 1:10:13where I'd like to discover what the topics are for a set of documents.
    • 1:10:17And this is something that humans could do.
    • 1:10:19A human could read a document and tell you, all right,
    • 1:10:21here's what this document is about, and give maybe
    • 1:10:23a couple of topics for who are the important people in this document, what
    • 1:10:27are the important objects in the document-- can probably tell you
    • 1:10:30that kind of thing.
    • 1:10:32But we'd like for our AI to be able to do the same thing.
    • 1:10:35Given some document, can you tell me what the important words
    • 1:10:38in this document are?
    • 1:10:39What are the words that set this document apart
    • 1:10:42that I might care about if I'm looking at documents
    • 1:10:44based on keywords, for example?
    • 1:10:47And so one instinctive idea-- an intuitive idea that probably makes
    • 1:10:49sense--
    • 1:10:50is let's just use term frequency.
    • 1:10:53Term frequency is just defined as the number of times
    • 1:10:56a particular term appears in a document.
    • 1:10:58If I have a document with 100 words and one particular word shows up 10 times,
    • 1:11:03it has a term frequency of 10.
    • 1:11:05It shows up pretty often.
    • 1:11:06Maybe that's going to be an important word.
    • 1:11:09And sometimes, you'll also see this framed
    • 1:11:10as a proportion of the total number of words, so 10 words out of 100.
    • 1:11:14Maybe it has a term frequency of 0.1, meaning 10% of all of the words
    • 1:11:19are this particular word that I care about.
    • 1:11:21Ultimately, that doesn't change relatively
    • 1:11:23how important they are for any one particular document,
    • 1:11:26but they're the same idea.
    • 1:11:27The idea is look for words that show up more frequently, because those
    • 1:11:31are more likely to be the important words inside of a corpus of documents.
    • 1:11:35And so let's go ahead and give that a try.
    • 1:11:37Let's say I wanted to find out what the Sherlock Holmes stories are about.
    • 1:11:40I have a whole bunch of Sherlock Holmes stories
    • 1:11:42and I want to know, in general, what are they about?
    • 1:11:45What are the important characters?
    • 1:11:47What are the important objects?
    • 1:11:49What are the important parts of the story, just in terms of words?
    • 1:11:52And I'd like for the AI to be able to figure that out on its own,
    • 1:11:55and we'll do so by looking at term frequency--
    • 1:11:57by looking at, what are the words that show up the most often?
    • 1:12:01So we'll go ahead, and I'll go ahead and go in to the tfidf directory.
    • 1:12:06You'll see why it's called that in a moment.
    • 1:12:08But let's first open up tf0.py, which is going to calculate the top 10 term
    • 1:12:14frequencies-- or maybe top five term frequencies
    • 1:12:17for a corpus of documents, a whole bunch of documents
    • 1:12:19where each document is just a story from Sherlock Holmes.
    • 1:12:22We're going to load all the data into our corpus
    • 1:12:26and we're going to figure out, what are all of the words that
    • 1:12:29show up inside of that corpus?
    • 1:12:32And we're going to basically just assemble all
    • 1:12:35of the number of the term frequencies.
    • 1:12:36We're going to calculate, how often do each of these terms
    • 1:12:39appear inside of the document?
    • 1:12:41And we'll print out the top five.
    • 1:12:43And so there are some data structures involved that you
    • 1:12:45can take a look at if you'd like to.
    • 1:12:47The exact code is not so important, but it is the idea of what we're doing.
    • 1:12:50We're taking each of these documents and first sorting them.
    • 1:12:54We're saying, take all the words that show up
    • 1:12:56and sort them by how often each word shows up.
    • 1:13:00And let's go ahead and just, for each document, save the top five
    • 1:13:04terms that happen to show up in each of those documents.
    • 1:13:07So again, some helper functions you can take a look at if you're interested.
    • 1:13:10But the key idea here is that all we're going to do
    • 1:13:13is run to tf0 on the Sherlock Holmes stories.
    • 1:13:18And what I'm hoping to get out of this process is I am hoping to figure out,
    • 1:13:21what are the important words in Sherlock Holmes, for example?
    • 1:13:25So we'll go ahead and run this and see what we get.
    • 1:13:29And it's loading the data.
    • 1:13:30And here's what we get.
    • 1:13:31For this particular story, the important words are the, and and, and I,
    • 1:13:36and to, and of.
    • 1:13:37Those are the words that show up more frequently.
    • 1:13:39In this particular story, it's the, and and, and I, and a, and of.
    • 1:13:45This is not particularly useful to us.
    • 1:13:47We're using term frequencies.
    • 1:13:48We're looking at what words show up the most frequently in each
    • 1:13:50of these various different documents, but what we get naturally
    • 1:13:54are just the words that show up a lot in English.
    • 1:13:57The word the, and of, and happen to show up a lot in English,
    • 1:14:00and therefore, they happen to show up a lot in each
    • 1:14:02of these various different documents.
    • 1:14:04This is not a particularly useful metric for us
    • 1:14:06to be able to analyze what words are important,
    • 1:14:08because these words are just part of the grammatical structure of English.
    • 1:14:12And it turns out we can categorize words into a couple of different categories.
    • 1:14:17These words happen to be known as what we might call function words, words
    • 1:14:21that have little meaning on their own, but that
    • 1:14:23are used to grammatically connect different parts of a sentence.
    • 1:14:26These are words like am, and by, and do, and is, and which,
    • 1:14:29and with, and yet-- words that, on their own, what do they mean?
    • 1:14:32It's hard to say.
    • 1:14:33They get their meaning from how they connect
    • 1:14:35different parts of the sentence.
    • 1:14:36And these function words are what we might call a closed class of words
    • 1:14:40in a language like English.
    • 1:14:41There's really just some fixed list of function words,
    • 1:14:44and they don't change very often.
    • 1:14:46There's just some list of words that are commonly
    • 1:14:48used to connect other grammatical structures in the language.
    • 1:14:52And that's in contrast with what we might call content words, words
    • 1:14:56that carry meaning independently-- words like algorithm,
    • 1:14:58category, computer, words that actually have some sort of meaning.
    • 1:15:02And these are usually the words that we care about.
    • 1:15:05These are the words where we want to figure out,
    • 1:15:07what are the important words in our document?
    • 1:15:10We probably care about the content words more
    • 1:15:12than we care about the function words.
    • 1:15:15And so one strategy we could apply is just ignore all of the function words.
    • 1:15:20So here in tf1.py, I've done the same exact thing,
    • 1:15:26except I'm going to load a whole bunch of words from a function_words.txt
    • 1:15:31file, inside of which are just a whole bunch of function words in alphabetical
    • 1:15:35order.
    • 1:15:36These are just a whole bunch of function words
    • 1:15:38that are just words that are used to connect other words in English,
    • 1:15:41and someone has just compiled this particular list.
    • 1:15:44And these are the words that I just want to ignore.
    • 1:15:46If any of these words-- let's just ignore it as one of the top terms,
    • 1:15:49because these are not words that I probably care about
    • 1:15:52if I want to analyze what the important terms inside of a document
    • 1:15:56happen to be.
    • 1:15:57So in tfidf1, we were ultimately doing is,
    • 1:16:01if the word is in my set of function words,
    • 1:16:05I'm just going to skip over it, just ignore any of the function words
    • 1:16:08by continuing on to the next word and then
    • 1:16:11just calculating the frequencies for those words instead.
    • 1:16:14So I'm going to pretend the function words aren't there,
    • 1:16:16and now maybe I can get a better sense for what
    • 1:16:19terms are important in each of the various different Sherlock Holmes
    • 1:16:23stories.
    • 1:16:24So now let's run tf1 on the Sherlock Holmes corpus and see what we get now.
    • 1:16:29And let's look at, what is the most important term in each of the stories?
    • 1:16:32Well, it seems like, for each of the stories,
    • 1:16:34the most important word is Holmes.
    • 1:16:36I guess that's what we would expect.
    • 1:16:38They're all Sherlock Holmes stories.
    • 1:16:39And Holmes is not a function in Word.
    • 1:16:40It's not the, or a, or an, so it wasn't ignored.
    • 1:16:44But Holmes and man--
    • 1:16:46these are probably not what I mean when I say, what are the important words?
    • 1:16:50Even though Holmes does show up the most often
    • 1:16:52it's not giving me a whole lot of information here
    • 1:16:54about what each of the different Sherlock Holmes stories
    • 1:16:57are actually about.
    • 1:16:59And the reason why is because Sherlock Holmes shows up in all the stories,
    • 1:17:02and so it's not meaningful for me to say that this story is about Sherlock
    • 1:17:06Holmes I want to try and figure out the different topics
    • 1:17:09across the corpus of documents.
    • 1:17:11What I really want to know is, what words show up
    • 1:17:13in this document that show up less frequently in the other documents,
    • 1:17:18for example?
    • 1:17:19And so to get at that idea, we're going to introduce the notion
    • 1:17:22of inverse document frequency.
    • 1:17:25Inverse document frequency is a measure of how common,
    • 1:17:29or rare, a word happens to be across an entire corpus of words.
    • 1:17:33And mathematically, it's usually calculated like this--
    • 1:17:35as the logarithm of the total number of documents
    • 1:17:39divided by the number of documents containing the word.
    • 1:17:43So if a word like Holmes shows up in all of the documents,
    • 1:17:47well, then total documents is how many documents there
    • 1:17:50are a number of documents containing Holmes is going to be the same number.
    • 1:17:55So when you divide these two together, you'll get 1, and the logarithm of one
    • 1:17:58is just 0.
    • 1:18:00And so what we get is, if Holmes shows up in all of the documents,
    • 1:18:04it has an inverse document frequency of 0.
    • 1:18:07And you can think now of inverse document frequency
    • 1:18:09as a measure of how rare is the word that
    • 1:18:13shows up in this particular document that if a word doesn't show up
    • 1:18:16across many documents at all this number is going to be much higher.
    • 1:18:21And this then gets us that a model known as tf-idf,
    • 1:18:24which is a method for ranking what words are important in the document
    • 1:18:28by multiplying these two ideas together.
    • 1:18:30Multiply term frequency, or TF, by inverse document frequency, or IDF,
    • 1:18:37where the idea here now is that how important a word is
    • 1:18:39depends on two things.
    • 1:18:41It depends on how often it shows up in the document using
    • 1:18:44the heuristic that, if a word shows up more often,
    • 1:18:46it's probably more important.
    • 1:18:47And we multiply that by inverse document frequency IDF,
    • 1:18:51because if the word is rarer, but it shows up in the document,
    • 1:18:54it's probably more important than if the word shows up
    • 1:18:57across most or all of the documents, because then it's probably
    • 1:19:00a less important factor in what the different topics
    • 1:19:02across the different documents in the corpus happen to be.
    • 1:19:06And so now let's go ahead and apply this algorithm on the Sherlock Holmes
    • 1:19:11corpus.
    • 1:19:13And here's tfidf.
    • 1:19:15Now what I'm doing is, for each of the documents,
    • 1:19:18for each word, I'm calculating its TF score,
    • 1:19:22term frequency, multiplied by the inverse document
    • 1:19:25frequency of that word-- not just looking at the single volume,
    • 1:19:28but multiplying these two values together
    • 1:19:30in order to compute the overall values.
    • 1:19:33And now, if I run tfidf on the Holmes corpus,
    • 1:19:37this is going to try and get us a better approximation for what's
    • 1:19:40important in each of the stories.
    • 1:19:41And it seems like it's trying to extract here
    • 1:19:44probably like the names of characters that
    • 1:19:46happen to be important in the story-- characters that show up
    • 1:19:49in this story that don't show up in the other story--
    • 1:19:51and prioritizing the more important characters that
    • 1:19:53happen to show up more often.
    • 1:19:56And so this then might be a better analysis of what types of topics
    • 1:20:00are more or less important.
    • 1:20:02I also have another corpus, which is a corpus of all of the Federalist
    • 1:20:05Papers from American history.
    • 1:20:07If I go ahead and run tfidf on the Federalist Papers,
    • 1:20:11we can begin to see what the important words in each
    • 1:20:14of the various different Federalist Papers happen to be--
    • 1:20:16that in Federalist Paper Number 61, seems like it's a lot about elections.
    • 1:20:22In Federalist Papers 66, but the Senate and impeachments.
    • 1:20:25You can start to extract what the important terms and what
    • 1:20:28the important words are just by looking at what things show up across--
    • 1:20:32and don't show up across many of the documents,
    • 1:20:34but show up frequently enough in certain of the documents.
    • 1:20:38And so this can be a helpful tool for trying
    • 1:20:40to figure out this kind of topic modeling,
    • 1:20:43figuring out what it is that a particular document happens
    • 1:20:47to be about.
    • 1:20:48And so this then is starting to get us into this world of semantics,
    • 1:20:53what it is that things actually mean when we're talking about language.
    • 1:20:56Now, we're not going to think about the bag of words,
    • 1:20:59where we just say, treat a sample of text as just a whole bunch of words.
    • 1:21:02And we don't care about the order.
    • 1:21:04Now, when we get into the world of semantics,
    • 1:21:06we really do start to care about what it is that these words actually mean,
    • 1:21:10how it is these words relate to each other,
    • 1:21:12and in particular, how we can extract information out of that text.
    • 1:21:17Information extraction is somehow extracting knowledge
    • 1:21:20from our documents-- figuring out, given a whole bunch of text,
    • 1:21:23can we automate the process of having an AI, look at those documents,
    • 1:21:28and get out what the useful or relevant knowledge inside those documents
    • 1:21:31happens to be?
    • 1:21:33So let's take a look at an example.
    • 1:21:34I'll give you two samples from news articles.
    • 1:21:37Here up above is a sample of a news article from the Harvard Business
    • 1:21:40Review that was about Facebook.
    • 1:21:42Down below is an example of a Business Insider article from 2018
    • 1:21:45that was about Amazon.
    • 1:21:47And there's some information here that we might
    • 1:21:49want an AI to be able to extract--
    • 1:21:51information, knowledge about these companies
    • 1:21:54that we might want to extract.
    • 1:21:55And in particular, what I might want to extract is--
    • 1:21:58let's say I want to know data about when companies were founded--
    • 1:22:02that I wanted to know that Facebook was founded in 2004,
    • 1:22:05Amazon founded in 1994--
    • 1:22:07that that is important information that I happen to care about.
    • 1:22:10Well, how do we extract that information from the text?
    • 1:22:13What is my way of being able to understand this text
    • 1:22:15and figure out, all right, Facebook was founded in 2004?
    • 1:22:18Well, what I can look for are templates or patterns, things
    • 1:22:22that happened to show up across multiple different documents that give me
    • 1:22:26some sense for what this knowledge happens to mean.
    • 1:22:28And what we'll notice is a common pattern
    • 1:22:30between both of these passages, which is this phrasing here.
    • 1:22:34When Facebook was founded in 2004, comma--
    • 1:22:37and then down below, when Amazon was founded in 1994, comma.
    • 1:22:42And those two templates end up giving us a mechanism for trying to extract
    • 1:22:47information-- that this notion, when company was founded in year comma,
    • 1:22:53this can tell us something about when a company was founded,
    • 1:22:56because if we set our AI loose on the web,
    • 1:22:58let look at a whole bunch of papers or a whole bunch of articles,
    • 1:23:01and it finds this pattern--
    • 1:23:03when blank was founded in blank, comma--
    • 1:23:06well, then our AI can pretty reasonably conclude
    • 1:23:09that there's a good chance that this is going to be like some company,
    • 1:23:13and this is going to be like the year that company was founded, for example--
    • 1:23:17might not be perfect, but at least it's a good heuristic.
    • 1:23:20And so you might imagine that, if you wanted
    • 1:23:22to train and AI to be able to look for information,
    • 1:23:25you might give the AI templates like this--
    • 1:23:27not only give it a template like when company blank was founded in blank,
    • 1:23:31but give it like, the book blank was written by blank, for example.
    • 1:23:34Just give it some templates where it can search the web,
    • 1:23:37search a whole big corpus of documents, looking for templates that match that,
    • 1:23:41and if it finds that, then it's able to figure out,
    • 1:23:44all right, here's the company and here's the year.
    • 1:23:47But of course, that requires us to write these templates.
    • 1:23:50It requires us to figure out, what is the structure of this information
    • 1:23:53likely going to look like?
    • 1:23:54And it might be difficult to know.
    • 1:23:56The different websites are, of course, going to do this differently.
    • 1:23:58This type of method isn't going to be able to extract all of the information,
    • 1:24:01because if the words are slightly in a different order,
    • 1:24:04it won't match on that particular template.
    • 1:24:06But one thing we can do is, rather than give our AI the template,
    • 1:24:11we can give AI the data.
    • 1:24:13We can tell the AI, Facebook was founded in 2004 and Amazon was founded in 1994,
    • 1:24:19and just tell the AI those two pieces of information,
    • 1:24:22and then set the AI loose on the web.
    • 1:24:24And now the ideas that the AI can begin to look for, where do Facebook in 2004
    • 1:24:30show up together, where do Amazon in 1994 show up together,
    • 1:24:33and it can discover these templates for itself.
    • 1:24:36It can discover that this kind of phrasing--
    • 1:24:38when blank was founded in blank--
    • 1:24:40tends to relate Facebook to 2004, and it released Amazon to 1994,
    • 1:24:45so maybe it will hold the same relation for others as well.
    • 1:24:49And this ends up being-- this automated template
    • 1:24:51generation ends up being quite powerful, and we'll go ahead
    • 1:24:54and take a look at that now as well.
    • 1:24:56What I have here inside of templates directory
    • 1:24:59is a file called companies.csv, and this is all of the data
    • 1:25:03that I am going to give to my AI.
    • 1:25:04I'm going to give it the pair Amazon, 1994 and Facebook, 2004.
    • 1:25:09And what I'm going to tell my AI to do is
    • 1:25:11search a corpus of documents for other data--
    • 1:25:14these pairs like this-- other relationships.
    • 1:25:16I'm not telling AI that this is a company and the date
    • 1:25:18that it was founded.
    • 1:25:19I'm just giving it Amazon, 1994 and Facebook, 2004
    • 1:25:23and letting the AI do the rest.
    • 1:25:25And what the AI is going to do is it's going to look through my corpus--
    • 1:25:28here's my corpus of documents--
    • 1:25:30and it's going to find, like inside of Business Insider,
    • 1:25:33that we have sentences like, back when Amazon was founded in 2004, comma--
    • 1:25:38and that kind of phrasing is going to be similar to this Harvard Business Review
    • 1:25:42story that has a sentence like, when Facebook was founded in 2004--
    • 1:25:46and it's going to look across a number of other documents
    • 1:25:49for similar types of patterns to be able to extract that kind of information.
    • 1:25:53And what it will do is, if I go ahead and run,
    • 1:25:56I'll go ahead and go into templates.
    • 1:25:58So I'll say python search.py.
    • 1:26:01I'm going to look for the data like the data and companies.csv
    • 1:26:05inside of the company's directory, which contains a whole bunch of news articles
    • 1:26:08that I've curated in advance.
    • 1:26:10And here's what I get--
    • 1:26:12Google 1998, Apple 1976, Microsoft 1975--
    • 1:26:15so on and so forth--
    • 1:26:16Walmart 1962, for example.
    • 1:26:18These are all of the pieces of data that happened
    • 1:26:20to match that same template that we were able to find before.
    • 1:26:23And how was it able to find this?
    • 1:26:25Well, it's probably because, if we look at the Forbes article,
    • 1:26:29for example, that it has a phrase in it like, when Walmart was founded in 1962,
    • 1:26:34comma-- that it's able to identify these sorts of patterns
    • 1:26:38and extract information from them.
    • 1:26:39Now, granted, I have curated all these stories in advance
    • 1:26:42in order to make sure that there is data that it's able to match on.
    • 1:26:46And in practice, it's not always going to be in this exact format
    • 1:26:49when you're seeing a company related to the year in which it was founded,
    • 1:26:52but if you give the AI access to enough data-- like all of the data of text
    • 1:26:56on the internet-- and just have the AI crawl the internet looking
    • 1:26:58for information, it can very reliably, or with some probability,
    • 1:27:02try and extract information using these sorts of templates
    • 1:27:05and be able to generate interesting sorts of knowledge.
    • 1:27:08And the more knowledge it learns, the more new templates
    • 1:27:10it's able to construct, looking for constructions that
    • 1:27:13show up in other locations as well.
    • 1:27:15So let's take a look at another example.
    • 1:27:17And then I'll here show you presidents.csv,
    • 1:27:20where I have two presidents and their inauguration date--
    • 1:27:23so George Washington 1789, Barack Obama 2009 for example.
    • 1:27:28And I also am going to give to our AI a corpus that
    • 1:27:31just contains a single document, which is the Wikipedia
    • 1:27:34article for the list of presidents of the United States, for example--
    • 1:27:37just information about presidents.
    • 1:27:39And I'd like to extract from this raw HTML document on a web page information
    • 1:27:45about the president.
    • 1:27:45So I can say search in presidents.csv.
    • 1:27:50And what I get is a whole bunch of data about presidents
    • 1:27:53and what year they were likely inaugurated and by looking
    • 1:27:56for patterns that matched--
    • 1:27:58Barack Obama 2009, for example--
    • 1:28:00looking for these sorts of patterns that happened
    • 1:28:02to give us some clues as to what it is that a story happens to be about.
    • 1:28:07So here's another example.
    • 1:28:08If I open up inside the olympics, here is a scraped version
    • 1:28:12of the Olympic home page that has information
    • 1:28:15about various different Olympics.
    • 1:28:16And maybe I want to extract Olympic locations and years
    • 1:28:20from this particular page.
    • 1:28:21Well, the way I can do that is using the exact same algorithm.
    • 1:28:24I'm just saying, all right, here are two Olympics and where they were located--
    • 1:28:29so 2012 London, for example.
    • 1:28:32Let me go ahead and just run this process,
    • 1:28:35Python search, on olympics.csv, look at all the Olympic data set,
    • 1:28:39and here I get some information back.
    • 1:28:41Now, this information-- not totally perfect.
    • 1:28:43There are a couple of examples that are obviously not
    • 1:28:45quite right, because my template might have been a little bit too general.
    • 1:28:48Maybe it was looking for a broad category of things
    • 1:28:51and certain strange things happened to capture on that particular template.
    • 1:28:55So you could imagine adding rules to try and make this process more intelligent,
    • 1:28:58making sure the thing on the left is just a year, for example--
    • 1:29:02for instance, and doing other sorts of analysis.
    • 1:29:04But purely just based on some data, we are
    • 1:29:07able to extract some interesting information using some algorithms.
    • 1:29:10And all search.py is really doing here is it is taking my corpus of data,
    • 1:29:16finding templates that match it--
    • 1:29:18here, I'm filtering down to just the top two templates that happen to match--
    • 1:29:22and then using those templates to extract results from the data
    • 1:29:26that I have access to, being able to look for all of the information
    • 1:29:30that I care about.
    • 1:29:31And that's ultimately what's going to help me,
    • 1:29:33to print out those results to figure out what the matches happen to be.
    • 1:29:38And so information extraction is another powerful tool
    • 1:29:41when it comes to trying to extract information.
    • 1:29:43But of course, it only works in very limited contexts.
    • 1:29:46It only works when I'm able will find templates that look exactly
    • 1:29:49like this in order to come up with some sort of match that
    • 1:29:53is able to connect this to some pair of data,
    • 1:29:55that this company was founded in this year.
    • 1:29:57What I might want to do, as we start to think about the semantics of words,
    • 1:30:01is to begin to imagine some way of coming up with definitions
    • 1:30:04for all words, being able to relate all of the words in a dictionary
    • 1:30:08to each other, because that's ultimately what's going to be necessary if we want
    • 1:30:12our AI to be able to communicate.
    • 1:30:13We need some representation of what it is that words mean.
    • 1:30:18And one approach of doing this, this famous data set called WordNet.
    • 1:30:22And what WordNet is is it's a human-curated--
    • 1:30:24researchers have curated together a whole bunch of words,
    • 1:30:27their definitions, their various different senses--
    • 1:30:29because the word might have multiple different meanings--
    • 1:30:31and also how those words relate to one another.
    • 1:30:35And so what we mean by this is--
    • 1:30:36I can show you an example of WordNet.
    • 1:30:38WordNet comes built into NLTK.
    • 1:30:40Using NLTK, you can download and access WordNet.
    • 1:30:44So let me go into WordNet, and go ahead and run WordNet,
    • 1:30:48and extract information about a word-- a word like city, for example.
    • 1:30:52Go ahead and press Return.
    • 1:30:53And here is the information that I get back about a city.
    • 1:30:56It turns out that city has three different senses, three
    • 1:30:59different meanings, according to WordNet.
    • 1:31:01And it's really just kind of like a dictionary, where
    • 1:31:03each sense is associated with its meaning-- just some definition
    • 1:31:07provided by human.
    • 1:31:08And then it's also got categories, for example, that a word belongs to--
    • 1:31:13that a city is a type of municipality, a city
    • 1:31:15is a type of administrative district.
    • 1:31:18And that allows me to relate words to other words.
    • 1:31:20So one of the powers of WordNet is the ability to take one word
    • 1:31:24and connect it to other related words.
    • 1:31:28If I do another example, let me try the word house, for instance.
    • 1:31:33I'll type in the word house and see what I get back.
    • 1:31:36Well, all right, the house is a kind of building.
    • 1:31:38The house is somehow related to a family unit.
    • 1:31:42And so you might imagine trying to come up
    • 1:31:43with these various different ways of describing a house.
    • 1:31:46It is a building.
    • 1:31:47It is a dwelling.
    • 1:31:48And researchers have just curated these relationships
    • 1:31:51between these various different words to say that a house is a type of building,
    • 1:31:55that a house is a type of dwelling, for example.
    • 1:31:58But this type of approach, while certainly
    • 1:32:01helpful for being able to relate words to one another,
    • 1:32:04doesn't scale particularly well.
    • 1:32:06As you start to think about language changing,
    • 1:32:08as you start to think about all the various different relationships
    • 1:32:11that words might have to one another, this challenge of word representation
    • 1:32:16ends up being difficult. What we've done is just
    • 1:32:18defined a word as just a sentence that explains what it is that that word is,
    • 1:32:23but what we really would like is some way
    • 1:32:26to represent the meaning of a word in a way
    • 1:32:28that our AI is going to be able to do something useful with it.
    • 1:32:31Anytime we want our AI to be able to look at texts
    • 1:32:33and really understand what that text means,
    • 1:32:35to relate text and words to similar words
    • 1:32:38and understand the relationship between words,
    • 1:32:40we'd like some way that a computer can represent this information.
    • 1:32:44And what we've seen all throughout the course
    • 1:32:46multiple times now is the idea that, when
    • 1:32:48we want our AI to represent something, it
    • 1:32:51can be helpful to have the AI represent it using numbers--
    • 1:32:54that we've seen that we can represent utilities in a game,
    • 1:32:57like winning, or losing, or drawing, as a number--
    • 1:32:591, negative 1, or a 0.
    • 1:33:01We've seen other ways that we can take data and turn it
    • 1:33:04into a vector of features, where we just have
    • 1:33:06a whole bunch of numbers that represent some particular piece of data.
    • 1:33:11And if we ever want to past words into a neural network,
    • 1:33:14for instance, to be able to say, given some word,
    • 1:33:16translate this sentence into another sentence,
    • 1:33:18or to be able to do interesting classifications with neural networks
    • 1:33:21on individual words, we need some representation of words
    • 1:33:26just in terms of vectors--
    • 1:33:27way to represent words, just by using individual numbers
    • 1:33:31to define the meaning of a word.
    • 1:33:34So how do we do that?
    • 1:33:35How do we take words and turn them into vectors
    • 1:33:37that we can use to represent the meaning of those words?
    • 1:33:40Well, one way is to do this.
    • 1:33:42If I have four words that I want to encode, like he wrote a book,
    • 1:33:46I can just say, let's let the word he be this vector--
    • 1:33:491, 0, 0, 0.
    • 1:33:51Wrote will be 0, 1, 0, 0.
    • 1:33:53A will be 0, 0, 1, 0.
    • 1:33:56Book will be 0, 0, 0, 1.
    • 1:33:59Effectively, what I have here is what's known as a one-hot representation
    • 1:34:03or a one-hot encoding, which is a representation of meaning,
    • 1:34:06where meaning is a vector that has a single 1 in it and the rest are 0's.
    • 1:34:10The location of the 1 tells me the meaning of the word--
    • 1:34:14that 1 in the first position, that means here--
    • 1:34:171 in the second position, that means wrote.
    • 1:34:19And every word in the dictionary is going
    • 1:34:21to be assigned to some representation like this, where we just
    • 1:34:24assign one place in the vector that has a 1 for the word
    • 1:34:28and 0 for the other words.
    • 1:34:29And now I have representations of words that
    • 1:34:31are different for a whole bunch of different words.
    • 1:34:33This is this one-hot representation.
    • 1:34:36So what are the drawbacks of this?
    • 1:34:38Why is this not necessarily a great approach?
    • 1:34:40Well, here, I am only creating enough vectors
    • 1:34:42to represent four words in a dictionary.
    • 1:34:45If you imagine a dictionary with 50,000 words that I might want to represent,
    • 1:34:49now these vectors get enormously long.
    • 1:34:51These are 50,000 dimensional vectors to represent
    • 1:34:54a vocabulary of 50,000 words-- that he is 1 followed by all these.
    • 1:34:58Wrote has a whole bunch of 0's in it.
    • 1:35:01That's not a particularly tractable way of trying to represent numbers,
    • 1:35:05if I'm going to have to deal with vectors of length 50,000.
    • 1:35:09Another problem-- a subtler problem--
    • 1:35:12is that ideally, I'd like for these vectors
    • 1:35:14to somehow represent meaning in a way that I can extract
    • 1:35:17useful information out of-- that if I have the sentence he wrote a book
    • 1:35:21and he authored a novel, well, wrote and authored are going to be two
    • 1:35:26totally different vectors.
    • 1:35:28And book and novel are going to be two totally different vectors inside
    • 1:35:32of my vector space that have nothing to do with each other.
    • 1:35:35The one is just located in a different position.
    • 1:35:38And really, what I would like to have happen
    • 1:35:40is for wrote and authored to have vectors
    • 1:35:43that are similar to one another, and for book and novel
    • 1:35:47to have vector representations that are similar to one another,
    • 1:35:49because they are words that have similar meanings.
    • 1:35:52Because their meanings are similar, ideally, I'd like for--
    • 1:35:56when I put them in vector form and use a vector to represent meanings,
    • 1:35:59I would like for those vectors to be similar to one another as well.
    • 1:36:04So rather than this one-hot representation,
    • 1:36:06where we represent a word's meaning by just giving it a vector that is one
    • 1:36:10in a particular location, what we're going to do--
    • 1:36:12which is a bit of a strange thing the first time you see it--
    • 1:36:15is what we're going to call a distributed representation.
    • 1:36:18We are going to represent the meaning of a word as just
    • 1:36:21a whole bunch of different values-- not just a single 1 and the rest 0's,
    • 1:36:25but a whole bunch of values.
    • 1:36:26So for example, in he wrote a book, he might just be a big vector.
    • 1:36:31Maybe it's 50 dimensions, maybe it's 100, dimensions but certainly less
    • 1:36:34than like tens of thousands, where each value is just some number--
    • 1:36:39and same thing for wrote, and a, and book.
    • 1:36:42And the idea now is that, using these vector representations,
    • 1:36:45I'd hope that wrote and authored have vector representations that
    • 1:36:48are pretty close to one another.
    • 1:36:50Their distance is not too far apart-- and same with the vector
    • 1:36:52representations for book and novel.
    • 1:36:56So this is going to be the goal of a lot of what statistical machine learning
    • 1:37:00approaches to natural language processing
    • 1:37:02is about is using these vector representations of words.
    • 1:37:06But how on earth do we define a word as just a whole bunch
    • 1:37:10of these sequences of numbers?
    • 1:37:11What does it even mean to talk about the meaning of a word?
    • 1:37:16The famous quote that answers this question
    • 1:37:18is from a British linguist in the 1950s, JR Firth, who said, "You shall
    • 1:37:22know a word by the company it keeps."
    • 1:37:28And what we mean by that is the idea that we
    • 1:37:30can define a word in terms of the words that show up around it, that we can get
    • 1:37:35at the meaning of a word based on the context in which that word happens
    • 1:37:39to appear.
    • 1:37:40That if I have a sentence like this, four words in sequence--
    • 1:37:43for blank he ate--
    • 1:37:46what goes in the blank?
    • 1:37:47Well, you might imagine that, in English,
    • 1:37:49the types of words that might fill in the blank are words like breakfast,
    • 1:37:52or lunch, or dinner.
    • 1:37:53These are the kinds of words that fill in that blank.
    • 1:37:56And so if we want to define, what does lunch or dinner mean,
    • 1:38:00we can define it in terms of what words happened
    • 1:38:03to show up around it-- that if a word shows up
    • 1:38:07in a particular context and another word happens to show up
    • 1:38:09in very similar context, then those two words are probably
    • 1:38:13related to each other.
    • 1:38:15They probably have a similar meaning to one another.
    • 1:38:18And this then is the foundational idea of an algorithm
    • 1:38:20known as word2vec, which is a model for generating word vectors.
    • 1:38:24You give word2vec a corpus of documents, just a whole bunch of texts,
    • 1:38:28and what word to that will produce is it will produce vectors for each word.
    • 1:38:34And there a number of ways that it can do this.
    • 1:38:36One common way is through what's known as the skip-gram architecture, which
    • 1:38:40basically uses a neural network to predict context words,
    • 1:38:44given a target word-- so given a word like lunch,
    • 1:38:47use a neural network to try and predict, given the word lunch, what
    • 1:38:50words are going to show up around it.
    • 1:38:53And so the way we might represent this is
    • 1:38:55with a big neural network like this, where
    • 1:38:57we have one input cell for every word.
    • 1:39:00Every word gets one node inside this neural network.
    • 1:39:04And the goal is to use this neural network to predict,
    • 1:39:07given a target word, a context word.
    • 1:39:09Given a word like lunch, can I predict the probabilities of other words,
    • 1:39:14showing up in a context of one word away or two words away, for instance,
    • 1:39:18in some sort of window of context?
    • 1:39:21And if you just give the AI, this neural network, a whole bunch of data of words
    • 1:39:27and what words show up in context, you can train a neural network
    • 1:39:30to do this calculation, to be able to predict, given a target word--
    • 1:39:34can I predict what those context words ultimately should be?
    • 1:39:39And it will do so using the same methods we've
    • 1:39:41talked about-- back propagating the error from the context word
    • 1:39:43back through this neural network.
    • 1:39:46And what you get is, if we use the single layer--
    • 1:39:48just a signal layer of hidden nodes--
    • 1:39:50what I get is, for every single one of these words, I get--
    • 1:39:54from this word, for example, I get five edges, each of which
    • 1:39:59has a weight to each of these five hidden nodes.
    • 1:40:02In other words, I get five numbers that effectively
    • 1:40:05are going to represent this particular target word here.
    • 1:40:10And the number of hidden nodes I choose in this middle layer here--
    • 1:40:13I can pick that.
    • 1:40:14Maybe I'll choose to have 50 hidden nodes or 100 hidden nodes.
    • 1:40:17And then, for each of these target words,
    • 1:40:19I'll have 50 different values or 100 different values,
    • 1:40:22and those values we can effectively treat as the vector
    • 1:40:26numerical representation of that word.
    • 1:40:29And the general idea here is that, if words are similar,
    • 1:40:33two words show up in similar contexts-- meaning, using the same target words,
    • 1:40:37I'd like to predict similar contexts words--
    • 1:40:40well, then these vectors and these values I choose in these vectors
    • 1:40:43here-- these numerical values for the weight of these edges
    • 1:40:45are probably going to be similar, because for two different words that
    • 1:40:49show up in similar contexts, I would like
    • 1:40:51for these values that are calculated to ultimately
    • 1:40:55be very similar to one another.
    • 1:40:58And so ultimately, the high-level way you can picture this
    • 1:41:01is that what this word2vec training method is
    • 1:41:02going to do is, given a whole bunch of words, were initially,
    • 1:41:06recall, we initialize these weights randomly and just pick
    • 1:41:09random weights that we choose.
    • 1:41:11Over time, as we train the neural network,
    • 1:41:14we're going to adjust these weights, adjust the vector representations
    • 1:41:17of each of these words so that gradually,
    • 1:41:20words that show up in similar contexts grow closer to one another,
    • 1:41:24and words that show up in different contexts
    • 1:41:27get farther away from one another.
    • 1:41:29And as a result, hopefully I get vector representations
    • 1:41:32of words like breakfast, and lunch, and dinner that are similar to one another,
    • 1:41:36and then words like book, and memoir, and novel
    • 1:41:39are also going to be similar to one another as well.
    • 1:41:42So using this algorithm, we're able to take a corpus of data
    • 1:41:46and just train our computer, train this neural network to be able to figure out
    • 1:41:50what vector, what sequence of numbers is going
    • 1:41:52to represent each of these words-- which is, again, a bit of a strange concept
    • 1:41:55to think about representing a word just as a whole bunch of numbers.
    • 1:41:59But we'll see in a moment just how powerful this really can be.
    • 1:42:02So we'll go ahead and go into vectors, and what I have inside a vectors.py--
    • 1:42:08which I'll open up now--
    • 1:42:09is I'm opening up words.txt, which is a pretrained model that just--
    • 1:42:14I've already run word2vec and it's already given me
    • 1:42:17a whole bunch of vectors for each of these possible words.
    • 1:42:19And I'm just going to take like 50,000 of them
    • 1:42:22and go ahead and save their vectors inside of a dictionary called words.
    • 1:42:26And then I've also defined some functions called distance,
    • 1:42:29closest_word, so it'll get me what are the closest words to a particular word,
    • 1:42:33and then closest_word, that just gets me the one closest word, for example.
    • 1:42:38And so now let me try doing this.
    • 1:42:39Let me open up the Python interpreter and say something like,
    • 1:42:43from vectors import star--
    • 1:42:46just import everything from vectors.
    • 1:42:48And now let's take a look at the meanings of some words.
    • 1:42:51Let me look at the word city, for example.
    • 1:42:55And here is a big array that is the vector representation of the words
    • 1:43:01city.
    • 1:43:01And this doesn't mean anything, in terms of what these numbers exactly are,
    • 1:43:04but this is how my computer is representing
    • 1:43:07the meaning of the word city.
    • 1:43:08We can do a different word, like words house,
    • 1:43:11and here then is the vector representation of the word house,
    • 1:43:14for example-- just a whole bunch of numbers.
    • 1:43:17And this is encoding somehow the meaning of the word house.
    • 1:43:20And how do I get at that idea?
    • 1:43:22Well, one way to measure how good this is is by looking at,
    • 1:43:24what is the distance between various different words?
    • 1:43:29There a number of ways you can define distance.
    • 1:43:31In context of vectors, one common way is what's
    • 1:43:33known as the cosine distance that has to do with measuring
    • 1:43:35the angle between vectors.
    • 1:43:37But in short, it's just measuring, how far apart
    • 1:43:40are these two vectors from each other?
    • 1:43:42So if I take a word like the word book, how far away for is it from itself--
    • 1:43:47how far away is the word book from book--
    • 1:43:49well, that's zero.
    • 1:43:50The word book is zero distance away from itself.
    • 1:43:54But let's see how far away word book is from a word like breakfast,
    • 1:43:59where we're going to say one is very far away, zero is not far away.
    • 1:44:03All right, book is about 0.64 away from breakfast.
    • 1:44:07They seem to be pretty far apart.
    • 1:44:09But let's now try and calculate the distance from words book
    • 1:44:12to words novel, for example.
    • 1:44:16Now, those two words are closer to each other--
    • 1:44:180.34.
    • 1:44:19The vector representation of the word book
    • 1:44:21is closer to the vector representation of the word novel
    • 1:44:25than it is to the vector representation of the word breakfast.
    • 1:44:28And I can do the same thing and, say, compare breakfast to lunch,
    • 1:44:34for example.
    • 1:44:35And those two words are even closer together.
    • 1:44:37They have an even more similar relationship
    • 1:44:40between one word and another.
    • 1:44:42So now it seems we have some representation of words,
    • 1:44:45representing a word using vectors, that allows us to be able to say something
    • 1:44:49like words that are similar to each other
    • 1:44:52ultimately have a smaller distance that happens to be between them.
    • 1:44:55And this turns out to be incredibly powerful to be
    • 1:44:58able to represent the meaning of words in terms of their relationships
    • 1:45:01to other words as well.
    • 1:45:03I can tell you as well--
    • 1:45:05I have a function called closest words that
    • 1:45:06basically just takes a whole bunch of words
    • 1:45:09and gets all the closest words to it.
    • 1:45:11So let me get the closest words to book, for example,
    • 1:45:15and maybe get the 10 closest words.
    • 1:45:18We'll limit ourselves to 10.
    • 1:45:20And right.
    • 1:45:21Book is obviously closest to itself-- the word book--
    • 1:45:24but is also closely related to books, and essay, and memoir, and essays,
    • 1:45:27and novella, anthology.
    • 1:45:29And why are these words that it was able to compute are close to it?
    • 1:45:32Well, because based on the corpus of information
    • 1:45:34that this algorithm was trained on, the vectors that arose
    • 1:45:38arose based on what words show up in a similar context--
    • 1:45:41that the word book shows up in a similar context, similar other words to words
    • 1:45:45like memoir and essays, for example.
    • 1:45:47And if I do something like--
    • 1:45:49let me get the closest words to city--
    • 1:45:53you end up getting city, town, township, village.
    • 1:45:56These are words that happen to show up in a similar context to the word city.
    • 1:46:02Now, where things get really interesting is that, because these are vectors,
    • 1:46:05we can do mathematics with them.
    • 1:46:07We can calculate the relationships between various different words.
    • 1:46:11So I can say something like, all right, what if I had man and king?
    • 1:46:16These are two different vectors, and this is a famous example
    • 1:46:18that comes out of word2vec.
    • 1:46:20I can take these two vectors and just subtract them from each other.
    • 1:46:24This line here, the distance here, is another vector
    • 1:46:28that represents like king minus man.
    • 1:46:30Now, what does it mean to take a word and subtract another word?
    • 1:46:33Normally, that doesn't make sense.
    • 1:46:34In the world of vectors, though, you can take some vector sum
    • 1:46:37sequence of numbers, subtract some other sequence of numbers,
    • 1:46:40and get a new vector, get a new sequence of numbers.
    • 1:46:43And what this new sequence of numbers is effectively going to do
    • 1:46:46is it is going to tell me, what do I need to do to get from man to king?
    • 1:46:52What is the relationship then between these two words?
    • 1:46:54And this is some vector representation of what makes--
    • 1:46:58takes us from man to king.
    • 1:47:00And we can then take this value and add it to another vector.
    • 1:47:04You might imagine that the word woman, for example,
    • 1:47:07is another vector that exists somewhere inside of this space,
    • 1:47:10somewhere inside of this vector space.
    • 1:47:12And what might happen if I took this same idea, king
    • 1:47:15minus man-- took that same vector and just added it to woman?
    • 1:47:19What will we find around here?
    • 1:47:22It's an interesting question we might ask,
    • 1:47:24and we can answer it very easily, because I have vector representations
    • 1:47:27of all of these things.
    • 1:47:30Let's go back here.
    • 1:47:31Let me look at the representation of the word man.
    • 1:47:34Here's the vector representation of men.
    • 1:47:36Let's look at the representation of the word king.
    • 1:47:38Here's the representation of the word king.
    • 1:47:41And I can subtract these two.
    • 1:47:42What is the vector representation of king minus man?
    • 1:47:46It's this array right here--
    • 1:47:48whole bunch of values.
    • 1:47:49So king minus man now represents the relationship between king and man
    • 1:47:53in some sort of numerical vector format.
    • 1:47:55So what happens then if I add woman to that?
    • 1:48:00Whatever took us from man to king, go ahead and apply that same vector
    • 1:48:04to the vector representation of the word woman,
    • 1:48:07and that gives us this vector here.
    • 1:48:10And now, just out of curiosity, let's take this expression
    • 1:48:15and find, what is the closest word to that expression?
    • 1:48:20And amazingly, what we get is we get the word queen--
    • 1:48:25that somehow, when you take the distance between man and king--
    • 1:48:28this numerical representation of how man is related to king--
    • 1:48:32and add that same notion, king minus man,
    • 1:48:34to the vector representation of the word woman.
    • 1:48:37What we get is we get the vector representation, or something close
    • 1:48:40to the vector representation of the word queen,
    • 1:48:43because this distance somehow encoded the relationship between these two
    • 1:48:48words.
    • 1:48:48And when you run it through this algorithm,
    • 1:48:50it's not programmed to do this, but if you just try and figure
    • 1:48:53out how to predict words based on context words,
    • 1:48:55you get vectors that are able to make these SAT-like analogies out
    • 1:48:59of the information that has been given.
    • 1:49:02So there are more examples of this.
    • 1:49:03We can say, all right, let's figure out, what
    • 1:49:06is the distance between Paris and France?
    • 1:49:10So Paris and France are words.
    • 1:49:12They each have a vector representation.
    • 1:49:14This then is a vector representation of the distance between Paris and France--
    • 1:49:18what takes us from France to Paris.
    • 1:49:21And let me go ahead and add the vector representation of England to that.
    • 1:49:26So this then is the vector representation
    • 1:49:29of going Paris minus France plus England--
    • 1:49:35so the distance between friends and Paris as vectors.
    • 1:49:38Add the England vector, and let's go ahead
    • 1:49:40and find the closest word to that.
    • 1:49:47And it turns out to be London.
    • 1:49:48You do this relationship, the relationship between France and Paris.
    • 1:49:51Go ahead and add the England vector to it, and the closest vector to that
    • 1:49:55happens to be the vector for the word London.
    • 1:49:57We can do more examples.
    • 1:49:58I can say, let's take the word for teacher--
    • 1:50:00that vector representation and-- let me subtract
    • 1:50:03the vector representation of school.
    • 1:50:05So what I'm left with is, what takes us from school to teacher?
    • 1:50:09And apply that vector to a word like hospital and see,
    • 1:50:14what is the closest word to that--
    • 1:50:15turns out the closest word is nurse.
    • 1:50:17Let's try a couple more examples-- closest word to ramen, for example.
    • 1:50:23Subtract closest word to Japan.
    • 1:50:25So what is the relationship between Japan and ramen?
    • 1:50:28Add the word for America to that.
    • 1:50:30Want to take a guess is what you might get as a result?
    • 1:50:33Turns out you get burritos as the relationship.
    • 1:50:35If you do the subtraction, do the addition,
    • 1:50:38this is the answer that you happen to get as a consequence of this as well.
    • 1:50:42So these very interesting analogies arise
    • 1:50:44in the relationships between these two words--
    • 1:50:46that if you just map out all of these words into a vector space,
    • 1:50:50you can get some pretty interesting results as a consequence of that.
    • 1:50:54And this idea of representing words as vectors turns out
    • 1:50:58to be incredibly useful and powerful anytime
    • 1:51:01we want to be able to do some statistical work with
    • 1:51:04regards to natural language, to be able to have--
    • 1:51:06represent words not just as their characters,
    • 1:51:09but to represent them as numbers, numbers that say something
    • 1:51:12or mean something about the words themselves,
    • 1:51:14and somehow relate the meaning of a word to other words that
    • 1:51:18might happen to exists--
    • 1:51:19so many tools then for being able to work inside
    • 1:51:23of this world of natural language.
    • 1:51:24The natural language is tricky.
    • 1:51:26We have to deal with the syntax of language and the semantics of language,
    • 1:51:29but we've really just seen just the beginning of some of the ideas that are
    • 1:51:33underlying a lot of natural language processing-- the ability to take text,
    • 1:51:37extract information out of it, get some sort of meaning out of it,
    • 1:51:40generate sentences maybe by having some knowledge of the grammar or maybe just
    • 1:51:43by looking at probabilities of what words are likely to show up based
    • 1:51:47on other words that have shown up previously--
    • 1:51:49and then finally, the ability to take words
    • 1:51:52and come up with some distributed representation of them, to take words
    • 1:51:55and represent them as numbers, and use those numbers
    • 1:51:58to be able to say something meaningful about those words as well.
    • 1:52:02So this then is yet another topic in this broader
    • 1:52:04heading of artificial intelligence.
    • 1:52:06And just as I look back at where we've been now,
    • 1:52:08we started our conversation by talking about the world of search,
    • 1:52:11about trying to solve problems like tic-tac-toe by searching
    • 1:52:14for a solution, by exploring our various different possibilities
    • 1:52:17and looking at what algorithms we can apply to be able to efficiently
    • 1:52:21try and search a space.
    • 1:52:22We looked at some simple algorithms and then looked at some optimizations
    • 1:52:25we could make to this algorithms, and ultimately, that
    • 1:52:28was in service of trying to get our AI to know things about the world.
    • 1:52:31And this has been a lot of what we've talked about today as well,
    • 1:52:34trying to get knowledge out of text-based information,
    • 1:52:37the ability to take information, draw conclusions based on those information.
    • 1:52:41If I know these two things for certain, maybe I
    • 1:52:43can draw a third conclusion as well.
    • 1:52:46That then was related to the idea of uncertainty.
    • 1:52:49If we don't know something for sure, can we
    • 1:52:51predict something, figure out the probabilities of something?
    • 1:52:54And we saw that again today in the context
    • 1:52:56of trying to predict whether a tweet or whether a message
    • 1:52:59is positive sentiment or negative sentiment,
    • 1:53:01and trying to draw that conclusion as well.
    • 1:53:04Then we took a look at optimization-- the sorts
    • 1:53:05of problems where we're looking for a local global or local maximum
    • 1:53:09or minimum.
    • 1:53:10This has come up time and time again, especially most recently
    • 1:53:13in the context of neural networks, which are really just a kind of optimization
    • 1:53:16problem where we're trying to minimize the total amount of loss
    • 1:53:20based on the setting of our weights of our neural network,
    • 1:53:23based on the setting of what vector representations for words we
    • 1:53:26happen to choose.
    • 1:53:27And those ultimately helped us to be able to solve
    • 1:53:30learning-related problems-- the ability to take a whole bunch of data,
    • 1:53:33and rather than us tell the AI exactly what to do,
    • 1:53:37let the AI learn patterns from the data for itself.
    • 1:53:40Let it figure out what makes an inbox message different from a spam message.
    • 1:53:43Let it figure out what makes a counterfeit
    • 1:53:45bill different from an authentic bill, and being
    • 1:53:47able to draw that analysis as well.
    • 1:53:49And one of the big tools in learning that we used
    • 1:53:52were neural networks, these structures that
    • 1:53:54allow us to relate inputs to outputs by training these internal networks
    • 1:53:58to learn some sort of function that maps us from some input to some output--
    • 1:54:02ultimately yet another model in this language of artificial intelligence
    • 1:54:05that we can use to communicate with our AI.
    • 1:54:08Then finally today, we looked at some ways
    • 1:54:10that AI can begin to communicate with us, looking at ways
    • 1:54:12that AI can begin to get an understanding for the syntax
    • 1:54:16and the semantics of language to be able to generate sentences,
    • 1:54:19to be able to predict things about text that's written in a spoken
    • 1:54:23language or a written language like English,
    • 1:54:25and to be able to do interesting analysis there as well.
    • 1:54:27And there's so much more in active research that's
    • 1:54:30happening all over the areas within artificial intelligence today,
    • 1:54:33and we've really only just seen the beginning of what AI has to offer.
    • 1:54:36So I hope you enjoyed this exploration into this world
    • 1:54:39of artificial intelligence with Python.
    • 1:54:41A big thank you to the courses teaching staff and the production team
    • 1:54:44for making this class possible.
    • 1:54:45This was an Introduction to Artificial Intelligence with Python.
  • CS50.ai
Shortcuts
Before using a shortcut, click at least once on the video itself (to give it "focus") after closing this window.
Play/Pause spacebar or k
Rewind 10 seconds left arrow or j
Fast forward 10 seconds right arrow or l
Previous frame (while paused) ,
Next frame (while paused) .
Decrease playback rate <
Increase playback rate >
Toggle captions on/off c
Toggle mute m
Toggle full screen f or double-click video