The video covers tokenization, which breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP.
- [Instructor] So we have seen how to tokenize the words and sentences, building up a dictionary of all the words to make a corpus. The next step will be to turn these sentences into list of values based on the tokens that we have generated from the tokenizer object, which are called sequences. So first of all, we're going to use the same tokenizer class so let's quickly import that from Kera's pre-processing text module. Let's quickly around this. So once that is done, the next step is to define the training sentences. And for that, we have defined a Python list called "twin sentences," and I've defined three sentences. So "it is a sunny day," "it is a cloudy day," and "will it rain today?" I've also added a question mark just to see how the tokenizer handles it. So let's quickly define these sentences. The next step is to train the tokenizer. So here for that, first of all, we need to instantiate the tokenizer class. And we have provided the non-words hyperparameter, which is set to 100. The next step is to train the tokenizer, which is again using the function "fit on text," as we saw earlier. And we are going to use the training sentences that we have defined above as the argument to this function. And the next step is to store word index, the word encoding dictionary, that the tokenizer is going to generate from the word index attribute. Store it, run the cell. Now, comes the main part, which is creating sequences of these trainings sentences. So for that, we are going to use text-to-sequences function and we are going to pass the training sentences on which we are going to run this function. And it will create the sequences for all the sentences that have been fine in this list. Let's run that. Now, what we're going to do is we are going to print both the word and coding dictionary, which is "wording_index," that is the vocabulary. And we're going to print the sequences that we have generated. So let's quickly do that. So you can see in the output of the cell, we have word index, which is the dictionary, the word encoding dictionary. All of the words have been encoded. "It," one, "Is" is encoding as two, "a" is encoded as three, so on and so forth up till "today," which is encoded as nine. And we have sequence of words. So training sentences list had three sentences and we can see three sequences over here. One, two, three, five, four; one, two, three, six, four; and seven, one, eight, nine. So let's look at a sample sentence and its sequence. So we have, "it is a sunny day," which is encoded as "one, two, three, five, four," and you can match all of these encodings with the word index dictionary. All of the words have been encoding, referring to each of those keys defined in the dictionary. Now, the next step is to use this tokenizer on new sentences. So we have defined new sentences, which are containing new words as well. Will it be raining today? So "be" and "raining" are new here. "It is a pleasant day," where "pleasant" is a new word. So let's see how the tokenizer that we have trained on the training sentences operate on these new sentences. So to create these sequences, again, we use the textual sequences matter and we pass the new sentences. So we are creating the sequences of the new sentences that we have defined. Let's quickly run that. Now, if we print the new sentences and the new sequences, so you see, we have five words in the first sentence, whereas we have a sequence of only three values, which is, again... So basically it is not able to find out, the encoding for "raining" or the encoding for "be" from the tokenizer, which has been trained on the training sentences. And again, "it is a pleasant day." So that also contains five words, whereas the sequence of it contains only four. So that is a problem we see. Now, how do we handle that? So here, we can define the tokenizer. So while instantiating the tokenizer, we can pass this out of vocabulary parameter, which is called "oov_token," and we can define the encoding for all the words that are not available inside our word encoding dictionary. So no matter how big your data is, how many words you are training on, there's always a chance that you will encounter a new word. And for that, we have this out of vocabulary token. So now, let's quickly train this new tokenizer that we have defined over here with the oov_token on the training sentences. The training sentences as defined above these three sentences. And then we are going to create the word encoding dictionary. Now, if we use this same tokenizer to create a new text to sequence, basically using the text to sequence method on the new sentences, let's see what it generates. So you see, first, in the word encoding dictionary, now we have out of vocabulary token, which is given an encoding of one. And if you see the encoding of the new sentences where we had five words in both of the sentences, now we have encoded the "be" and the "raining" as one, and we can also actually look at it by printing the new sentences dictionary. So "will it be raining beginning today?" Now, it contains five encodings, basically five values, in its sequence. And the two values which are encoded as one are basically out of vocabulary token because "be" and "raining" are not available in our dictionary. And same goes for the next sentence. "Pleasant" is not available in the word encoding dictionary and thus, it is encoded as one, which is the code for out of vocabulary token. So this is how we can create sequences. Now, there are different manipulations that we are going to add so that all of the sequences are of the same length, so that the deep learning model, or the neural network, can actually process. And we are going to look at how we can manipulate those sequences, manipulate those lengths in the upcoming video.