This video focuses on padding sequences using Keras pad_sequences() method. This function transforms a list (of length num_samples) of sequences (lists of integers) into a 2D NumPy array of shapes (num_samples, num_timesteps).
- So far we have seen how to create sequences of sentences in the actual data. But these sequences can be of different length, which is a problem for deep learning models. We need to have a consistent input size for the model. And we can achieve that using padding. Now, TensorFlow, our you can say the Keras API actually offers padding functions that does all the heavy lifting for us. So again, here from Keras preprssessing sequence module we are going to import pad_sequences function. So this is the one that we're going to be using. So we have imported it here in the fourth cell. All right. Now the next step is to define the training sentences on which we are going to train our tokenizer. So first of all, this is a python list, train sentences, I've defined four sentences of different lengths. So 'it will rain', the first sentence contains three words. 'The Weather is cloudy' contains four words, and we have 'Will it be raining today?', which has five words. And then last one ' It is a super hot day. This contains six words. All right, now the next step is again as we have seen previously we need to instantiate our tokenizer with our out-of-vocabulary encoding. So past the parameter 'oov_token' So once that is defined, now the next step is to train this tokenizer using the fit on text method, passing the training sentences, and the last step is to store the word index. The word encoding dictionary that would be generated using the word and the score index attribute. So let's quickly run that as well. Now come the next step, which is creating the sequences of all the trainings sentences. So for that, we have seen, we can use the text to sequences method and pass the training sentences. Okay. We have, the sequences generated as well. Now comes the next very important step. So we are going to have sequences of different length, but we can make them of equal length using padding. And for that, we are using 'pad_sequences'. The function that we have imported above, to this function we are going to pass the sequences that we have generated right above. All right. So once we run this, we would have the padded sequences stored in this variable pattern. And the (indistinct) sequences, let's quickly print all of these things. So first of all, let's print the word encoding dictionary. 'word_index' The next step is to store trainings sentences. Let's print all the training sentences just for our convenience. Then we can print the sequences and we'll compare it with the padded sequences that we have generated using the pad sequences function. Let's quickly run the cell as well. So you can see, we have the dictionary at the top. In the first line, all of the words have been encoded. The second line contains all the sentences that have been defined in the training sentences. So 'it will rain.', 'The weather is cloudy.', all of those sentences are there. And then we have the sequences of all of those training sentences. 2, 3, 5. 6, 7, 4, 8. So we have 3, 4, 5, and then 6 values based on the length of each of those sentences. So when we have padded these sequences, you see we have zeros. If you look at the first sentences padded sequence we have 3 zeroes that have been placed before 2, 3, 5. So we are actually looking at pre-padding. So we have inserted zeros just to make the length of each of these sequences equal. And then accordingly in the second sentence we just had the four words. So we had to put 2 zeros. And then in the third sentence, we had five, so just one, zero, and the last, which was 'it is a super hot day.', last sentence, it had six values in its sequence. So that was all, okay, didn't need any pad sequences. But now all of the sequences are of equal length, which is six. Okay. Now, if, let's say we want to have a max length defined. We need to customize our padded sequence or let's say, instead of pre-padding we want to add these zeros at the end. We can actually achieve that as well. We can customize our padded sequences using some parameters. So here I have defined pad sequences function passing the sequences that we have generated about. Then I am passing this padding parameter which would basically contain the string, which will define do you want a pre-padding or a post-padding? So let's say, if you want a post-padding, you can type it like this. You can type the max length of your sequence. So here it has taken six as the max length without any max length parameter. But if we define, let's say, 5. So this is going to be our max length for our pattern sequence. So if there is any sentence that contains more than five words, it would be truncated. So do you want to truncate the word from the starting of the sentence or from the ending of the sentence that you can define here? So we want words to be truncated post, basically from the end. So you can define that as well using the truncating parameter. Now, if we run this function we have created the padded sequences using this function using these parameters. Or now, if we look at the padded sequence, you see all of the sequences are all of the length five. And you see the last sentence which had 2, 4 12, 13, 14, 15, as you can see over here. Now the 15 has been truncated from the end because we are using post truncating. And all the zeros have been added at the end because we were using post padding. So this is how we can customize the padded sequences. And now we are at a stage where we can actually start working or passing these trained data this pre-process data to our model. So we are going to work on a real world data set and perform all of these operations in the next video.