Learn how to work with objects. This video covers the BeautifulSoup library, BeautifulSoup objects, and NavigableString objects.
- [Teacher] Now that we've finished with tag objects let's start looking at the NavigatableString object. The first thing we want to do is make sure that you have installed Beautiful Soup and you would do that through the Anaconda command line. I showed you how to do that in section 6.1. So in this case we have already installed Beautiful Soup and then we just want to check to make sure that our version of Python is compatible with the demonstration we've about to do. So this notebook was written for Python 3.7 and let's just check the version we are running here. So I'll say import sys and then I'll just print the sys version. Run that and okay, cool. So we are at 3.7. So next let's go ahead and import Beautiful Soup into this environment. So we will say from BS4 import Beautiful Soup, and run that. Now we have our Beautiful Soup library imported into our IPython environment. Now in this demonstration I'm going to show you how to work with NavigatableString objects. NavigatableString objects are used as containers for chunks of text that are stored inside of tags. So let's just go back to our example from 6.1 with the soup object. We'll create a variable here called soup_object and then we will call the Beautiful Soup constructor and let's pass an H1 tag. So we'll create a string and the open tag is going to be H1 attribute_1="heading level 1" and then it's going to read the Future Trends in IOT in 2018. Future trends in IOT in 2018 and then we will just close this H1 tag. Okay. Okay so it looks like I put a quote there which I didn't need to put so I need to remove that, and then we have two quotes there so clean this up a little bit. Now we want this to be read in as LXML so let's just pass a parameter that says that. LXML, okay, and then I'll just double check my syntax and everything that I've typed out here. Now let's create and object called tag and assign it the name H1. So to do that we'll just say tag is equal to soup_object.H1, and then let's call the type method on the tag object to verify that we have actually indeed created a tag object. So we'll say type and then pass in our tag object and run this, and okay, so you can see that we have indeed created a tag object here and so let's verify the name of that tag. It should be H1 so I'll just write tag.name and run that and yes it is indeed H1 so we have an H1 tag, but if you just wanted to isolate this string object from within this tag object then what you can do is you just say tag.string and it'll print out the string. Cool, so here we have it, that's our string from within this tag, so it's right there. I want to show you here how actually tag.string is a separate object of it's own. So let's go ahead and just go to type function and then pass in tag.string, and when we run this we see that tag.string is actually a NavigatableString. So let's play with this a little bit. I will create a new variable and let's call it our_navigatable_string and all sides are equal to tag.string and then print it out. I'll just copy this and run that. So basically what this is saying is that our NavigatableString is now this Future Trends in IOT in 2018. If you wanted to replace the string object from within the NavigatableString you can just call the replace with method off of the NavigatableString and then pass in a replacement string. So let's try that out and we will just replace this Future Trends for IOT in 2018 with NaN. So let's just try this out. We're going to say our navigatable string and call the replace_with method and we will pass in Not a Number, so NaN and then just print our string again. So we will then say tag.string and see what it prints out, and as you can see now our string is just not a number. We have replaced this Future Trends in IOT with Not a Number. Okay, now let's look at how to utilize NavigatableStrings. So I'm giving you this HTML document and this is going to come preloaded in your Jupyter notebook just because it's easier for us to use for demonstration purposes, and what we're going to do is we're going to convert this to a parse tree like we did in the previous section. So right now I'll just run this cell. If there is one or more string object within a parse tree you can easily isolate them. One way to do that is by calling the stripped strings generator to return all of the strings within the object where string is consisting entirely of white space are ignored and white space at the beginning and end of the strings is removed. So for this example, for each string object in the parse tree this stripped strings generator passes through, strips white spaces, and then prints out each string that contains a printable representation. So let's just try this out here. We'll say for string in our soup object, we'll call stripped strings and then for each of these stripped strings let's just print out a representation of that string. So put a colon, new line I'm going to say call it print function, and we want to print a representation of the string. So we'll just write REPR and then pass in the string and then run this. All right, so now you can see that our strings have been pretty much cleaned up. We just have a list of strings here without the tags or any of the mark up within the body of this series of strings. The last thing I want to show you in this demonstration is how to access parent tag objects within a parse tree. So let's create a new object called first_link and then we'll just set it equal to the A tag from within the parse tree. First link is equal to our_soup_object.a, okay and then we'll print out our first link. Okay cool, so this is actually the first link in our document and the text that it contains. So in out document it would be this text here is actually clickable and it redirects to a bit.ly link, and if we wanted to access the parent of that first link we could just say first link, first_link.parent, and we see that we now have the parent tag of this first link. Now the NavigatableString object of first link is a string that reads last month Ericsson Digital invited me. Let me show you. We will say first_link and then we're just going to look for the string, so we're going to take that string, we run that, and we're basically going into this link and pulling only the string from within it, right. So that is now printed out as the string object, and lastly we can also retrieve the parent of that NavigatableString object. To do that we would just say first_link.string.parent and run that. So in this case the parent of the NavigatableString is the A tag which is sort of self evident, right. So now that you know how to work with objects in Beautiful Soup we're going to go into using Beautiful Soup for data parsing.
- Why use Python for working with data
- Filtering and selecting data
- Concatenating and transforming data
- Data visualization best practices
- Visualizing data
- Creating a plot
- Creating statistical data graphics
- Performing basic math and linear algebra
- Correlation analysis
- Multivariate analysis
- Data sourcing via web scraping
- Introduction to natural language processing
- Collaborative analytics with Plotly