From the course: Web Scraping with Python

Crawling a website - Python Tutorial

From the course: Web Scraping with Python

Start my 1-month free trial

Crawling a website

- [Instructor] So in this chapter, we're going to switch gears a bit and focus on creating web crawlers that collect data from Wikipedia. Now, before you get any bright ideas, it is completely useless thing to build web crawlers for Wikipedia in real life. If you want data for Wikipedia, they have fantastic APIs you can use, right? So scraping them is really just a waste of your time and their servers. However, they do have a lot of links and articles and things. You can click on one link to the next, their HTML is really stable. It doesn't change very often. And for this reason, it's a really great site to use when you're learning web crawling or when you're teaching web crawling. So if you do feel inclined and you do want to crawl them as practice, I would highly encourage you to chip in towards the cost of their servers. You know, I give them 20 bucks every now and then, and they are a really great organization. Anyway, let's start a new web crawler. We're going to call this one article scraper rather than naming it Wikipedia scraper, because the type of the thing we want to scrape is an article. So scrapy startproject article_scraper. All right. So we can name the individual spiders after the sites they scrape for the articles, but the project itself is going to be called article_scraper. Oops, sorry about that. We definitely want to do start project and not star project. All right. So then we navigate to the article scraper directory and to the spiders directory, and then let's make our Wikipedia spiders. So scrapy genspider wikipedia, en.wikipedia.org is going to be our domain and that's the English-language domain. Right, so let's check it out, see what we have. Spiders, Wikipedia. Great. So the first thing we want to do is extend Scrapy's CrawlSpider class instead of scrapy.spider. And we need to make sure we import this. So from scrapy.spiders, import CrawlSpider. And the other thing we're going to import is Rule, which we'll get into in a second. And we also want to import LinkExtractor. So from scrapy.linkextractors, import LinkExtractor. And rules, LinkExtractor, and CrawlSpiders work really well together, as we're going to see. Needs to be HTTPS, en.wikipedia.org, and let's actually make our start URL /wiki/Kevin_Bacon. You may have heard of Kevin Bacon the actor, also of course the namesake of the game Six Degrees of Kevin Bacon. So let's fill in a few things that we want to collect from each page in the parse function. Let's get the title response.xpath. That's going to be H1/text.get, or response.Xpath, h1/i/text.get. So for works of literature or movies or things like that, they're actually italicized. So we want to have the italicized version in there as well if we don't find the first version. Then we have the URL. That's just going to be response.url, pretty straightforward. And the last edited, this is the last edited date, at the bottom of Wikipedia pages. This page was last edited on the 27th of October, 2020. There's the timestamp there. We want to grab that too. So the last edited date is going to be a response.xpath li, where the ID is footer-info-lastmod, get the inner text, .get. Okay, great. So now we have something that looks just like the scraper we wrote in the last chapter, but what makes it a crawler and not just a scraper? We need to give it rules for the links to follow. So we can do this by using a Scrapy rule object imported at the top, and that takes as its arguments a Scrapy LinkExtractor object. So the first target is going to be a Scrapy LinkExtractor. The callback is going to be, let's call this the function parse_info, and then just change this name. I like to use parse_info for CrawlSpiders and parse for regular spiders. And then we want it to follow links. So follow equals true. This just means that every time it finds an internal URL it follows that, then it keeps following and following and finding new internal URLs and just basically goes ad infinitum. So let's take a look at that first LinkExtractor object. This is the thing that actually parses the HTML page and finds new Wikipedia links to visit. This takes in an allow argument. I actually have it in my clipboard right here. So unfortunately, regular expressions is outside the scope of this class, but what this regular expression means is that we want pages at /wiki and then /some text, sort of like this Kevin Bacon URL. And we do want to exclude any URLs that contain colons in them. So if you look at Wikipedia, pages that contain colons in the URL are things like the special pages, the discussion pages, talk pages, the random article page, right? So what we want are just those URLs that look like /wiki/some text. And now we can run this spider using the usual commands. So scrapy runspider wikipedia.py. All right, great. And you can see what it's doing is it keeps crawling to new Wikipedia pages, and the only way it'll stop, unless Wikipedia runs out of pages, which is probably unlikely, is with a Control + C. And this is multi-threaded, so you may have to press it a couple of times, but there you go. Your first Wikipedia crawler.

Contents