From the course: Web Scraping with Python

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Challenge: Using CNN's sitemap

Challenge: Using CNN's sitemap - Python Tutorial

From the course: Web Scraping with Python

Start my 1-month free trial

Challenge: Using CNN's sitemap

(bright music) - What if a client comes to you with a web scraping project and says, "I want a database of all CNN articles from the last 10 years, and I need to be reasonably certain that this is a complete collection, gathered in an ordered and systematic way." So what do you do? Remember a scrapy crawler, like the kind we worked with in Chapter 2 might reach all the pages eventually, but it might not, due to the random path it takes throughout the website. It's hard to crawl in an orderly way. So this type of solution isn't going to meet your client's needs. But is there a way where you can take five minutes, do a little exploration and go back to the client and say, "10 years of articles? I can give you 20." Or maybe it's, "Sorry, I can only do the last eight years." Is there a way you can estimate how many articles are available quickly, which can translate into how long it'll take to scrape? So take a few minutes…

Contents