From the course: Web Scraping with Python
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Challenge: Using CNN's sitemap - Python Tutorial
From the course: Web Scraping with Python
Challenge: Using CNN's sitemap
(bright music) - What if a client comes to you with a web scraping project and says, "I want a database of all CNN articles from the last 10 years, and I need to be reasonably certain that this is a complete collection, gathered in an ordered and systematic way." So what do you do? Remember a scrapy crawler, like the kind we worked with in Chapter 2 might reach all the pages eventually, but it might not, due to the random path it takes throughout the website. It's hard to crawl in an orderly way. So this type of solution isn't going to meet your client's needs. But is there a way where you can take five minutes, do a little exploration and go back to the client and say, "10 years of articles? I can give you 20." Or maybe it's, "Sorry, I can only do the last eight years." Is there a way you can estimate how many articles are available quickly, which can translate into how long it'll take to scrape? So take a few minutes…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.