Start free trial Sign in

From the course: Web Scraping with Python

What is web scraping? - Python Tutorial

From the course: Web Scraping with Python

Start my 1-month free trial

What is web scraping?

“

- [Instructor] So what is web scraping? Let's start this off by going over some web scraping terms. You may have heard web scraping being called many things, screen scraping and web harvesting are what I think of as older terms for web scraping, a little bit out of date. Web crawling or web spiders or spidering all refer to moving from one webpage to another via links. So, you can scrape data from a single page, or you can crawl across multiple pages, scraping data from each one. Of course, sometimes all these terms are used interchangeably, but in this course, we'll use the term scraping for getting data off of a single page and crawling specifically refers to the act of moving from page to page. But we'll get more into those details in the second chapter. So bots or web bots refer more to automated interaction with websites or web apps. So this interaction, may be with the goal of scraping data or with the goal of crawling from page to page, but it's the bot that performs this interaction. And the concept of interaction or automation of a web app is something that we'll explore more in chapter three. So you may be familiar with the traditional definition of web scraping that goes something like, an automated program that requests an HTML webpage or DOM meant for humans and then parses the displayed data. So I find this definition to be a little bit limiting. Not all data is HTML, right? Browsers are useful tools and their DOM can be a great interpretation of this web data that sometimes we don't really need or care about it. Is all data meant for humans? Sometimes it is sometimes it isn't. Sometimes it's somewhere in between, right? Let's look at a definition of web scraping that I like to use: a program that requests and parses any data on the web, especially in an unexpected way. So this doesn't necessarily have to be automated. Of course, many web scrapers are. This could be any data, text files, videos, could be HTML. I think the essence of web scraping, when most people talk about it, is this sort of unexpected or almost hacker-like use of the data. Maybe this data was even meant to be parsed by a computer program, just not your computer program. So web scraping is a lot of data detective work, really getting down and dirty, and understanding what's going on behind the scenes in order to repurpose the application for your own uses. Some uses of web scraping might be a crawler that scans medical patient boards looking for experiences with drug combinations, so lots of big data collection, natural language processing classification. You could have automated UI testing of a company's web app. Here, obviously you have permission of the people who own the app, but the app's code wasn't designed to be interacted with by a bot. So you're not scraping the data and parsing it in a traditional sense, but you're putting the data into a database. You are logging test failures and successes and that's a type of data. You can have a bot that interacts with an airline flight search app, monitoring price changes. So here, there's a similar app interaction. It's logging data that gets uncovered at the end of that interaction. You probably don't have the permission of the airline company. But what if you're collecting that exact same data, only this time, your bot just makes requests to the airline's public API. Is requesting JSON data from a URL somehow fundamentally different than requesting a web application page meant for humans? And the thing is, there's no real answer to that question. The definition of web scraping is totally arbitrary. It's this kind of nebulous field that encompasses many skills and practices, so application security, networking, data science natural language processing, law, data architecture. The bad news, this course will not teach all of these fields. But what I do want to teach is the sense of how to look at a web scraping problem and break it down correctly. You want to figure out the first step and the second, and if you do need to bring in outside help, say from a data scientist, you have a neat, nice contained little problem for them that slots in well with the rest of your web scraping project. And I do want to leave you with this sort of apocryphal, probably apocryphal quote, from Michelangelo about how he sculpted the statue of David. "It's easy. You just chip away the stone that doesn't look like David." So really what web scraping is, is being able to look at a website, a huge collection of links and media and ads and garbage and cut through all of that and see only your database. You want to chip away everything that isn't your data. And the first step to doing that, of course, is learning how the internet works over in the second video. So I hope you will come join me.

Contents