From the course: Introduction to Spark SQL and DataFrames
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Eliminating duplicates in DataFrames
From the course: Introduction to Spark SQL and DataFrames
Eliminating duplicates in DataFrames
- [Instructor] Now, when we're working with Data Frames, Spark provides some ways to de-duplicate data. So, let's take a look at how to do that. Now, our data files that we've been working with the location and temperature data in our utilization files don't have any duplicate data, so we'll take this as an opportunity to also look at how we can create small data sets to work with within our Jupiter Notebook session. So, the first thing I want to do, is import some code that we'll need from the PySpark SQL package, so I'll specify from PySpark dot SQL import the row package, and we have that. And, now what I'm going to do, is I'm going to create a data frame and I'm going to do that by entering data manually here in the notebook and I'm going to call this data frame dup because it's going to have duplicate data in there. And, to do that, I specify SC, which stands for Spark Context. It's a global variable that gives us…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.