From the course: Advanced Pandas

Reshaping dataframes (pivot, stack)

From the course: Advanced Pandas

Start my 1-month free trial

Reshaping dataframes (pivot, stack)

- [Instructor] After you've got your data, and you've began the data prep phase of your work, there's often a hurdle you've got to jump before you can really extract inside. Often your data simply is not in the appropriate format, or level of granularity needed to begin creating those snazzy visualizations or machine learning models. Luckily, pandas offer several easy to use and flexible functions for you to reshape your data. The first of these is called pivot. This allows you to take away a level of granularity from your rows and pivot that to your columns. You simply identify the index you want to use, the variable you want to pivot from your rows to your columns, in which values you want to populate your data frame. Note, pivot doesn't perform any kind of aggregation. So for this to work, the index and columns you choose, will have to be unique combinations. To show this, we'll create the following data frame. Notice there's two teams per region, and revenue and cost data for each region and team. Great, now let's pivot team to columns, using region as our index. For values, we'll use revenue. Now you see, region is our index and teams is now up top. The values here represent revenue. Next, we'll use stack to work in the opposite direction. That is, we'll pivot column labels to our rows. But first we need to set a multi-index for our data frame. To do this, we use set index and specify region and team. All right, now I'll call the stack function and create a new data frame called stacked. Excellent, so now you can see individual rows for revenue and cost and the values for both now lives within the same column. Similarly, we can reverse this process with unstack, that is we'll pivot row labels back to our columns. And now you see revenue and costs each have their own column once again. Note that this is simply the default behavior and you can actually specify at what level in your multi-index you want to unstack. The default is negative one or the inner most index, and we could change that to negative two or negative three, to unstack by team or region respectively, or better yet you can specifically name the index you want to unstack. For example, let's specify region. Great, now we've got region up top. One other way to reshape your data frames to be mindful of is melt. This allows you to specify ID variables from your columns which will stay intact. All remaining columns or measure variables will be pivoted from the column to the row level. Returning to our original data frame, we're going to designate region and team as our ID variables, then pivot revenue and cost back to the row level. Great, now all of the reshaping approaches so far require that we have unique combinations of the row and column variables that we're pivoting if that's not the case in your data. So for example, say we had multiple revenue entries for each region and team, you'd want to use the handy pivot table function. Here, we set an index, determine which variables we want pivoted to the columns if any, then designate our values. We'll go with revenue here. By default, pivot table uses mean to aggregate, but you can specify the method you want to use as well. Excellent, the average revenue by region is higher for team one than for team two. Now, one last thing, let's add region to columns to see how that changes it. Now we have the average for each region and team. Basically, pivot table acts like a pandas group by, then unstacks by the variable you specify in columns. Between these methods, there is always a way to get your data in the format you need to take it to the next level. When in doubt, PivotTable is the most flexible of the bunch and is a great place to start.

Contents