From the course: Faster pandas

Boolean indexing

From the course: Faster pandas

Start my 1-month free trial

Boolean indexing

- [Instructor] You can use Boolean indexing, to select data based on a criterion. Let's see an example. So ipython, and then import pandas as pd. And you say df=pd.read_csv('cart.csv') that we have. And if you look at the data frame, we have some cart information. The Customer, the Item, the Amount, and the Item Price. We can do a df[Item Price]>10. And you'll get a series of Boolean values. If we assign these values to a mask, we can use this mask to select only the rows in the data frame, that have true in the mask. So df[mask] will give us only the rows which the item price is bigger than 10. We usually write it in short as df [df ['Item Price'] > 10] We can also combine using Boolean operator such as and, or, and not. Selection using Boolean indexing is much faster than handwritten Python code Let's row the bigger data to see. So import sqlite3, and then the conn = sqlite3.connect 'logs.db', and we tell it to detect_types=sqlite3.PARSE_DECLTYPES. And our data frame now, is going to use pd.read_sql and we say ('SELECT * FROM logs', conn). And if we look at the data frame, we have 10,000 rows. Let's count how many arrows, meaning the status code is bigger or equal to 400 in the data frame. First, we're in the Python way. So timeit and total = 0 and for I don't care end row in df.iterrows, if row['status_code'] >= 400: total += 1. And this is 2.1 seconds, analysis of the Boolean indexing version. So %timeit len(df[df['status_code']>=400). And this is 856 microseconds. We have a million microseconds in a second. So 2_100_000/856. And we got about 2,500 times speed up, by switching to Boolean indexing.

Contents