From the course: Data Visualization: A Lesson and Listen Series

Lesson: Visualizing large data sets

From the course: Data Visualization: A Lesson and Listen Series

Lesson: Visualizing large data sets

(techno music) - Big data is one of those terms that's so over used as to be nearly pointless. It shares that stage with other terms like data science. Another buzz word that has nearly lost all meaning. And I'm not going to define big data here except to say I'm referring to large, complex data sets with a volume and scope that would make running a calculation in Excel and generating a pie chart either impossible or at least a disservice to the complexity available in the data. For example, say you had the entire 2010 census data set and you wanted to show geographic distribution, population density, and racial diversity of every single human in America at that time. That's around 300 million data records with a racial identifier as well as location information at the census block level. Which is about 2/3 of an acre. Could you aggregate this data at a national level and simply create a pie chart of the overall racial break down of every human in America? Of course. If that's your goal. You may not win any awards, but it might do the trick. But the bigger and more interesting question may be, can you show the entire data set in a way that's engaging, compelling, and leads to interesting insights that can't be found by simply aggregating the data? Well, you know I'm going to say yes. And maybe you've seen the Racial Dot Map created by Dustin Cable at University of Virginia in 2013. Which is one of my favorite examples of visualizations of big data. Let's take a look at it. On this map, one dot represents one person. And we have different color dots for each race. You can see some patterns such as the clear concentration of African Americans in the Southeast, and the concentrations of Hispanics in Texas and California, Asians in L.A. and San Francisco. You can also see a lot of purple which is a combination. For instance, in my hometown of Boston, what you seem to have is a mixture of races just based on that purple hue that you can see. This aggregated overall view is interesting. But let's see what happens when we zoom in. Now you can see the large blocks of color change. At each zoom level we see more information, a more nuanced view of the data. Instead of an aggregated average, at a certain point I can zoom in enough to see individual census blocks. I'm getting closer and closer to seeing one dot per person. So, instead of a combined color value for each race when you're zoomed out and seeing thousands of people at a time, now I'm seeing individuals. It's now revealed that Boston is made up of different segregated neighborhoods where towns like Winthrop, which is nearly entirely white, are clearly contrasted against towns like Quincy which have a large Asian population. And neighborhoods like Roxbury which has a large concentration of Africa Americans. Let's quickly look at another example on a similar theme: the American population again. This time looking at every single person who has immigrated to the United States from 1790 to 2016. Once again, we're looking here at one dot per person. And each ring from the center outward is a year. Like tree rings. Color represents the region of origin of each immigrant. So it's clear that in the late 18th and early 19th century, the center of this diagram, the vast majority of immigrants were coming from Europe. There was this big wave of Canadian immigrants in the late 19th century. During and right after World War II immigration really shrunk. And soon thereafter began the wave of immigrants from Latin America and Asia while Europeans were still coming in in large numbers as well. This is pretty big data that still allows a great, high level overview impression while also allowing a nuanced view as well. These two examples reflect the kinds of decisions you need to make when visualizing big data. Of course, you have to think about your audience. And you have to make very careful decisions about how deep to go with your data when presenting it to them. Is the high level, aggregated view good enough? Or is a highly nuanced, deep dive view what they require to make decisions. 'Cause that's usually what it's about, making decisions. And then you have to think about technology. Data collection and transformation can be a huge task when working with big data. And then displaying it can bring a host of challenges. For instance, think about the Racial Dot Map. No mapping technology could display 300 million dots in your browser and perform at anything other than a snail's pace. So, the creator had to write code to generate different map tiles to layer on top of Google Maps. One set of tiles for each zoom level. And each tile has a collection of dots organized by census block. Sometimes, one dot of one color per person if zoomed in close enough. Sometimes, one dot of a calculated color value based on the percentage of people of different races within some larger geography. The methodology, math, and coding are complex to say the least. But the first thing was the idea. Coming up with what they wanted to communicate that an audience would find useful and insightful then led them to technical solutions that were feasible. As always in data communications, the creative idea precedes the technical solution. That is the most important lesson I can teach you. Know what you're tryna communicate at whatever level of detail is necessary and then you can probably find a technology or creative technical hack to make it work. Next up, we'll talk to Elijah Meeks, Senior Data Visualization Engineer at Netflix who works with big data every day. We'll be talking about best practices and strategies for visualizing big data as well as the state of the data visualization field which is in constant flux.

Contents