From the course: Data Science for Java Developers

What is data science anyway? - Java Tutorial

From the course: Data Science for Java Developers

What is data science anyway?

- [Narrator] So before we start getting into all the different data science skills, stuff like data visualization, machine learning algorithms and all that fun stuff, the first question that you probably have or should have at any rate is what exactly is data science? Data science is the study of data, but in a more fundamental level data science is the study of how to take data and make some sort of useful sense out of it. So this might mean doing something like taking customer's purchase history at a supermarket and using that data to determine which items should be placed next to each other in order to increase sales. Or it might mean taking traffic data and using that to improve the layout of roads or it might even mean taking all of the users posts on a social media site and creating a model from that data that allows us to predict if a given user is suffering from depression or mania. And in fact, Facebook has actually done this. So as you can see, there are a lot of opportunities like this to make sense out of data in such a way that it helps people or businesses, and that is the realm of data science. Another very effective way that I've seen it described is visually in this Venn diagram that was first published by a guy called Drew Conway on his blog about a decade ago. The idea here is that data science lies at the intersection of three distinct skill groups. One, hacking skills that is technical prowess, such as coding and software development. Two, math and statistics. Now, sorry if that made some of you cringe there but it is a pretty vital part of data science, nonetheless. And the last skill here, number three, is something called substantive expertise. In other words, a sort of down to earth practicality about how to apply theoretical concepts and discoveries to real-world problems. Now, the names of these three skills might be a little opaque for some of you, so if it helps you to just think of it as tech skills, math skills, and business skills that's pretty much the point that's being made here. So moving on here, you may have noticed that so far besides Java, I haven't actually mentioned any specific data science related technologies as part of the definition of what data science is, and that's no accident. You see, when we're approaching data science for the first time as with pretty much any other tech-related field, we need to remember that the technology is not the field. With so many domain specific technologies programming language and frameworks around the tendency has been for people to start defining the fields themselves in terms of the current hot technologies that are being used in the field. So you might hear that data science is learning to use Python or that data science is learning to use Hadoop or BigQuery or Apache Spark or any of those other technology buzzwords. And while those really are all incredibly useful technologies in the field, we got to remember, again, that the technology is not the field. Another question that you probably have about data science is how data science is different from things like big data, data analytics, and so on, right? Those other kinds of technology data related buzzwords. Now, I'm not going to go into too much detail about the differences here, mostly because these differences aren't really as concrete as some people like to think. So I'll explain more about why I think that is in a minute but first here's a very general rundown of the theoretical differences between the fields that I just listed. So first I'm going to start off with data science. Data science is focused on using things like statistics, machine learning, techniques for gathering and cleaning data, et cetera, to gain insights from data. That's the sort of theoretical definition of data science. Big data on the other hand is focused on working with huge amounts of data effectively. So when you're working with large amounts of data on the scale of terabytes or petabytes, there are going to be some strategies that you're going to have to learn in order to work with that amount of data effectively. And lastly, we have data analytics. Data analytics is generally focused on automating processes for drawing conclusions from data. The fact is that when most companies pick one title or another, it's either completely at random or it's based on which title the human resources department thinks will catch people's eye in the job posting more effectively. That's just the reality of these things. So the other reason that you shouldn't be too focused on the differences between these fields is because at the end of the day, nearly every person who gets hired in the data science, big data, data analytics field will end up wearing all three hats at some point or other without even thinking about it. In general, there are very few data scientists that don't have to work with large amounts of data and perform data analytics, and vice versa is true as well for all three fields.

Contents