From the course: Data Science Foundations: Data Mining in Python

Tools for data mining - Python Tutorial

From the course: Data Science Foundations: Data Mining in Python

Start my 1-month free trial

Tools for data mining

- [Instructor] If you're going to go out mining then you need the right tools. When you're mining for data gold, that means software. And the two most common tools for that are Python and R. Now, as it happens, we have two versions of this course, with one showing how to mine data in Python and the other one in R. There are good reasons for choosing these two languages. Python is currently the most popular language for Data Science and Machine Learning. It has an advantage of being a general purpose language which means millions of people know how to use it because it can program basically anything. It's also well adapted for data. Python also is really easy to learn as it has a clean readable syntax. R is a different kind of language. It's a programming language that's specifically developed for data analysis. Consequently, it's very popular with scientists and researchers. So it's what I learned first as a way of working with data. Either one works great for mining data but I do want you to be aware of some of the other options. I want to make sure to mention applications with point and click GUIs or graphical user interfaces that are developed with data work in mind. Some of these are free. Some are expensive. Some are similar to spreadsheets. Some have extensive programming abilities. They include apps like SPSS. Very popular for analyzing data in the social sciences, which is where it was first developed, or SAS which also has strong programming abilities. jamovi, a free open source application that resembles SPSS but runs on R is one of my personal favorites. JASP is another free open source application that strongly resembles SPSS and jamovi. And then there's Tableau, which is specifically for data visualization, but it's one of the go-to tools for people who are trying to get a feel for large datasets. In addition to these apps, one language that anybody who's going to work with data should learn is SQL, that's Structured Query Language. And it's usually pronounced sequel. This is for accessing databases. And especially if you're working in a large organizational setting, you're going to have a lot of data in databases. SQL has a great learning ROI or Return On Investment. It doesn't take much to get most of the things that you're going to need in SQL and get ready accessing and cleaning and querying data. I also want to mention spreadsheets. So the rows and columns of Microsoft Excel and Google Sheets. The reason these are important is A, they're the single most common data tools in the world. And that's my personal belief. Everybody has them. And a good chance is if you're working with a client, they may provide you with data in spreadsheets they may want the results back. They're great for browsing data. They're great for making very simple graphs and anybody who's going to work with data owes it to themselves to be very comfortable with both of these spreadsheets. Now, I want to give you my personal philosophy about data tools. I see an order to things. Generally, I recommend that people start with the simplest tools and that would be spreadsheets, until the data you're working with or the analysis you need to do become unwieldy. Then consider the next step up. That's an application like jamovi or SPSS. These allow you to do a lot of specialized procedures for analyzing and mining data or use a data mining specific application like RapidMiner or KNIME. But only then should you consider taking the final step to programming languages. And they are of course the most flexible and powerful but often they're also the most difficult to learn and to use proficiently. I do want to give you a little bit of data about programming languages and apps for context. KDnuggets is one of the most common and popular sites for information about data mining. KD stands for Knowledge Discovery. And this is their most recent survey of software use among practicing data miners. And what you'll see here is that Python is number one on that list, where 66% of the respondents said that they use Python on a regular basis. 51% said they use a software application called RapidMiner and you know, it's popular but the survey people say that may have been because the survey was actively promoted on a RapidMiner discussion list. R is third at 47. Please note that Excel, the spreadsheet is here at 35. Anaconda is really a distribution of Python. And then there's the language SQL here at the end. So it lets you know, some of the things that are frequently used in data mining. Another perspective on this is postings for Data Science jobs on the job site, Indeed. And this is from 2017. So it's a little old but the pattern is consistent over time. What you see again is Python is here at the top. SQL is close behind. It is still an important thing in Data Science. Then if you come down here, you'll see that R is still in demand. Any one of these can be very useful tools for data mining. And so consequently, I want you to choose your tools wisely. You have great flexibility and choices. I'm going to be emphasizing Python and R in the two versions of this data mining course, but you have a lot of other options. Almost any tool you can choose can do almost any analysis if you give it enough work but do some things. Look at the client's requirements. Do they use a particular software, application or language that they want you to use? Obviously you'd want to work with that. Python and R or common, but use what works best for you and the particular project you're working on. And that can get you started on getting some insight from your data as you mine through it for the valuable insights.

Contents