From the course: The Data Science of Economics, Banking, and Finance, with Barton Poulson

Data science and money

- [Instructor] In courtroom cross-examinations and in wedding proposals, the general rule is: never ask a question unless you already know the answer. Now, wouldn't it be great if you could already know the answer to your financial questions? Like whether a particular investment will be profitable, or whether a loan application will get accepted, or whether your credit card is safe online, and while it's not possible to be a 100 percent certain about the future, data science can provide useful answers to those questions and a host of others in economics, banking, and finance. In this course, we're going to take a non-technical look at how data sciences apply to a wide range of financial topics, from fraud detection in cryptocurrencies to trend analysis and economic policy. But we'll begin by talking about how data science applies to money in general. Specifically, when I say Data Science and Money what do I mean? Well, Data science is the combination of mathematics and statistics with computer programming, in applied settings. And Data Science and Money is the application of those techniques, from data science, to solve practical problems in finance, banking, and economics. Now, there's a few things that that's similar to, it's like traditional data approaches and finance, but there's more to it than that. It's not just tables of numbers and spreadsheets, although you can tell incredible amounts of things by browsing through a spreadsheet, on the other hand you may be familiar with the work of the quants, that is people who do Quantitative analysis in the Wall Street firms, that very similar to what we're talking about here. And let's ask another obvious question, Why Apply Data Science to Money? Well, I can think of three reasons. Number one: you want to be able to identify unseen possibilities for profit, especially in a rapidly evolving and competitive environment. It also allows you to increase customer loyalty, find new markets, and so on. You can identify and quantify financial risk and take steps to reduce it to acceptable levels in an extremely fast-changing environment. And the overall goal is greater profitability, or return on investment for yourself, or more often, for the people who've entrusted you with their financial future. All three of these are excellent reasons to want to get the extra insight and power of Data Science when dealing with financial data. Now, let's look at where your data actually comes from. There Traditional Sources of Data for Finance, that includes things like Economic indicators like macroeconomic, GDP, and so on, performance of investments over time, so you get the history of a stock, and Records of behaviors of clients, what they have done and whether they have made their payments on time and so forth. But data science allows to go beyond that, you can get Unconventional Sources of Data. That can include things like Unstructured text, like tweets and social media posts. You can look at Sequence data, you can see how things change specifically over time and whether a sequence in place one predicts a sequence in place two. You can also look at Biometric markers, so, for instance, how a person types on their keyboard, how long it takes them to type in their name, whether they used the left or the right shift key, how they squiggle the mouse to find the cursor when it disappears. All these can be used as signatures to identify a person and to provide extra insight. And then Network data, that specifically means networks of social relationships, the graph data that we have here. It's a very complicated kind of data, but data science allows you to get extra insight into that to be used in financial situations. Now, any time you're working in data science you have to make decisions, and here's a few you have to make with financial data. Number one: always what is the acceptable level of risk. Specifically, there's more than one kind of risk. There is false positives and false negatives. A false positive, say for instance with the loan default, would be somebody who you think would default, but they wouldn't, they'd make their payments. And a false negative would be somebody who you think would be safe, but actually would default. Those are completely independent judgments and you have to put values separately on each of them. You also get to decide about the Sources of your data. Do you want existing in-house data, social media data, open data, or third-party data, and all of those involve issues of time, and cost, and quality. How important is speed, or (near) real-time analysis. That's going to change the kind of data you can use and the algorithms that you can use. And then, finally, the Importance of transparency, being able to know what's actually happening, versus the really impressive precision of black box models, the machine learning algorithms that are very hard to interpret. These are decisions that you have to make before you get started on a particular data science project. But no matter what you do, you then have to choose a tool. And I'm going to start with the most obvious tool, it's Microsoft Excel. The spreadsheet is the universal container for data, and even if you're going to do all of your analysis in some other program, it will probably come to an Excel. And this is a great way of sorting through data and letting you see what's there. For a better look, you want to do data visualization. And one of the best way to do that is in Tableau, Tableau Desktop, Tableau Server, Tableau Public. It's a famulus interactive application for visualizing and exploring your data. And then, finally, a huge amount of data science work is done in programming languages. The two most common are Python, that's a general-purpose programming language with packages, like Quantopian, that are excellent for use with financial data. And R, a statistical programming language, the one that I prefer, that has packages like quantmod that are specifically developed for working with economics data. Now, no matter what tools you use, you're going to have to decide about what data science methods you're going to use. There's a list of usual suspects that fall under the rubric of Machine learning, these are a whole broad category of algorithms, or different ways of processing data. Some really common ones include K-nearest neighbors, and Regression, a lot of variations on that, Decision trees and random forests, or Artificial neural networks, and deep learning, which is getting a huge amount of attention. Now, this course is non-technical so I'm not going to demonstrate these. I simply want you to know that these are some common choices, there's a lot of others. But there's a huge amount of choices that depend in part on what your goals are, what you're trying to accomplish, and what your constraints are in working with your data. No matter what you do, it's also important to remember that there are also sometimes Potential Problems that come up when working with data. With data science, probably the biggest one is something called a Flash crash, this has happened a few times. This is when automated trading gets out of hand and produces an enormous change in the value of a security in an extremely brief period of time. The most recent is from June, 2017, when the market for the cryptocurrency Ethereum went from $319 to 10 cents in seconds because of the way that these algorithms interacted, later rebounded, but people did lose a lot of money during that brief time. Another problem is when Relevant data is not included. The algorithms can only work with what's there in the data and so you need to make sure they have the most important and most accurate information in your data sets. There is a risk for us called Overfeeding, we develop a model that's too closely tailored to the data you currently have and it doesn't generalize well. That's related to the problem of extrapolation where you have a pattern that works well up to a certain point, but as we all know, past performance does not guarantee future results and it's risky to go into the future beyond what you have. Then there's the issue I've already mentioned of black box models, a lot of machine learning algorithms that can give really impressive precision in their predictions, but is very hard to know exactly what they're doing to get to that. And then, finally, related to that is the issue of Unintentionally using protected information. If you have demographic information, things like gender, and age, and race, and ethnicity, in your data set, and you're processing loan applications, well that creates a world of trouble if those things make it into your algorithm unawares 'cause they're not allowed to be used in the loan application. And so you do need to be careful about how your algorithms work and how you implement them. Let me give a quick example from Fraud Detection just very, very briefly. the Wisconsin Department of Health Services provides almost $9 billion in benefits to more than 1.3 million people annually. And the Office of the Inspector General, in Wisconsin, in 2016 created a data analytics section. And their work was able to save the Department of Health Services over $50 million in that one year. $32 million were saved from the detection of benefit overpayments, and $18 million saved from cost avoidance. And so there was tremendous return on investment for this intervention, the use of sophisticated data analytic techniques for looking at the services provided in Wisconsin. And that's just an example of what can be done anywhere in a multitude of situations. Very briefly, before we finish, let's just review, again, the Benefits of Applying Data Science to Finance. First is increased opportunities, second is the ability to identify and quantify financial risk, including reduced fraud, and then third, it allows you to better reach the overall goal of greater profitability, or return on investment.

Contents