From the course: Data Science Methodologies: Making Business Sense

Where to begin?

- [Instructor] The spirit of data science projects is all about inquiry and explorations. Sometimes we have the data, but we are not sure what value we can extract from it. Other times, we see the value in data but I'm not sure what kind of product we should build to extract that value. Should it be a feature in an existing web application or should it be an independent mobile app? And above all, how does it align with the larger business goal? How can I make my idea into a compelling story so that my stakeholders agree to support me in this initiative. In the banks direct marketing example, we started with a high-level business objective of improving direct marketing efficiency. The context was, human agents contacting customers to offer long-term deposits. And the data mining goal was, to identify those customer characteristics that influence their decision to accept or reject the offer. In other words, find the most relevant features. Using Scrum's approach of stating the requirement from the end user's point of view, we stated the user story as, as a direct marketer, I want to call customers with high likelihood of offer acceptance. Then we broke this somewhat vague user story into more specific and smaller data science stories stated in the data scientists language. As we articulate the stories, they go into a list of user stories arranged in a prioritized order of implementation called a product backlog. We picked the top post stories from the product backlog and planned to implement them in the upcoming sprints. The list of stories to be implemented in one sprint is called a sprint backlog. For now, we chose just the first story from our product backlog. Let us start with understanding the data. We find that there are four datasets available at the URL listed here. The full data set has over 41,000 samples and 20 features. The second has ten percent samples with all features. The third set has all samples, but only 17 features and the smallest one has 10% samples and 17 features. The fourth one is also cleaner as it doesn't have missing values, so I decided to start with that. Next step, data preparation which involves understanding the features, their data types, missing values, and so on. Our dataset is about customers, their financial status and the last direct marketing campaigns contact made with them. Let us say, I decide to use the personal and financial features as my minimum viable dataset to start with. So here are the steps to prepare my data. First, I will read the data into a data frame and then drop the campaign related columns. From among the eight features I decide to use age and balance that are numeric, whereas others are categorical values. So I will use standardization for the numeric data. Now I plan to use random forest classifier as my first model which doesn't take categorical data as input. So, I will perform one hot encoding for categorical data. And then finally I will split the data into training and test datasets and then use the training set to train my model. All these steps are already coded in my Jupyter notebook. As you can see here, first, I read the data into a data frame, assigning the column names as the data doesn't have a header. You can see here some sample data which includes the campaign related data as well. Since I'm not interested in it as of now, I dropped the columns from eight to 16 and then I can see the eight features that I am interested in. Next, I performed one hot encoding on categorical data. So first I extract those six features that have categorical values and then perform encoding here in these statements. The encoded data frame shows for example, the job feature split into columns, each column representing one type of job, admin, blue collar, entrepreneur, and so on. And for each row representing one customer, the column that has one means that that is the customer's job and the rest all are encoded as zeros. Next step is to process the numeric data. Numeric data is in columns zero and five for age and balance. So I'm using the standard scaler from eschalon and putting the scaled values in a numeric data frame. When I view its head, I can see that it has scaled values for age and balance. Now we need to combine the two data frames, one heart encoded and numeric to make the final data frame. Finally, our data is ready to be split into train and test datasets and then fed into the random forest classifier. This is where I'm performing the training and once the model is ready, I can test its performance. Here is the confusion matrix and here is the model accuracy score, not bad to start with. At this point, a data scientists would typically go back to reprocess the data and retrain the model to get a better performance. But remember the process we started with. We picked up one user story from our sprint backlog, understood and prepared the data and then trained and evaluated the model. At this stage, we should think about how to deploy the model so that it can be used in an application by the end users. We need to create our minimum viable product and to be able to do that, we need to go beyond modeling and look into software development architecture. Let us do that next.

Contents