With new techniques and buzzwords entering the industry all the time, it can become quite tempting to opt for the latest fads when you need to select a modeling technique. However, using the latest one doesn’t always mean using the best one for the task at hand. Tom Zougas, Data Science Manager at TransUnion Canada, looks at how to determine when to use new modeling techniques—and when not to.
Too many techniques, too little time
Data scientists are under increasing pressure to follow the latest and greatest in modeling technique trends. On the one hand you have organizations who want to know what their modeling teams are doing to implement the newest innovations in artificial intelligence, while on the other you have new data scientists, fresh from school, who are eager to apply all the techniques they have recently learned.
It’s easy to get distracted by the latest modeling techniques. However, more advanced models can be complicated to use, interpret, and deploy. This can often detract from solving the business problem at hand.
In this blog we will cover three core data models, how to approach your modeling decision and what to consider when choosing to implement a new model. A more detailed look at these core models, the various learning algorithms (including newer models) and the holistic modeling process can be referenced in a paper I wrote for the SAS Global Forum 1.
Always start with the business problem
As simple as it sounds, we often forget to start at the beginning. It is essential to consider what the business problem is, and what data you have available, before embarking on your modeling decision journey. A few simple questions can help determine what the model needs to achieve. These include:
- Does the business need to rank order data based on certain outcomes?
- Does the business want to categorize a record or observation?
- Does the business want to predict a numerical quantity?
You may find yourself trying to decide between intricate modeling techniques when the business problem could be solved by simply using one of the core models described in this blog. This could impact the amount of time and effort spent deciphering how to implement a brand-new model into your current infrastructure, when it may yield the same results as a core model would. Starting with the business problem can save a lot of time, effort and frustration down the line.
Consider the core models
A recent KDNuggets poll 2 identified the top three data science machine learning models used in 2017:
- 1. Regression: Curve-fitting to understand the relationship between variables
- 2. Clustering: Grouping records that are more similar to each other than to those in other groups
- 3. Decision trees: Identifying relationships in data as a series of rules
While none of these core models are particularly special, they are effective in producing a representation of the relationships between the variables in the data if relationships exist. They are simple, they are interpretable, and they can provide a reliable baseline to measure other models against. These three core models should be included in any basic modeling toolkit. In fact, they are generally already included in many software packages or languages, which makes them easily accessible to data scientists.
To decide which core model to apply, you should take a look at your data. If you have a dataset containing input attributes (variables, characteristics) and a known output (outcome, target) associated with each record or observation, then you can follow a supervised learning approach. If the dataset does not contain a known output, then you can opt for unsupervised learning.
The three core modeling techniques outlined above allow you to follow either approach, where regression and decision trees can be applied for supervised learning and clustering can be applied for unsupervised learning.
Work out what doesn’t work
Once the model is complete, and you know what dataset you are dealing with, you should consider how it will be used. This is where the business problem comes in again: will the model be used for insights and reporting, or will it be used in a production environment?
By looking at the business problem you can assess whether the model you selected will address it and whether the effort required to deploy the model will be worth it. The beauty of the core models lies in their consistency: they are often easier to deploy because their coding tends to be more straightforward.
If you start running into problems, then it’s time to consider whether the issue lies with your model, or with the actual data. Make sure you check quality and characteristics and that you double check whether the data is meaningful in the context of the problem to be solved. This in-depth understanding of data is the true differentiator of a data scientist: without it, they can’t confidently choose an effective modeling technique to apply.
When to consider different (and new) modeling techniques
If the core model isn’t delivering on what you want it to do, and there does not appear to be anything wrong with the data, then you might consider new modeling techniques. Some of these include:
- Deep Learning: Sophisticated machine learning techniques which identify very complicated patterns in the data.
- Ensemble: Combines simpler models to build into one super model that can pick up patterns that a single model can’t pick up on its own.
- Elastic Net: An improved regression model that is developed to work well with high-dimensionality data, and many correlated variables.
- Automated Machine Learning: Sifts through a wide range of modelling algorithms and based on the performance criteria, decides which models to use.
Again, you can start with the business problem. Explore the new techniques and see which one you believe will solve it the best. You might even find that various characteristics of new techniques overlap with characteristics of core models, which means you may want to take another look at the core models to assess their ability to solve the problem at hand.
All that is left to do is to assess the model’s performance. Did it solve the business problem? If yes, then you can put the model into production. If not, then you should go back to the business problem and systematically work through the challenges, one by one, and consider what you need to do to address them.
So, are new modeling techniques worth it?
Core models can provide a good starting point and may well be what you need to address the business objective. Often times, the baseline provided by the core models is sufficient to satisfy the business objective. In such cases, a newer model might be overly complex or difficult to deploy. This doesn’t, however, mean you should avoid using new models altogether. The trick is in identifying when core models won’t satisfy your business problem, and making sure the required effort is worth the benefit. No model can fix issues with underlying data, so make sure you have the full picture before deciding. At the end of the day, it’s all about focusing on solving your business problem and using the best suited model for your analysis. Just remember that core models are at the core of data modeling for a reason.