Why do your machine learning strategies fail? What are the main causes for your machine learning strategies falling? In this post, I will try to give you some perspective about that.
Machine learning has emerged as a must-have tool for every professional data team, capable of supplementing processes, making better and more accurate predictions, and overall boosting our capacity to use data. However, discussing ML applications in principle differs from actually using ML models in production – at scale.
Barr Moses, CEO of Monte Carlo, and Manu Mukerji, VP of Engineering — AI, Data, and Analytics at 8×8, discuss common obstacles and solutions for making machine learning a force multiplier for your data organization.
Machine learning (ML) has developed from a theoretical term to a powerful tool that most of us use every day, from planning your weekend bike route on Google Maps to assisting you in discovering your next binge-worthy program on Netflix.
According to Gartner’s report on technology trends made in October 2020, just 53% of initiatives made it from prototype to production — and this is at firms with some level of AI experience. That figure is likely to be far higher for firms still attempting to build a data-driven culture, with some failure-rate estimates reaching around 90%.
Data-first digital businesses like Google, Facebook, and Amazon using machine learning to revolutionize daily lives, but many other well-funded and highly skilled teams are still battling to get their efforts off the ground. But why is this happening? And how can we make it right?
1. Asking The Wrong Question:
You will get incorrect answers if you ask improper questions. An example comes to mind from the finance business and the issue of detecting fraud. “Is this transaction fraudulent or not?” may be the first asked question. To test and predict such fraudulent transactions, you will need a dataset with instances of fraudulent and non-fraudulent transactions to make that judgment.
This dataset will grow with human involvement. In other words, a group of subject matter experts (SME), in our case experts in fraud detection, establish the classification of this data. In the dataset, a true mark corresponds to a fraudulent transaction, and a false one corresponds to a non-fraudulent transaction.
Therefore, the experts categorize this dataset based on the fraudulent activity they have encountered in the past. As a result, the model trained on this data will only detect fraud that satisfies the traditional pattern of deception. If a bad actor devises a new method of committing fraud, our system will be incapable of detecting it.
As an alternative, “Is this transaction odd or not?” could be a better question to ask. As a result, it would not necessarily check for a transaction that has already been confirmed to be dishonest, but rather for a transaction that does not fit the typical signature of a transaction.
Even the most strong fraud detection algorithms depend on people to evaluate anticipated fraudulent transactions to validate the model’s conclusions. One disadvantage of this technique is that it is more prone to produce false positives than the prior model.
2. Trying To Use Machine learning To Tackle The Wrong Problem:
A typical error here is failing to focus on the business use case. When developing your requirements, keep this question in mind at all times. “Will the use of Machine learning to solve this challenge offer significant value to the business?” To solve this issue, when you divide your challenge into subtasks, the initial tasks should focus on resolving this issue.
For example, suppose you have a brilliant concept for an Artificial Intelligence product and want to start selling it. Assume it is a service in which you submit a full-body image to a website, and the AI software calculates your dimensions to make a fitted suit for your body type and size. Let us discuss some of the tasks that will be required to complete this project:
- Create AI/ML technology to calculate body measurements from photographs.
- Create a website and a mobile app to engage with your consumers.
- Conduct a feasibility study to assess whether or not there is a market for this service.
Technologists/Data Scientists/Coders are all eager to design and code. We love it. Therefore, they may convince to begin with the first two jobs. “Let us do a quick prototype to see how it works?” is something said quite often.
As you can understand, business-wise, it would be a terrible error if they conducted the feasibility study after completing the first two jobs or the research revealed that there is no market for the idea. So a lot of resources are wasted because of bad planning. It seems like an obvious example, but bad planning happens more often than it should.
Well the root to not having machine learning strategies to fail is asking “Is machine learning necessary to solve this business problem”.
3. Not Having Enough Data:
Some of my projects have been in the life sciences area, and one issue we have encountered is the inability to access specific data at any cost. Because the life sciences sector is extremely concerned about the storage and transmission of protected health information (PHI), most datasets accessible strip this information out.
A person’s location, for example, may have a statistically significant influence on their health. For example, someone from Mississippi may be more likely than someone from Connecticut to get diabetes. However, because this information may not be at hand, and cannot use it.
Another example is from the finance sector. This area contains some of the most fascinating and essential datasets, but much of this information may be very sensitive and well-guarded, and access to it may severely restrict. Relevant findings, however, will be impossible to get without this access. All that to say that not having enough data can lead to your machine learning strategies to fail.
4. Not Having The Correct Data:
Even if you have the best models, using inaccurate or filthy data can lead to poor forecasts. In supervised learning, use data that has labeled.
In many situations, this labeling is done by hand, which might result in inaccuracies. An extreme hypothetical example would be having a model with 100% accuracy all of the time but dealing with erroneous data.
Consider the MNIST dataset for a minute. When the models tested against it, assume that the photos of handwritten digits classification were 100 percent accurate. Now, when our model encounters pictures of human faces on production, we are only 20% accurate.
Even though it is a simplistic example, training your model with the wrong data that do not consider all the possible input when you go to production leads to a bad product. And, A bad product is a failed machine learning project in your case.
5. Having Too Much Data:
There is no such thing as too much data in principle. In practice, despite enormous gains in storage and processing costs and performance, and are still constrained by physical time and space limits.
So, one of the most significant duties a data scientist has right now is to carefully choose the data sources that they believe will influence generating accurate model predictions.
Assume you are attempting to forecast the weight of a baby at delivery. Intuitively, a mother’s age appears to be an essential characteristic to give, but the mother’s name is unlikely to be significant, but the address may be.
The MNIST dataset is another example that springs to mind. The majority of the information in the MNIST photos is in the middle of the image, so you can probably get away with eliminating the border without losing much information.
Again, human involvement and judgment were required in this scenario to determine that eliminating a particular amount of border pixels would have a negligible influence on predictions. Another method for reducing dimensionality is to employ Principal Component Analysis (PCA) and T-distributed Stochastic Neighbor Embedding (t-SNE).
Determining which of these traits will be helpful is still a difficult challenge for computers before running the models, but it is a sector rife with opportunities to automate the process. Meanwhile, having too much data remains a significant slip block that might ruin any data science attempt.
Funny enough, not having enough data can lead to your machine learning strategies to fail, having too much data has the same effect too.
6. Hiring The Wrong People:
You would not trust your doctor to fix your automobile, so you should put your confidence in a mechanic to perform your colonoscopy.
If you have a tiny data science practice, you may need to rely on one or a few employees to handle everything. It includes data collecting, procurement, data cleaning and munging, feature engineering and creation, model selection, and model deployment in production.
However, as your company expands, you should think about recruiting specialists for each of these duties. The abilities necessary to be an ETL development expert may not always coincide with those important to be a Natural Language Processing (NLP) expert. Furthermore, having extensive subject knowledge may be helpful, if not vital, in different fields, such as biotech and finance.
Having a subject matter expert (SME) and a data scientist with strong communication skills, on the other hand, may be an acceptable option. Having the necessary specialized resources as your data science business expands is a difficult balancing act, but having the right resources and talent pool is one of the most crucial elements to your firm’s success.
7. Using The Wrong Tools:
In this context, many cases spring to mind. A common issue is the phrase “I have a hammer, and everything seems like a nail now.”
Here is an example from the industry: You recently sent your employees to MySQL training, and now that they are back, you need to set up an analytics pipeline.
They recommend that they use their new sparkling device while the training is still fresh in their thoughts. However, depending on the quantity of data that your pipeline will handle and the amount of analytics you will need to run with the findings, this option may be the incorrect one for the task.
Many SQL services impose a hard restriction on the amount of data that store on a single table. In this instance, a NoSQL service like MongoDB or a highly scalable columnar database like AWS Redshift may be a preferable option.
Not discussing what is the correct technology stack to use in a project can lead to your machine learning strategies to fail.
8. Not Having The Correct Model:
A model is a distilled version of reality. These cuts remove extra fluff, noise, and complexity. A good model enables its users to concentrate on the critical feature of reality that relevant in a given area.
Keeping characteristics such as client email and address, for example, may be helpful in a marketing program.
In a medical environment, however, a patient’s height, weight, and blood type may be more essential. These simplifications are based on assumptions, which may hold in different cases but not in others. It implies that a model that works well in one context may not work well in another.
In mathematics, there is a well-known theorem. According to the “No Free Lunch” (NFL) theory, there is no one model that works best for every situation. Because the assumptions of a successful model for one domain may not hold for another, it is usual in data science to iterate using numerous models to discover the one that best suits a specific circumstance.
This model specificity is especially true in the case of supervised learning. Validation or cross-validation is a technique that is widely used to compare the predicted accuracy of different models of various complexity to determine the best model.
Furthermore, a good model may train using several techniques – for example, linear regression can learn using normal equations or gradient descent.
Depending on the application, it is necessary to understand the trade-offs between speed, accuracy, and complexity of various models and algorithms and to employ the best model for a specific domain.
9. Not Having The Proper Yardstick (correct evaluation metric):
It is essential in machine learning to evaluate the performance of a trained model. It is critical to assess how well the model performs against both training and test data. We use this information to choose the model to use, the hyperparameter settings, and whether or not the model is ready for production use.
It is critical to select the correct assessment measures for the job at hand when measuring model performance. There is much literature on metric selection, therefore we will not go into detail here, but essential criteria to consider when picking metrics are:
- There are three types of machine learning problems: supervised learning, unsupervised learning, and reinforcement learning.
- The type of supervised learning: binary, classification, or regression.
- The dataset type: If the data set is imbalanced, different metrics might be more suitable.
Final Words on Machine Learning Strategies Fail:
Overall, machine learning models can help people in ways that they could never imagine. Countless firms have benefited from it, and in the coming years, more businesses will implement some form or machine learning in their business processes.
However, for businesses to succeed and make the most of ML models, a solid foundation must be established in the first place. Checking the data quality, finding the proper staff, and paying them properly, in conjunction with managerial assistance, may all aid in achieving the desired results.
Keep the above reason in mind during your next machine learning project’s strategy meeting.
Content adapted from Alberto Artasanchez
If you made this far in the article, thank you very much.
I hope this information was of use to you.
Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to [email protected] or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam.
If you liked this article, maybe you will like these too.
Are Data Science and Machine Learning the same?
Machine learning in Cybersecurity? [in 2021]
Top 10 Exploratory Data Analysis (EDA) libraries you have to try in 2021.
SQL Data Science: Most Common Queries all Data Scientists should know