Five key things to remember when implementing machine learning
Machine learning is ubiquitous. Online product recommendations and discount offers are already influencing daily shopping behaviours of millions of consumers across the globe. With growing adoption of machine learning, algorithms are increasingly taking control of critical decisions involving health insurance, medical diagnosis, mortgage approval and even job applications.
When done right, machine learning can help businesses improve revenue and profitability. However, improper deployment of machine-learning models can impair business performance, destroy brand reputation and lead to lawsuits and fines from regulators. Hence, it becomes imperative to take a note of the five things you must know when implementing machine learning:
1. Invest in a robust data ecosystem
Data is key to machine learning. It has been widely acknowledged by data scientists that more data usually beat clever machine-learning algorithms used for driving better insights and predictions.
Modern businesses, especially financial services, retail and healthcare, use many applications that generate and consume data. Availability of the right quality data at the right time is often a concern as data are usually distributed across multiple systems of record, customer engagement applications, operational data stores and data warehouses. Sometimes, the process of data collection may introduce latent biases that can be quite hard to detect. Metadata, such as data lineage, are very valuable for data scientists. Without a robust data ecosystem, it may be very challenging to obtain quality data for building machine-learning models as well as sourcing data to execute the models.
2. Establish success criteria that business stakeholders understand
All machine-learning models are optimised based on agreed success criteria for a specific goal. Hence, it is important to have a clear understanding of the business objectives to determine acceptable performance criteria. For a spam filter, a false negative (spam goes to inbox) is more acceptable than a false positive (non-spam is marked as spam). Similarly, for fraud detection, a false positive (normal transaction marked as possible fraud) is more acceptable than a false negative (a fraudulent transaction remains undetected). A medical diagnostic test to identify cancer needs to minimise false positives, that is, people who do not have cancer should not be diagnosed as positive. A false positive would be extremely stressful on a patient. The test should also avoid false negatives, that is, people who have cancer should not be diagnosed as negative. Otherwise, a person with cancer would miss timely detection and cure.
It is important for data scientists to ensure that business stakeholders understand the nuances and trade-offs of a model under different operating conditions. Visualisation of model performance using easy-to-understand graphs is a great way to communicate the results to business stakeholders who may not be data scientists.
3. Data preparation is the key to successful models
Data preparation includes data cleaning such as correcting invalid values, removing duplicates, filling in missing values and correcting known biases. Removing correlated input variables with inter-dependencies is a good way to reduce noise in the data. Various techniques are used to transform input data to generate useful attributes (also known as features) that can make the data better aligned for the learning algorithms. This process is commonly referred to as ‘feature engineering’ and is a way to add more meaning to the data in a format that algorithms can understand. This is both a science and an art that can be perfected with experience and domain knowledge. For example, we can derive age from the date of birth for generating music recommendations. Similarly, it might be useful to extract the day of the week or hour of the day from a date for predicting energy demand. Another frequently used technique is to segment data into multiple buckets or bins (commonly known as ‘one hot encoding’). This prevents a machine from inadvertently assigning different weights based on absolute values of the data that actually represent different classes or categories.
It is a good practice to avoid too many features in a model. For example, when building a model for credit card offer recommendation, it might be tempting to include attributes such as make and model of car, number of children and size of household, which may not have any material impact on the model. On the contrary, too many features generally increase noise and lead to complex models that do not work well for new data sets. Moreover, as the number of input features grows, the amount of data needed to train a model also grows exponentially.
4. Third-party APIs can simplify processing of unstructured data
For handling unstructured data such as text, it is often advisable to use off-the-shelf third-party APIs (Application Programming Interfaces) as opposed to building home-grown text-mining solutions. Tasks such as part-of-speech tagging and named-entity recognition require annotated corpus of tens of thousands of documents and meticulous annotation, tagging and labelling, which may take months. Cloud-based APIs such as Google cloud natural language API and Microsoft cloud text analytics API offer text-mining implementations that can be used to extract useful information from unstructured text such as people, place, event and sentiment.
5. Do not ignore model governance
Machine-learning model evaluation and deployment should follow appropriate governance and monitoring processes. We may need to periodically retrain existing models to learn from recent data. Any updates should follow an established route-to-live process before deployment in production. Historical training and test data should be preserved to compare performance with old models when testing new changes.
Realising a return on machine-learning investments requires planning and disciplined follow-through. It also necessitates equipping employees with the skills to adjust to rapid and ongoing technology changes. But remembering the above five requisites will ease the adoption of machine learning and set organisations on the path to harnessing full potential.
Saurabh Banerjee is senior specialist at Sapient Consulting. Views are personal.