8 Steps of a Machine Learning Project
By Kristina Fan, CEO and Founder, and Roy Lowrance, Chief Scientist, 7 Chord
Previously we wrote about when to bother with machine learning. Once that bridge has been crossed, the question becomes: how much time and money should one budget for an experimental machine learning project? The real answer will drive your boss bananas, so let’s keep this between us.
Experiment. Fail. Learn. Repeat.
Building a machine learning system is a highly iterative and experimental process. While it does go through the well-known phases we describe below, what is found in one phase often informs decisions previously made in earlier phases, and that means revisiting and refining your earlier decisions.
Makes sense, right? Well, in many organizations with traditional linear software development mindset, the backward movement is interpreted as failure. Therefore, how you structure the program and build expectations within your organization will be critical. This cultural conflict maybe the biggest reason why some organizations are more successful at adopting AI than others.
When you consider the cost of building an ML system, there are at least four main budgetary considerations: a) outright cost of development; b) computing power; c) risk of execution or project extension; d) maintenance. The risk of project extension is very hard to estimate – and is the strongest argument for buying machine learning application instead of building it in-house.
Method to the Madness.
If you do decide to “try machine learning at home”, here’s the actual roadmap we followed at 7 Chord along with the effort it took us to build the commercial version of BondDroidTM 2.0 which we have ultimately soft-launched in July 2018.
Overall Project Timeline: Jan 2016 – June 2018
- BondDroid Beta: Jan 2016 – June 2017. Fully functional batch system.
- BondDroid 1.0: Jun 2017 – Jan 2018. Fully functional real-time system, limited scope and delivery methods, basic cyber-security.
- BondDroid 2.0: Jan 2018 – July 2018. Enterprise-ready cloud-based app.
- BondDroid 3.0: Summer 2019. Stay tuned!
Through each of the 3 phases, we have followed the following steps, although the last 3 were refinements rather than brand new decisions:
Pick the target: Determine the decisions that the modeling exercise will inform and ensure that those decisions can be implemented and will be economically viable. This is when the problem statement is agreed upon, and the accuracy objectives are set.
Pick the yardstick for success: The final product should be fit for purpose. Picking the accuracy measure that is achievable yet sufficient to add economic value is very important.
Define Data Acquisition Strategy: Identify critical data sets and analyze the ecosystem of your data providers. Can this data be sourced elsewhere? What is your competitive relationship with the key data providers? Can the dataset be replaced later and if it is unique, what are the terms of your agreement with the data provider? Does your organization have unique sources of data because of your informational advantage?
Prepare Data: This least enjoyable step is critical. Most reduce it to mechanics of data cleaning, normalization and defining your data management framework. But the designer needs to determine to what extent the data reflects the economic reality, so that noise can be filtered out in a systematic manner. Also, if your system works in real-time like ours, it is of paramount importance to make sure that the noise filtering procedure can handle uncertainty with regards to validity of your data.
Build Model: The designer picks a machine learning method and ascertains how accurate, computationally expensive and latent it is likely to be in the operational environment.
Test and Evaluate: Test your system in the operational environment. Because improvements to the model will be ongoing, it is important to set a clear back-testing strategy and roll-back policy.
Deploy: Move the prediction engine from the evaluation environment to the operational environment. Continue to measure accuracy. Prepare and implement contingency plans to be used as the model’s performance degrades.
Maintain: Many forget to budget for this one. Implement procedures to ensure that the model is performing as expected. In our case this includes real-time and end-of-day accuracy reports, as well as voluminous diagnostics that constantly measure the health of the system.
Our advice is to iterate around the project cycle as quickly as possible. The team with the most iterations, all other things being equal, has the best chance of producing the best model. Start with a crude prototype that will be easy to implement and move to the next level from there. Even if you are planning to outsource ML design, building a crude prototype in-house will be critical for successful engagement.
It is important that your design team includes highly-skilled software engineers from the start, as this will ensure that the project is implementable in practice. We will discuss the optimal composition of your machine learning squad in the next part of the series: Who to hire if you are building machine learning in-house?
Fintech exec explains how practical AI can accelerate the transition away from LIBOR.
Invesco seeks a differentiated view in a data-rich world.
Refinitiv says 90% of financial firms are using it, though inadequate data remains a constraint.
About 80% of the cost structure comes from reconciling data.
SimCorp says it's about addressing concrete problems faced on a daily basis.