Where should we expect there to be disruption in the next 5 years, and specifically across enterprise-grade machine learning? Additionally, what areas will companies likely focus on building, and what will be acquired by large incumbents?
To understand this, we need to know where we are currently and what we have accomplished to date. Currently, machine and deep learning have come a long way; we now have sets of software packages, internal practices, and infrastructure built out. Enterprises are picking this up by engaging externally, as well as developing internally. Each area has its own level of maturity. But both practices have come a long way over the past decade, yet not everything is solved for, and very few large firms are currently ‘getting it right’. To the firms that are getting it right, I look at Apple’s M&A activity with Xnor, Drive, and Voysis – and their internally developed applications. Additionally, other technologies are helping to speed up this space: cloud computing, better enterprise data warehouse strategies, and the proliferation of knowledge are spreading because of researchers (hands-on-keyboard developers) gaining notoriety and becoming stars in their own circles.
With this, enterprise machine and deep learning can be bucketed into three categories as we’ll focus on the development to deployment, and not post deployment (analytics, consumption, usage, etc). I took these bucketed categories from an excellent paper titled, ‘Cloudy with a high chance of DBMS’. I found their categorization to be very sound.
Model training and development: phase 1
There are some good tools here already (Sci-kit learn & CARRoT), and training in the cloud has proliferated. The community in this bucket is expanding to where data scientists are developing sophisticated processes for data exploration, feature engineering, model training, model selection, and model deployment (for example, MLFlow and Apache Calcite). Training and development benefit most by large centralized data sets (companies enablers like Snowflake), scalable computing power (Amazon’s EC2 G4 instances), and the latest hardware (enablers like Intel and their FPGA designs). This section is well covered but is still under disruption. A few winners have already begun to appear (Datarobot, H2Oai, and Sagemaker). What we can expect is that moving forward all models will be developed and trained in the cloud. The companies that master this with an enterprise salesforce will likely win.
Model scoring: phase 2
In an oversimplified view, machine learning boils down to linear algebra. And linear algebra is a sub-field of mathematics focusing on matrices, vectors, and linear transforms. Linear concepts can span tabular datasets, one-hot encoding, dimensionality reduction, and also be embedded into recommender systems, NLP, and traditional deep learning, to name a few. What’s important to understand is that you don’t need to be an ace in linear algebra to be effective at machine learning, and therefore model scoring. Lastly, it does help to be an ace at linear algebra if you want to squeeze more performance out of your models. That part is non-negotiable.
Usually, models are trained centrally, then deployed. But edge computing (being disrupted by companies like Mutable), federated learning (i.e., DataFleets) and data virtualization (i.e., AtScale) are all changing the way models will be deployed in the future; on what data; as well as what inferences of data will even be necessary. This progression is changing the processes of doing running inferences on data stored in DBMS’, and or showcasing the fact that models may not need to be trained centrally any longer. This area is still prime for disruption. It’s likely that in the future all models will be scored and stored in management environments (BDMS). There are many providers in this space, but clear winners have yet to be seen.
Model governance and management (MLOps): phase 3
In my opinion, this sub-sector is the hotbed for disruption over the oncoming years. Especially for startups and scaleups looking to build products at enterprise scale. Governance has yet to be fully addressed, but there is a lot of work going into it.
The questions of how do we secure, manage, access, permission, or rollover to new versions of the models is still up for debate. And you may think that it must already be solved, but this maturity is best solved in software development, but not machine learning. Currently, there are efforts focused on improving the algorithms and training infrastructures themselves, but there’s also a need for secure data access, provenance tracking, version management, and tools to showcase possible model bias in the models themselves. Moving forward, provenance needs to be collected across all phases of the model’s life – this is only amplified by laws like GDPR (or potential lawsuits).
What the aforementioned alludes to is that there will be triple-triple-double-double-double (TTDDD) companies arising, if Amazon doesn’t blitz scale first. We’re seeing this in the cloud computing space currently. For example, Google actually has the best product right now, but their enterprise sales team is playing catch up Bezos or Nadella. This innovation will arise from some currently non-existent company (or a current seed to series C company) that can pivot their business model based on the research being released. These companies will also hit some fringe use cases, doing one thing very well and not taking a horizontal approach, but strictly vertical.
Every year new research emerges, challenging previously assumed bottlenecks, creating theoretically non-existent horizons. I am personally excited to see companies implement new types of research into products that are consumed by enterprises, at scale, where we feel and see the end product. For example, some of the research coming out from an MIT professor and Ph.D. student on neural network pruning techniques, dubbed The Lottery Ticket Hypothesis is the type that’s very interesting. If you know of a team implementing this work, please do let me know as I would love to talk to them!