Data > AI Framework/Architecture

Source: Medium, Feb 2019

At its essence, machine learning works by extracting information from a dataset and transferring it to the model weights. A better model is more more efficient at this process (in terms of time and/or overall quality), but assuming some baseline of adequacy (that is, the model is actually learning something) better data will trump a better architecture.

To illustrate this point, let’s have a quick and dirty test. I created two simple convolutional networks, a “better” one, and a “worse” one. The final dense layer of the better model had 128 neurons, while the worse one had to make due with only 64. I trained them on subsets of the MNIST dataset of increasing size, and plotted the models’ accuracy on the test set vs the number of samples they were trained on.

The positive effect of training dataset size is obvious (at least until the models start to overfit and accuracy plateaus). My “better” model, blue line, clearly outperforms the “worse” model, green line. However, what I want to point out is that the accuracy of the “worse” model trained on 40 thousand samples is better than of the “better” model at 30 thousand samples!

Good engineering is always important, but if you are doing AI, the data is what creates the competitive advantage. The billion dollar question is, however, if you are going to be able to maintain your advantage.

Creating a dataset, however, is a different sort of problem. Usually, it requires a lot of manual human labor — and you can easily scale it by hiring more people. Or it could be that someone has the data — then all you have to do is pay for a license. In any case — the money makes it go a lot faster.

A much better option is to treat AI as a lever. You can take an existing, working business model and supercharge it with AI. For example, if you have a process which depends on human cognitive labor, automating it away will do wonders for your gross margins. Some examples I can think of are ECG analysisindustrial inspectionsatellite image analysis. What is also exciting here is that, because AI stays in the backend, you have some non-AI options to build and maintain your competitive advantage.

Building AI models may be very interesting, but what really matters is having better data than the competition. Maintaining a competitive advantage is hard, especially if you encounter a competitor who is richer than you, which is very likely to happen if your AI idea takes off. You should aim to create a scalable data collection process which is hard to reproduce by your competition. AI is well suited to disrupt industries which rely on the cognitive work of low qualified humans, as it allows to automate this work.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.