What No One Tells You About Real-Time Machine Learning

During this year, I heard and read a lot about real-time machine learning. People usually provide this appealing business scenario when discussing credit card fraud detection systems. They say that they can continuously update credit card fraud detection model in real-time (See “What is Apache Spark?”, “…real-time use cases…” and “Real time machine learning”). It looks fantastic but not realistic to me. One important detail is missing in this scenario – continuous flow of transactional data is not needed for model retraining. Instead, you need continuous flow of labeled (or pre-marked as Fraud\Not-Fraud) transactional data.

Machine learning process
Machine learning process

Creating labeled data is probably the slowest and the most expensive step in most of the machine learning systems. Machine learning algorithms learn to detect the fraud transactions from the people which is much like labeled data. Let’s see how it works for fraud detection scenario.

1. Creating model

For training credit card models, you need a lot of examples of transactions and each transaction should be labeled as Fraud or Not-Fraud. This labels has to be as accurate as possible! This is our labeled data set. This data set is an input for supervised machine learning algorithms. Based on the labeled data, the algorithm trains the fraud detection model. The model is usually presented as a binary classifier with True (Fraud) or False (Not-Fraud) classes.

The labeled data set plays a central role in this process. It is very easy to change the parameters of our algorithm such as the feature normalization method or loss function. We can change the algorithm itself from logistic regression to SVM or random forest for example. However, you cannot change the labeled data set. This information is predefined and your model should predict the labels that you already have.

2. How long does data labeling process takes?

How can we label the freshest transactions? If customers report fraud transactions or stolen credit cards, we can immediately mark the transaction as “Fraud”. What should we do with the rest of the transactions? We can assume that non reported transactions are “Not Fraud”. How long should we wait to be sure that they are not fraud? The last time when my friend lost a credit card, she said, “I won’t report the missing credit card yet. Tomorrow I’ll go to the shop that I had last visited and I’ll ask them if they found my credit card.” Fortunately, the store found and returned her credit card. I’m not an expert in the credit card fraud field (I’m only a good card user), but from my experience, we should wait at least a couple of days before marking transactions as “Not Fraud”.

In contrast, if somebody reported a Fraud transaction, we can immediately label this transaction as “Fraud”. A guy who reports fraud probably realizes the fraud transaction only after several hours or couple days after the loss but this is the best we can do.

In that way, our “freshest” labeled data set will be limited by a few “Fraud” transactions with several hours or days delay and lot of “Not Fraud” transactions within 2-3 days delay.

3. Let’s try to speed up the labeling process

Our goal is to obtained the “freshest” labeled data possible. In fact, we have “fresh Fraud” labels only. For “Not Fraud” labels, we have to wait a few days. It might look like a good idea to build a model using only “Fresh Fraud” labeled data. However, we should understand that this labeled data set is biased which might lead to a lot of issues with the models.

Let’s imagine a new big shopping center opened yesterday and we got one single fraud report regarding one single transaction from this store. Our labeled data set will contain only one transaction from this shop with a “Fraud” label. All other transactions from the shop are not labeled yet. The algorithm might decide that this shop is a strong fraud predictor and all transactions from this shop will be erroneously mis-classified as “Fraud” immediately “in real-time”. Advantages of real time give us real-time problems.

Conclusion

As we can see, the credit card fraud detection business scenario does not look like the best scenario for real-time supervised machine learning. Also, I was unable to imagine a good scenario from another business domains. I’d love to see good scenarios of real-time machine learning. Please share if you have any information or ideas to share with the community.

16 thoughts on “What No One Tells You About Real-Time Machine Learning

  1. I’m working on a real-time password, signature fraud detection system using biometric information. And I have been facing same problem, that is, there are few negative samples. At last I have to use correlation coefficient to calculate similarity and thus make the error rate so high. Besides I try to let every people input same text like Coursera does in Honor code and the result also not satisfied.

    Probably works like honor code are only useful methods.

    Liked by 1 person

    • Nan, thank you for sharing with us! Very interesting application. Do users enter the signature after login in your network\domain? If so, a quality of your labels should be good.

      Yes, Honor code can improve quality of your data set. However, it won’t solve the issue. You still need people\judges for a better result I think. It depends on application of course…

      Like

      • well, actually a user will be asked to write signature 3 times(on smart phone, it’s a mobile application) during his/her registration. And yes, users do enter the signature after login. users will be asked to enter signature when they do some high security or dangerous operation.

        What we did with honor code is let the honor code entered by user himself be positive samples and let others’ entered be negative samples. But actually this system doesn’t work well.

        Like

      • Hi Dmitry, I really like your article. May I translate it to Chinese and share it to my friends? I’ll mark you as the original author and with a link to here in the article. I’ll also post a link of the translated articles to you. How about that?

        Like

  2. I posted a link to this post in reddit https://www.reddit.com/r/MachineLearning/comments/3ny2qa/what_no_one_tells_you_about_realtime_machine/

    The link initiated an interesting discussion about real-time machine learning. You can find many interesting real-time ML examples. Many of them theoretical to my mind. The best thought from datatadadata (this is his or her username): “realtime training is only meaningful in cases where the past is not a good predictor of the future, apart from the very recent past.”

    Like

  3. Good real-time machine learning examples are recommendation engines. While browsing Amazon, or some other shopping website, I do want the recommendations to be related to what I just searched; having said that, I do realize this does not necessarily mean that the training is being done in real time!

    Liked by 1 person

    • Exactly. You do not need to retrain recommendation model in real-time after each click or purchase. The same model can work for months without retraining. After a purchase your recommendation changes just because regarding to a (old) model you will buy something related to recent purchase.

      The interesting question – do you need to retrain the model after each click or each 5 minutes? You probably need that if an user behaviour pattern can change in 5 minutes. Not your or mine pattern – pattern of all Amazon users. And I don’t think this happens every 5 minutes of even one day.

      Like

  4. Well, in cases like this where there are very few positive examples you could try using anomaly detection / outlier detection algorithm. It helps to have the labelled positive examples but it can work without them. What is the tricky part is choosing the parameters that work, that is, parameters that are likely to be very consistent for negative examples and not so for positive examples. You may also need to fine-tune the threshold.

    Examples of real-time machine learning without labelled data are, like mentioned, recommender systems. Many of them require or at least heavily use labelled data like spam filters, image tagging, face recognition etc.

    Liked by 1 person

    • Are you using an unsupervised anomaly detection? Right, there are big issues with unsupervised parameters.

      We need to separate unsupervised and supervised. With unsupervised – yes, labels are not needed. With supervised – you need labels in real-time. Or, at least, you need a small part of actual labels for semi-supervised (spam filters).

      Like

  5. Interesting post. I’ve been thinking about similar issues and how they might interact with ideas from ‘active learning’. So for example if you classify each new transaction with a ‘fraud probability’, and accept new samples into the training either if they are very high or low probability. For those you are less certain of you ‘poll’ the user to see how the transaction should be labeled.

    I gather this kind of approach has the same difficulties with bias though.. and I don’t really know how you deal with that.

    Like

    • Right, ‘active learning’ is technically possible (by Apache Spark for example) right now. However, machine learning methods might be much more complicated and unstable. First, we need good methods, and new technologies later.

      Liked by 1 person

Leave a comment