What No One Tells You About Real-Time Machine Learning

October 7, 2015October 7, 2015 Dmitry Petrov16 Comments

During this year, I heard and read a lot about real-time machine learning. People usually provide this appealing business scenario when discussing credit card fraud detection systems. They say that they can continuously update credit card fraud detection model in real-time (See “What is Apache Spark?”, “…real-time use cases…” and “Real time machine learning”). It looks fantastic but not realistic to me. One important detail is missing in this scenario – continuous flow of transactional data is not needed for model retraining. Instead, you need continuous flow of labeled (or pre-marked as Fraud\Not-Fraud) transactional data.

Creating labeled data is probably the slowest and the most expensive step in most of the machine learning systems. Machine learning algorithms learn to detect the fraud transactions from the people which is much like labeled data. Let’s see how it works for fraud detection scenario.

1. Creating model

For training credit card models, you need a lot of examples of transactions and each transaction should be labeled as Fraud or Not-Fraud. This labels has to be as accurate as possible! This is our labeled data set. This data set is an input for supervised machine learning algorithms. Based on the labeled data, the algorithm trains the fraud detection model. The model is usually presented as a binary classifier with True (Fraud) or False (Not-Fraud) classes.

The labeled data set plays a central role in this process. It is very easy to change the parameters of our algorithm such as the feature normalization method or loss function. We can change the algorithm itself from logistic regression to SVM or random forest for example. However, you cannot change the labeled data set. This information is predefined and your model should predict the labels that you already have.

2. How long does data labeling process takes?

How can we label the freshest transactions? If customers report fraud transactions or stolen credit cards, we can immediately mark the transaction as “Fraud”. What should we do with the rest of the transactions? We can assume that non reported transactions are “Not Fraud”. How long should we wait to be sure that they are not fraud? The last time when my friend lost a credit card, she said, “I won’t report the missing credit card yet. Tomorrow I’ll go to the shop that I had last visited and I’ll ask them if they found my credit card.” Fortunately, the store found and returned her credit card. I’m not an expert in the credit card fraud field (I’m only a good card user), but from my experience, we should wait at least a couple of days before marking transactions as “Not Fraud”.

In contrast, if somebody reported a Fraud transaction, we can immediately label this transaction as “Fraud”. A guy who reports fraud probably realizes the fraud transaction only after several hours or couple days after the loss but this is the best we can do.

In that way, our “freshest” labeled data set will be limited by a few “Fraud” transactions with several hours or days delay and lot of “Not Fraud” transactions within 2-3 days delay.

3. Let’s try to speed up the labeling process

Our goal is to obtained the “freshest” labeled data possible. In fact, we have “fresh Fraud” labels only. For “Not Fraud” labels, we have to wait a few days. It might look like a good idea to build a model using only “Fresh Fraud” labeled data. However, we should understand that this labeled data set is biased which might lead to a lot of issues with the models.

Let’s imagine a new big shopping center opened yesterday and we got one single fraud report regarding one single transaction from this store. Our labeled data set will contain only one transaction from this shop with a “Fraud” label. All other transactions from the shop are not labeled yet. The algorithm might decide that this shop is a strong fraud predictor and all transactions from this shop will be erroneously mis-classified as “Fraud” immediately “in real-time”. Advantages of real time give us real-time problems.

Conclusion

As we can see, the credit card fraud detection business scenario does not look like the best scenario for real-time supervised machine learning. Also, I was unable to imagine a good scenario from another business domains. I’d love to see good scenarios of real-time machine learning. Please share if you have any information or ideas to share with the community.

16 thoughts on “What No One Tells You About Real-Time Machine Learning”

Nan Li says:

October 8, 2015 at 6:04 am

I’m working on a real-time password, signature fraud detection system using biometric information. And I have been facing same problem, that is, there are few negative samples. At last I have to use correlation coefficient to calculate similarity and thus make the error rate so high. Besides I try to let every people input same text like Coursera does in Honor code and the result also not satisfied.

Probably works like honor code are only useful methods.

LikeLiked by 1 person

Reply
- Dmitry Petrov (@FullStackML) says:
  
  October 8, 2015 at 11:10 am
  
  Nan, thank you for sharing with us! Very interesting application. Do users enter the signature after login in your network\domain? If so, a quality of your labels should be good.
  
  Yes, Honor code can improve quality of your data set. However, it won’t solve the issue. You still need people\judges for a better result I think. It depends on application of course…
  
  LikeLike
  
  Reply
  - Nan Li says:
    
    October 8, 2015 at 11:29 pm
    
    well, actually a user will be asked to write signature 3 times(on smart phone, it’s a mobile application) during his/her registration. And yes, users do enter the signature after login. users will be asked to enter signature when they do some high security or dangerous operation.
    
    What we did with honor code is let the honor code entered by user himself be positive samples and let others’ entered be negative samples. But actually this system doesn’t work well.
    
    LikeLike
  - Nan Li says:
    
    December 10, 2015 at 2:50 am
    
    Hi Dmitry, I really like your article. May I translate it to Chinese and share it to my friends? I’ll mark you as the original author and with a link to here in the article. I’ll also post a link of the translated articles to you. How about that?
    
    LikeLike
  - Dmitry Petrov says:
    
    December 11, 2015 at 12:03 am
    
    Hi Nan,
    
    Sure. Please feel free to translate to Chinese with a backlink.
    
    LikeLike
  - Nan Li says:
    
    December 11, 2015 at 7:57 am
    
    here it is. http://www.jianshu.com/p/27a12e46fd7f thanks a lot.
    
    LikeLiked by 1 person
  - Dmitry Petrov says:
    
    December 16, 2015 at 2:09 am
    
    Nan, thank you for the translation! In Chinese it looks amazing 🙂
    
    LikeLike
Anastasia Kim (@thejister) says:

October 8, 2015 at 9:09 am

This is really exciting! Thank you for sharing your thoughts and ideas. I can’t wait to see what other things will be posted in the future! Awesome!

LikeLiked by 1 person

Reply
PerryZhao says:

October 8, 2015 at 11:19 pm

Reblogged this on 木秀于林.

LikeLike

Reply
Dmitry Petrov says:

October 9, 2015 at 7:14 pm

I posted a link to this post in reddit https://www.reddit.com/r/MachineLearning/comments/3ny2qa/what_no_one_tells_you_about_realtime_machine/

The link initiated an interesting discussion about real-time machine learning. You can find many interesting real-time ML examples. Many of them theoretical to my mind. The best thought from datatadadata (this is his or her username): “realtime training is only meaningful in cases where the past is not a good predictor of the future, apart from the very recent past.”

LikeLike

Reply
cn142 says:

October 13, 2015 at 9:07 am

Good real-time machine learning examples are recommendation engines. While browsing Amazon, or some other shopping website, I do want the recommendations to be related to what I just searched; having said that, I do realize this does not necessarily mean that the training is being done in real time!

LikeLiked by 1 person

Reply
- Dmitry Petrov says:
  
  October 13, 2015 at 11:15 am
  
  Exactly. You do not need to retrain recommendation model in real-time after each click or purchase. The same model can work for months without retraining. After a purchase your recommendation changes just because regarding to a (old) model you will buy something related to recent purchase.
  
  The interesting question – do you need to retrain the model after each click or each 5 minutes? You probably need that if an user behaviour pattern can change in 5 minutes. Not your or mine pattern – pattern of all Amazon users. And I don’t think this happens every 5 minutes of even one day.
  
  LikeLike
  
  Reply
Sampo says:

October 13, 2015 at 3:17 pm

Well, in cases like this where there are very few positive examples you could try using anomaly detection / outlier detection algorithm. It helps to have the labelled positive examples but it can work without them. What is the tricky part is choosing the parameters that work, that is, parameters that are likely to be very consistent for negative examples and not so for positive examples. You may also need to fine-tune the threshold.

Examples of real-time machine learning without labelled data are, like mentioned, recommender systems. Many of them require or at least heavily use labelled data like spam filters, image tagging, face recognition etc.

LikeLiked by 1 person

Reply
- Dmitry Petrov says:
  
  October 13, 2015 at 10:26 pm
  
  Are you using an unsupervised anomaly detection? Right, there are big issues with unsupervised parameters.
  
  We need to separate unsupervised and supervised. With unsupervised – yes, labels are not needed. With supervised – you need labels in real-time. Or, at least, you need a small part of actual labels for semi-supervised (spam filters).
  
  LikeLike
  
  Reply
padarn says:

November 5, 2015 at 8:02 am

Interesting post. I’ve been thinking about similar issues and how they might interact with ideas from ‘active learning’. So for example if you classify each new transaction with a ‘fraud probability’, and accept new samples into the training either if they are very high or low probability. For those you are less certain of you ‘poll’ the user to see how the transaction should be labeled.

I gather this kind of approach has the same difficulties with bias though.. and I don’t really know how you deal with that.

LikeLike

Reply
- Dmitry Petrov says:
  
  November 6, 2015 at 3:32 am
  
  Right, ‘active learning’ is technically possible (by Apache Spark for example) right now. However, machine learning methods might be much more complicated and unstable. First, we need good methods, and new technologies later.
  
  LikeLiked by 1 person
  
  Reply