How to train isolationForest model so as to give the minimum number of false positives?
Asked Answered
M

1

6

While using Isolation Forest for anomaly detection in data should we train the model with only normal data or mix of both normal as well as outlier data? Also what is the best algorithm for anomaly detection for multivariate data? I want minimum false positives.

  1. I am looking at contamination level less than 5% .
  2. Also what is the best ML algorithm for anomaly detection for multivariate data so that it gives minimum false positives.

Note: I know that false positives reduction is a matter of tuning the model but I wanted to know the most efficient algorithm. from blogs I have understood that IsolationForest is one of the newest and most efficient unsupervised anomaly detection algorithm.

Mortise answered 21/4, 2018 at 14:19 Comment(2)
cooks distance is an alternative. it is available in R such as Cooks Distance Function here:rdocumentation.org/packages/car/versions/1.2-16/topics/…Rustin
Is there a library in python for this ?Mortise
A
1

Currently, scikit-learn v0.20.3 has isolation forests implemented. IForests are fairly good with handling high dimensional, multivariate data:

"the data is recursively partitioned with axis-parallel cuts at randomly chosen partition points in randomly selected attributes, so as to isolate the instances into nodes with fewer and fewer instances until the points are isolated into singleton nodes containing one instance." -- Charu C. Aggarwal (in Chapter 5 of Outlier Analysis)

I can't say for a fact that it gives the minimum false positives because it would really depend on many factors including your training data. As far as I can tell, it does a good job identifying anomalies and/or outliers (even with discrete time series).

You can set the contamination parameter to whatever percent your heart desires as long as it's a float in (0., 0.5).

"The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function."

The default is 0.1 (or 10%), so you could set contamination=0.04 (4%).

from sklearn.ensemble import IsolationForest

clf = IsolationForest(contamination=0.04)
Addy answered 9/5, 2019 at 0:56 Comment(1)
Thanks for your answer. May I ask you kindly to have a look at related post here?Juna

© 2022 - 2024 — McMap. All rights reserved.