How to improve my recommendation result? I am using spark ALS implicit
Asked Answered
P

1

7

First, I have some use history of user's app.

For example:
user1, app1, 3(launch times)
user2, app2, 2(launch times)
user3, app1, 1(launch times)

I have basically two demands:

  1. Recommend some app for every user.
  2. Recommend similar app for every app.

So I use ALS(implicit) of MLLib on spark to implement it. At first, I just use the original data to train the model. The result is terrible. I think it may caused by the range of launch times. And the launch time range from 1 to thousands. So I process the original data. I think score can reflect the true situation and more regularization.

score = lt / uMlt + lt / aMlt

score is process result to train model.
lt is launch times in original data.
uMlt is user's mean launch times in original data. uMlt(all launch times of a user) / (number of app this user ever launched)
aMlt is app's mean launch times in original data. aMlt(all launch times of a app) / (number of user who ever launched this app)
Here is some example of the data after processing.

Rating(95788,20992,0.14167073369026184)
Rating(98696,20992,5.92363166809082)
Rating(160020,11264,2.261538505554199)
Rating(67904,11264,2.261538505554199)
Rating(268430,11264,0.13846154510974884)
Rating(201369,11264,1.7999999523162842)
Rating(180857,11264,2.2720916271209717)
Rating(217692,11264,1.3692307472229004)
Rating(186274,28672,2.4250855445861816)
Rating(120820,28672,0.4422124922275543)
Rating(221146,28672,1.0074234008789062)

After I have done this, and aggregate the apps which have different package name, the result seems better. But still not good enough.
I find that the features of users and products is so small, and most of them is negative.

Here is 3 line example of products features, 10 dimensions for each line:

((CompactBuffer(com.youlin.xyzs.shoumeng, com.youlin.xyzs.juhe.shoumeng)),(-4.798973236574966E-7,-7.641608021913271E-7,6.040852440492017E-7,2.82689171626771E-7,-4.255948056197667E-7,1.815822798789668E-7,5.000047167413868E-7,2.0220664964654134E-7,6.386763402588258E-7,-4.289261710255232E-7))
((CompactBuffer(com.dncfcjaobhegbjccdhandkba.huojia)),(-4.769295992446132E-5,-1.7072002810891718E-4,2.1351299074012786E-4,1.6345139010809362E-4,-1.4456869394052774E-4,2.3657752899453044E-4,-4.508546771830879E-5,2.0895185298286378E-4,2.968782791867852E-4,1.9461760530248284E-4))
((CompactBuffer(com.tern.rest.pron)),(-1.219763362314552E-5,-2.8371430744300596E-5,2.9869115678593516E-5,2.0747662347275764E-5,-2.0555471564875916E-5,2.632938776514493E-5,2.934047643066151E-6,2.296348611707799E-5,3.8075613701948896E-5,1.2197584510431625E-5))

Here is 3 line example of users features, 10 dimensions for each line:

(96768,(-0.0010857731103897095,-0.001926362863741815,0.0013726564357057214,6.345533765852451E-4,-9.048808133229613E-4,-4.1544197301846E-5,0.0014421759406104684,-9.77902309386991E-5,0.0010355513077229261,-0.0017878251383081079))
(97280,(-0.0022841691970825195,-0.0017134940717369318,0.001027365098707378,9.437055559828877E-4,-0.0011165080359205604,0.0017137592658400536,9.713359759189188E-4,8.947265450842679E-4,0.0014328152174130082,-5.738904583267868E-4))
(97792,(-0.0017802991205826402,-0.003464450128376484,0.002837196458131075,0.0015725698322057724,-0.0018932095263153315,9.185600210912526E-4,0.0018971719546243548,7.250450435094535E-4,0.0027060359716415405,-0.0017731878906488419))

So you can imagine how small when I get dot product of the feature vectors to compute value of user-item matrix.

My question here is :

  1. Is there any other way to improve the recommendation result?
  2. Does my features seem right, or there's something going wrong?
  3. Is my way to process the original launch times(convert to score) right?

I put some code here. And this is absolutely a program question. But maybe can't be solved by a few lines of code.

val model = ALS.trainImplicit(ratings, rank, iterations, lambda, alpha)
print("recommendForAllUser")
val userTopKRdd = recommendForAllUser(model, topN).join(userData.map(x => (x._2._1, x._1))).map {
  case (uid, (appArray, mac)) => {
    (mac, appArray.map {
      case (appId, rating) => {
        val packageName = appIdPriorityPackageNameDict.value.getOrElse(appId, Constants.PLACEHOLDER)
        (packageName, rating)
      }
    })
  }
}
HbaseWriter.writeRddToHbase(userTopKRdd, "user_top100_recommendation", (x: (String, Array[(String, Double)])) => {
  val mac = x._1
  val products = x._2.map {
    case (packageName, rating) => packageName + "=" + rating
  }.mkString(",")
  val putMap = Map("apps" -> products)
  (new ImmutableBytesWritable(), Utils.getHbasePutByMap(mac, putMap))
})

print("recommendSimilarApp")
println("productFeatures ******")
model.productFeatures.take(1000).map{
  case (appId, features) => {
    val packageNameList = appIdPackageNameListDict.value.get(appId)
    val packageNameListStr = if (packageNameList.isDefined) {
      packageNameList.mkString("(", ",", ")")
    } else {
      "Unknow List"
    }
    (packageNameListStr, features.mkString("(", ",", ")"))
  }
}.foreach(println)
println("productFeatures ******")
model.userFeatures.take(1000).map{
  case (userId, features) => {
    (userId, features.mkString("(", ",", ")"))
  }
}.foreach(println)
val similarAppRdd = recommendSimilarApp(model, topN).flatMap {
  case (appId, similarAppArray) => {
    val groupedAppList = appIdPackageNameListDict.value.get(appId)
    if (groupedAppList.isDefined) {
      val similarPackageList = similarAppArray.map {
        case (destAppId, rating) => (appIdPriorityPackageNameDict.value.getOrElse(destAppId, Constants.PLACEHOLDER), rating)
      }
      groupedAppList.get.map(packageName => {
        (packageName, similarPackageList)
      })
    } else {
      None
    }
  }
}
HbaseWriter.writeRddToHbase(similarAppRdd, "similar_app_top100_recommendation", (x: (String, Array[(String, Double)])) => {
  val packageName = x._1
  val products = x._2.map {
    case (packageName, rating) => packageName + "=" + rating
  }.mkString(",")
  val putMap = Map("apps" -> products)
  (new ImmutableBytesWritable(), Utils.getHbasePutByMap(packageName, putMap))
})  

UPDATE :
I found something new about my data after reading the paper("Collaborative Filtering for Implicit Feedback Datasets"). My data is too sparse compare to the IPTV data set described in the paper.

Paper: 300,000(users) 17,000(products) 32,000,000(data)
Mine: 300,000(users) 31,000(products) 700,000(data)

So the user-item matrix in the paper's data set has been filled with 0.00627 = (32,000,000 / 300,000 / 17,000). My data set's ratio is 0.0000033. I think it means that my user-item matrix is 2000 times sparser than the paper's.
Should this lead to a bad result? And any way to improve it?

Parliament answered 24/2, 2016 at 13:39 Comment(1)
Did you had any update from your issue? I have a similar problem.Lisettelisha
N
3

There are two things you should try:

  1. Standardise your data so that it has zero mean and unit variance per user vector. This is a common step in lots of machine learning. It helps to reduce the effect of outliers, which cause the close-to-zero values you are seeing.
  2. Remove all users that have only a single app. The only thing you will learn from these users is a slightly better "mean" value for the app scores. They will not help you learn any meaningful relationships though, which is what you really want.

Having removed a user from the model, you will lose the ability to get a recommendation for that user directly from the model, by providing the user ID. However, they only have a single app rating anyway. So, you can instead run a KNN search over the product matrix to find apps most similar to that users apps = recommendations.

Nakasuji answered 27/6, 2017 at 16:17 Comment(2)
why you would remove the users that have only a single app? don't understand your rationaleVenison
Users with only one app (after standardization, as stated above) will have 100% or the max value for one column and all the other columns' entry will be 0. This will not only increase sparsity but will also act as outliers. Hence, in a way, removing those users de-noises your dataset which results in better training.Defeasible

© 2022 - 2024 — McMap. All rights reserved.