Which algorithm/implementation for weighted similarity between users by their selected, distanced attributes?
Asked Answered
L

1

7

Data Structure:

User has many Profiles
    (Limit - no more than one of each profile type per user, no duplicates)
Profiles has many Attribute Values
    (A user can have as many or few attribute values as they like)
Attributes belong to a category
    (No overlap. This controls which attribute values a profile can have)

Example/Context:

I believe with stack exchange you can have many profiles for one user, as they differ per exchange site? In this problem:

  • Profile: Video, so Video profile only contains Attributes of Video category
  • Attributes, so an Attribute in the Video category may be Genre
  • Attribute Values, e.g. Comedy, Action, Thriller are all Attribute Values

Profiles and Attributes are just ways of grouping Attribute Values on two levels. Without grouping (which is needed for weighting in 2. onwards), the relationship is just User hasMany Attribute Values.

Problem:

Give each user a similarity rating against each other user.

  1. Similarity based on All Attribute Values associated with the user.
    • Flat/one level
    • Unequal number of attribute values between two users
    • Attribute value can only be selected once per user, so no duplicates
    • Therefore, binary string/boolean array with Cosine Similarity?
  2. 1 + Weight Profiles
    • Give each profile a weight (totaling 1?)
    • Work out profile similarity, then multiply by weight, and sum?
  3. 1 + Weight Attribute Categories and Profiles
    • As an attribute belongs to a category, categories can be weighted
    • Similarity per category, weighted sum, then same by profile?
    • Or merge profile and category weights
  4. 3 + Distance between every attribute value
    • Table of similarity distance for every possible value vs value
    • Rather than similarity by value === value
    • 'Close' attributes contribute to overall similarity.
    • No idea how to do this one

Fancy code and useful functions are great, but I'm really looking to fully understand how to achieve these tasks, so I think generic pseudocode is best.

Thanks!

Langill answered 11/2, 2014 at 18:40 Comment(4)
Do you need all these tasks done or are you just considering these approaches as possible solutions to one main goal (find similarity between users)? Can you please give us some context?Fortissimo
These are the approaches I've considered, 1 being the simplest, 4 being the most complex. I'd like to understand how to do each one, so yes, I need all of them, but as 1 would feed into understanding 2, etc, they're essentially all part of one solution. I'm open to any suggestions as to how do achieve these tasks, but I think they're the best way for me to compare users.Langill
What are attributes and categories? Can you please provide some examples? More generally, what are these profiles in real life? Say, are they profiles from Fb, LinkedIn, etc., or what? Also, what is your intuition behind weights (for both - profiles and categories). Sorry for asking so many question, but things like finding similarity always depends on concrete settings and task details.Fortissimo
Added to question. Does that make sense?Langill
F
9

First of all, you should remember that everything should be made as simple as possible, but not simpler. This rule applies to many areas, but in things like semantics, similarity and machine learning it is essential. Using several layers of abstraction (attributes -> categories -> profiles -> users) makes your model harder to understand and to reason about, so I would try to omit it as much as possible. This means that it's highly preferable to keep direct relation between users and attributes. So, basically your users should be represented as vectors, where each variable (vector element) represents single attribute.

If you choose such representation, make sure all attributes make sense and have appropriate type in this context. For example, you can represent 5 video genres as 5 distinct variables, but not as numbers from 1 to 5, since cosine similarity (and most other algos) will treat them incorrectly (e.g. multiply thriller, represented as 2, with comedy, represented as 5, which makes no sense actually).

It's ok to use distance between attributes when applicable. Though I can hardly come up with example in your settings.

At this point you should stop reading and try it out: simple representation of users as vector of attributes and cosine similarity. If it works well, leave it as is - overcomplicating a model is never good.

And if the model performs bad, try to understand why. Do you have enough relevant attributes? Or are there too many noisy variables that only make it worse? Or do some attributes should really have larger importance than others? Depending on these questions, you may want to:

  1. Run feature selection to avoid noisy variables.
  2. Transform your variables, representing them in some other "coordinate system". For example, instead of using N variables for N video genres, you may use M other variables to represent closeness to specific social group. Say, 1 for "comedy" variable becomes 0.8 for "children" variable, 0.6 for "housewife" and 0.9 for "old_people". Or anything else. Any kind of translation that seems more "correct" is ok.
  3. Use weights. Not weights for categories or profiles, but weights for distinct attributes. But don't set these weights yourself, instead run linear regression to find them out.

Let me describe last point in a bit more detail. Instead of simple cosine similarity, which looks like this:

cos(x, y) = x[0]*y[0] + x[1]*y[1] + ... + x[n]*y[n]

you may use weighted version:

cos(x, y) = w[0]*x[0]*y[0] + w[1]*x[1]*y[1] + ... + w[2]*x[2]*y[2]

Standard way to find such weights is to use some kind of regression (linear one is the most popular). Normally, you collect dataset (X, y) where X is a matrix with your data vectors on rows (e.g. details of house being sold) and y is some kind of "correct answer" (e.g. actual price that the house was sold for). However, in you case there's no correct answer to user vectors. In fact, you can define correct answer to their similarity only. So why not? Just make each row of X be a combination of 2 user vectors, and corresponding element of y - similarity between them (you should assign it yourself for a training dataset). E.g.:

X[k] = [ user_i[0]*user_j[0], user_i[1]*user_j[1], ..., user_i[n]*user_j[n] ]
y[k] = .75  // or whatever you assign to it

HTH

Fortissimo answered 12/2, 2014 at 14:24 Comment(6)
First paragraph, good point. I'll manipulate weighting for the calculation based on groups, rather than run multiple similarity calculations. Second, each attribute will have a value, 'importance'. Machine learning is my long term goal. I'd like to use manual weights as you describe as 2. and implement some learning to modify those numbers based on user metrics. What you describe is what I meant by 'distance'. So you'd still use one calculation, but add in each variable with a modified value based on it's relation to an attribute value already in the set?Langill
What's the difference between "importance" and "weights"? As for calculations, yes, I'd prefer single formula. There are several kinds of models that split attributes and weights into several parts (e.g. layers in neural networks), but in general the more components (and abstraction levels), the harder it is to reason about it, find its coefficients and so on.Fortissimo
Ok. The importance is, say, a score? Or worth? Some are more important than others, so i want them to count more, INITIALLY. This will be inversely proportional to the usage count. Then weights are used to modify the worth based on that initial importance. They are like weights I guessLangill
Well, this makes sense for some optimization methods like stochastic gradient descent (which is quite common), but be prepared that other algorithms may change your initial importance coefficients drastically or even completely ignore them.Fortissimo
So what would you recommend overall if Weighted Cosine Similarity is not the best method?Langill
It's impossible to predict which method will do best - only experiments may tell you the truth. But weighted cosine similarity seems to be a good starting point. I just wanted to emphasize that combining manual and calculated weights may not always give good results. It all depends on concrete steps. You really need to try testing ideas. "Theory and practice sometimes clash. And when that happens, theory loses. Every single time." (c) Linus TorvaldsFortissimo

© 2022 - 2024 — McMap. All rights reserved.