Right approach to find similar products solely based on content and not on user history using machine learning algorithms

I have around 2-3 million products. Each product follows this structure

{
    "sku": "Unique ID of Product ( String of 20 chars )"
    "title":"Title of product eg Oneplus 5 - 6GB + 64GB ",
    "brand":"Brand of product eg OnePlus",
    "cat1":"First Category of Product Phone",
    "cat2":"Second Category of Product Mobile Phones",
    "cat3":"Third Category of Product Smart Phones",
    "price":500.00,
    "shortDescription":"Short description about the product ( Around 8 - 10 Lines )",
    "longDescription":"Long description about the product ( Aroung 50 - 60 Lines )"
}

The problem statement is

Find the similar products based on content or product data only. So when the e-commerce user will click on a product ( SKU ) , I will show the similar products to that SKU or product in the recommendation.

For example if the user clicks on apple iphone 6s silver , I will show these products in "Similar Products Recommendation"

1) apple iphone 6s gold or other color

2) apple iphone 6s plus options

3) apple iphone 6s options with other configurations

4) other apple iphones

5) other smart-phones in that price range

What I have tried so far

A) I have tried to use 'user view event ' to recommend the similar product but we do not that good data. It results fine results but only with few products. So this template is not suitable for my use case.

B) One hot encoder + Singular Value Decomposition ( SVD ) + Cosine Similarity

I have trained my model for around 250 thousand products with dimension = 500 with modification of this prediction io template. It is giving good result. I have not included long description of product in the training.

But I have some questions here

1) Is using One Hot Encoder and SVD is right approach in my use case?

2) Is there any way or trick to give extra weight the title and brand attribute in the training.

3) Do you think it is scalable. I am trying to increase the product size to 1 million and dimension = 800-1000 but it is talking a lot of time and system hangs/stall or goes out of memory. ( I am using apache prediction io )

4) What should be my dimension value when I want to train for 2 million products.

5) How much memory I would need to deploy the SVD trained model to find in-memory cosine similarity for 2 million products.

What should I use in my use-case so that I can also give some weight to my important attributes and I will get good results with reasonable resources. What should be the best machine learning algorithm I should use in this case.

Now that I've stated my objections to the posting, I will give some guidance on the questions:

"Right Approach" often doesn't exist in ML. The supreme arbiter is whether the result has the characteristics you need. Most important, is the accuracy what you need, and can you find a better method? We can't tell without having a significant subset of your data set.
Yes. Most training methods will adjust whatever factors improve the error (loss) function. If your chosen method (SVD or other) doesn't do this automatically, then alter the error function.
Yes, it's scalable. The basic inference process is linear on the data set size. You got poor results because you didn't scale up the hardware when you enlarged the data set; that's part of "scale up". You might also consider scaling out (more compute nodes).
Well, how should a dimension scale with the data base size? I believe that empirical evidence supports this being a log(n) relationship ... you'd want 600-700 dimension. However, you should determine this empirically.
That depends on how you use the results. From what you've described, all you'll need is a sorted list of N top matches, which requires only the references and the similarity (a simple float). That's trivial memory compared to the model size, a matter of N*8 bytes.

Recommended topics

Hot tags