I have around 2-3 million products. Each product follows this structure
{
"sku": "Unique ID of Product ( String of 20 chars )"
"title":"Title of product eg Oneplus 5 - 6GB + 64GB ",
"brand":"Brand of product eg OnePlus",
"cat1":"First Category of Product Phone",
"cat2":"Second Category of Product Mobile Phones",
"cat3":"Third Category of Product Smart Phones",
"price":500.00,
"shortDescription":"Short description about the product ( Around 8 - 10 Lines )",
"longDescription":"Long description about the product ( Aroung 50 - 60 Lines )"
}
The problem statement is
Find the similar products based on content or product data only. So when the e-commerce user will click on a product ( SKU ) , I will show the similar products to that SKU or product in the recommendation.
For example if the user clicks on apple iphone 6s silver , I will show these products in "Similar Products Recommendation"
1) apple iphone 6s gold or other color
2) apple iphone 6s plus options
3) apple iphone 6s options with other configurations
4) other apple iphones
5) other smart-phones in that price range
What I have tried so far
A) I have tried to use 'user view event ' to recommend the similar product but we do not that good data. It results fine results but only with few products. So this template is not suitable for my use case.
B) One hot encoder
+ Singular Value Decomposition ( SVD )
+ Cosine Similarity
I have trained my model for around 250 thousand products with dimension = 500 with modification of this prediction io template. It is giving good result. I have not included long description of product in the training.
But I have some questions here
1) Is using One Hot Encoder
and SVD
is right approach in my use case?
2) Is there any way or trick to give extra weight the title
and brand
attribute in the training.
3) Do you think it is scalable. I am trying to increase the product size to 1 million and dimension = 800-1000 but it is talking a lot of time and system hangs/stall or goes out of memory. ( I am using apache prediction io )
4) What should be my dimension value when I want to train for 2 million products.
5) How much memory I would need to deploy the SVD
trained model to find in-memory cosine similarity
for 2 million products.
What should I use in my use-case so that I can also give some weight to my important attributes and I will get good results with reasonable resources. What should be the best machine learning algorithm I should use in this case.