Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB

Asked 19/8, 2011 at 12:33 Answered 20/8, 2011 at 6:16

Solved r hadoop amazon-simpledb mahout google-bigquery

I am currently working on system that generated product recommendations like those on Amazon : "People who bought this also bought this.."

Current Scenario:

Extract the Google Analytics data of the client and insert it in database.
On the website of the client, on load of product page the API call is made to get the recommendations of the product being viewed.
When API receives the product ID as request it looks in the database and retrieves (using association rules) the recommended product IDs and sends them as response.
The list of these product Ids will be processed to get the product details(image,price..) at the client end and displayed on website.
Currently I am using PHP and MYSQL with gapi package and REST api storage on AMAZON EC2 .

My Question is: Now, if I have to choose amongst the following, which will be the best choice to implement the above mentioned concept.

PHP with SimpleDB or BIGQuery.
R language with BIGQuery.
RHIPE-(R and hadoop ) with SimpleDB.
Apache Mahout.

Plese help!

Prana answered 19/8, 2011 at 12:33 Comment(0)

This isn't so easy to answer, because the constraints are fairly specialized.

The following considerations can be made, though:

BIGQuery is not yet public. Thus, with a small usage base, even if you are in the preview population, it will be harder to get advice on improvement.
Each of your answers asked about a modeling system & a storage system. Apache Mahout is not a storage mechanism, so it won't necessarily work on its own. I used to believe that its machine learning implementations were a a pastiche of a few Google Summer of Code, but I've updated that view on the suggestion of a commenter. It still looks like it has rather uneven and spotty coverage of different algorithms, and it's not particularly clear how the components are supported or maintained. I encourage an evangelist for Mahout to address this.

As a result, this eliminates the 1st, 2nd, and 4th options.

What I don't quite get is the need for a real-time server to utilize Hadoop and RHIPE. That should be done in your batch processing for developing the recommendation models, not in real-time. I suppose you could use RHIPE as a simple one-stop front end for firing off queries.

I'd recommend using RApache instead of RHIPE, because you can get your packages and models pre-loaded. I see no advantage to using Hadoop in the front end, but it would be a very natural back end system for the model fitting.

(Update 1) Other interface options include RServe (http://www.rforge.net/Rserve/) and possibly RStudio in server mode. There are R/PHP interfaces (see comments below), but I suspect it would be better to access R through HTTP or TCP/IP.

(Update 2) Addressing the whole process, the basic idea I see is that you could query the data from PHP and pass to R or, if you wish to query from within R, look at the link in the comments (to the OmegaHat tools) or post a new question about R & SimpleDB - I'm sure someone else on SO would be able to give better insight on this particular connection. RApache would let you instantiate many R processes already prepared with packages loaded and data in RAM; thus you would only need to pass whatever data needs to be used for prediction. If your new data is a small vector then RApache should be fine, and it seems this is correct for the data being processed in real-time.

Froh answered 19/8, 2011 at 23:10 Comment(24)

I can't figure out why you think Mahout is a bunch of GSoC projects, nor somehow not ready for use. Just counting the code I wrote even myself, I can tell you I maintain it, improve it, have done since 2005, and know it's used 'in anger' in production. sorry you may have had some bad impression, but this is flatly wrong. – Octaviooctavius 20/8, 2011 at 6:11

I have two process,1>generating the recommendations from the raw data stored in mysql 2>After processing the data(i.e applying the clustering and recommendation algorithms) store the data in database(SimpleDB or Bigquery).This database will be queried through api request.Now,1>will be processed in batch and 2> needs real time resoponse.Now,as per Seans reply,it seems for 1> I can use Mahout but for 2>I am still not clear about the right combination of the database with mahout.But according to you mahout is not ready for industrial usage.Please help selecting me the right combination. – Prana 20/8, 2011 at 7:42

@Sean: I mean no disrespect - it takes a lot of time to develop good modeling libraries, and working on Mahout is certainly a labor of love. Still, addressing the OP's question, it is still a small project with only a few algorithms and rather spotty coverage across algorithm classes. I realize I may have been wrong about GSoC - in the past it looked like most of Mahout coverage came from one-off projects. – Froh 20/8, 2011 at 12:5

(Continued) I realize that there is adult supervision and am glad that you and others mentor students to teach them about ML and scalability, in addition to your own contributions to the codebase. In light of this, I will update my answer. – Froh 20/8, 2011 at 12:7

@Sean, I've updated my answer. I apologize if my comment about not being ready to be used in production seems insulting. I've come to realize I have a pretty high standard for what is used in production. Mahout is far from alone in not making the cut. That's not to say that others may not use Mahout. I have seen many people do things I would not. – Froh 20/8, 2011 at 12:15

@Samridhi: I've updated my answer. Certainly Mahout can be used in an industrial application, so I over-stated my opinion. For industrial applications for which I am responsible, I would not use it, but there are a great many other systems I would not use, either. For me, being able to investigate the data, knowing that the statistics is rock-solid, and being able to vigorously investigate a model are critical. I simply don't yet see that I can do that yet with Mahout. On the other hand, most people do not do this, so they must make their own decisions about suitability. – Froh 20/8, 2011 at 12:21

@Iterator-After reading your updated answer,lets keep mahout aside till I get more insight into it.Now,how good would it be to use R with SimpleDB?I mean which modeling system & storage system do you suggest for optimum performance? – Prana 20/8, 2011 at 12:23

(Continued) If you do go the route of using Mahout, I certainly encourage you to get the book that @Sean has co-authored with several other key Mahout participants - "Mahout in Action". I'm sure it'll be a good book to have by your desk while developing your application. – Froh 20/8, 2011 at 12:24

@samridhi: It seems tools for this already exist: r.789695.n4.nabble.com/Amazon-SimpleDB-and-R-td905712.html However, as you're also familiar with PHP, you could simply issue requests via PHP and pass the data directly to R as you see fit - be it in RAM, via RApache, or some other method. A rather clumsy approach is here: stanford.edu/~mjockers/cgi-bin/drupal/node/25, an older module appears here: steve-chen.net/document/r/r_php – Froh 20/8, 2011 at 12:29

(Continued) However, I'd really recommend using RApache as a front end, and use that to access R. This way you need not spend a lot of time working on inter-language connections. Two other options are RServe (rforge.net/Rserve) and RStudio in server mode. – Froh 20/8, 2011 at 12:31

@iterator: Once the decision of choosing modeling system and storage system is finalized,will definitely go for it.But,right now,as I am comparing the possible choices,I would be really helpful to know your opinion on use of R with SimpleDB.Also,I am considering use of Mahout with SimpleDB. – Prana 20/8, 2011 at 12:35

continued:Hey thanks a lot for your suggestions.For now ,will move ahead with the resources that you have suggested. – Prana 20/8, 2011 at 12:39

@samridhi: It's probably best to ask about these two connections (R+SimpleDB and Mahout+SimpleDB) as two additional separate questions, just to get precise answers on only these connections. I have pointed to resources for accessing SimpleDB, but your data flow looks like PHP could query SimpleDB and you can pass that to R via RApache, which seems both easier for you to code (it's in PHP) and a cleaner & faster way to access R. – Froh 20/8, 2011 at 12:41

Good luck! If you could post a separate question on R + SimpleDB, that would be great. I could do it, too, but encourage you. I'm sure many R users with big datasets would find the Q&A useful. – Froh 20/8, 2011 at 12:45

Hey this is what i can sum up from our discussion:I will get clients data from google analytics using PHP and store it in MySQL.I will pass this raw data to R through PHP again to generate recommendations,and store the processed data in SimpleDB-(this is the data that has to be returned on API request,hence will use SimpleDB insted of MySQL for speed and scalability).SimlpeDB will return the recommended product ID on API request which contains ID of product for which recommendations are to be generated. – Prana 20/8, 2011 at 12:52

Ya will definitely post seperate question,thanks a lot for your guidance.Please also correct me if the above mentioned summary is incorrect. – Prana 20/8, 2011 at 12:55

Your summary looks good, though I'm not quite clear on what you mean about SimpleDB returning the recommend product ID. – Froh 20/8, 2011 at 13:2

@Iterator:I have two process,step-1>generating the recommendations from the raw data stored in MySQL step-2>After processing the data(i.e applying the clustering and recommendation algorithms) store the recommendations generated in database and this time in SimpleDB as this database will be queried through api request to return the recommendations (I have mentioned details on API in the question).Now,step-1>will be processed in batch hence using MySQL and step-2> needs real time response hence using SimpleDB.Hope this clarifies the process.And now,based on this,is my summary correct? – Prana 20/8, 2011 at 13:32

It seems fine. I think you'll want to try out several of these methods to see what is easiest to maintain. – Froh 20/8, 2011 at 14:14

Ya,I am going through the links you have suggested.These resources are amazing.But,I cant understand the use of Rstudio except for development purpose. – Prana 20/8, 2011 at 14:28

RStudio has two kinds of interfaces: desktop and server. It's great for the desktop, though what I was advocating was consideration of its server functionality. I only mentioned it for completeness, and am not sure that that will be the best route, as I suspect RApache or Rserve will be a better fit. – Froh 20/8, 2011 at 15:12

@Froh I don't find it insulting, I am just not sure how much you know about the project. I am not sure how to be useful if all I know of your issue is, "well I would not use it in production" -- why? what piece? There are 20-30 algorithms in there; I don't think there's any generalization, good or bad, that applies to that many different things! What stats are you unsure about? If you want to explore a model, use R. I don't think that's what Mahout is for though. – Octaviooctavius 20/8, 2011 at 17:50

@Sean: Thank you for responding & understanding. It is a bit hard to describe my considerations in comments. :) However, I will take a look at your book and see what I need to reconsider. – Froh 20/8, 2011 at 18:5

@iterator:Thanks a lot! Will explore more on Rstudio,but rApache and Rserve really simplified my problem.But I am not sure if,after using Rserve, I would need rApache as Rserve allows me to have the functionality of R in PHP. – Prana 21/8, 2011 at 11:55

If you want a real-time API for recommendations based on data in a database, Apache Mahout does this directly. You want to use ReloadFromJDBCDataModel, put on top a GenericItemBasedRecommender, and use the servlet-based wrapper in the examples module. It's probably a day or two of work to get familiar with the code and customize it to your needs, but it's pretty simple.

When you get past about 100M data points you would need to look at distributing the computation Hadoop. That's a fair bit more complex. Mahout has a distributed recommender too which you can customize.

Octaviooctavius answered 20/8, 2011 at 6:16 Comment(9)

Hey,thanks for responding.Now as per your suggestion and my needs,I need a combination of data processing (R or mahout) and data storage( simpeldb or bigquery) .Is it good to use mahout with SimpleDB?Also,how will hadoop help me here? – Prana 20/8, 2011 at 7:32

Nothing in Mahout uses SimpleDB directly. If you want to use a remote data store, where access is relatively slow, see my article on integrating with Cassandra; you can perhaps reuse that approach (acunu.com/blogs/sean-owen/recommending-cassandra). Hadoop is for distributing a computation to parallelize and scale it. Don't use it unless you need it. If you have less than tens of millions of rows you don't need it. – Octaviooctavius 20/8, 2011 at 18:6

Hey,went through your article,you made it really simple to grasp the concept.Now,correct me if I am wrong,Cassandra's role is to hold the data which in turn acts as input for Mahout,but this is what happens in my case...... – Prana 21/8, 2011 at 11:40

(continued...) I have two process, step-1>generating the recommendations from the raw data stored in MySQL step-2>After processing the data(i.e applying the clustering and recommendation algorithms using Mahout) store the recommendations generated in database and this time in SimpleDB as this database will be queried through api request to return the recommendations (I have mentioned details on API in the question).Now,step-1>will be processed in batch hence using MySQL and step-2> needs real time response hence using SimpleDB. Now here should Cassandra replace SimpleDB? – Prana 21/8, 2011 at 11:43

Hey Sean,it will be great help if can comment on the following architecture : 1> MySQL to store raw data 2>process the data using mahout 3>store the output in SimpleDB(if possible) or Cassandra. – Prana 23/8, 2011 at 7:18

That's fine. You can put the data wherever you want. Doesn't need to be Cassandra, but could be. – Octaviooctavius 23/8, 2011 at 22:50

Hey,thanks a lot.Just one more query->If in future I want to apply clustering algorithm to make cluster of users and then apply the association rules subjective to each cluster,will mahout provide me enough flexibility to tweak with the existing algorithms according to my needs?As per this souce "iletken.com.tr/documents/mahout_review_by_iletken.pdf", god knows how much to rely on,it seems I will have difficulty in future. – Prana 24/8, 2011 at 7:35

I don't know how to answer that -- not sure exactly what you're doing or why. It's open-source, you can do what you want. The link you cite is old, and is from someone trying to sell their own product. I do not think it is a very good reference. But I also don't think it pertains to what you are doing. – Octaviooctavius 24/8, 2011 at 22:55

Ya,true,the source that I cited doesn't exactly pertain to my needs,but there were few points which I thought will affect my implementation.But as you justified,it isn't a reliable source.Actually,I wanted to apply the clustering algorithm on users and after the users are categorized into cluster say Cluster 'A' and Cluster 'B' based on their similarity, apply association rules to generate the recommendations specific to each cluster.My question was if I can achieve this using Mahout. – Prana 25/8, 2011 at 7:30

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags