Crawling the Google Play store
Asked Answered
G

3

11

I want to crawl the Google Play store to download the web pages of all the android application (All the webpages with the following base url: https://play.google.com/store/apps/). I checked the robots.txt file of the play store and it disallows crawling these URLs.

Also, when I browse the Google Play store I can only see top applications up to 3 pages for each of the categories. How can I get the other application pages?

If anyone has tried crawling the Google Play please let me know the following things: a) Were you successful in crawling the play store. If yes, please let me know how you did that. b) How to crawl the hidden application pages not visible in top apps for each of the categories? c) Is there a techniques to download the applications also and not just the webpages?

I already searched around and found the following links:

a) https://code.google.com/p/android-market-api/ 
b) https://code.google.com/p/android-marketplace-crawler/source/checkout 
c) http://mohsin-junaid.blogspot.co.uk/2012/12/how-to-install-android-marketplace.html 
d) http://mohsin-junaid.blogspot.in/2012/12/how-to-download-multiple-android-apks.html

Thanks!

Granthem answered 8/6, 2013 at 17:46 Comment(0)
V
7

First of all, Google Play's robots.txt does NOT disallow the pages with base "/store/apps".

If you want to crawl Google Play you would need to develop your own web crawler, parse the HTML page and extract the app meta-data you need (e.g. title, descriptions, price, etc). This topic has been covered in this other question. There are libraries helping with that, for instance:

The harder part is to "find" the app-pages to crawl. You could use 1) the Google Play Sitemap or 2) follow the app-links you find in every page you crawl as explained in the Link Extractor documentation (in case you plan to use Scrapy).

Another option is to use an open-source library based on ProtoBuf to fetch meta-data about an app, here the link to the project: https://code.google.com/archive/p/android-market-api. This library fetches app meta-data from Google Play on behalf of a valid Google account, but also in this case you need a crawler to "find" which apps are available and schedule their meta-data retrieval. This other open-source project can help you with that: https://code.google.com/archive/p/android-marketplace-crawler.

If you don't want to implement all this by yourself, you could use a third-party managed service to access Android apps meta-data through a JSON-based API. For instance, 42matters.com (the company I work for) offers an API for both Android and iOS to retrieve apps' meta-data, here more details:

https://42matters.com/app-market-data

In order to get the Title, Icon, Description, Downloads for an app you can use the "lookup" endpoint as documented here:

https://42matters.com/docs/app-market-data/android/apps/lookup

This is an example of the JSON response for the "Angry Birds Space Premium" app:

{
    "package_name": "com.rovio.angrybirdsspace.premium",
    "title": "Angry Birds Space Premium",
    "description": "Play over 300 interstellar levels across 10 planets...",
    "short_desc": "The #1 mobile game of all time blasts off into space!",
    "rating": 4.3046236038208,
    "category": "Arcade",
    "cat_key": "GAME_ARCADE",
    "cat_keys": [
        "GAME_ARCADE",
        "GAME",
        "FAMILY_EDUCATION",
        "FAMILY"
    ],
    "price": "$1.15",
    "downloads": "1,000,000 - 5,000,000",
    "version": "2.2.1",
    "content_rating": "Everyone",
    "promo_video": "https://www.youtube.com/embed/g6AL9YqRHaI?ps=play&vq=large&rel=0&autohide=1&showinfo=0&autoplay=1",
    "market_update": "2015-07-03T00:00:00+00:00",
    "screenshots": [
        "https://lh3.googleusercontent.com/ZmuBQzIy1G74coPrQ1R7fCeKdJmjTdpJhNrIHBOaFyM0N2EYdUPwZaQjnQUtiUDGmac=h310",
        "https://lh3.googleusercontent.com/Xg2Aq70ZH0SnNhtSKH7xg9jCfisWgmmq3C7xQbx6YMhTVAIRqlRJeH8GYtjxapb_qR4=h310",
        "https://lh3.googleusercontent.com/T4o5-2_UP82sj4fSSegbjrGmslNHlfvtEYuZacXMSOC55-7eyiKySw05lNF1QQGO2FeU=h310",
        "https://lh3.googleusercontent.com/f2ennaLdivFu5cQQaVPKsRcWxB8FS5T4Bkoy3l0iPW9-GDDnTVRhvR5kz6l4m8FL1c8=h310",
        "https://lh3.googleusercontent.com/H-9M03_-O9Df1nHr2-rUdjtk2aeBY3bAxnqSX3m2zh_aV8-K1t0qU1DxLXnK0GrDAw=h310"
    ],
    "created": "2012-03-22T08:24:00+00:00",
    "developer": "Rovio Entertainment Ltd.",
    "number_ratings": 20812,
    "price_currency": "$",
    "icon": "https://lh3.ggpht.com/aQaIEGrmba1ENSEgUtArdm3yhJUug7BRWlu_WaspoJusZyHv1rjlWtYqe_qRjE_Kmh1E=w300",
    "icon_72": "https://lh3.ggpht.com/aQaIEGrmba1ENSEgUtArdm3yhJUug7BRWlu_WaspoJusZyHv1rjlWtYqe_qRjE_Kmh1E=w72",
    "market_url": "https://play.google.com/store/apps/details?id=com.rovio.angrybirdsspace.premium&referrer=utm_source%3D42matters.com%26utm_medium%3Dapi"
}

I hope this helps, otherwise feel free to get in touch with me. I know this topic quite well and can point you in the right direction.

Regards,

Andrea

Vulpine answered 27/9, 2016 at 10:35 Comment(4)
I can't see any API in 42matter which can be used for retrieving all applications. Let's say I want to parse email of all Google Play application who has more 10,000 installs and less than 5 million installs. 42matter offers API only for fetching app by package or by search termCoterie
@Coterie with the Advanced Query API 42matters.com/docs/app-market-data/android/apps/… you can exactly retrieve what you need by setting the downloads_gte and downloads_lte respectively. Hope this helps :)Vulpine
Those sitemaps appear to be exclusively Google Play Books pages. Do you know of sitemaps for Apps?Libeler
@Libeler did you find sitemaps for Apps ?Fontanez
P
2

I have did the job in Python before, what you need is a web auto test lib called selenium, it can execute Javascript code and return the result to Python, with Javascript, you can click the "show more" button by the program itself. And when you get all links for a single category page, you can get some info for the app. The simple demo here. Hope helpful.

Paulin answered 7/8, 2014 at 9:23 Comment(0)
G
2

Google doesn't disallow crawling of /store/apps pages.

There is no mention about "/store/apps" in the robot.txt

See https://play.google.com/robots.txt

Gustav answered 9/1, 2015 at 8:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.