web-crawler Questions
6
Solved
From the HTTP server's perspective.
Tartu asked 22/7, 2010 at 12:6
4
Every hour and a half Im getting a flood of requests from http://www.facebook.com/externalhit_uatext.php.
I know what theses requests should mean, but this behavior is very odd.
On a regular bas...
Grouch asked 19/3, 2012 at 16:27
3
Solved
There's a way of excluding complete page(s) from google's indexing. But is there a way to specifically exclude certain part(s) of a web page from google's crawling? For example, exclude the side-ba...
Ailanthus asked 5/1, 2010 at 7:39
2
I have discovered through Google's webmaster tools that google is crawling paths that look like links embedded in json in a <script type="application/json"> tag. This json is later parsed and...
Asti asked 9/11, 2017 at 20:3
10
Solved
The Facebook Crawler is hitting my servers multiple times every second and it seems to be ignoring both the Expires header and the og:ttl property.
In some cases, it is accessing the same og:image...
Tsarevna asked 30/3, 2018 at 16:2
4
I've been searching for npm packages but they all seem unmaintained and rely on the outdated user-agent databases. Is there a reliable and up-to-date package out there that helps me detect crawlers...
Deficiency asked 7/1, 2016 at 4:57
3
Solved
On these sites (https://coinalyze.net/ethereum-classic/liquidations/, BTC/USDT), I am able to add following indications into grpah [Liquidations, Long Liquidations, Short Liquidations, Aggregated L...
Trait asked 12/5, 2021 at 19:27
6
Solved
So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems
1. ht...
Rhaetian asked 20/8, 2010 at 17:54
3
import matplotlib.pyplot as plt
import numpy as np
labels=['Siege', 'Initiation', 'Crowd_control', 'Wave_clear', 'Objective_damage']
markers = [0, 1, 2, 3, 4, 5]
str_markers = ["0", &quo...
Lifeless asked 20/10, 2018 at 21:17
3
Solved
I'm interested in automatizing reverse image search. Yandex in particular is great for busting catfishes, even better than Google Images. So, consider this Python code:
import requests
import webb...
Purulence asked 23/5, 2020 at 20:16
4
Solved
I want to send a value for "User-agent" while requesting a webpage using Python Requests. I am not sure is if it is okay to send this as a part of the header, as in the code below:
debug = {'verbo...
Garget asked 15/5, 2012 at 17:48
2
Solved
In every paper I have read about crawler proposals, I see that one important component is the DNS Resolver.
My question is:
Why is it necessary? Can't we just make a request to http://www.some-do...
Hannibal asked 28/10, 2012 at 5:12
7
I am writing python to crawl Twitter space using Twitter-py. I have set the crawler to sleep for a while (2 seconds) between each request to api.twitter.com. However, after some times of running (a...
Academic asked 11/1, 2012 at 5:54
4
Solved
I'm trying to program a simple web-crawler using the Requests module, and I would like to know how to disable its -default- keep-alive feauture.
I tried using:
s = requests.session()
s.config['ke...
Karl asked 8/1, 2014 at 23:42
4
I am a newbie to python. I am running python 2.7.3 version 32 bit on 64 bit OS. (I tried 64 bit but it didn't workout).
I followed the tutorial and installed scrapy on my machine. I have created o...
Salado asked 12/4, 2012 at 11:58
6
Solved
I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.
I use the cheerio in node.js and My code is below.
var request = require('req...
Crave asked 26/2, 2015 at 9:49
4
I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond:
Remote file exists but does not contain any link -- not re...
Tachylyte asked 27/6, 2013 at 3:37
6
Solved
How do you prevent emails being gathered from web pages by email spiders? Does mailto: linking them increase the likelihood of them being picked up? Is URL-encoding useful?
Obviously the best coun...
Palace asked 8/9, 2010 at 1:17
2
Solved
Until recently there were several ways to retrieve Instagram user media without the need for API authentication. But apparently, the website stopped all of them.
Some of the old methods:
https:/...
Cantoris asked 16/4, 2018 at 7:49
2
I am trying to crawl data from a list of URLs. I have already done with the code below and succeeded yesterday without any error.
But today, when I came back and ran the code again, there was an er...
Befriend asked 28/8, 2023 at 19:32
4
I spend a lot of time searching about this.
At the end of the day I combined a number of answers and it works. I share my answer and I'll appreciate it if anyone edits it or provides us with an eas...
Peculiar asked 21/1, 2015 at 15:1
5
Solved
I put a package on PyPi for the first time ~2 months ago, and have made some version updates since then. I noticed this week the download count recording, and was surprised to see it had been downl...
Thynne asked 10/3, 2012 at 16:23
4
I would like to be able to tell if a site lets you upload files. I can think of two main ways sites do it and ideally I'd like to be able to detect both:
Button
Drag & Drop
PhantomJS document...
Lanellelanette asked 16/12, 2021 at 12:10
3
Solved
I use Tor to crawl web pages.
I started tor and polipo service and added
class ProxyMiddleware(object): # overwrite process request def
process_request(self, request, spider):
# Set the locatio...
Ticon asked 8/12, 2014 at 18:38
4
Solved
I would like to get the same result as this command line :
scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json
My script is as follows :
import scrapy
from linkedin_anonymo...
Massage asked 20/12, 2015 at 15:6
1 Next >
© 2022 - 2025 — McMap. All rights reserved.