How to reliably reproduce curl_multi timeout while testing public proxies
Asked Answered
E

2

10

Relevant information: issue 3602 on GitHub

I'm working on a project that gathers and tests public/free proxies, and noticed that when I use the curl_multi interface for testing these proxies, sometimes I get many 28(timeout) errors. This never happens if I test every proxy alone.

The problem is that this issue is unreliably reproducible, and it does not always show up , it could be something in curl or something else.

Unfortunately, I'm not such a deep networks debugger and I don't know how to debug this issue on a deeper level, however I wrote 2 C testing programs (one of them is originally written by Daniel Stenberg but I modified it's output to the same format as the other C program). These 2 C programs test 407 public proxies using curl

  1. with curl_multi interface (which has the problem)

  2. with curl on many threads, each curl operates on a thread. (which has no problem)

These are the 2 C programs I wrote for testing I'm not a C developer so please let me know about anything wrong you notice in the 2 programs.

This is the original PHP class that I used for reproducing the issue a month ago.

And these are the 2 C programs tests results. You can notice that the tests done with curl_multi timeout, while the timeouts made by curl-threads are stable (about 50 out of 407 of the proxies are working).

This is a sample from the test results. Please note columns 4 and 5 to see how the curl threads timeout about ~170 times and successfully connect ~40 times. Out of these, curl_multi makes 0 successful connections and timeouts ~300 times out of 407 proxies.

column(1) : #
column(2) : time(UTC)
column(3) : total execution time (seconds)
column(4) : no error 0 (how many requests result in no error CURLE_OK)
column(5) : error 28 (how many requests result in error 28 CURLE_OPERATION_TIMEDOUT)
column(6) : error 7 (how many requests result in error 7 CURLE_COULDNT_CONNECT)
column(7) : error 35 (how many requests result in error 35 CURLE_SSL_CONNECT_ERROR)
column(8) : error 56 (how many requests result in error 56 CURLE_RECV_ERROR)
column(9) : other errors (how many requests result in errors other than the above)
column(10) : program that used the curl
column(11) : cURL version

c(1)    c(2)           c(3)c(4)c(5)c(6)c(7)c(8)c(9) c(10)                  c(11)
267 2019-3-28 01:58:01  40  43  176 183 1   4   0   C (curl - threads) (Linux Fedora)   7.59.0
268 2019-3-28 01:59:01  30  0   286 110 1   10  0   C (curl-multi one thread) (Linux Fedora)    7.59.0
269 2019-3-28 02:00:01  30  46  169 181 1   8   2   C (curl - threads) (Linux Fedora)   7.59.0
270 2019-3-28 02:01:01  31  0   331 74  1   1   0   C (curl-multi one thread) (Linux Fedora)    7.59.0
271 2019-3-28 02:02:01  30  42  173 186 1   4   1   C (curl - threads) (Linux Fedora)   7.59.0
272 2019-3-28 02:03:01  30  0   277 116 1   13  0   C (curl-multi one thread) (Linux Fedora)    7.59.0

Why does curl_multi timeout inconsistently with most of the connections, while curl-threads never does this?

I downloaded Wireshark and used it to capture the traffic while each of the 2 C programs was running, I also filtered the traffic to the proxies list used by the 2 C programs, and saved the files on GitHub.

the curl-threads program (the expected behavior)

63 successful connections and 158 connections timeout out of 407 proxies.

the curl_multi program (the unexpected behavior)

0 successful connections and 272 connections timeout out of 407 proxies.

You can open the .pcapng files using Wireshark and see the recorded traffic on my computer while both expected/unexpected behavior. I filtered the traffic to the 407 proxy IPs and left Wireshark open for a little while after the 30 seconds of curl limit because I noticed some packets still showing up. I don't know Wireshark and this level of networking, but I thought this could be useful.


Note on the bandwidth:

Open the .pcapng file of the curl_threads program (the normal behavior) in wireshark and go to Statistics > Conversations . you will see a window like this

enter image description here

I have copied the data and saved them here on GitHuB , now calculate the Sum of the Bytes sent from A->B and B->A.

The ENTIRE bandwidth needed to work normally is about 692.8 KB.

Ellsworth answered 22/2, 2019 at 12:40 Comment(8)
Please check my comment on the GitHub issue. Also, in your code, it would be best to enable CURLOPT_VERBOSE. It may also be considerable to use the C version provided by badger on GitHub for consistency.Geochemistry
Hello @JL2210 I have replied to your comment on GitHub. Regarding the C version, I just added the ability to aggregate the tests result and print them to a file in the same format as the threads program, so I can put both programs results on the same file and compare.Gayomart
I think I've made an edit that makes your question and your circumstances a bit more clear. Please review it and get back to me.Geochemistry
Is there a firewall on your network? Or something that could limit outbound connections?Geochemistry
Try running strace curl.Geochemistry
@JL2210 Thank you very much for the edits, I will check them . "Is there a firewall on your network? Or something that could limit outbound connections" If there is something wrong on my network, then the curl-threads program would have it too, but the threads program works fine and the curl-multi program reproduced the problem sometimes.Gayomart
Sorry about that... Anyway, how did this go for you?Geochemistry
@JL2210 Same thing, I made little C program that uses curl for 1 request on 1 thread, and it works fine. curl_multi still produces 28 timeout errors, I don't use anymore.Gayomart
G
2

I've gotten reproducible behavior and I'm waiting for badger on GitHub to reply. Try running a program like Ettercap to get more information.

Geochemistry answered 31/3, 2019 at 14:18 Comment(14)
I captured the traffic while the normal and up normal behavior using Wireshark and saved the Wireshark .pcapng so any one can analyze what is going on. I will update the question and the GitHub issue, I hope this also can help.Gayomart
OK. I can't use Wireshark, it requires Qt5, which hasn't built properly for me yet. I'll update this with more information when I get any.Geochemistry
"So to answer your question: Run those tests on a very slow/low bandwidth network."... So, why did the threads program works as expected on the same exact machine and network ?! they behaved constantly regarding successful connections curl-threads>43, curl-multi>0, curl-threads>46, curl-multi>0, curl-threads>42, curl_multi>0, ...Gayomart
I'm getting the error consistently on a slow/low bandwidth network, so doesn't that answer your question?.Geochemistry
"I'm getting the error consistently on a slow/low bandwidth network, so doesn't that answer your question?." No James, because the threads program WORKS ! . offcurse on a very slow network both of them will not work, but on a good connection only the curl_multi program will show the issue sometimes, it even reproduced on a production server with a 500 Mbps up-link!Gayomart
I'm getting the error on the multi program. The threads program does work. At this point, what are you even asking? I reliably reproduced the behavior and I told you how to do so. Is that an answer or not? It's not an issue with Curl. It's just absurdly overloading your network and multiple open connections in a single process.Geochemistry
The network reason is not acceptable for me because If the threads program worked using the same slow network resources, then that means it is not a slow network issue !!!! plus, I reproduced multiple times on a production server with powerful networking resources! ... I will update the question with these measures.Gayomart
"and multiple open connections in a single process." ... This is logically acceptable reason more than the network reason, but IF this is the case, then that means curl_multi or any multiple connections on 1 thread model can never be as good as multi-threaded model when it comes to asynchronous connections. And I wish if I knew this before I spent 1 month and stop my project because of this lazy weak model that can never be as good as a multi-threaded solution :(Gayomart
Please see the note on the bandwidth in the question, it's only 692.8 KB !!! even a dial up connection can provide more than this in the 30 seconds we allowed for curl_multi. It can't be the bandwidth that makes curl_multi makes 0 connections in 30 seconds!Gayomart
There's no way a dialup connection can sustain this. Maybe it can on a nice sunny day somewhere in Paradise, but not anywhere on Earth.Geochemistry
Let us continue this discussion in chat.Geochemistry
But according to this answer, and this, and this comment, it's not a problem to have 407 open connections by a single process.Gayomart
I removed my upvote, because you replaced 100% of your answer. I upvoted the original post when you said that you reproduced the issue and you have information that you wait for Daniel on GitHub to give, but now you replaced the post with "...may be overloading .... may be blacklisting you ..... network may be limiting ...." which doesn't help resolving the issue and adds more to the uncertainty part of the issue which I essentially posted the question to clear. I didn't downvote thought, thank you for your time and interest in the question.Gayomart
@Accountantم Sorry about that, I've reverted to revision 1, if that helps. However, it doesn't seem as if your issue has been resolved.Geochemistry
S
1

To me it looks that you are not having problem with the curl itself but doing too much connections concurrently to the proxy servers if the connections are refused. You might be blacklisted permanently or for some period.

Check that by running your curl from current IP and do stat: how many connections were established, how many refused, how many timed out. Do it several times and collect an average. Change then server to other that has different IP and check what stats you have there. At the first run you should have much better statistics, that probably if you repeat test at new IP will get only worse. Good idea might be to not use all pool of the proxies to connect to do stat but select a slice from them and check on actual IP and repeat that check on new IP so if the reason is you abusing service you don't blacklist yourself at all proxies but still be having next group of 'untouched' proxies to test on them on new IP if this is really the case. Be aware that even if the IPs of proxies are at different location they can belong to the same service provider. That probably has one abuse list for all of their proxy serves so if you are not seen well with the amount of requests you do in one country you can be blocked in other country as well, even before you connect to the other country proxy.

If you still want to check if this is not curl then you can set up a test environment with multiple serves. This test environment you can pass to curl maintainer so he can replicate the error. You can use docker and create 10, 20 or 100 proxy servers and connect to them to see if curl has a problem or not.

you will need docker it can be installed on Win/Mac/Linux
one of the proxy image to create proxies
create network tutorial for the containers (bridge should be ok)
attach containers to network --network
good to set for each proxy container their --ip
make for each proxy container possible to read config and write error log (so you can read why they disconnected if that happens) by mountig error log/config files/direcotires with --volume
and all proxy containers should be runnig

you can connect to a proxy that is running inside container two ways. if you would like to have curl outside these containers then you need to expose with -p these proxies' ports from container to the outside world (curl in your case).

or

you may use another container image that has linux + curl. For example Alpine linux + curl and connect it the same network the same way as you do with proxies. If you do that you don't need to publish (expose) ports of proxies and don't need to think about what number of proxy port should I expose for this particular proxy.

at each step you can issue a command

docker ps -a

to see all containers and their status.

to stop and remove all containers (not the images they are coming from but running containers) in case you had some erros with container that exited.

docker stop $(docker ps -aq) && docker rm $(docker ps -aq)

or to stop and remove from the list a particular container

docker stop <container-id>
docker rm <container-id>

to see all containers that are connected to bridge network (default)

docker network inspect bridge

If you confirm there is problem really with connection to proxies that are at your local machine then this is something maintainer of curl can replicate.

just put all commands like above to create all proxies connect them to network etc in a file for example replicate.sh script starting with

#!/bin/sh

and your comands here

save that file and issue then command

chmod +x ./replicate.sh

to make it executable.

you can run it to double check if everything is working as expected

./replicate.sh

and send the maintainer of curl to replicate environment on which you had experienced problem.

If you don't like to put a lot of commands like doker run for the proxies to run, you can use docker compose instead that allows you to define whole testing environment in one file.

If you run lot of containers you can limit resources for example memory each of them consume, may help you in case of so many proxies

Sirajuddaula answered 30/3, 2019 at 1:56 Comment(5)
Thank you very much Jimmix for your time. The possibility of that I'm getting blacklisted is ruled out because the curl-threads program works just fine from the SAME NETWORK while the curl-multi program doesn't work, lucky I was in the tests result that I have provided a sample of in the question , curl-multi didn't work at all(0 successful connections). Check the column 4 in the sample I have provided in the answer and you will see that curl-threads>43, curl-multi>0, curl-threads>46, curl-multi>0, curl-threads>42, curl_multi>0, ...(the time span between each test is 1 minute).Gayomart
Check the readme of the C programs and you will see how to do the tests for yourself if you want and you will know what I mean.Gayomart
Thank you very much for the docker suggestion, but I think I don't need it since the problem is already reproducible on the current installation, it is just not always reproducible. I never used Docker, but I will bookmark your answer as a reference for me when I come and try docker :) thanks.Gayomart
@Accountantم Im glad you reproduced the error. I read the readme you linked and cron jobs. You may like also this way of using cron for every EVEN/ODD number of minutes.Sirajuddaula
it's not a blacklist issue because the proxies don't know if he's using the curl_multi api with a single thread, or multiple curl_easy handles with 1-thread-per-handle, and this issue only occur when he's using the curl_multi api with a single thread, it doesn't happen when he's using 1-thread-per-easy-handle. if it was a blacklist issue, it would happen when using 1-thread-per-handle as well, but it doesn't.Dicot

© 2022 - 2024 — McMap. All rights reserved.