I'm writing a little crawler that should fetch a URL multiple times, I want all of the threads to run at the same time (simultaneously).
I've written a little piece of code that should do that.
import thread
from urllib2 import Request, urlopen, URLError, HTTPErrordef getPAGE(FetchAddress):attempts = 0while attempts < 2:req = Request(FetchAddress, None)try:response = urlopen(req, timeout = 8) #fetching the urlprint "fetched url %s" % FetchAddressexcept HTTPError, e:print 'The server didn\'t do the request.'print 'Error code: ', str(e.code) + " address: " + FetchAddresstime.sleep(4)attempts += 1except URLError, e:print 'Failed to reach the server.'print 'Reason: ', str(e.reason) + " address: " + FetchAddresstime.sleep(4)attempts += 1except Exception, e:print 'Something bad happened in gatPAGE.'print 'Reason: ', str(e.reason) + " address: " + FetchAddresstime.sleep(4)attempts += 1else:try:return response.read()except:"there was an error with response.read()"return Nonereturn Noneurl = ("http://www.domain.com",)for i in range(1,50):thread.start_new_thread(getPAGE, url)
from the apache logs it doesn't seems like the threads are running simultaneously, there's a little gap between requests, it's almost undetectable but I can see that the threads are not really parallel.
I've read about GIL, is there a way to bypass it with out calling a C\C++ code? I can't really understand how does threading is possible with GIL? python basically interpreters the next thread as soon as it finishes with the previous one?
Thanks.