Using Tor proxy with scrapy

2024/10/13 0:36:09

I need help setting up Tor in Ubuntu and to use it within scrapy framework.

I did some research and found out this guide:

class RetryChangeProxyMiddleware(RetryMiddleware):def _retry(self, request, reason, spider):log.msg('Changing proxy')tn = telnetlib.Telnet('127.0.0.1', 9051)tn.read_until("Escape character is '^]'.", 2)tn.write('AUTHENTICATE "267765"\r\n')tn.read_until("250 OK", 2)tn.write("signal NEWNYM\r\n")tn.read_until("250 OK", 2)tn.write("quit\r\n")tn.close()time.sleep(3)log.msg('Proxy changed')return RetryMiddleware._retry(self, request, reason, spider)

then use it in settings.py:

DOWNLOADER_MIDDLEWARE = {'spider.middlewares.RetryChangeProxyMiddleware': 600,}

and then you just want to send requests through local tor proxy (polipo) which could be done with:

tsocks scrapy crawl spirder 

does anyone can confirm, that this method works and you get different IPs?

Answer

I was using this snippet: http://snipplr.com/view/66992/use-a-random-user-agent-for-each-request/

Update: broken link fixed

https://en.xdnf.cn/q/69595.html

Related Q&A

Best practice for structuring module exceptions in Python3

Suppose I have a project with a folder structure like so./project__init__.pymain.py/__helpers__init__.pyhelpers.py...The module helpers.py defines some exception and contains some method that raises th…

How can you read a gzipped parquet file in Python

I need to open a gzipped file, that has a parquet file inside with some data. I am having so much trouble trying to print/read what is inside the file. I tried the following: with gzip.open("myFil…

Pandas - combine row dates with column times

I have a dataframe:Date 0:15 0:30 0:45 ... 23:15 23:30 23:45 24:00 2004-05-01 3.74618 3.58507 3.30998 ... 2.97236 2.92008 2.80101 2.6067 2004-05-02 3.09098 3.846…

How to extract tables in Images

I wanted to extract tables from images.This python module https://pypi.org/project/ExtractTable/ with their website https://www.extracttable.com/pro.html doing the job very well but they have limited f…

Extract string if match the value in another list

I want to get the value of the lookup list instead of a boolean. I have tried the following codes:val = pd.DataFrame([An apple,a Banana,a cat,a dog]) lookup = [banana,dog] # I tried the follow code: va…

Automating HP Quality Center with Python or Java

We have a project that uses HP Quality Center and one of the regular issues we face is people not updating comments on the defect.So I was thinkingif we could come up with a small script or tool that c…

indexing numpy array with logical operator

I have a 2d numpy array, for instance as:import numpy as np a1 = np.zeros( (500,2) )a1[:,0]=np.arange(0,500) a1[:,1]=np.arange(0.5,1000,2) # could be also read from txtthen I want to select the indexes…

Stream multiple files into a readable object in Python

I have a function which processes binary data from a file using file.read(len) method. However, my file is huge and is cut into many smaller files 50 MBytes each. Is there some wrapper class that feeds…

AWS Python SDK | Route 53 - delete resource record

How to delete a DNS record in Route 53? I followed the documentation but I still cant make it work. I dont know if Im missing something here.Based on the documentation:DELETE : Deletes a existing reso…

How can I change to gt; and gt; to ? [duplicate]

This question already has answers here:Decode HTML entities in Python string?(7 answers)Closed 8 years ago.print u<How can I print <print > How can I print >