Scraping Google Analytics by Scrapy

2024/9/28 21:26:09

I have been trying to use Scrapy to get some data from Google Analytics and despite the fact that I'm a complete Python newbie I have made some progress. I can now login to Google Analytics by Scrapy but I need to make an AJAX request to get the data what I want. I have tried to replicate my browser's HTTP request header with the code below but it doesn't seem to work, my error log says

too many values to unpack

Could somebody help? I've been worked on it for two days, I have the feeling that I'm very close but I'm also very confused.

Here is the code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from scrapy.selector import Selector
import  logging
from super.items import SuperItem
from import inspect_response
import jsonclass LoginSpider(BaseSpider):name = 'super'start_urls = ['']def parse(self, response):return [FormRequest.from_response(response,formdata={'Email': 'Email'},callback=self.log_password)]def log_password(self, response):return [FormRequest.from_response(response,formdata={'Passwd': 'Password'},callback=self.after_login)]def after_login(self, response):if "authentication failed" in response.body:self.log("Login failed", level=logging.ERROR)return# We've successfully authenticated, let's have some fun!else:print("Login Successful!!")return Request(url="",method='POST',headers=[{'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8','Galaxy-Ajax': 'true','Origin': '','Referer': '','User-Agent': 'My-user-agent','X-GAFE4-XSRF-TOKEN': 'Mytoken'}],callback=self.parse_tastypage, dont_filter=True)def parse_tastypage(self, response):response = json.loads(jsonResponse)inspect_response(response, self)yield item

And here is part of the log:

2016-03-28 19:11:39 [scrapy] INFO: Enabled item pipelines:
2016-03-28 19:11:39 [scrapy] INFO: Spider opened
2016-03-28 19:11:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-28 19:11:39 [scrapy] DEBUG: Telnet console listening on
2016-03-28 19:11:40 [scrapy] DEBUG: Crawled (200) <GET> (referer: None)
2016-03-28 19:11:46 [scrapy] DEBUG: Crawled (200) <POST> (referer:
2016-03-28 19:11:50 [scrapy] DEBUG: Redirecting (302) to <GET> from <POST>
2016-03-28 19:11:57 [scrapy] DEBUG: Redirecting (302) to <GET> from <GET>
2016-03-28 19:12:01 [scrapy] DEBUG: Crawled (200) <GET> (referer:
Login Successful!!
2016-03-28 19:12:01 [scrapy] ERROR: Spider error processing <GET> (referer:
Traceback (most recent call last):File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/", line 577, in _runCallbackscurrent.result = callback(current.result, *args, **kw)File "/Users/aminbouraiss/super/super/spiders/", line 42, in after_logincallback=self.parse_tastypage, dont_filter=True)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/request/", line 35, in __init__self.headers = Headers(headers or {}, encoding=encoding)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/", line 12, in __init__super(Headers, self).__init__(seq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/", line 193, in __init__self.update(seq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/", line 229, in updatesuper(CaselessDict, self).update(iseq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/", line 228, in <genexpr>iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack
2016-03-28 19:12:01 [scrapy] INFO: Closing spider (finished)
2016-03-28 19:12:01 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6419,'downloader/request_count': 5,'downloader/request_method_count/GET': 3,'downloader/request_method_count/POST': 2,'downloader/response_bytes': 75986,'downloader/response_count': 5,'downloader/response_status_count/200': 3,'downloader/response_status_count/302': 2,'finish_reason': 'finished','finish_time': datetime.datetime(2016, 3, 28, 23, 12, 1, 824033),'log_count/DEBUG': 6,

Your error is because headers needs to be a dict, not a list inside a dict:

  headers={'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8','Galaxy-Ajax': 'true','Origin': '','Referer': '','User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36',},

That will fix your current issue but you will get a 411 as you need to specify the content-length also, if you add what you want to pull from I will be able to show you how. You can see the output below:

2016-03-29 14:02:11 [scrapy] DEBUG: Redirecting (302) to <GET> from <GET>
2016-03-29 14:02:13 [scrapy] DEBUG: Crawled (200) <GET> (referer:
Login Successful!!
2016-03-29 14:02:14 [scrapy] DEBUG: Crawled (411) <POST> (referer:
2016-03-29 14:02:14 [scrapy] DEBUG: Ignoring response <411>: HTTP status code is not handled or not allowed

