I have been trying to use Scrapy to get some data from Google Analytics and despite the fact that I'm a complete Python newbie I have made some progress. I can now login to Google Analytics by Scrapy but I need to make an AJAX request to get the data what I want. I have tried to replicate my browser's HTTP request header with the code below but it doesn't seem to work, my error log says
too many values to unpack
Could somebody help? I've been worked on it for two days, I have the feeling that I'm very close but I'm also very confused.
Here is the code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from scrapy.selector import Selector
import logging
from super.items import SuperItem
from scrapy.shell import inspect_response
import jsonclass LoginSpider(BaseSpider):name = 'super'start_urls = ['https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier']def parse(self, response):return [FormRequest.from_response(response,formdata={'Email': 'Email'},callback=self.log_password)]def log_password(self, response):return [FormRequest.from_response(response,formdata={'Passwd': 'Password'},callback=self.after_login)]def after_login(self, response):if "authentication failed" in response.body:self.log("Login failed", level=logging.ERROR)return# We've successfully authenticated, let's have some fun!else:print("Login Successful!!")return Request(url="https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0",method='POST',headers=[{'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8','Galaxy-Ajax': 'true','Origin': 'https://analytics.google.com','Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1','User-Agent': 'My-user-agent','X-GAFE4-XSRF-TOKEN': 'Mytoken'}],callback=self.parse_tastypage, dont_filter=True)def parse_tastypage(self, response):response = json.loads(jsonResponse)inspect_response(response, self)yield item
And here is part of the log:
2016-03-28 19:11:39 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-28 19:11:39 [scrapy] INFO: Spider opened
2016-03-28 19:11:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-28 19:11:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-28 19:11:40 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier> (referer: None)
2016-03-28 19:11:46 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr)
2016-03-28 19:11:50 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> from <POST https://accounts.google.com/ServiceLoginAuth>
2016-03-28 19:11:57 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-28 19:12:01 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-28 19:12:01 [scrapy] ERROR: Spider error processing <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Traceback (most recent call last):File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbackscurrent.result = callback(current.result, *args, **kw)File "/Users/aminbouraiss/super/super/spiders/mySuper.py", line 42, in after_logincallback=self.parse_tastypage, dont_filter=True)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/request/__init__.py", line 35, in __init__self.headers = Headers(headers or {}, encoding=encoding)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/headers.py", line 12, in __init__super(Headers, self).__init__(seq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 193, in __init__self.update(seq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 229, in updatesuper(CaselessDict, self).update(iseq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 228, in <genexpr>iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack
2016-03-28 19:12:01 [scrapy] INFO: Closing spider (finished)
2016-03-28 19:12:01 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6419,'downloader/request_count': 5,'downloader/request_method_count/GET': 3,'downloader/request_method_count/POST': 2,'downloader/response_bytes': 75986,'downloader/response_count': 5,'downloader/response_status_count/200': 3,'downloader/response_status_count/302': 2,'finish_reason': 'finished','finish_time': datetime.datetime(2016, 3, 28, 23, 12, 1, 824033),'log_count/DEBUG': 6,