Question 1

I have been trying to use Scrapy to get some data from Google Analytics and despite the fact that I'm a complete Python newbie I have made some progress. I can now login to Google Analytics by Scrapy but I need to make an AJAX request to get the data what I want. I have tried to replicate my browser's HTTP request header with the code below but it doesn't seem to work, my error log says

too many values to unpack

Could somebody help? I've been worked on it for two days, I have the feeling that I'm very close but I'm also very confused.

Here is the code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from scrapy.selector import Selector
import  logging
from super.items import SuperItem
from scrapy.shell import inspect_response
import jsonclass LoginSpider(BaseSpider):name = 'super'start_urls = ['https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier']def parse(self, response):return [FormRequest.from_response(response,formdata={'Email': 'Email'},callback=self.log_password)]def log_password(self, response):return [FormRequest.from_response(response,formdata={'Passwd': 'Password'},callback=self.after_login)]def after_login(self, response):if "authentication failed" in response.body:self.log("Login failed", level=logging.ERROR)return# We've successfully authenticated, let's have some fun!else:print("Login Successful!!")return Request(url="https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0",method='POST',headers=[{'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8','Galaxy-Ajax': 'true','Origin': 'https://analytics.google.com','Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1','User-Agent': 'My-user-agent','X-GAFE4-XSRF-TOKEN': 'Mytoken'}],callback=self.parse_tastypage, dont_filter=True)def parse_tastypage(self, response):response = json.loads(jsonResponse)inspect_response(response, self)yield item

And here is part of the log:

2016-03-28 19:11:39 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-28 19:11:39 [scrapy] INFO: Spider opened
2016-03-28 19:11:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-28 19:11:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-28 19:11:40 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier> (referer: None)
2016-03-28 19:11:46 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr)
2016-03-28 19:11:50 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> from <POST https://accounts.google.com/ServiceLoginAuth>
2016-03-28 19:11:57 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-28 19:12:01 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-28 19:12:01 [scrapy] ERROR: Spider error processing <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Traceback (most recent call last):File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbackscurrent.result = callback(current.result, *args, **kw)File "/Users/aminbouraiss/super/super/spiders/mySuper.py", line 42, in after_logincallback=self.parse_tastypage, dont_filter=True)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/request/__init__.py", line 35, in __init__self.headers = Headers(headers or {}, encoding=encoding)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/headers.py", line 12, in __init__super(Headers, self).__init__(seq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 193, in __init__self.update(seq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 229, in updatesuper(CaselessDict, self).update(iseq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 228, in <genexpr>iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack
2016-03-28 19:12:01 [scrapy] INFO: Closing spider (finished)
2016-03-28 19:12:01 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6419,'downloader/request_count': 5,'downloader/request_method_count/GET': 3,'downloader/request_method_count/POST': 2,'downloader/response_bytes': 75986,'downloader/response_count': 5,'downloader/response_status_count/200': 3,'downloader/response_status_count/302': 2,'finish_reason': 'finished','finish_time': datetime.datetime(2016, 3, 28, 23, 12, 1, 824033),'log_count/DEBUG': 6,

Question 2

Your error is because headers needs to be a dict, not a list inside a dict:

  headers={'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8','Galaxy-Ajax': 'true','Origin': 'https://analytics.google.com','Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1','User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36',},

That will fix your current issue but you will get a 411 as you need to specify the content-length also, if you add what you want to pull from I will be able to show you how. You can see the output below:

2016-03-29 14:02:11 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-29 14:02:13 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-29 14:02:14 [scrapy] DEBUG: Crawled (411) <POST https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0> (referer: https://analytics.google.com/analytics/web/?hl=fr&pli=1)
2016-03-29 14:02:14 [scrapy] DEBUG: Ignoring response <411 https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0>: HTTP status code is not handled or not allowed

Scraping Google Analytics by Scrapy

Related Q&A

Pandas representative sampling across multiple columns

TensorFlow - Ignore infinite values when calculating the mean of a tensor

encode unicode characters to unicode escape sequences

Python: Regarding variable scope. Why dont I need to pass x to Y?

Python/Pandas - partitioning a pandas DataFrame in 10 disjoint, equally-sized subsets

How to fix pylint error Unnecessary use of a comprehension

conv2d_transpose is dependent on batch_size when making predictions

How SelectKBest (chi2) calculates score?

Refer to multiple Models in View/Template in Django

Can I use a machine learning model as the objective function in an optimization problem?