Scraping Google Analytics by Scrapy

2024/9/28 21:26:09

I have been trying to use Scrapy to get some data from Google Analytics and despite the fact that I'm a complete Python newbie I have made some progress. I can now login to Google Analytics by Scrapy but I need to make an AJAX request to get the data what I want. I have tried to replicate my browser's HTTP request header with the code below but it doesn't seem to work, my error log says

too many values to unpack

Could somebody help? I've been worked on it for two days, I have the feeling that I'm very close but I'm also very confused.

Here is the code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from scrapy.selector import Selector
import  logging
from super.items import SuperItem
from scrapy.shell import inspect_response
import jsonclass LoginSpider(BaseSpider):name = 'super'start_urls = ['https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier']def parse(self, response):return [FormRequest.from_response(response,formdata={'Email': 'Email'},callback=self.log_password)]def log_password(self, response):return [FormRequest.from_response(response,formdata={'Passwd': 'Password'},callback=self.after_login)]def after_login(self, response):if "authentication failed" in response.body:self.log("Login failed", level=logging.ERROR)return# We've successfully authenticated, let's have some fun!else:print("Login Successful!!")return Request(url="https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0",method='POST',headers=[{'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8','Galaxy-Ajax': 'true','Origin': 'https://analytics.google.com','Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1','User-Agent': 'My-user-agent','X-GAFE4-XSRF-TOKEN': 'Mytoken'}],callback=self.parse_tastypage, dont_filter=True)def parse_tastypage(self, response):response = json.loads(jsonResponse)inspect_response(response, self)yield item

And here is part of the log:

2016-03-28 19:11:39 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-28 19:11:39 [scrapy] INFO: Spider opened
2016-03-28 19:11:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-28 19:11:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-28 19:11:40 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier> (referer: None)
2016-03-28 19:11:46 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr)
2016-03-28 19:11:50 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> from <POST https://accounts.google.com/ServiceLoginAuth>
2016-03-28 19:11:57 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-28 19:12:01 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-28 19:12:01 [scrapy] ERROR: Spider error processing <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Traceback (most recent call last):File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbackscurrent.result = callback(current.result, *args, **kw)File "/Users/aminbouraiss/super/super/spiders/mySuper.py", line 42, in after_logincallback=self.parse_tastypage, dont_filter=True)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/request/__init__.py", line 35, in __init__self.headers = Headers(headers or {}, encoding=encoding)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/headers.py", line 12, in __init__super(Headers, self).__init__(seq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 193, in __init__self.update(seq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 229, in updatesuper(CaselessDict, self).update(iseq)File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 228, in <genexpr>iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq)
ValueError: too many values to unpack
2016-03-28 19:12:01 [scrapy] INFO: Closing spider (finished)
2016-03-28 19:12:01 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 6419,'downloader/request_count': 5,'downloader/request_method_count/GET': 3,'downloader/request_method_count/POST': 2,'downloader/response_bytes': 75986,'downloader/response_count': 5,'downloader/response_status_count/200': 3,'downloader/response_status_count/302': 2,'finish_reason': 'finished','finish_time': datetime.datetime(2016, 3, 28, 23, 12, 1, 824033),'log_count/DEBUG': 6,
Answer

Your error is because headers needs to be a dict, not a list inside a dict:

  headers={'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8','Galaxy-Ajax': 'true','Origin': 'https://analytics.google.com','Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1','User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36',},

That will fix your current issue but you will get a 411 as you need to specify the content-length also, if you add what you want to pull from I will be able to show you how. You can see the output below:

2016-03-29 14:02:11 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA>
2016-03-29 14:02:13 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo)
Login Successful!!
2016-03-29 14:02:14 [scrapy] DEBUG: Crawled (411) <POST https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0> (referer: https://analytics.google.com/analytics/web/?hl=fr&pli=1)
2016-03-29 14:02:14 [scrapy] DEBUG: Ignoring response <411 https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0>: HTTP status code is not handled or not allowed
https://en.xdnf.cn/q/71290.html

Related Q&A

Pandas representative sampling across multiple columns

I have a dataframe which represents a population, with each column denoting a different quality/ characteristic of that person. How can I get a sample of that dataframe/ population, which is representa…

TensorFlow - Ignore infinite values when calculating the mean of a tensor

This is probably a basic question, but I cant find a solution:I need to calculate the mean of a tensor ignoring any non-finite values.For example mean([2.0, 3.0, inf, 5.0]) should return 3.333 and not …

encode unicode characters to unicode escape sequences

Ive a CSV file containing sites along with addresses. I need to work on this file to produce a json file that I will use in Django to load initial data to my database. To do that, I need to convert all…

Python: Regarding variable scope. Why dont I need to pass x to Y?

Consider the following code, why dont I need to pass x to Y?class X: def __init__(self):self.a = 1self.b = 2self.c = 3class Y: def A(self):print(x.a,x.b,x.c)x = X() y = Y() y.A()Thank you to…

Python/Pandas - partitioning a pandas DataFrame in 10 disjoint, equally-sized subsets

I want to partition a pandas DataFrame into ten disjoint, equally-sized, randomly composed subsets. I know I can randomly sample one tenth of the original pandas DataFrame using:partition_1 = pandas.Da…

How to fix pylint error Unnecessary use of a comprehension

With python 3.8.6 and pylint 2.4.4 the following code produces a pylint error (or recommendation) R1721: Unnecessary use of a comprehension (unnecessary-comprehension)This is the code: dict1 = {"A…

conv2d_transpose is dependent on batch_size when making predictions

I have a neural network currently implemented in tensorflow, but I am having a problem making predictions after training, because I have a conv2d_transpose operations, and the shapes of these ops are d…

How SelectKBest (chi2) calculates score?

I am trying to find the most valuable features by applying feature selection methods to my dataset. Im using the SelectKBest function for now. I can generate the score values and sort them as I want, b…

Refer to multiple Models in View/Template in Django

Im making my first steps with Python/Django and wrote an example application with multiple Django apps in one Django project. Now I added another app called "dashboard" where Id like to displ…

Can I use a machine learning model as the objective function in an optimization problem?

I have a data set for which I use Sklearn Decision Tree regression machine learning package to build a model for prediction purposes. Subsequently, I am trying to utilize scipy.optimize package to solv…