from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).
I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs
(#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post
After that, I realized that my framework can't make it without a JavaScript
interpreter or a browser imitation, so they mentioned the Selenium
library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.
So Finally, my biggest questions are:
1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy? So far, this is the code I'm using, but doesn't work...
EDIT:
#!/usr/bin/env python
# -*- coding: utf-8 -*-# The require imports...def getBrowser():path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"dcap = dict(DesiredCapabilities.PHANTOMJS)dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 ""(KHTML, like Gecko) Chrome/15.0.87")browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)return browserclass MySpider(Spider):name = "myspider"browser = getBrowser()def start_requests(self):the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)def parse(self, response):self.get_page_links()def get_page_links(self):""" This first part, goes through all available pages """for i in xrange(1, 3): # 210new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},"config": {"page": str(i)}}json_data = json.dumps(new_data)new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)self.browser.get(new_url)print "\nThe new URL is -> ", new_url, "\n"content = self.browser.page_sourceself.get_item_links(content)def get_item_links(self, body=""):if body:""" This second part, goes through all available items """raw_links = re.findall(r'listclickable.+?>', body)links = []if raw_links:for raw_link in raw_links:new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"","")links.append(str(new_link))if links:ids = self.get_ids(links)for link in links:current_id = self.get_single_id(link)print "\nThe Link -> ", link# If commented the line below, code works, doesn't otherwiseyield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True) def get_ids(self, list1=[]):if list1:ids = []for elem in list1:raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")ids.append(raw_id)return idselse:return []def get_single_id(self, text=""):if text:raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")return raw_idelse:return ""def parse_room(self, response): # More scraping code...
So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.
2 - If the answer to the first question is negative, how could I tackle this issue? I'm opened to another means, otherwise
3 - Can anyone tell me or show me pages where I can learn how to solve/combine webscraping along javaScript and Ajax? Nowadays are more the websites that use JavaScript and Ajax scripts to load content
Many thanks in advance!