Question 1

from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).

I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post

After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.

So Finally, my biggest questions are:

1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy? So far, this is the code I'm using, but doesn't work...

EDIT:

#!/usr/bin/env python
# -*- coding: utf-8 -*-# The require imports...def getBrowser():path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"dcap = dict(DesiredCapabilities.PHANTOMJS)dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 ""(KHTML, like Gecko) Chrome/15.0.87")browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)return browserclass MySpider(Spider):name = "myspider"browser = getBrowser()def start_requests(self):the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)def parse(self, response):self.get_page_links()def get_page_links(self):""" This first part, goes through all available pages """for i in xrange(1, 3):  # 210new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},"config": {"page": str(i)}}json_data = json.dumps(new_data)new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)self.browser.get(new_url)print "\nThe new URL is -> ", new_url, "\n"content = self.browser.page_sourceself.get_item_links(content)def get_item_links(self, body=""):if body:""" This second part, goes through all available items """raw_links = re.findall(r'listclickable.+?>', body)links = []if raw_links:for raw_link in raw_links:new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"","")links.append(str(new_link))if links:ids = self.get_ids(links)for link in links:current_id = self.get_single_id(link)print "\nThe Link -> ", link# If commented the line below, code works, doesn't otherwiseyield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)                                                                           def get_ids(self, list1=[]):if list1:ids = []for elem in list1:raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")ids.append(raw_id)return idselse:return []def get_single_id(self, text=""):if text:raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")return raw_idelse:return ""def parse_room(self, response): # More scraping code...

So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.

2 - If the answer to the first question is negative, how could I tackle this issue? I'm opened to another means, otherwise

3 - Can anyone tell me or show me pages where I can learn how to solve/combine webscraping along javaScript and Ajax? Nowadays are more the websites that use JavaScript and Ajax scripts to load content

Many thanks in advance!

Question 2

Selenium is one of the best tools to scrape dynamic data.you can use selenium with any web browser to fetch the data that is loading from scripts.That works exactly like the browser click operations.But I am not prefering it.

For getting dynamic data you can use scrapy + splash combo. From scrapy you wil get all the static data and splash for other dynamic contents.

How to yield fragment URLs in scrapy using Selenium?

Related Q&A

Django Database Migration

Search engine using python for bookmarked sites [closed]

How to extract only particular set of structs from a file between braces in python

Selection of Face of a STL by Face Normal value Threshold

Python , Changing a font size of a string variable

Python http.server command gives Syntax Error [closed]

If statement to check if the value of a variable is in a JSON file [closed]

Assign variables to a pandas dataframe when specific cell is empty

Try Except for one variable in multiple variables

Print method invalid syntax reversing a string [closed]