I have a Website URL Like www.example.com
I want to collect social information from this website like : facebook url (facebook.com/example ), twitter url ( twitter.com/example ) etc., if available anywhere, at any page of website.
How to complete this task, suggest any tutorials, blogs, technologies ..
Since you don't know exactly where (on which page of the website) those link are located, you probably want to base you spider on CrawlSpider
class. Such spider lets you define rules for link extraction and navigation through the website. See this minimal example:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractorclass MySpider(CrawlSpider):name = 'example.com'start_urls = ['http://www.example.com']rules = (Rule(LinkExtractor(allow_domains=('example.com', )), callback='parse_page', follow=True),)def parse_page(self, response):item = dict()item['page'] = response.urlitem['facebook_urls'] = response.xpath('//a[contains(@href, "facebook.com")]/@href').extract()item['twitter_urls'] = response.xpath('//a[contains(@href, "twitter.com")]/@href').extract()yield item
This spider will crawl all pages of example.com
website and extract URLs containing facebook.com
and twitter.com
.