How to extract social information from a given website?

2024/11/20 8:37:15

I have a Website URL Like www.example.com

I want to collect social information from this website like : facebook url (facebook.com/example ), twitter url ( twitter.com/example ) etc., if available anywhere, at any page of website.

How to complete this task, suggest any tutorials, blogs, technologies ..

Answer

Since you don't know exactly where (on which page of the website) those link are located, you probably want to base you spider on CrawlSpider class. Such spider lets you define rules for link extraction and navigation through the website. See this minimal example:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractorclass MySpider(CrawlSpider):name = 'example.com'start_urls = ['http://www.example.com']rules = (Rule(LinkExtractor(allow_domains=('example.com', )), callback='parse_page', follow=True),)def parse_page(self, response):item = dict()item['page'] = response.urlitem['facebook_urls'] = response.xpath('//a[contains(@href, "facebook.com")]/@href').extract()item['twitter_urls'] = response.xpath('//a[contains(@href, "twitter.com")]/@href').extract()yield item

This spider will crawl all pages of example.com website and extract URLs containing facebook.com and twitter.com.

https://en.xdnf.cn/q/119852.html

Related Q&A

Check if string is of nine digits then exit function in python

I have a function in python that returns different output of strings (Text). And I have different parts that I should check for the string and if the string is of nine digits or contains 9 digits then …

How to extract quotations from text using NLTK [duplicate]

This question already has answers here:RegEx: Grabbing values between quotation marks(20 answers)Closed 8 years ago.I have a project wherein I need to extract quotations from a huge set of articles . H…

takes exactly 2 arguments (1 given) when including self [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

scipy.optimize.curve_fit a definite integral function with scipy.integrate.quad

If I have a function that the independent variable is the upper limit of an definite integral of a mathematical model. This mathematical model has the parameters I want to do regression. This mathemati…

MAC OS - os.system(command) display nothing

When I run IDLE (python 3.8) :>>> import os >>> os.system("ls") 0 >>> os.system(echo "test") 0 >>> os.system("users") 0 >>> Bu…

Flask App will not load app.py (The file/path provided (app) does not appear to exist)

My flask app is outputting no content for the for() block and i dont know why.I tested my query in app.py , here is app.py:# mysql config app.config[MYSQL_DATABASE_USER] = user app.config[MYSQL_DATABAS…

how to create from month Gtk.Calendar a week calendar and display the notes in each day in python

I have created a calendar app with month and week view in python. In month view, I can write notes in each day, store them in a dictionary and save the dictionary in to disk so I can read it any time.…

How to access inner attribute class from outer class?

As title. the class set a attribute value inside inner class. then, access that inner attribute class from outer function. In below, attribute sets with inner function set_error. then, use outer functi…

Summing up the total based on the random number of inputs of a column

I need to sum up the "value" column amount for each value of col1 of the File1 and export it to an output file. Im new in python and need to do it for thousands of records.File1col1 col2 …

What is wrong with this Binomial Tree Backwards Induction European Call Option Pricing Function?

The function below works perfectly and only needs one thing: Removal of the for loop that creates the 1000 element array arr. Can you help me get rid of that for loop? Code is below #Test with europea…