How to crawl thousands of pages using scrapy?

2024/9/20 17:23:36

I'm looking at crawling thousands of pages and need a solution. Every site has it's own html code - they are all unique sites. No clean datafeed or API is available. I'm hoping to load the captured data into some sort of DB.

Any ideas on how to do this with scrapy if possible?

Answer

If I had to scrape clean data from thousands of sites, with each site having its own layout, structure, etc I would implement (and actually have done so in some projects) the following approach:

  1. Crawler - a scrapy script that crawls these sites with all their subpages (that's the easiest part) and transforms them into plain text
  2. NLP Processing - some basic NLP (natural language) processing (tokenizing, part of speech (POS) tagging, named entity-recognition (NER)) on the plain text
  3. Classification - a classifier that can use the data from step 2 to decide whether a page contains the data we're looking for - either simple rules based or - if needed - using machine learning. Those pages that are suspected to contain any usable data will be put into the next step:
  4. Extraction - an grammar-based, statistical or machine learning based extractor that uses POS-tags and NER-tags (and any other domain specific factors) to extract that data we're looking for
  5. Clean up - some basic matching of duplicate records that were created in step 4 and maybe it's also necessary to throw away records that had low confidence scores in steps 2 to 4.

This goes way beyond building a scrapy scraper of course and requires deep knowlegde and experience in NLP and maybe machine learning.

Also you can't expect to get anywhere close to 100% accurate results from such an approach. Depending on how the algorithms are adjusted and trained such a system either will skip some of the valid data (false negatives) or will pick up data where actually isn't any data (false positives) ... or a mix of both (false positives and false negatives).

Nonetheless I hope my answer helps you to get a good picture about.

https://en.xdnf.cn/q/119301.html

Related Q&A

Object Transmission in Python using Pickle [duplicate]

This question already has answers here:Send and receive objects through sockets in Python(3 answers)Closed last year.I have the following class, a Point objectclass Point:def __init__(self):passdef __i…

Google App Engine: Modifying 1000 entities

I have about 1000 user account entities like this:class UserAccount(ndb.Model):email = ndb.StringProperty()Some of these email values contain uppercase letters like [email protected]. I want to select …

more efficient method of dealing with large numbers in Python? [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

MLM downline distribution count

I make my first MLM software and I think I managed to code how to get the points from the downline even though it is a recursive problem I didnt use recursion and I might refactor to a recursive versio…

Can someone please explain to me the purpose of the asterisk in Python? [duplicate]

This question already has answers here:What does asterisk * mean in Python? [duplicate](5 answers)How are pythons unpacking operators * and ** used?(1 answer)Closed 5 years ago.For instance, can some…

Linear Programming with cvxpy

I would like to ask you regarding on the Linear Program for optimization.I have an objective function, and constraint functions as below,variables(x1, x2, x3, x4, x5, x6) are quantities of the products…

Program runs forever without giving an error when plotting data only on continent [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

KeyError while perfoming solve of two equation

1st i need to get two equation of two longest line length i put lenghths with eq in list like these [( length 1 , eq 1 ) ,.....] sort list with reverse get two equation of two longest line when run the…

Most Pythonic way to merge two dictionnaries having common key/value pair

I have two lists of python dictionnaries : l1 = [{"id":1, "name":"A"}, {"id":2, "name":"B"}] l2 = [{"id":1, "full_name":&…

How to close a while True loop instantly Python

I have a problem ... How can i press P on my keyboard and close the entire program faster ( i would like instantly ) ? The script that i made runs in a loop ( Loop B ) and checks for an image on deskt…