Question 1

I'm looking at crawling thousands of pages and need a solution. Every site has it's own html code - they are all unique sites. No clean datafeed or API is available. I'm hoping to load the captured data into some sort of DB.

Any ideas on how to do this with scrapy if possible?

Question 2

If I had to scrape clean data from thousands of sites, with each site having its own layout, structure, etc I would implement (and actually have done so in some projects) the following approach:

Crawler - a scrapy script that crawls these sites with all their subpages (that's the easiest part) and transforms them into plain text
NLP Processing - some basic NLP (natural language) processing (tokenizing, part of speech (POS) tagging, named entity-recognition (NER)) on the plain text
Classification - a classifier that can use the data from step 2 to decide whether a page contains the data we're looking for - either simple rules based or - if needed - using machine learning. Those pages that are suspected to contain any usable data will be put into the next step:
Extraction - an grammar-based, statistical or machine learning based extractor that uses POS-tags and NER-tags (and any other domain specific factors) to extract that data we're looking for
Clean up - some basic matching of duplicate records that were created in step 4 and maybe it's also necessary to throw away records that had low confidence scores in steps 2 to 4.

This goes way beyond building a scrapy scraper of course and requires deep knowlegde and experience in NLP and maybe machine learning.

Also you can't expect to get anywhere close to 100% accurate results from such an approach. Depending on how the algorithms are adjusted and trained such a system either will skip some of the valid data (false negatives) or will pick up data where actually isn't any data (false positives) ... or a mix of both (false positives and false negatives).

Nonetheless I hope my answer helps you to get a good picture about.

How to crawl thousands of pages using scrapy?

Related Q&A

Object Transmission in Python using Pickle [duplicate]

Google App Engine: Modifying 1000 entities

more efficient method of dealing with large numbers in Python? [closed]

MLM downline distribution count

Can someone please explain to me the purpose of the asterisk in Python? [duplicate]

Linear Programming with cvxpy

Program runs forever without giving an error when plotting data only on continent [closed]

KeyError while perfoming solve of two equation

Most Pythonic way to merge two dictionnaries having common key/value pair

How to close a while True loop instantly Python