I'm looking at crawling thousands of pages and need a solution. Every site has it's own html code - they are all unique sites. No clean datafeed or API is available. I'm hoping to load the captured data into some sort of DB.
Any ideas on how to do this with scrapy if possible?
If I had to scrape clean data from thousands of sites, with each site having its own layout, structure, etc I would implement (and actually have done so in some projects) the following approach:
- Crawler - a scrapy script that crawls these sites with all their subpages (that's the easiest part) and transforms them into plain text
- NLP Processing - some basic NLP (natural language) processing (tokenizing, part of speech (POS) tagging, named entity-recognition (NER)) on the plain text
- Classification - a classifier that can use the data from step 2 to decide whether a page contains the data we're looking for - either simple rules based or - if needed - using machine learning. Those pages that are suspected to contain any usable data will be put into the next step:
- Extraction - an grammar-based, statistical or machine learning based extractor that uses POS-tags and NER-tags (and any other domain specific factors) to extract that data we're looking for
- Clean up - some basic matching of duplicate records that were created in step 4 and maybe it's also necessary to throw away records that had low confidence scores in steps 2 to 4.
This goes way beyond building a scrapy scraper of course and requires deep knowlegde and experience in NLP and maybe machine learning.
Also you can't expect to get anywhere close to 100% accurate results from such an approach. Depending on how the algorithms are adjusted and trained such a system either will skip some of the valid data (false negatives) or will pick up data where actually isn't any data (false positives) ... or a mix of both (false positives and false negatives).
Nonetheless I hope my answer helps you to get a good picture about.