Scrapy and celery `update_state`

2024/10/11 22:25:45

I have the following setup (Docker):

  • Celery linked to Flask setup which runs the Scrapy spider
  • Flask setup (obviously)
  • Flask setup gets request for Scrapy -> fire up worker to do some work

Now I wish to update the original flask setup on the progress of the celery worker. BUT there is no way right now to use celery.update_state() inside of the scraper as it has no access to the original task (though it is being run inside of the celery task).

As an aside: am i missing something about the structure of scrapy? It would seem reasonable that I can assign arguments inside of __init__ to be able to use furtheron, but scrapy uses the method as lambda functions it seems..


To answer some questions:

  • How are you using celery with scrapy? Scrapy is running inside of a celery task, not run from the command line. I also have never heard of scrapyd, is this a subproject of scrapy? I use a remote worker to fire off scrapy from inside of a celery/flask instance, so it is not the same as the thread being intanced by the original request, they are seperate docker instances.

The task.update_state works great! inside of the celery task, but as soon as we are 'in' the spider, we no longer have access to celery. Any ideas?

From the item_scraped signal issue Task.update_state(taskid,meta={}). You can also run without the taskid if scrapy happens to be running in a Celery task itself (as it defaults to self)

Is this sort of like a static way of accessing the current celery task? As I would love that....

Answer

I'm not sure how you are firing your spiders, but i've faced the same issue you describe.

My setup is flask as a rest api, which upon requests fires celery tasks to start spiders. I havent gotten to code it yet, but I'll tell you what i was thinking of doing:

from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy import signals
from .your_celery import app@app.task(bind=True)
def scrapping(self):def my_item_scrapped_handler(item, response, spider):meta = {# fill your state meta as required based on scrapped item, spider, or response object passed as parameters}# here self refers to the task, so you can call update_state when using bindself.update_state(state='PROGRESS',meta=meta)settings = get_project_settings()configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})runner = CrawlerRunner(settings)d = runner.crawl(MySpider)d.addBoth(lambda _: reactor.stop())for crawler in runner.crawlers:crawler.signals.connect(my_item_scrapped_handler, signal=signals.item_scraped)reactor.run()

I'm sorry for not being able to confirm if it works, but as soon as I get around to testing it I'll report back here! I currently can't dedicate as much time as I would like to to this project :(

Do not hesitate to contact me if you think I can help you any further!

Cheers, Ramiro

Sources:

  • CrawlerRunner crawlers method: https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerRunner.crawlers
  • Celery tasks docs:
    • Bound Tasks: http://docs.celeryproject.org/en/latest/userguide/tasks.html#bound-tasks
    • Custom states: http://docs.celeryproject.org/en/latest/userguide/tasks.html#custom-states
  • Scrapy signals: https://doc.scrapy.org/en/latest/topics/signals.html#signals
  • Running scrapy as scripts: https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
https://en.xdnf.cn/q/118271.html

Related Q&A

SPIDEV on raspberry pi for TI DAC8568 not behaving as expected

I have a Texas Instruments DAC8568 in their BOOST breakout board package. The DAC8568 is an 8 channel, 16bit DAC with SPI interface. The BOOST package has headers to connect it to my raspberry pi, an…

Tensorflow: Simple Linear Regression using CSV data

I am an extreme beginner at tensorflow, and i was tasked to do a simple linear regression using my csv data which contains 2 columns, Height & State of Charge(SoC), where both values are float. In …

How to resolve positional index error in python while solving a condition in python?

I have the following data and I am trying the following code: Name Sensex_index Start_Date End_Date AAA 0.5 20/08/2016 25/09/2016 AAA 0.8 26/08/2016 …

Google Calendar API: Insert multiple events (in Python)

I am using the Google Calendar API, and have successfully managed to insert a single event into an authorized primary calendar, but I would like to hard code multiple events that, when executed, would …

Remove special characters from column headers

I have a dictionary (data_final) of dataframes (health, education, economy,...). The dataframes contain data from one xlsx file. In one of the dataframes (economy), the column names have brackets and s…

Python Flask application getting OPTIONS instead of POST

I have a python Flask listener waiting on port 8080. I expect another process to make a series of POSTs to this port.The code for listener is as follows.#!/usr/bin/env python2 from __future__ import pr…

Raspberry pi:convert fisheye image to normal image using python

I have attached the USB webcam with raspberry pi to capture image and write code to send it using mail. It captures image using fswebcam commamnd so code for capture image in python script is :subproce…

modifying python daemon script, stop does not return OK (but does kill the process)

Following on from the previous post, the script now start and stops the python script (and only that particular script) correctly but does not report the OK back to the screen...USER="root" A…

fulfill an empty dataframe with common index values from another Daframe

I have a daframe with a series of period 1 month and frequency one second.The problem the time step between records is not always 1 second.time c1 c2 2013-01-01 00:00:01 5 3 2013-01-0…

How to mix numpy slices to list of indices?

I have a numpy.array, called grid, with shape:grid.shape = [N, M_1, M_2, ..., M_N]The values of N, M_1, M_2, ..., M_N are known only after initialization.For this example, lets say N=3 and M_1 = 20, M_…