How to reduce memory usage of threaded python code?

2024/10/3 19:16:07

I wrote about 50 classes that I use to connect and work with websites using mechanize and threading. They all work concurrently, but they don't depend on each other. So that means 1 class - 1 website - 1 thread. It's not particularly elegant solution, especially for managing the code, since lot of the code repeats in each class (but not nearly enough to make it into one class to pass arguments, as some sites may require additional processing of retrieved data in middle of methods - like 'login' - that others might not need). As I said, it's not elegant -- But it works. Needless to say I welcome all recommendations how to write this better without using 1 class for each website approach. Adding additional functionality or overall code management of each class is a daunting task.

However, I found out, that each thread takes about 8MB memory, so with 50 running threads we are looking at about 400MB usage. If it was running on my system I wouldn't have problem with that, but since it's running on a VPS with only 1GB memory, it's starting to be an issue. Can you tell me how to reduce the memory usage, or are there any other way to to work with multiple sites concurrently?

I used this quick test python program to test if it's the data stored in variables of my application that is using the memory, or something else. As you can see in following code, it's only processing sleep() function, yet each thread is using 8MB of memory.

from thread import start_new_thread
from time import sleepdef sleeper():try:while 1:sleep(10000)except:if running: raisedef test():global runningn = 0running = Truetry:while 1:start_new_thread(sleeper, ())n += 1if not (n % 50):print nexcept Exception, e:running = Falseprint 'Exception raised:', eprint 'Biggest number of threads:', nif __name__ == '__main__':test()

When I run this, the output is:

50
100
150
Exception raised: can't start new thread
Biggest number of threads: 188

And by removing running = False line, I can then measure free memory using free -m command in shell:

             total       used       free     shared    buffers     cached
Mem:          1536       1533          2          0          0          0
-/+ buffers/cache:       1533          2
Swap:            0          0          0

The actual calculation why I know it's taking about 8MB per thread is then simple by dividing dividing the difference of memory used before and during the the above test application is running, divided by maximum threads it managed to start.

It's probably only allocated memory, because by looking at top, the python process uses only about 0.6% of memory.

Answer
  • use less threads: ThreadPoolExecutor example. Install futures on Python 2.x
  • try asynchronous approach:
    • gevent example
    • twisted example
https://en.xdnf.cn/q/70695.html

Related Q&A

Connection is closed when a SQLAlchemy event triggers a Celery task

When one of my unit tests deletes a SQLAlchemy object, the object triggers an after_delete event which triggers a Celery task to delete a file from the drive.The task is CELERY_ALWAYS_EAGER = True when…

Python escape sequence \N{name} not working as per definition

I am trying to print unicode characters given their name as follows:# -*- coding: utf-8 -*- print "\N{SOLIDUS}" print "\N{BLACK SPADE SUIT}"However the output I get is not very enco…

Binary integer programming with PULP using vector syntax for variables?

New to the python library PULP and Im finding the documentation somewhat unhelpful, as it does not include examples using lists of variables. Ive tried to create an absolutely minimalist example below …

Nonblocking Scrapy pipeline to database

I have a web scraper in Scrapy that gets data items. I want to asynchronously insert them into a database as well. For example, I have a transaction that inserts some items into my db using SQLAlchemy …

python function to return javascript date.getTime()

Im attempting to create a simple python function which will return the same value as javascript new Date().getTime() method. As written here, javascript getTime() method returns number of milliseconds …

Pulling MS access tables and putting them in data frames in python

I have tried many different things to pull the data from Access and put it into a neat data frame. right now my code looks like this.from pandas import DataFrame import numpy as npimport pyodbc from sq…

Infinite loop while adding two integers using bitwise operations?

I am trying to solve a problem, using python code, which requires me to add two integers without the use of + or - operators. I have the following code which works perfectly for two positive numbers: d…

When is pygame.init() needed?

I am studying pygame and in the vast majority of tutorials it is said that one should run pygame.init() before doing anything. I was doing one particular tutorial and typing out the code as one does an…

mypy overrides in toml are ignored?

The following is a simplified version of the toml file example from the mypy documentation: [tool.mypy] python_version = "3.7" warn_return_any = true warn_unused_configs = true[[tool.mypy.ove…

/usr/bin/env: python2.6: No such file or directory error

I have python2.6, python2.7 and python3 in my /usr/lib/I am trying to run a file which has line given below in it as its first line#!/usr/bin/env python2.6after trying to run it it gives me following e…