Extended example to understand CUDA, Numba, Cupy, etc

2024/10/14 3:18:27

Mostly all examples of Numba, CuPy and etc available online are simple array additions, showing the speedup from going to cpu singles core/thread to a gpu. And commands documentations mostly lack good examples. This post is intended to provide a more comprehensive example.

The initial code is provided here. Its a simple model for the classic Cellular Automata. Originally, it doesn't even uses numpy, just plain python and the Pyglet module for visualization.

My goal is to extend this code to a specific problem (that will be very large), but first i thought its best to already optimize for GPU usage.

The game_of_life.py is this:

import random as rnd
import pyglet
#import numpy as np
#from numba import vectorize, cuda, jitclass GameOfLife: def __init__(self, window_width, window_height, cell_size, percent_fill):self.grid_width = int(window_width / cell_size) # cell_size self.grid_height = int(window_height / cell_size) # self.cell_size = cell_sizeself.percent_fill = percent_fillself.cells = []self.generate_cells()def generate_cells(self):for row in range(0, self.grid_height): self.cells.append([])for col in range(0, self.grid_width):if rnd.random() < self.percent_fill:self.cells[row].append(1)else:self.cells[row].append(0)def run_rules(self): temp = []for row in range(0, self.grid_height):temp.append([])for col in range(0, self.grid_width):cell_sum = sum([self.get_cell_value(row - 1, col),self.get_cell_value(row - 1, col - 1),self.get_cell_value(row,     col - 1),self.get_cell_value(row + 1, col - 1),self.get_cell_value(row + 1, col),self.get_cell_value(row + 1, col + 1),self.get_cell_value(row,     col + 1),self.get_cell_value(row - 1, col + 1)])if self.cells[row][col] == 0 and cell_sum == 3:temp[row].append(1)elif self.cells[row][col] == 1 and (cell_sum == 3 or cell_sum == 2):temp[row].append(1)else:                 temp[row].append(0)self.cells = tempdef get_cell_value(self, row, col): if row >= 0 and row < self.grid_height and col >= 0 and col < self.grid_width:return self.cells[row][col]return 0def draw(self): for row in range(0, self.grid_height):for col in range(0, self.grid_width):if self.cells[row][col] == 1:#(0, 0) (0, 20) (20, 0) (20, 20)square_coords = (row * self.cell_size,                  col * self.cell_size,row * self.cell_size,                  col * self.cell_size + self.cell_size,row * self.cell_size + self.cell_size, col * self.cell_size,row * self.cell_size + self.cell_size, col * self.cell_size + self.cell_size)pyglet.graphics.draw_indexed(4, pyglet.gl.GL_TRIANGLES,[0, 1, 2, 1, 2, 3],('v2i', square_coords))

Firstly, i could use numpy adding at the end of generate_cells this self.cells = np.asarray(self.cells) and at end of run_rules this self.cells = np.asarray(temp), since doing this before wouldn't present speedups, as presented here.(Actually changing to numpy didn't present a noticeable speedup)

Regarding gpu's, for example, i added @jit before every function, and became very slow. Also tried to use @vectorize(['float32(float32, float32)'], target='cuda'), but this raised a question: how to use @vectorize in functions that only have self as input argument?

I also tried substituting numpy for cupy, like self.cells = cupy.asarray(self.cells), but also became very slow.

Following the initial idea of a extended example of gpu usage, what would be the proper approach to the problem? Where is the right place to put the modifications/vectorizations/parallelizations/numba/cupy etc? And most important, why?

Additional info: besides the code provided, here's the main.py file:

import pyglet
from game_of_life import GameOfLife class Window(pyglet.window.Window):def __init__(self):super().__init__(800,800)self.gameOfLife = GameOfLife(self.get_size()[0],self.get_size()[1],15,  # the lesser this value, more computation intensive will be0.5) pyglet.clock.schedule_interval(self.update, 1.0/24.0) # 24 frames per seconddef on_draw(self):self.clear()self.gameOfLife.draw()def update(self, dt):self.gameOfLife.run_rules()if __name__ == '__main__':window = Window()pyglet.app.run()
Answer

I don't quite understand your example, but I only need GPU computing. After a few days of pain, I may understand its usage, so I'll show it to you, hoping to help you. In addition, I need to point out that when using "...kernel(cuts, cuts", I will put two. Because the first one specifies the type when it is passed in, it will be used by the core as a traversal element and cannot be read by the index. So I use the second one to calculate free index data.

```
binsort_kernel = cp.ElementwiseKernel(
'int32 I,raw T cut,raw T ind,int32 row,int32 col,int32 q','raw T out,raw T bin,raw T num',    
'''
int i_x = i / col;                
int i_y = i % col;                
int b_f = i_x*col;                
int b_l = b_f+col;                
int n_x = i_x * q;                
int inx = i_x%row*col;            
////////////////////////////////////////////////////////////////////////////////////////
int r_x = 0; int adi = 0; int adb = 0;  
////////////////////////////////////////////////////////////////////////////////////////
if (i_y == 0)
{
for(size_t j=b_f; j<b_l; j++){if (cut[j]<q){                r_x = inx + j -b_f;       adb = n_x + cut[j];       adi = bin[adb] + num[adb];out[adi] = ind[r_x];      num[adb]+= 1;             }}
}
////////////////////////////////////////////////////////////////////////////////////////
''','binsort')binsort_kernel(cuts,cuts,ind,row,col,q,iout,bins,bnum)

https://en.xdnf.cn/q/69456.html

Related Q&A

Python 2 newline tokens in tokenize module

I am using the tokenize module in Python and wonder why there are 2 different newline tokens:NEWLINE = 4 NL = 54Any examples of code that would produce both tokens would be appreciated.

Prevent encoding errors in Python

I have scripts which print out messages by the logging system or sometimes print commands. On the Windows console I get error messages likeTraceback (most recent call last):File "C:\Python32\lib\l…

How do I get the operating system name in a friendly manner using Python 2.5?

I tried:print os.nameAnd the output I got was::ntHowever, I want output more like "Windows 98", or "Linux".After suggestions in this question, I also tried:import os print os.name i…

Extend dataclass __repr__ programmatically

Suppose I have a dataclass with a set method. How do I extend the repr method so that it also updates whenever the set method is called: from dataclasses import dataclass @dataclass class State:A: int …

find least common denominator for list of fractions in python

I have a list of fractionsfrom fractions import Fractionfractions_list=[Fraction(3,14),Fraction(1,7),Fraction(9,14)]The output should be a list with the numerators for each fraction, then the denominat…

How to configure uwsgi to encode logging as json except app output

Im running uwsgi around a Python Flask webapp with these options (among others) to get JSON-encoded log records on stdout:fmt=${"timestamp": "${strftime:%FT%TZ}", "level":…

Testing aiohttp client with unittest.mock.patch

Ive written a simple HTTP client using aiohttp and Im trying to test it by patching aiohttp.ClientSession and aiohttp.ClientResponse. However, it appears as though the unittest.mock.patch decorator is …

GridsearchCV: cant pickle function error when trying to pass lambda in parameter

I have looked quite extensively on stackoverflow and elsewhere and I cant seem to find an answer to the problem below. I am trying to modify a parameter of a function that is itself a parameter inside …

How to insert a carriage return in a ReportLab paragraph?

Is there a way to insert a carriage return in a Paragraph in ReportLab? I am trying to concatenate a "\n" to my paragraph string but this isnt working. Title = Paragraph("Title" + …

How to get predictions and calculate accuracy for a given test set in fast ai?

Im trying to load a learner which was exported by learn.export() and I want to run it against a test set. I want my test set have labels so that I can measure its accuracy. This is my code: test_src = …