Question 1

Mostly all examples of Numba, CuPy and etc available online are simple array additions, showing the speedup from going to cpu singles core/thread to a gpu. And commands documentations mostly lack good examples. This post is intended to provide a more comprehensive example.

The initial code is provided here. Its a simple model for the classic Cellular Automata. Originally, it doesn't even uses numpy, just plain python and the Pyglet module for visualization.

My goal is to extend this code to a specific problem (that will be very large), but first i thought its best to already optimize for GPU usage.

The game_of_life.py is this:

import random as rnd
import pyglet
#import numpy as np
#from numba import vectorize, cuda, jitclass GameOfLife: def __init__(self, window_width, window_height, cell_size, percent_fill):self.grid_width = int(window_width / cell_size) # cell_size self.grid_height = int(window_height / cell_size) # self.cell_size = cell_sizeself.percent_fill = percent_fillself.cells = []self.generate_cells()def generate_cells(self):for row in range(0, self.grid_height): self.cells.append([])for col in range(0, self.grid_width):if rnd.random() < self.percent_fill:self.cells[row].append(1)else:self.cells[row].append(0)def run_rules(self): temp = []for row in range(0, self.grid_height):temp.append([])for col in range(0, self.grid_width):cell_sum = sum([self.get_cell_value(row - 1, col),self.get_cell_value(row - 1, col - 1),self.get_cell_value(row,     col - 1),self.get_cell_value(row + 1, col - 1),self.get_cell_value(row + 1, col),self.get_cell_value(row + 1, col + 1),self.get_cell_value(row,     col + 1),self.get_cell_value(row - 1, col + 1)])if self.cells[row][col] == 0 and cell_sum == 3:temp[row].append(1)elif self.cells[row][col] == 1 and (cell_sum == 3 or cell_sum == 2):temp[row].append(1)else:                 temp[row].append(0)self.cells = tempdef get_cell_value(self, row, col): if row >= 0 and row < self.grid_height and col >= 0 and col < self.grid_width:return self.cells[row][col]return 0def draw(self): for row in range(0, self.grid_height):for col in range(0, self.grid_width):if self.cells[row][col] == 1:#(0, 0) (0, 20) (20, 0) (20, 20)square_coords = (row * self.cell_size,                  col * self.cell_size,row * self.cell_size,                  col * self.cell_size + self.cell_size,row * self.cell_size + self.cell_size, col * self.cell_size,row * self.cell_size + self.cell_size, col * self.cell_size + self.cell_size)pyglet.graphics.draw_indexed(4, pyglet.gl.GL_TRIANGLES,[0, 1, 2, 1, 2, 3],('v2i', square_coords))

Firstly, i could use numpy adding at the end of generate_cells this self.cells = np.asarray(self.cells) and at end of run_rules this self.cells = np.asarray(temp), since doing this before wouldn't present speedups, as presented here.(Actually changing to numpy didn't present a noticeable speedup)

Regarding gpu's, for example, i added @jit before every function, and became very slow. Also tried to use @vectorize(['float32(float32, float32)'], target='cuda'), but this raised a question: how to use @vectorize in functions that only have self as input argument?

I also tried substituting numpy for cupy, like self.cells = cupy.asarray(self.cells), but also became very slow.

Following the initial idea of a extended example of gpu usage, what would be the proper approach to the problem? Where is the right place to put the modifications/vectorizations/parallelizations/numba/cupy etc? And most important, why?

Additional info: besides the code provided, here's the main.py file:

import pyglet
from game_of_life import GameOfLife class Window(pyglet.window.Window):def __init__(self):super().__init__(800,800)self.gameOfLife = GameOfLife(self.get_size()[0],self.get_size()[1],15,  # the lesser this value, more computation intensive will be0.5) pyglet.clock.schedule_interval(self.update, 1.0/24.0) # 24 frames per seconddef on_draw(self):self.clear()self.gameOfLife.draw()def update(self, dt):self.gameOfLife.run_rules()if __name__ == '__main__':window = Window()pyglet.app.run()

Question 2

I don't quite understand your example, but I only need GPU computing. After a few days of pain, I may understand its usage, so I'll show it to you, hoping to help you. In addition, I need to point out that when using "...kernel(cuts, cuts", I will put two. Because the first one specifies the type when it is passed in, it will be used by the core as a traversal element and cannot be read by the index. So I use the second one to calculate free index data.

```
binsort_kernel = cp.ElementwiseKernel(
'int32 I,raw T cut,raw T ind,int32 row,int32 col,int32 q','raw T out,raw T bin,raw T num',    
'''
int i_x = i / col;                
int i_y = i % col;                
int b_f = i_x*col;                
int b_l = b_f+col;                
int n_x = i_x * q;                
int inx = i_x%row*col;            
////////////////////////////////////////////////////////////////////////////////////////
int r_x = 0; int adi = 0; int adb = 0;  
////////////////////////////////////////////////////////////////////////////////////////
if (i_y == 0)
{
for(size_t j=b_f; j<b_l; j++){if (cut[j]<q){                r_x = inx + j -b_f;       adb = n_x + cut[j];       adi = bin[adb] + num[adb];out[adi] = ind[r_x];      num[adb]+= 1;             }}
}
////////////////////////////////////////////////////////////////////////////////////////
''','binsort')binsort_kernel(cuts,cuts,ind,row,col,q,iout,bins,bnum)

Extended example to understand CUDA, Numba, Cupy, etc

Related Q&A

Python 2 newline tokens in tokenize module

Prevent encoding errors in Python

How do I get the operating system name in a friendly manner using Python 2.5?

Extend dataclass repr programmatically

find least common denominator for list of fractions in python

How to configure uwsgi to encode logging as json except app output

Testing aiohttp client with unittest.mock.patch

GridsearchCV: cant pickle function error when trying to pass lambda in parameter

How to insert a carriage return in a ReportLab paragraph?

How to get predictions and calculate accuracy for a given test set in fast ai?