Generate larger synthetic dataset based on a smaller dataset in Python

2024/10/7 22:22:59

I have a dataset with 21000 rows (data samples) and 102 columns (features). I would like to have a larger synthetic dataset generated based on the current dataset, say with 100000 rows, so I can use it for machine learning purposes thereby.

I've been referring to the answer by @Prashant on this post https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data, but am unable to get it working on generating a larger synthetic dataset for my data.

import numpy as np
from random import randrange, choice
from sklearn.neighbors import NearestNeighbors
import pandas as pd
#referring to https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-datadf = pd.read_pickle('df_saved.pkl')
df = df.iloc[:,:-1] # this gives me df, the final Dataframe which I would like to generate a larger dataset based on. This is the smaller Dataframe with 21000x102 dimensions.def SMOTE(T, N, k):
# """
# Returns (N/100) * n_minority_samples synthetic minority samples.
#
# Parameters
# ----------
# T : array-like, shape = [n_minority_samples, n_features]
#     Holds the minority samples
# N : percetange of new synthetic samples:
#     n_synthetic_samples = N/100 * n_minority_samples. Can be < 100.
# k : int. Number of nearest neighbours.
#
# Returns
# -------
# S : array, shape = [(N/100) * n_minority_samples, n_features]
# """n_minority_samples, n_features = T.shapeif N < 100:#create synthetic samples only for a subset of T.#TODO: select random minortiy samplesN = 100passif (N % 100) != 0:raise ValueError("N must be < 100 or multiple of 100")N = N/100n_synthetic_samples = N * n_minority_samplesn_synthetic_samples = int(n_synthetic_samples)n_features = int(n_features)S = np.zeros(shape=(n_synthetic_samples, n_features))#Learn nearest neighboursneigh = NearestNeighbors(n_neighbors = k)neigh.fit(T)#Calculate synthetic samplesfor i in range(n_minority_samples):nn = neigh.kneighbors(T[i], return_distance=False)for n in range(N):nn_index = choice(nn[0])#NOTE: nn includes T[i], we don't want to select itwhile nn_index == i:nn_index = choice(nn[0])dif = T[nn_index] - T[i]gap = np.random.random()S[n + i * N, :] = T[i,:] + gap * dif[:]return Sdf = df.to_numpy()
new_data = SMOTE(df,50,10) # this is where I call the function and expect new_data to be generated with larger number of samples than original df.

The traceback of the error I get is mentioned below:-

Traceback (most recent call last):File "MyScript.py", line 66, in <module>new_data = SMOTE(df,50,10)File "MyScript.py", line 52, in SMOTEnn = neigh.kneighbors(T[i], return_distance=False)File "/trinity/clustervision/CentOS/7/apps/anaconda/4.3.31/3.6-VE/lib/python3.5/site-packages/sklearn/neighbors/base.py", line 393, in kneighborsX = check_array(X, accept_sparse='csr')File "/trinity/clustervision/CentOS/7/apps/anaconda/4.3.31/3.6-VE/lib/python3.5/site-packages/sklearn/utils/validation.py", line 547, in check_array"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:

I know that this error (Expected 2D array, got 1D array) is occurring on the line nn = neigh.kneighbors(T[i], return_distance=False). Precisely, when I call the function, T is the numpy array of shape (21000x102), my data which I convert from a Pandas Dataframe to a numpy array. I know that this question may have some similar duplicates, but none of them answer my question. Any help in this regard would be highly appreciated.

Answer

So what T[i] is giving it is an array with shape (102, ).

What the function expects is an array with shape (1, 102).

You can get this by calling reshape on it:

nn = neigh.kneighbors(T[i].reshape(1, -1), return_distance=False)

In case you're not familiar with np.reshape, The 1 says that the first dimension should be size 1, and the -1 says that the second dimension should be what ever size numpy can broadcast it to; in this case the original 102.

https://en.xdnf.cn/q/70188.html

Related Q&A

Executing python script in android terminal emulator

I installed python 2.7 in my Android device and I tried executing a python script by typing the command in terminal emulator. The problem is that although I use the full path for python the following e…

How to return error messages in JSON with Bottle HTTPError?

I have a bottle server that returns HTTPErrors as such:return HTTPError(400, "Object already exists with that name")When I receive this response in the browser, Id like to be able to pick out…

Cant execute msg (and other) Windows commands via subprocess

I have been having some problems with subprocess.call(), subprocess.run(), subprocess.Popen(), os.system(), (and other functions to run command prompt commands) as I cant seem to get the msg command to…

Django development server stops after logging into admin

I have installed django 3.0 in python 3.7 and started a basic django project. I have created a superuser and run the development server using python manage.py runserver. When I go to localhost:8000/adm…

fastai.fastcore patch decorator vs simple monkey-patching

Im trying to understand the value-added of using fastais fastcore.basics.patch_to decorator. Heres the fastcore way: from fastcore.basics import patch_toclass _T3(int):pass@patch_to(_T3) def func1(self…

Adding user to group on creation in Django

Im looking to add a User to a group only if a field of this User is specified as True once the User is created. Every User that is created would have a UserProfile associated with it. Would this be the…

imgradient matlab equivalent in Python

I am searching for an imgradient MATLAB equivalent in Python. I am aware of cv2.Sobel() and cv2.Laplacian() but it doesnt work as imgradient works in MATLAB. If I could get source code of imgradient.m…

Error: astype() got an unexpected keyword argument categories

df = pd.DataFrame([A+, A, A-, B+, B, B-, C+, C, C-, D+, D],index=[excellent, excellent, excellent, good, good, good, ok, ok, ok, poor, poor])df.rename(columns={0: Grades}, inplace=True)dfI am trying to…

python: regular expression search pattern for binary files (half a byte)

I am using the following regular expression pattern for searching 0xDEAD4FAD in a binary file:my_pattern = re.compile(b"\xDE\xAD\x4F\xAD")but how do I generalize the search pattern for search…

pandas: selecting rows in a specific time window

I have a dataset of samples covering multiple days, all with a timestamp. I want to select rows within a specific time window. E.g. all rows that were generated between 1pm and 3 pm every day.This is a…