Question 1

I have a dataset with 21000 rows (data samples) and 102 columns (features). I would like to have a larger synthetic dataset generated based on the current dataset, say with 100000 rows, so I can use it for machine learning purposes thereby.

I've been referring to the answer by @Prashant on this post https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data, but am unable to get it working on generating a larger synthetic dataset for my data.

import numpy as np
from random import randrange, choice
from sklearn.neighbors import NearestNeighbors
import pandas as pd
#referring to https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-datadf = pd.read_pickle('df_saved.pkl')
df = df.iloc[:,:-1] # this gives me df, the final Dataframe which I would like to generate a larger dataset based on. This is the smaller Dataframe with 21000x102 dimensions.def SMOTE(T, N, k):
# """
# Returns (N/100) * n_minority_samples synthetic minority samples.
#
# Parameters
# ----------
# T : array-like, shape = [n_minority_samples, n_features]
#     Holds the minority samples
# N : percetange of new synthetic samples:
#     n_synthetic_samples = N/100 * n_minority_samples. Can be < 100.
# k : int. Number of nearest neighbours.
#
# Returns
# -------
# S : array, shape = [(N/100) * n_minority_samples, n_features]
# """n_minority_samples, n_features = T.shapeif N < 100:#create synthetic samples only for a subset of T.#TODO: select random minortiy samplesN = 100passif (N % 100) != 0:raise ValueError("N must be < 100 or multiple of 100")N = N/100n_synthetic_samples = N * n_minority_samplesn_synthetic_samples = int(n_synthetic_samples)n_features = int(n_features)S = np.zeros(shape=(n_synthetic_samples, n_features))#Learn nearest neighboursneigh = NearestNeighbors(n_neighbors = k)neigh.fit(T)#Calculate synthetic samplesfor i in range(n_minority_samples):nn = neigh.kneighbors(T[i], return_distance=False)for n in range(N):nn_index = choice(nn[0])#NOTE: nn includes T[i], we don't want to select itwhile nn_index == i:nn_index = choice(nn[0])dif = T[nn_index] - T[i]gap = np.random.random()S[n + i * N, :] = T[i,:] + gap * dif[:]return Sdf = df.to_numpy()
new_data = SMOTE(df,50,10) # this is where I call the function and expect new_data to be generated with larger number of samples than original df.

The traceback of the error I get is mentioned below:-

Traceback (most recent call last):File "MyScript.py", line 66, in <module>new_data = SMOTE(df,50,10)File "MyScript.py", line 52, in SMOTEnn = neigh.kneighbors(T[i], return_distance=False)File "/trinity/clustervision/CentOS/7/apps/anaconda/4.3.31/3.6-VE/lib/python3.5/site-packages/sklearn/neighbors/base.py", line 393, in kneighborsX = check_array(X, accept_sparse='csr')File "/trinity/clustervision/CentOS/7/apps/anaconda/4.3.31/3.6-VE/lib/python3.5/site-packages/sklearn/utils/validation.py", line 547, in check_array"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:

I know that this error (Expected 2D array, got 1D array) is occurring on the line nn = neigh.kneighbors(T[i], return_distance=False). Precisely, when I call the function, T is the numpy array of shape (21000x102), my data which I convert from a Pandas Dataframe to a numpy array. I know that this question may have some similar duplicates, but none of them answer my question. Any help in this regard would be highly appreciated.

Question 2

So what T[i] is giving it is an array with shape (102, ).

What the function expects is an array with shape (1, 102).

You can get this by calling reshape on it:

nn = neigh.kneighbors(T[i].reshape(1, -1), return_distance=False)

In case you're not familiar with np.reshape, The 1 says that the first dimension should be size 1, and the -1 says that the second dimension should be what ever size numpy can broadcast it to; in this case the original 102.

Generate larger synthetic dataset based on a smaller dataset in Python

Related Q&A

Executing python script in android terminal emulator

How to return error messages in JSON with Bottle HTTPError?

Cant execute msg (and other) Windows commands via subprocess

Django development server stops after logging into admin

fastai.fastcore patch decorator vs simple monkey-patching

Adding user to group on creation in Django

imgradient matlab equivalent in Python

Error: astype() got an unexpected keyword argument categories

python: regular expression search pattern for binary files (half a byte)

pandas: selecting rows in a specific time window