I'm trying to perform what are many iterations of Scipy's curve_fit
at once in order to avoid loops and therefore increase speed.
This is very similar to this problem, which was solved. However, the fact that the functions are piece-wise (discontinuous) makes so that that solution isn't applicable here.
Consider this example:
import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
rng.seed(0)
N=20
X=np.logspace(-1,1,N)
Y = np.zeros((4, N))
for i in range(0,4):b = i+1a = bprint(a,b)Y[i] = (X/b)**(-a) #+ 0.01 * rng.randn(6)Y[i, X>b] = 1
This yields these arrays:
Which as you can see are discontinuous at X==b
. I can retrieve the original values of a
and b
by using curve_fit
iteratively:
def plaw(r, a, b):""" Theoretical power law for the shape of the normalized conditional density """import numpy as npreturn np.piecewise(r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])coeffs=[]
for ix in range(Y.shape[0]):print(ix)c0, pcov = curve_fit(plaw, X, Y[ix])coeffs.append(c0)
But this process can be very slow depending of the size of X
, Y
and the loop, so I'm trying to speed things up by trying to get coeffs
without the need for a loop. So far I haven't had any luck.
Things that might be important:
X
and Y
only contain positive values
a
and b
are always positive
- Although the data to fit in this example is smooth (for the sake of simplicity), the real data has noise
EDIT
This is as far as I've gotten:
y=np.ma.masked_where(Y<1.01, Y)lX = np.log(X)
lY = np.log(y)
A = np.vstack([lX, np.ones(len(lX))]).T
m,c=np.linalg.lstsq(A, lY.T)[0]print('a=',-m)
print('b=',np.exp(-c/m))
But even without any noise the output is:
a= [0.18978965578339158 1.1353633705997466 2.220234483915197 3.3324502660995714]
b= [339.4090881838179 7.95073481873057 6.296592007396107 6.402567167503574]
Which is way worse than I was hoping to get.
Here are three approaches to speeding this up. You gave no desired speed up or accuracies, or even vector sizes, so buyer beware.
TL;DR
Timings:
len 1 2 3 4
1000 0.045 0.033 0.025 0.022
10000 0.290 0.097 0.029 0.023
100000 3.429 0.767 0.083 0.030
1000000 0.546 0.0461) Original Method
2) Pre-estimate with Subset
3) M Newville [linear log-log estimate](https://stackoverflow.com/a/44975066/7311767)
4) Subset Estimate (Use Less Data)
Pre-estimate with Subset (Method 2):
A decent speedup can be achieved by simply running the curve_fit
twice, where the first time uses a short subset of the data to get a quick estimate. That estimate is then used to seed a curve_fit
with the entire dataset.
x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
return curve_fit(power_law, x, y, p0=c0)[0]
M Newville linear log-log estimate (Method 3):
Using the log estimate proposed by M Newville, is also considerably faster. As the OP was concerned about the initial estimate method proposed by Newville, this method uses curve_fit
with a subset to provide the estimate of the break point in the curve.
x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]index_max = np.where(x > c0[1])[0][0]
log_x = np.log(x[:index_max])
log_y = np.log(y[:index_max])
result = linregress(log_x, log_y)
return -result[0], np.exp(-result[1] / result[0])
return (m, c), result
Use Less Data (Method 4):
Finally the seed mechanism used for the previous two methods provides pretty good estimates on the sample data. Of course it is sample data so your mileage may vary.
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
Test Code:
import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
from scipy.stats import linregressfit_data = {}
current_data = Nonedef data_for_fit(a, b, n):key = a, b, nif key not in fit_data:rng.seed(0)x = np.logspace(-1, 1, n)y = np.clip((x / b) ** (-a) + 0.01 * rng.randn(n), 0.001, None)y[x > b] = 1fit_data[key] = x, yreturn fit_data[key]def power_law(r, a, b):""" Power law for the shape of the normalized conditional density """import numpy as npreturn np.piecewise(r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])def method1():x, y = current_datareturn curve_fit(power_law, x, y)[0]def method2():x, y = current_datareturn curve_fit(power_law, x, y, p0=method4()[0])def method3():x, y = current_datac0, pcov = method4()index_max = np.where(x > c0[1])[0][0]log_x = np.log(x[:index_max])log_y = np.log(y[:index_max])result = linregress(log_x, log_y)m, c = -result[0], np.exp(-result[1] / result[0])return (m, c), resultdef method4():x, y = current_datastride = int(max(1, len(x) / 200))return curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])from timeit import timeitdef runit(stmt):print("%s: %.3f %s" % (stmt, timeit(stmt + '()', number=10,setup='from __main__ import ' + stmt),eval(stmt + '()')[0]))def runit_size(size):print('Length: %d' % size)if size <= 100000:runit('method1')runit('method2')runit('method3')runit('method4')for i in (1000, 10000, 100000, 1000000):current_data = data_for_fit(3, 3, i)runit_size(i)