Doing many iterations of curve_fit in one go for piecewise function

2024/9/25 11:15:19

I'm trying to perform what are many iterations of Scipy's curve_fit at once in order to avoid loops and therefore increase speed.

This is very similar to this problem, which was solved. However, the fact that the functions are piece-wise (discontinuous) makes so that that solution isn't applicable here.

Consider this example:

import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
rng.seed(0)
N=20
X=np.logspace(-1,1,N)
Y = np.zeros((4, N))
for i in range(0,4):b = i+1a = bprint(a,b)Y[i] = (X/b)**(-a) #+ 0.01 * rng.randn(6)Y[i, X>b] = 1

This yields these arrays:

enter image description here

Which as you can see are discontinuous at X==b. I can retrieve the original values of a and b by using curve_fit iteratively:

def plaw(r, a, b):""" Theoretical power law for the shape of the normalized conditional density """import numpy as npreturn np.piecewise(r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])coeffs=[]
for ix in range(Y.shape[0]):print(ix)c0, pcov = curve_fit(plaw, X, Y[ix])coeffs.append(c0)

But this process can be very slow depending of the size of X, Y and the loop, so I'm trying to speed things up by trying to get coeffs without the need for a loop. So far I haven't had any luck.

Things that might be important:

  • X and Y only contain positive values
  • a and b are always positive
  • Although the data to fit in this example is smooth (for the sake of simplicity), the real data has noise

EDIT

This is as far as I've gotten:

y=np.ma.masked_where(Y<1.01, Y)lX = np.log(X)
lY = np.log(y)
A = np.vstack([lX, np.ones(len(lX))]).T
m,c=np.linalg.lstsq(A, lY.T)[0]print('a=',-m)
print('b=',np.exp(-c/m))

But even without any noise the output is:

a= [0.18978965578339158 1.1353633705997466 2.220234483915197 3.3324502660995714]
b= [339.4090881838179 7.95073481873057 6.296592007396107 6.402567167503574]

Which is way worse than I was hoping to get.

Answer

Here are three approaches to speeding this up. You gave no desired speed up or accuracies, or even vector sizes, so buyer beware.

TL;DR

Timings:

len       1      2      3      4
1000    0.045  0.033  0.025  0.022
10000   0.290  0.097  0.029  0.023
100000  3.429  0.767  0.083  0.030
1000000               0.546  0.0461) Original Method
2) Pre-estimate with Subset
3) M Newville [linear log-log estimate](https://stackoverflow.com/a/44975066/7311767)
4) Subset Estimate (Use Less Data)

Pre-estimate with Subset (Method 2):

A decent speedup can be achieved by simply running the curve_fit twice, where the first time uses a short subset of the data to get a quick estimate. That estimate is then used to seed a curve_fit with the entire dataset.

x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
return curve_fit(power_law, x, y, p0=c0)[0]

M Newville linear log-log estimate (Method 3):

Using the log estimate proposed by M Newville, is also considerably faster. As the OP was concerned about the initial estimate method proposed by Newville, this method uses curve_fit with a subset to provide the estimate of the break point in the curve.

x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]index_max = np.where(x > c0[1])[0][0]
log_x = np.log(x[:index_max])
log_y = np.log(y[:index_max])
result = linregress(log_x, log_y)
return -result[0], np.exp(-result[1] / result[0])
return (m, c), result

Use Less Data (Method 4):

Finally the seed mechanism used for the previous two methods provides pretty good estimates on the sample data. Of course it is sample data so your mileage may vary.

stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]

Test Code:

import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
from scipy.stats import linregressfit_data = {}
current_data = Nonedef data_for_fit(a, b, n):key = a, b, nif key not in fit_data:rng.seed(0)x = np.logspace(-1, 1, n)y = np.clip((x / b) ** (-a) + 0.01 * rng.randn(n), 0.001, None)y[x > b] = 1fit_data[key] = x, yreturn fit_data[key]def power_law(r, a, b):""" Power law for the shape of the normalized conditional density """import numpy as npreturn np.piecewise(r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])def method1():x, y = current_datareturn curve_fit(power_law, x, y)[0]def method2():x, y = current_datareturn curve_fit(power_law, x, y, p0=method4()[0])def method3():x, y = current_datac0, pcov = method4()index_max = np.where(x > c0[1])[0][0]log_x = np.log(x[:index_max])log_y = np.log(y[:index_max])result = linregress(log_x, log_y)m, c = -result[0], np.exp(-result[1] / result[0])return (m, c), resultdef method4():x, y = current_datastride = int(max(1, len(x) / 200))return curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])from timeit import timeitdef runit(stmt):print("%s: %.3f  %s" % (stmt, timeit(stmt + '()', number=10,setup='from __main__ import ' + stmt),eval(stmt + '()')[0]))def runit_size(size):print('Length: %d' % size)if size <= 100000:runit('method1')runit('method2')runit('method3')runit('method4')for i in (1000, 10000, 100000, 1000000):current_data = data_for_fit(3, 3, i)runit_size(i)
https://en.xdnf.cn/q/71584.html

Related Q&A

python - dictionary iterator for pool map

I am handling set of frozensets. I am trying to find minimal sets for each frozenset in the dictionary output. I have 70k frozensets, so i am making chunk of this frozenset dictionary and parallelizing…

How to get SciPy.integrate.odeint to stop when path is closed?

edit: Its been five years, has SciPy.integrate.odeint learned to stop yet?The script below integrates magnetic field lines around closed paths and stops when it returns to original value within some t…

High-dimensional data structure in Python

What is best way to store and analyze high-dimensional date in python? I like Pandas DataFrame and Panel where I can easily manipulate the axis. Now I have a hyper-cube (dim >=4) of data. I have be…

How to access top five Google result links using Beautifulsoup

I want to access the top five(or any specified number) of links of results from Google. Through research, I found and modified the following code.import requests from bs4 import BeautifulSoup import re…

Logging in a Framework

Imagine there is a framework which provides a method called logutils.set_up() which sets up the logging according to some config.Setting up the logging should be done as early as possible since warning…

Working of the Earth Mover Loss method in Keras and input arguments data types

I have found a code for the Earth Mover Loss in Keras/Tensrflow. I want to compute the loss for the scores given to images but I can not do it until I get to know the working of the Earth Mover Loss gi…

Django Rest Framework writable nested serializer with multiple nested objects

Im trying to create a writable nested serializer. My parent model is Game and the nested models are Measurements. I am trying to post this data to my DRF application using AJAX. However, when try to po…

Django How to Serialize from ManyToManyField and List All

Im developing a mobile application backend with Django 1.9.1 I implemented the follower model and now I want to list all of the followers of a user but Im currently stuck to do that. I also use Django…

PyDrive and Google Drive - automate verification process?

Im trying to use PyDrive to upload files to Google Drive using a local Python script which I want to automate so it can run every day via a cron job. Ive stored the client OAuth ID and secret for the G…

Using rm * (wildcard) in envoy: No such file or directory

Im using Python and Envoy. I need to delete all files in a directory. Apart from some files, the directory is empty. In a terminal this would be:rm /tmp/my_silly_directory/*Common sense dictates that i…