Optimization on piecewise linear regression

2024/10/4 7:28:29

I am trying to create a piecewise linear regression to minimize the MSE(minimum square errors) then using linear regression directly. The method should be using dynamic programming to calculate the different piecewise sizes and combinations of groups to achieve the overall MSE. I think the algorithm runtime is O(n²) and I wonder if there are ways to optimize it to O(nLogN)?

import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
import pandas as pd
import matplotlib.pyplot as pltx = [3.4, 1.8, 4.6, 2.3, 3.1, 5.5, 0.7, 3.0, 2.6, 4.3, 2.1, 1.1, 6.1, 4.8,3.8]
y = [26.2, 17.8, 31.3, 23.1, 27.5, 36.5, 14.1, 22.3, 19.6, 31.3, 24.0, 17.3, 43.2, 36.4, 26.1]
dataset = np.dstack((x,y))
dataset = dataset[0]
d_arg = np.argsort(dataset[:,0])
dataset = dataset[d_arg]def calc_error(dataset):lr_model = linear_model.LinearRegression()x = pd.DataFrame(dataset[:,0])y = pd.DataFrame(dataset[:,1])lr_model.fit(x,y)predictions = lr_model.predict(x)mse = mean_squared_error(y, predictions)return mse#n is the number of points , m is the number of groups, k is the minimum number of points in a group
#(15,5,3)returns 【【3,3,3,3,3】】
#(15,5,2) returns [[2,4,3,3,3],[3,2,4,2,4],[4,2,3,3,3]....]
def all_combination(n,m,k):result = []if n < k*m:print('There are not enough elements to split.')returncombination_bottom = [k for q in range(m)] #add greedy algorithm here?if n == k*m:result.append(combination_bottom.copy())else:combination_now = [combination_bottom.copy()]j = k*m+1while j < n+1:combination_last = combination_now.copy()combination_now = []for x in combination_last:for i in range (0, m):combination_new = x.copy()combination_new[i] = combination_new[i]+1combination_now.append(combination_new.copy())j += 1else:for x in combination_last:for i in range (0, m):combination_new = x.copy()combination_new[i] = combination_new[i]+1if combination_new not in result:result.append(combination_new.copy())return result #2-d listdef calc_sum_error(dataset,cb):#cb = combinationmse_sum = 0    for n in range(0,len(cb)):if n == 0:low = 0high = cb[0]else:low = 0for i in range(0,n):low += cb[i]high = low + cb[n]mse_sum += calc_error(dataset[low:high])return mse_sum#k is the number of points as a group
def best_piecewise(dataset,k):lenth = len(dataset)max_split = lenth // kmin_mse = calc_error(dataset)split_cb = []all_cb = []for i in range(2, max_split+1):split_result = all_combination(lenth, i, k)all_cb += split_resultfor cb in split_result:tmp_mse = calc_sum_error(dataset,cb)if tmp_mse < min_mse:min_mse = tmp_msesplit_cb = cbreturn min_mse, split_cb, all_cbmin_mse, split_cb, all_cb = best_piecewise(dataset, 2)print('The best split of the data is '+str(split_cb))
print('The minimum MSE value is '+str(min_mse))x = np.array(dataset[:,0])
y = np.array(dataset[:,1])plt.plot(x,y,"o")
for n in range(0,len(split_cb)):if n == 0:low = 0 high = split_cb[n]else:low = 0for i in range(0,n):low += split_cb[i]high = low + split_cb[n]x_tmp = pd.DataFrame(dataset[low:high,0])y_tmp = pd.DataFrame(dataset[low:high,1])lr_model = linear_model.LinearRegression()lr_model.fit(x_tmp,y_tmp)y_predict = lr_model.predict(x_tmp)plt.plot(x_tmp, y_predict, 'g-')plt.show()

Please let me know if I didn't make it clear in any part.

Answer

It took me some time to realize, that the problem you're describing is exactly what a decision tree regressor tries to solve.

Unfortunately, construction of an optimal decision tree is NP-hard, meaning that even with dynamic programming you can't bring the runtime down to anything like O(NlogN).

Good news is that you can directly use any well maintained decision tree implementation, DecisionTreeRegressor of sklearn.tree module for example, and can be certain about obtaining best possible performance in O(NlogN) time complexity. To enforce a minimum number of points per group, use min_samples_leaf parameter. You can also control several other properties like maximun of no. groups with max_leaf_nodes, optimization w.r.t different loss functions using criterion etc.

If you're curious how Scikit-learn's decision tree compare with the one learnt by your algorithm (i.e. split_cb in your code):

X = np.array(x).reshape(-1,1)
dt = DecisionTreeRegressor(min_samples_leaf=MIN_SIZE).fit(X,y)
split_cb = np.unique(dt.apply(X),return_counts=True)[1]

And then use the same plotting code you use. Do note that since your time complexity is considerably higher than O(NlogN)*, your implementation will often find better splits than the scikit-learn's greedy algorithm.


[1] Hyafil, L., & Rivest, R. L. (1976). Constructing optimal binary decision trees is np-complete. Information Processing Letters, 5(1), 15–17

*Although I'm not sure about the exact time complexity of your implementation, it's quite certainly worse than O(N^2), all_combination(21,4,2) took more than 5 mins.

https://en.xdnf.cn/q/70635.html

Related Q&A

Python: Check if list of named tuples contains particular attribute value

I have a list of named tuples:from collections import namedtupleT = namedtuple(T, [attr1, attr2, attr3, attr4]) t1 = T(T1, 1, 1234, XYZ) t2 = T(T2, 2, 1254, ABC) t3 = T(T2, 2, 1264, DEF) l = [t1, t2, t…

javascript error: arguments[0].scrollIntoView is not a function using selenium on python

Im using Selenium on python and I would like to scroll to an element to click on it. Everywhere I see that the rigth things to do to go directly to the element is to use :driver = webdriver.Chrome() dr…

Uploading a static project to google app engines

Disclaimer: I already asked here, but apparently off-topic. I want to set up a page using this bootstrap template and host it as a static website using the google appengine service. Inside the google_a…

Python cannot import DataFrame

I am trying to use Pandas in Python to import and manipulate some csv file.my code is like:import pandas as pd from pandas import dataframe data_df = pd.read_csv(highfrequency2.csv) print(data_df.col…

Sum of product of combinations in a list

What is the Pythonic way of summing the product of all combinations in a given list, such as:[1, 2, 3, 4] --> (1 * 2) + (1 * 3) + (1 * 4) + (2 * 3) + (2 * 4) + (3 * 4) = 35(For this example I have t…

discord.py: How to get the user who invited/added the bot to his server? [solution]

I want to send a DM to the user, who invited/added the bot to his server. I noticed that its displayed in the audit log. Can I fetch that and get the user or is there a easier way to achieve that? Ex…

How to reorder the keys of a dictionary?

I have multiple dictionaries inside the list. I want to sort the dictionary with the custom key. In my case, I want to sort it using Date key. By that, I mean to move the Date key to the first position…

How do bitwise operations work in Python?

I have been learning about Bitwise operations today and I learned that Not (~) inverses all bits, e.g.:01010 to 10101which means ~10 should be -5 but instead I have seen that it is -11 (per the python …

How to split large wikipedia dump .xml.bz2 files in Python?

I am trying to build a offline wiktionary using the wikimedia dump files (.xml.bz2) using Python. I started with this article as the guide. It involves a number of languages, I wanted to combine all th…

CherryPy interferes with Twisted shutting down on Windows

Ive got an application that runs Twisted by starting the reactor with reactor.run() in my main thread after starting some other threads, including the CherryPy web server. Heres a program that shuts d…