Slow loading SQL Server table into pandas DataFrame

2024/10/1 21:24:07

Pandas gets ridiculously slow when loading more than 10 million records from a SQL Server DB using pyodbc and mainly the function pandas.read_sql(query,pyodbc_conn). The following code takes up to 40-45 minutes to load 10-15 million records from SQL table: Table1

Is there a better and faster method to read SQL Table into pandas Dataframe?

import pyodbc
import pandasserver = <server_ip> 
database = <db_name> 
username = <db_user> 
password = <password> 
port='1443'
conn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';PORT='+port+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = conn.cursor()data = pandas.read_sql("select * from Table1", conn) #Takes about 40-45 minutes to complete
Answer

I had a same problem with even more number of rows, ~50 M Ended up writing a SQL query and stored them as .h5 files.

sql_reader = pd.read_sql("select * from table_a", con, chunksize=10**5)hdf_fn = '/path/to/result.h5'
hdf_key = 'my_huge_df'
store = pd.HDFStore(hdf_fn)
cols_to_index = [<LIST OF COLUMNS THAT WE WANT TO INDEX in HDF5 FILE>]for chunk in sql_reader:store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()

This way, we'll be able to read them faster than a Pandas.read_csv

https://en.xdnf.cn/q/70923.html

Related Q&A

compress a string in python 3?

I dont understand in 2.X it worked :import zlib zlib.compress(Hello, world)now i have a :zlib.compress("Hello world!") TypeError: must be bytes or buffer, not strHow can i compress my string …

How to set color of text using xlwt

I havent been able to find documentation on how to set the color of text. How would the following be done in xlwt?style = xlwt.XFStyle()# bold font = xlwt.Font() font.bold = True style.font = font# ba…

How to apply linregress in Pandas bygroup

I would like to apply a scipy.stats.linregress within Pandas ByGroup. I had looked through the documentation but all I could see was how to apply something to a single column like grouped.agg(np.sum)or…

Python Shared Memory Array, no attribute get_obj()

I am working on manipulating numpy arrays using the multiprocessing module and am running into an issue trying out some of the code I have run across here. Specifically, I am creating a ctypes array f…

What is a qualified/unqualified name in Python?

In Python: what is a "qualified name" or "unqualified name"?Ive seen it mentioned a couple of times, but no explanation as to what it is.

Python code explanation for stationary distribution of a Markov chain

I have got this code: import numpy as np from scipy.linalg import eig transition_mat = np.matrix([[.95, .05, 0., 0.],\[0., 0.9, 0.09, 0.01],\[0., 0.05, 0.9, 0.05],\[0.8, 0., 0.05, 0.15]])S, U = eig(tr…

Detecting insertion/removal of USB input devices on Windows 10

I already have some working Python code to detect the insertion of some USB device types (from here).import wmiraw_wql = "SELECT * FROM __InstanceCreationEvent WITHIN 2 WHERE TargetInstance ISA \W…

FastAPI as a Windows service

I am trying to run FastAPI as a windows service.Couldnt find any documentation or any article to run Uvicorn as a Windows service. I tried using NSSM as well but my windows service stops.

Why not use runserver for production at Django?

Everywhere i see that uWSGI and Gunicorn are recommended for production mode from everyone. However, there is a lot more suffering to operate with it, the python manage.py runserver is more faster, sim…

How to create decorator for lazy initialization of a property

I want to create a decorator that works like a property, only it calls the decorated function only once, and on subsequent calls always return the result of the first call. An example:def SomeClass(obj…