Adding an extra column to (big) SQLite database from Pandas dataframe

2024/10/4 5:34:57

I feel like I'm overlooking something really simple, but I can't make it work. I'm using SQLite now, but a solution in SQLAlchemy would also be very helpful.

Let's create our original dataset:

### This is just the setup part
import pandas as pd
import sqlite3
conn = sqlite3.connect('test.sqlite')orig = pd.DataFrame({'COLUPC': [100001, 100002, 100003, 100004],
'L5': ['ABC ALE', 'ABC MALT LIQUOR', 'ABITA AMBER', 'ABITA AMBER'],
'attr1': [0.25, 0.25, 0.041, 0.041]})orig.to_sql("UPCs", conn, if_exists='replace', index=False)#Create an index just in case it's needed
conn.execute("""CREATE INDEX upc_index
ON UPCs (COLUPC);""")

Now suppose I take that orig dataframe and add a column called 'L5_lower'. Then I create the column in the SQLite database:

# Create new variable
orig['L5_lower'] = orig.L5.str.lower()
conn.execute("alter table UPCs add column L5_lower TEXT;")

Now suppose I want to fill in this single column L5_lower to the SQLite table, without having to pass other columns (below I explain why I need this)

I tried passing the index and the new column as tuples:

query='''insert or replace into UPCs (COLUPC, L5_lower) values (?,?) '''
conn.executemany(query, orig[['COLUPC', 'L5_lower']].to_records(index=False))
conn.commit() # But then:
df = pd.read_sql("SELECT * FROM UPCs;", conn)
conn.close()

gives this messed up result.

    COLUPC                               L5                 attr1   L5_lower
0   100001                               ABC ALE            0.250   None
1   100002                               ABC MALT LIQUOR    0.250   None
2   100003                               ABITA AMBER        0.041   None
3   100004                               ABITA AMBER        0.041   None
4   b'\xa1\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abc ale
5   b'\xa2\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abc malt liquor
6   b'\xa3\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abita amber
7   b'\xa4\x86\x01\x00\x00\x00\x00\x00'     None            NaN     abita amber

Instead, the expected output is:

    COLUPC                               L5                 attr1   L5_lower
0   100001                               ABC ALE            0.250   abc ale
1   100002                               ABC MALT LIQUOR    0.250   abc malt liquor
2   100003                               ABITA AMBER        0.041   abita amber
3   100004                               ABITA AMBER        0.041   abita amber

So, why am I trying to pass a single column? I have a very big dataset and I won't be able to have the whole dataframe in memory. My intended workflow is to construct one column at a time and then update or insert into the SQLite database.

Answer

AFAIK you can't add COLUMNS using Pandas to_sql - you can add ROWS. One solution would be to insert a new column into a temporary table (with the same index as the original table has) and then update the source table on the SQLite side.

Here is a working example:

SETUP:

assuming we have the following original DF:

In [79]: orig
Out[79]:COLUPC               L5  attr1
0  100001          ABC ALE  0.250
1  100002  ABC MALT LIQUOR  0.250
2  100003      ABITA AMBER  0.041
3  100004      ABITA AMBER  0.041In [80]: orig.set_index('COLUPC', inplace=True)In [81]: conn = sqlite3.connect('d:/temp/test.sqlite')In [82]: orig.to_sql('upcs', conn, if_exists='replace', index=True)In [83]: conn.close()

SOLUTION:

In [84]: conn = sqlite3.connect('d:/temp/test.sqlite')In [85]: df = pd.read_sql('select * from upcs', conn, index_col='COLUPC')In [86]: df
Out[86]:L5  attr1
COLUPC
100001          ABC ALE  0.250
100002  ABC MALT LIQUOR  0.250
100003      ABITA AMBER  0.041
100004      ABITA AMBER  0.041

create temporary table:

In [87]: tmp = orig.L5.str.lower().to_frame('L5_lower')In [88]: tmp
Out[88]:L5_lower
COLUPC
100001          abc ale
100002  abc malt liquor
100003      abita amber
100004      abita amberIn [89]: tmp.to_sql('tmp', conn, if_exists='replace', index=True)

add new column to SQLite table:

In [90]: conn.execute('alter table UPCs add column L5_lower varchar(50)')
Out[90]: <sqlite3.Cursor at 0xa558c00>In [91]: qry = 'update upcs set L5_lower = (select L5_lower from tmp where tmp.COLUPC = upcs.COLUPC) where L5_lower is NULL'In [92]: conn.execute(qry)
Out[92]: <sqlite3.Cursor at 0xa593570>In [93]: conn.commit()In [94]: conn.execute('drop table tmp')
Out[94]: <sqlite3.Cursor at 0xa5930a0>

Check:

In [95]: pd.read_sql('select * from upcs', conn, index_col='COLUPC')
Out[95]:L5  attr1         L5_lower
COLUPC
100001          ABC ALE  0.250          abc ale
100002  ABC MALT LIQUOR  0.250  abc malt liquor
100003      ABITA AMBER  0.041      abita amber
100004      ABITA AMBER  0.041      abita amberIn [96]: conn.close()
https://en.xdnf.cn/q/70642.html

Related Q&A

error inserting values to db with psycopg2 module [duplicate]

This question already has answers here:psycopg2: cant adapt type numpy.int64(4 answers)Inserting records into postgreSQL database in Python(3 answers)Closed 3 months ago.I am attempting to insert a dat…

NaN values in pivot_table index causes loss of data

Here is a simple DataFrame:> df = pd.DataFrame({a: [a1, a2, a3],b: [optional1, None, optional3],c: [c1, c2, c3],d: [1, 2, 3]}) > dfa b c d 0 a1 optional1 c1 1 1 a2 None c2…

ModuleNotFoundError in Docker

I have imported my entire project into docker, and I am getting a ModuleNotFoundErrorfrom one of the modules I have created.FROM python:3.8 WORKDIR /workspace/ COPY . /workspace/ RUN pip install pipenv…

Can I use md5 authentication with psycopg2?

After two hours of reading documentation, source code and help-threads, Im giving up. I cant get psycopg2 to authenticate with a md5-string. According to this thread I dont have to anything besides ena…

Python checking __init__ parameter

Ive been trying to figuring this out for the last few hours, and Im about to give up.How do you make sure that in python only a matching specific criteria will create the object?For example, lets say …

Minidom getElementById not working

Minidoms getElementById function is returning None for any entry I pass to it.For example, this code:l = minidom.parseString(<node id="node">Node</node>) print(l.getElementById(&q…

Optimization on piecewise linear regression

I am trying to create a piecewise linear regression to minimize the MSE(minimum square errors) then using linear regression directly. The method should be using dynamic programming to calculate the dif…

Python: Check if list of named tuples contains particular attribute value

I have a list of named tuples:from collections import namedtupleT = namedtuple(T, [attr1, attr2, attr3, attr4]) t1 = T(T1, 1, 1234, XYZ) t2 = T(T2, 2, 1254, ABC) t3 = T(T2, 2, 1264, DEF) l = [t1, t2, t…

javascript error: arguments[0].scrollIntoView is not a function using selenium on python

Im using Selenium on python and I would like to scroll to an element to click on it. Everywhere I see that the rigth things to do to go directly to the element is to use :driver = webdriver.Chrome() dr…

Uploading a static project to google app engines

Disclaimer: I already asked here, but apparently off-topic. I want to set up a page using this bootstrap template and host it as a static website using the google appengine service. Inside the google_a…