openpyxl please do not assume text as a number when importing

2024/9/8 11:13:19

There are numerous questions about how to stop Excel from interpreting text as a number, or how to output number formats with openpyxl, but I haven't seen any solutions to this problem:

I have an Excel spreadsheet given to me by someone else, so I did not create it. When I open the file with Excel, I have certain values like "5E12" (clone numbers, if anyone cares) that appear to display correctly, but there's a little green arrow next to each one warning me that "This appears to be a number stored as text". Excel then asks me if I would like to convert it to a number, and if I saw yes, I get 5000000000000, which then converts automatically to scientific notation and displays 5E12 again, only this time a text output would show the full number with zeroes. Note that before the conversion, this really is text, even to Excel, and I'm only being warned/offered to convert it.

So, when reading this file in with openpyxl (from openpyxl.reader.excel import load_workbook), the 5E12 is getting converted automatically to 5000000000000. I assume that openpyxl is making the same assumption that Excel made, only the conversion happens without a prompt or input on my part.

How can I prevent this from happening? I do not want text that look like "numbers stored as text" to convert to numbers. They are text unless I say so.

So far, the only solution I have found is to add single quotes to the front of each cell, but this is not an ideal solution, as it's manual labor rather than a programmatic solution. Also, the solution needs to be general, since I don't always know where this problem might occur (I'm reading millions of lines per day, so I don't want to have to do anything by hand).

I think this is a problem with openpyxl. There is a google group discussion from the beginning of 2011 that mentions this problem, but assumes it's too rare to matter. https://groups.google.com/forum/?fromgroups=#!topic/openpyxl-users/HZfpShMp8Tk

So, any suggestions?

Answer

If you want to use openpyxl again (for whatever reason), the following changes to the worksheet reader routine do the trick of keeping the strings as strings:

diff --git a/openpyxl/reader/worksheet.py b/openpyxl/reader/worksheet.py

--- a/openpyxl/reader/worksheet.py
+++ b/openpyxl/reader/worksheet.py
@@ -134,8 +134,10 @@data_type = element.get('t', 'n')if data_type == Cell.TYPE_STRING:value = string_table.get(int(value))
-
-            ws.cell(coordinate).value = value
+                ws.cell(coordinate).set_value_explicit(value=value,
+                                                data_type=Cell.TYPE_STRING)
+            else:
+                ws.cell(coordinate).value = value# to avoid memory exhaustion, clear the item after useelement.clear()

The Cell.value is a property and on assignment call Cell._set_value, which then does a Cell.bind_value which according to the method's doc: "Given a value, infer type and display options". As the types of the values are in the XML file those should be taken (here I only do that for strings) instead of doing something 'smart'.

As you can see from the code, the test whether it is a string was already there.

https://en.xdnf.cn/q/72904.html

Related Q&A

NLTK CoreNLPDependencyParser: Failed to establish connection

Im trying to use the Stanford Parser through NLTK, following the example here.I follow the first two lines of the example (with the necessary import)from nltk.parse.corenlp import CoreNLPDependencyPars…

How to convert hex string to color image in python?

im new in programming so i have some question about converting string to color image.i have one data , it consists of Hex String, like a fff2f3..... i want to convert this file to png like this.i can c…

How to add values to a new column in pandas dataframe?

I want to create a new named column in a Pandas dataframe, insert first value into it, and then add another values to the same column:Something like:import pandasdf = pandas.DataFrame() df[New column].…

value error happens when using GridSearchCV

I am using GridSearchCV to do classification and my codes are:parameter_grid_SVM = {dual:[True,False],loss:["squared_hinge","hinge"],penalty:["l1","l2"] } clf = …

How to remove english text from arabic string in python?

I have an Arabic string with English text and punctuations. I need to filter Arabic text and I tried removing punctuations and English words using sting. However, I lost the spacing between Arabic word…

python module pandas has no attribute plotting

I am a beginner of Python. I follow the machine learning course of Intel. And I encounter some troubles in coding. I run the code below in Jupyter and it raises an AttributeError.import pandas as pd st…

pandass resample with fill_method: Need to know data from which row was copied?

I am trying to use resample method to fill the gaps in timeseries data. But I also want to know which row was used to fill the missed data.This is my input series.In [28]: data Out[28]: Date 2002-09-0…

Inefficient multiprocessing of numpy-based calculations

Im trying to parallelize some calculations that use numpy with the help of Pythons multiprocessing module. Consider this simplified example:import time import numpyfrom multiprocessing import Pooldef t…

SQLite: return only top 2 results within each group

I checked other solutions to similar problems, but sqlite does not support row_number() and rank() functions or there are no examples which involve joining multiple tables, grouping them by multiple co…

Python list.append if not in list vs set.add performance [duplicate]

This question already has answers here:Which is faster and why? Set or List?(3 answers)Closed 6 years ago.Which is more performant, and what is asymptotic complexity (or are they equivalent) in Pytho…