As I was experimenting with pandas, I noticed some odd behavior of pandas.read_csv and was wondering if someone with more experience could explain what might be causing it.
To start, here is my basic class definition for creating a new pandas.dataframe from a .csv file:
import pandas as pdclass dataMatrix:def __init__(self, filepath):self.path = filepath # File path to the target .csv file.self.csvfile = open(filepath) # Open file.self.csvdataframe = pd.read_csv(self.csvfile)
Now, this works pretty well and calling the class in my __ main __.py successfully creates a pandas dataframe:
From dataMatrix.py import dataMatrixtestObject = dataMatrix('/path/to/csv/file')
But I was noticing that this process was automatically setting the first row of the .csv as the pandas.dataframe.columns index. Instead, I decided to number the columns. Since I didn't want to assume I knew the number of columns before hand, I took the approach of opening the file, loading it into a dataframe, counting the columns, and then reloading the dataframe with the proper number of columns using range().
import pandas as pdclass dataMatrix:def __init__(self, filepath):self.path = filepathself.csvfile = open(filepath)# Load the .csv file to count the columns.self.csvdataframe = pd.read_csv(self.csvfile)# Count the columns.self.numcolumns = len(self.csvdataframe.columns)# Re-load the .csv file, manually setting the column names to their # number.self.csvdataframe = pd.read_csv(self.csvfile, names=range(self.numcolumns))
Keeping my processing in __ main __.py the same, I got back a dataframe with the correct number of columns (500 in this case) with proper names (0...499), but it was otherwise empty (no row data).
Scratching my head, I decided to close self.csvfile and reload it like so:
import pandas as pdclass dataMatrix:def __init__(self, filepath):self.path = filepathself.csvfile = open(filepath)# Load the .csv file to count the columns.self.csvdataframe = pd.read_csv(self.csvfile)# Count the columns.self.numcolumns = len(self.csvdataframe.columns)# Close the .csv file. #<---- +++++++self.csvfile.close() #<---- Added# Re-open file. #<---- Blockself.csvfile = open(filepath) #<---- +++++++# Re-load the .csv file, manually setting the column names to their# number.self.csvdataframe = pd.read_csv(self.csvfile, names=range(self.numcolumns))
Closing the file and re-opening it returned correctly with a pandas.dataframe with columns numbered 0...499 and all 255 subsequent rows of data.
My question is why does closing the file and re-opening it make a difference?