Tidy data from multilevel Excel file via pandas

2024/9/8 10:44:44

I want to produce tidy data from an Excel file which looks like this, with three levels of "merged" headers:

enter image description here

Pandas reads the file just fine, with multilevel headers:

# df = pandas.read_excel('test.xlsx', header=[0,1,2])

For repeatability, you can copy-paste this:

df = pandas.DataFrame({('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'a'): {1: 'aX', 2: 'aY'}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'b'): {1: 'bX', 2: 'bY'}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'c'): {1: 'cX', 2: 'cY'}, ('level1_1', 'level2_1', 'level3_1'): {1: 1, 2: 10}, ('level1_1', 'level2_1', 'level3_2'): {1: 2, 2: 20}, ('level1_1', 'level2_2', 'level3_1'): {1: 3, 2: 30}, ('level1_1', 'level2_2', 'level3_2'): {1: 4, 2: 40}, ('level1_2', 'level2_1', 'level3_1'): {1: 5, 2: 50}, ('level1_2', 'level2_1', 'level3_2'): {1: 6, 2: 60}, ('level1_2', 'level2_2', 'level3_1'): {1: 7, 2: 70}, ('level1_2', 'level2_2', 'level3_2'): {1: 8, 2: 80}})

I want to normalise this so that the level headings are in variable rows, but retain columns a, b and c as columns:

Desired output

Without the multi-level headers, I would do pandas.melt(df, id_vars=['a', 'b', 'c']) to get what I want. pandas.melt(df) gives me the three variable columns I want, but obviously doesn't retain the a, b, and c columns.

Answer

It should be as simple as:

wide_df = pandas.read_excel(xlfile, sheetname, header=[0, 1, 2], index_col=[0, 1, 2, 3])long_df = wide_df.stack().stack().stack()

Here's an example with a mock-up CSV file (note the 4th row to label the index and the first column to label the header levels):

from io import StringIO
from textwrap import dedentimport pandasmockcsv = StringIO(dedent("""\num,,,this1,this1,this1,this1,that1,that1,that1,that1let,,,thisA,thisA,thatA,thatA,thisB,thisB,thatB,thatBanimal,,,cat,dog,bird,lizard,cat,dog,bird,lizarda,b,c,,,,,,,,a1,b1,c1,x1,x2,x3,x4,x5,x6,x7,x8a1,b1,c2,y1,y2,y3,y4,y5,y6,y7,y8a1,b2,c1,z1,z2,z3,z4,z5,6z,zy,z8
"""))wide_df = pandas.read_csv(mockcsv, index_col=[0, 1, 2], header=[0, 1, 2])
long_df = wide_df.stack().stack().stack()

So wide_df looks like this:

num      this1                  that1                 
let      thisA     thatA        thisB     thatB       
animal     cat dog  bird lizard   cat dog  bird lizard
a  b  c                                               
a1 b1 c1    x1  x2    x3     x4    x5  x6    x7     x8c2    y1  y2    y3     y4    y5  y6    y7     y8b2 c1    z1  z2    z3     z4    z5  6z    zy     z8

And long_df

a   b   c   animal  let    num  
a1  b1  c1  bird    thatA  this1    x3thatB  that1    x7cat     thisA  this1    x1thisB  that1    x5dog     thisA  this1    x2thisB  that1    x6lizard  thatA  this1    x4thatB  that1    x8c2  bird    thatA  this1    y3thatB  that1    y7cat     thisA  this1    y1thisB  that1    y5dog     thisA  this1    y2thisB  that1    y6lizard  thatA  this1    y4thatB  that1    y8b2  c1  bird    thatA  this1    z3thatB  that1    zycat     thisA  this1    z1thisB  that1    z5dog     thisA  this1    z2thisB  that1    6zlizard  thatA  this1    z4thatB  that1    z8

With literal data shown in the OP, you can get at this w/o modifying anything by doing the following:

index_names = ['a', 'b', 'c']
col_names = ['Level1', 'Level2', 'Level3']
df = (pandas.read_excel('Book1.xlsx', header=[0, 1, 2], index_col=[0, 1, 2, 3]).reset_index(level=0, drop=True).rename_axis(index_names, axis='index').rename_axis(col_names, axis='columns').stack().stack().stack().to_frame()
)

I think the tricky part will be inspecting each of your files to figure out what index_names should be.

https://en.xdnf.cn/q/72676.html

Related Q&A

ttk tkinter multiple frames/windows

The following application I have created is used to demonstrate multiple windows in tkinter. The main problem is that none of the Entry controls, neither in the bmi-calculator or the converter, accept …

input() vs sys.stdin.read()

import sys s1 = input() s2 = sys.stdin.read(1)#type "s" for examples1 == "s" #False s2 == "s" #TrueWhy? How can I make input() to work properly? I tried to encode/decode…

Renormalize weight matrix using TensorFlow

Id like to add a max norm constraint to several of the weight matrices in my TensorFlow graph, ala Torchs renorm method.If the L2 norm of any neurons weight matrix exceeds max_norm, Id like to scale it…

Numpy: find the euclidean distance between two 3-D arrays

Given, two 3-D arrays of dimensions (2,2,2):A = [[[ 0, 0],[92, 92]],[[ 0, 92],[ 0, 92]]]B = [[[ 0, 0],[92, 0]],[[ 0, 92],[92, 92]]]How do you find the Euclidean distance for each vector in A and B e…

Is it possible to break from lambda when the expected result is found

I am Python newbie, and just become very interested in Lambda expression. The problem I have is to find one and only one target element from a list of elements with lambda filter. In theory, when the t…

Intersection of multiple pandas dataframes

I have a number of dataframes (100) in a list as:frameList = [df1,df2,..,df100]Each dataframe has the two columns DateTime, Temperature.I want to intersect all the dataframes on the common DateTime col…

docker with pycharm 5

I try to build a docker-based development box for our django app. Its running smoothly.None of my teammembers will care about that until there is a nice IDE integration, therefore I play the new and sh…

How to make a simple Python REST server and client?

Im attempting to make the simplest possible REST API server and client, with both the server and client being written in Python and running on the same computer.From this tutorial:https://blog.miguelgr…

Histogram fitting with python

Ive been surfing but havent found the correct method to do the following.I have a histogram done with matplotlib:hist, bins, patches = plt.hist(distance, bins=100, normed=True)From the plot, I can see …

Subtract each row of matrix A from every row of matrix B without loops

Given two arrays, A (shape: M X C) and B (shape: N X C), is there a way to subtract each row of A from each row of B without using loops? The final output would be of shape (M N X C).Example A = np.ar…