Use beautifulsoup to scrape a table within a webpage?

2024/9/20 22:49:35

I am scraping a county website that posts emergency calls and their locations. I have found success webscraping basic elements, but am having trouble scraping the rows of the table.

(Here is an example of what I am working with codewise)

location = list.find('div', class_='listing-search-item__sub-title')

Im not sure how to specifically webscrape the rows of the table. Can anyone explain how to dig into the sublevels of html to look for these records ? I'm not sure if I need to dig into tr, table, tbody, td, etc. Could use some guidance on which division or class to assign to dig into the data.

enter image description here

Answer

For extracting specific nested elements, I often prefer to use .select, which uses css selectors (bs4 doesn't seem to have any support for xpath but you can also check out these solutions using the lxml library), so for your case you could use something like

soup.select_one('table[id="form1:tableEx1"]').select('tbody tr')

but the results might look a bit weird since the columns might not be separated - to have separated columns/cells, you could get the of rows as tuples instead with

tableRows = [tuple([c.text.strip() for c in r.find_all(['th', 'td'])]) for r in BeautifulSoup(tHtml).select_one('table[id="form1:tableEx1"]').select('tbody tr')
]

(Note that you can't use the .select(#id) format when the id contains a ":".)

As one of the comments mentioned, you can use pandas.read_html(htmlString) to get a list of tables in the html; if you want a specific table, use the attrs argument:

# import pandas
pandas.read_html(htmlString, attrs={'id': 'form1:tableEx1'})[0]

but you will get the whole table - not just what's in tbody; and this will flatten any tables that are nested inside (see results with table used from this example).

And the single-statement method I showed at first with select cannot be used at all with nested tables since the output will be scrambled. Instead, if you want to preserve any nested inner tables without flattening, and if you are likely to be scraping tables often, I have the following set of functions which can be used in general:

  • first define two other function that the main table extractor depends on:
# get a list of tagNames between a tag and its ancestor
def linkAncestor(t, a=None):aList = []while t.parent != a or a is None:t = t.parent if t is None:if a is not None: aList = NonebreakaList.append(t.name)return aList# if a == t.parent: return []# if a is None, return tagNames of ALL ancestors # if a not in t.parents: return Nonedef getStrings_table(xSoup): # not perfect, but enough for me so fartableTags = ['table', 'tr', 'th', 'td']return "\n".join([c.get_text(' ', strip=True) for c in xSoup.children if c.get_text(' ', strip=True) and (c.name is None or (c.name not in tableTags and not c.find(tableTags)))])
  • then, you can define the function for extracting the tables as python dictionaries:
def tablesFromSoup(mSoup, mode='a', simpleOp=False):typeDict = {'t': 'table', 'r': 'row', 'c': 'cell'}finderDict = {'t': 'table', 'r': 'tr', 'c': ['th', 'td']}refDict = {'a': {'tables': 't', 'loose_rows': 'r', 'loose_cells': 'c'},'t': {'inner_tables': 't', 'rows': 'r', 'loose_cells': 'c'},'r': {'inner_tables': 't', 'inner_rows': 'r', 'cells': 'c'}, 'c': {'inner_tables': 't', 'inner_rows': 'r', 'inner_cells': 'c'}}mode = mode if mode in refDict else 'a'# for when simpleOp = TruenextModes = {'a': 't', 't': 'r', 'r': 'c', 'c': 'a'}mainCont = {'a': 'tables', 't': 'rows', 'r': 'cells', 'c': 'inner_tables'}innerContent = {} for k in refDict[mode]: if simpleOp and k != mainCont[mode]: continuefdKey = refDict[mode][k] # also the mode for recursive callinnerSoups = [(s, linkAncestor(s, mSoup)) for s in mSoup.find_all(finderDict[fdKey])] innerSoups = [s for s, la in innerSoups if not ('table' in la or 'tr' in la or 'td' in la or 'th' in la)]# recursive callkCont = [tablesFromSoup(s, fdKey, simpleOp) for s in innerSoups] if simpleOp:if kCont == [] and mode == 'c': breakreturn tuple(kCont) if mode == 'r' else kCont# if not empty, check if header then add to outputif kCont: if 'row' in k:for i in range(len(kCont)):if 'isHeader' in kCont[i]: continuekCont[i]['isHeader'] = 'thead' in innerSoups[i][1]if 'cell' in k:isH = [(c[0].name == 'th' or 'thead' in c[1]) for c in innerSoups]if sum(isH) > 0:if mode == 'r':innerContent['isHeader'] = Trueelse: innerContent[f'isHeader_{k}'] = isHinnerContent[k] = kCont if innerContent == {} and mode == 'c':innerContent = mSoup.get_text(' ', strip=True) elif mode in typeDict:if innerContent == {}: innerContent['innerText'] = mSoup.get_text(' ', strip=True)else:innerStrings = getStrings_table(mSoup)if innerStrings:innerContent['stringContent'] = innerStringsinnerContent['type'] = typeDict[mode] return innerContent

With the same example as before, this function gives this output; if the simpleOp argument is set to True, it results in a simpler output, but then the headers are no longer differentiated and some other peripheral data is also excluded.

https://en.xdnf.cn/q/119274.html

Related Q&A

Encrypt folder or zip file using python

So I am trying to encrypt a directory using python and Im not sure what the best way to do that is. I am easily able to turn the folder into a zip file, but from there I have tried looking up how to en…

Use Python Element Tree to parse xml in ASCII text file [closed]

Closed. This question needs debugging details. It is not currently accepting answers.Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to repro…

Time series plot showing unique occurrences per day

I have a dataframe, where I would like to make a time series plot with three different lines that each show the daily occurrences (the number of rows per day) for each of the values in another column. …

Problem accessing indexed results two stage stochastic programming Pyomo

When running a stochastic programming problem in Pyomo, the resulting solution works only when running 10 precisely the same scenarios but the results remain zero when running different scenarios. I ai…

Pandas python + format for values

This is the code:import pandas as pd from pandas import Series, DataFrame import numpy as np import matplotlib.pyplot as pltdf.head(3).style.format({Budget: "€ {:,.0f}"}) Year …

Is there any implementation of deconvolution?

Some one may prefer to call it the transposed convolution, as introduced here. Im looking forward to an implementation of the transposed convolution, in Python or C/C++. Thank you all for helping me!

discord.py How to check if user is on server?

I need to check if the user is on the server. Please help me

Inserting variable stored data into SQLite3 - Python 3

I have been reading information on how to insert data into a database using data stored in a variable. I have not been able to get my data to load to my database and I am not sure why.The program is w…

Print specific line in a .txt file in Python?

I have got a .txt file that contains a lot of lines. I would like my program to ask me what line I would like to print and then print it into the python shell. The .txt file is called packages.txt.

EOF error with both exec_command and invoke_shell method in paramiko

I am trying to execute a command on linux server from my windows machine using python paramiko , I used both of the methods1.exec_command2.invoke_shellBoth of them giving EOF error.import paramiko impo…