I am scraping a county website that posts emergency calls and their locations. I have found success webscraping basic elements, but am having trouble scraping the rows of the table.
(Here is an example of what I am working with codewise)
location = list.find('div', class_='listing-search-item__sub-title')
Im not sure how to specifically webscrape the rows of the table. Can anyone explain how to dig into the sublevels of html to look for these records ? I'm not sure if I need to dig into tr, table, tbody, td, etc. Could use some guidance on which division or class to assign to dig into the data.
For extracting specific nested elements, I often prefer to use .select
, which uses css selectors (bs4 doesn't seem to have any support for xpath but you can also check out these solutions using the lxml library), so for your case you could use something like
soup.select_one('table[id="form1:tableEx1"]').select('tbody tr')
but the results might look a bit weird since the columns might not be separated - to have separated columns/cells, you could get the of rows as tuples instead with
tableRows = [tuple([c.text.strip() for c in r.find_all(['th', 'td'])]) for r in BeautifulSoup(tHtml).select_one('table[id="form1:tableEx1"]').select('tbody tr')
]
(Note that you can't use the .select(#id)
format when the id contains a ":".)
As one of the comments mentioned, you can use pandas.read_html(htmlString)
to get a list of tables in the html; if you want a specific table, use the attrs
argument:
# import pandas
pandas.read_html(htmlString, attrs={'id': 'form1:tableEx1'})[0]
but you will get the whole table - not just what's in tbody
; and this will flatten any tables that are nested inside (see results with table used from this example).
And the single-statement method I showed at first with select
cannot be used at all with nested tables since the output will be scrambled. Instead, if you want to preserve any nested inner tables without flattening, and if you are likely to be scraping tables often, I have the following set of functions which can be used in general:
- first define two other function that the main table extractor depends on:
# get a list of tagNames between a tag and its ancestor
def linkAncestor(t, a=None):aList = []while t.parent != a or a is None:t = t.parent if t is None:if a is not None: aList = NonebreakaList.append(t.name)return aList# if a == t.parent: return []# if a is None, return tagNames of ALL ancestors # if a not in t.parents: return Nonedef getStrings_table(xSoup): # not perfect, but enough for me so fartableTags = ['table', 'tr', 'th', 'td']return "\n".join([c.get_text(' ', strip=True) for c in xSoup.children if c.get_text(' ', strip=True) and (c.name is None or (c.name not in tableTags and not c.find(tableTags)))])
- then, you can define the function for extracting the tables as python dictionaries:
def tablesFromSoup(mSoup, mode='a', simpleOp=False):typeDict = {'t': 'table', 'r': 'row', 'c': 'cell'}finderDict = {'t': 'table', 'r': 'tr', 'c': ['th', 'td']}refDict = {'a': {'tables': 't', 'loose_rows': 'r', 'loose_cells': 'c'},'t': {'inner_tables': 't', 'rows': 'r', 'loose_cells': 'c'},'r': {'inner_tables': 't', 'inner_rows': 'r', 'cells': 'c'}, 'c': {'inner_tables': 't', 'inner_rows': 'r', 'inner_cells': 'c'}}mode = mode if mode in refDict else 'a'# for when simpleOp = TruenextModes = {'a': 't', 't': 'r', 'r': 'c', 'c': 'a'}mainCont = {'a': 'tables', 't': 'rows', 'r': 'cells', 'c': 'inner_tables'}innerContent = {} for k in refDict[mode]: if simpleOp and k != mainCont[mode]: continuefdKey = refDict[mode][k] # also the mode for recursive callinnerSoups = [(s, linkAncestor(s, mSoup)) for s in mSoup.find_all(finderDict[fdKey])] innerSoups = [s for s, la in innerSoups if not ('table' in la or 'tr' in la or 'td' in la or 'th' in la)]# recursive callkCont = [tablesFromSoup(s, fdKey, simpleOp) for s in innerSoups] if simpleOp:if kCont == [] and mode == 'c': breakreturn tuple(kCont) if mode == 'r' else kCont# if not empty, check if header then add to outputif kCont: if 'row' in k:for i in range(len(kCont)):if 'isHeader' in kCont[i]: continuekCont[i]['isHeader'] = 'thead' in innerSoups[i][1]if 'cell' in k:isH = [(c[0].name == 'th' or 'thead' in c[1]) for c in innerSoups]if sum(isH) > 0:if mode == 'r':innerContent['isHeader'] = Trueelse: innerContent[f'isHeader_{k}'] = isHinnerContent[k] = kCont if innerContent == {} and mode == 'c':innerContent = mSoup.get_text(' ', strip=True) elif mode in typeDict:if innerContent == {}: innerContent['innerText'] = mSoup.get_text(' ', strip=True)else:innerStrings = getStrings_table(mSoup)if innerStrings:innerContent['stringContent'] = innerStringsinnerContent['type'] = typeDict[mode] return innerContent
With the same example as before, this function gives this output; if the simpleOp
argument is set to True
, it results in a simpler output, but then the headers are no longer differentiated and some other peripheral data is also excluded.