Create a dataframe from HTML table in Python [closed]

2024/10/7 22:25:53

I'm trying to extract info from multiple tables, like the one below. I'm trying to extract the address, lot number, guide price, description - should I simply do a regular expression match? There are 232 such tables - presumably do a loop to extract them (and stick them into pandas)?

                            <table cellspacing="0" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1" style="width:100%;border-collapse:collapse;">
<tr><td colspan="2"><table class="table-search-result"><tr><th>66D Charlwood Street, Pimlico, London, SW1V 4PQ</th><th style="text-align: right; white-space: nowrap;"><a href="http://www.englishhouseprices.com/results.aspx?postcode=SW1V 4PQ" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_A2" class="icon" target="_blank"><img src="/content/images/icons/32/houseprices.png" alt="Compare with Property Prices" title="Compare with Property Prices in this Postcode" /></a><a id="" title="View Auction Details" class="icon" onclick="return o(this,900,650,1,1)" href="/clients/auctions/details.aspx?auctionid=17999" target="_blank"><img title="View Auction Details" src="/content/images/icons/32/auctiondetails.png" alt="" /></a><a id="" title="Trend Analysis" class="icon" onclick="return o(this,900,650,1,1)" href="/clients/lots/trend-analysis.aspx?lotid=756425" target="_blank"><img title="Trend Analysis" src="/content/images/icons/32/piechart.png" alt="" /></a><a href='http://maps.google.co.uk?q=SW1V 4PQ' target="_blank"><img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_ImageLocationMap" title="Location Map" class="icon" src="/content/images/icons/32/compass.png" /></a><a href='http://www.multimap.com/map/photo.cgi?scale=5000&mapsize=big&pc=SW1V 4PQ' target="_blank"><img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_ImageAerialPhoto" title="Aerial Photo" class="icon" src="/content/images/icons/32/camera.png" /></a><a href='/clients/search/search-results.aspx?searchtype=comparable&lotid=756425' title="Find similar properties like this one"><img src="/content/images/icons/32/find.png" alt="Find other properties matching this tenant" title="Find similar properties like this one" class="icon" /></a><a href='/clients/search/search-results.aspx?searchtype=history&lotid=756425'><img src="/content/images/icons/32/history.png" alt="Find history of property in this street" title="Find history of property in this street" class="icon" /></a><a id="" title="Add to one of my portfolios" class="icon" Title="Add to portfolio" onclick="return o(this,650,500,1,1)" href="/clients/portfolios/lot.aspx?lotid=756425" target="_blank"><img title="Add to one of my portfolios" src="/content/images/icons/32/briefcase.png" alt="" /></a><a href="https://www.eigroup.co.uk/files/55/17999/6ec339ec-d59e-4b8a-9136-dc6e9a583328.pdf" id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_A4" target="_blank"><img src="/content/images/icons/32/catalogue.png" alt="Catalogue Entry" class="icon" title="Full Catalogue Entry" /></a><a id="" title="Add to my shortlist" class="icon" Title="Add to shortlist" onclick="return o(this,900,650,1,1)" href="/clients/lots/shortlist.aspx?lotid=756425" target="shortlist"><img title="Add to my shortlist" src="/content/images/icons/32/shortlist.png" alt="" /></a></th></tr><tr><td colspan="2" style="background-color: #f5f5f5;"><table style="width: 100%"><tr><td style="background-color: #f1f1f1; width: 170px; text-align: center;"><a href='/clients/lots/details.aspx?lotid=756425&hb=1' target='756425' onclick="window.open(this.href,this.target,'width=900,height=650,resizable=yes,scrollbars=yes');return false" title="Auction property in Pimlico, London, SW1"><img id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_Image1" src="https://www.eigroup.co.uk/files/55/17999/de591a4f-7da1-4bcd-a42c-76731bd72a23.jpg" alt="Pimlico, London, SW1" style="border-color:Black;border-width:2px;border-style:Solid;width:150px;" /></a></td><td style="padding-left: 10px; width: 50%;"><p><b>Description</b><br />Leasehold 2nd Floor Studio Flat Unmodernised Vacant</p><p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P1"><b>Guide Price</b><br />£450,000 Plus</p><p><b>Lot Number</b><br />2</p><p><b> </b></p></td><td style="white-space: nowrap;"><p><b>Auctioneer</b><br /><a id="" onclick="return o(this,900,650,1,1)" href="/clients/auctioneers/details.aspx?auctioneerid=55" target="_blank">Savills (London - National)</a></p><p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P3"><b>Vendor</b><br />Housing Association</p></td><td style="white-space: nowrap;"><p><b>Auction Date</b><br /><a id="" onclick="return o(this,900,650,1,1)" href="/clients/auctions/details.aspx?auctionid=17999" target="_blank">28 October 2014</a></p><p id="ListViewLots_ClientPropertyControl1_1_FormViewLot_1_P7"><b>Lease Details</b><br />125 Yr, commencing 01/01/2013 (GR.£250.PA)</p></td></tr></table></td></tr></table>
</td>
</tr>

Answer

Use beautifulSoup to parse html:

Using your posted html as an example:

from bs4 import BeautifulSoupsoup = BeautifulSoup(html)s = (soup.find_all("p"))
for ele in s:print(ele.text.strip()DescriptionLeasehold 2nd Floor Studio Flat Unmodernised Vacant
Guide Price£450,000 Plus
Lot Number2Auctioneer
Savills (London - National)
VendorHousing Association
Auction Date
28 October 2014
Lease Details125 Yr, commencing 01/01/2013 (GR.£250.PA)
https://en.xdnf.cn/q/118770.html

Related Q&A

Unable to load a Django model from a separate directory in a database script

I am having difficulty writing a python script that takes a directory of .txt files and loads them into my database that is utilized in a Django project. Based on requirements the python script needs …

Leetcode problem 14. Longest Common Prefix (Python)

I tried to solve the problem (you can read description here: https://leetcode.com/problems/longest-common-prefix/) And the following is code I came up with. It gives prefix value of the first string in…

Python BS: Fetching rows with and without color attribute

I have some html that looks like this (this represents rows of data in a table, i.e the data between tr and /tr is one row in a table)<tr bgcolor="#f4f4f4"> <td height="25"…

Python multiple number guessing game

I am trying to create a number guessing game with multiple numbers. The computer generates 4 random numbers between 1 and 9 and then the user has 10 chances to guess the correct numbers. I need the fee…

How to produce a graph of connected data in Python?

Lets say I have a table of words, and each word has a "related words" column. In practice, this would probably be two tables with a one-to-many relationship, as each word can have more than o…

Syntax for reusable iterable?

When you use a generator comprehension, you can only use the iterable once. For example.>>> g = (i for i in xrange(10)) >>> min(g) 0 >>> max(g) Traceback (most recent call la…

Buildozer Problem. I try to make apk file for android, but i cant

artur@DESKTOP-SMKQONQ:~/Suka$ lsbuildozer.spec main.pyartur@DESKTOP-SMKQONQ:~/Suka$ buildozer android debugTraceback (most recent call last):File "/usr/local/bin/buildozer", line 10, in <…

how to run python script with ansible-playbook?

I want to print result in ansible-playbook but not working. python script: #!/usr/bin/python3import timewhile True:print("Im alive")time.sleep(5)deploy_python_script.yml:connection: localbeco…

How to concatenate pairs of row elements into a new column in a pandas dataframe?

I have this DataFrame where the columns are coordinates (e.g. x1,y1,x2,y2...). The coordinate columns start from the 8th column (the previous ones are irrelevant for the question) I have a larger exam…

Python: using threads to call subprocess.Popen multiple times

I have a service that is running (Twisted jsonrpc server). When I make a call to "run_procs" the service will look at a bunch of objects and inspect their timestamp property to see if they s…