Extract HTML Tables With Similar Data from Different Sources with Different Formatting - Python

2024/10/5 18:24:15

I am trying to scrape HTML tables from two different HTML sources. Both are very similar, each table includes the same data but they may be structured differently, with different column names etc. For one source, all of the data may be included in one table, while the other source may have the data broken up into two separate tables.

As an example, we can look at insider holders of both AAPL and MMM stocks.

Screenshots here - https://i.sstatic.net/dt6Pa.jpg

Lets say the end goal is to extract the total number of shares held by insiders - one singular number. Each table may be structured differently, but what should be similar is key words such as "Beneficially" or "Stock".

Any help would be greatly appreciated. In a previous post I was able to extract some of the data. But it can't be looped or repeated if structuring is different.

Extract HTML Table Based on Specific Column Headers - Python

df = pd.read_html("https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm", attrs={'style': 'border-collapse: collapse; width: 100%; font: 9pt Arial, Helvetica, Sans-Serif'}, match="Name/address")df = df[0]
df = df.dropna(axis = 'columns')

Also attempted with BS


url = 'https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm'
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.find_all('table')
rows = tables.find_all('tr')
Answer

That was really complicated but here we go :).

import requests
from bs4 import BeautifulSoup
import re
import pandas as pdurls = ['https://www.sec.gov/Archives/edgar/data/320193/000119312520001450/d799303ddef14a.htm','https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm']def main(urls):with requests.Session() as req:for url in urls:r = req.get(url)soup = BeautifulSoup(r.content, 'html.parser')for item in soup.findAll("a", text=re.compile("^Security")):item = item.get("href")[1:]catch = soup.find("a", {'name': item}).find_next("table")df = pd.read_html(str(catch))print(df)df[0].to_csv(f"{item}.csv", index=False, header=None)main(urls)

Output:

[                                                    0  ...    8
0                                                 NaN  ...  NaN
1                                                 NaN  ...  NaN
2                            Name of Beneficial Owner  ...  NaN
3                                                 NaN  ...  NaN
4                                  The Vanguard Group  ...    %
5                                                 NaN  ...  NaN
6                                     BlackRock, Inc.  ...    %
7                                                 NaN  ...  NaN
8         Berkshire Hathaway Inc. / Warren E. Buffett  ...    %
9                                                 NaN  ...  NaN
10                                         Kate Adams  ...  NaN
11                                                NaN  ...  NaN
12                                    Angela Ahrendts  ...  NaN
13                                                NaN  ...  NaN
14                                         James Bell  ...  NaN
15                                                NaN  ...  NaN
16                                           Tim Cook  ...  NaN
17                                                NaN  ...  NaN
18                                            Al Gore  ...  NaN
19                                                NaN  ...  NaN
20                                        Andrea Jung  ...  NaN
21                                                NaN  ...  NaN
22                                       Art Levinson  ...  NaN
23                                                NaN  ...  NaN
24                                       Luca Maestri  ...  NaN
25                                                NaN  ...  NaN
26                                    Deirdre O’Brien  ...  NaN
27                                                NaN  ...  NaN
28                                          Ron Sugar  ...  NaN
29                                                NaN  ...  NaN
30                                         Sue Wagner  ...  NaN
31                                                NaN  ...  NaN
32                                      Jeff Williams  ...  NaN
33                                                NaN  ...  NaN
34  All current executive officers and directors a...  ...  NaN[35 rows x 9 columns]]
[                                                   0   1   ...                18  19 
0                        Name  and principal position NaN  ...  Percent of Class NaN  
1                    Thomas “Tony” K. Brown, Director NaN  ...               (5) NaN  
2                           Pamela J. Craig, Director NaN  ...               (5) NaN  
3                           David B. Dillon, Director NaN  ...               (5) NaN  
4                          Michael L. Eskew, Director NaN  ...               (5) NaN  
5                         Herbert L. Henkel, Director NaN  ...               (5) NaN  
6                               Amy E. Hood, Director NaN  ...               (5) NaN  
7                               Muhtar Kent, Director NaN  ...               (5) NaN  
8                           Edward M. Liddy, Director NaN  ...               (5) NaN  
9                           Dambisa F. Moyo, Director NaN  ...               (5) NaN  
10                          Gregory R. Page, Director NaN  ...               (5) NaN  
11                       Patricia A. Woertz, Director NaN  ...               (5) NaN  
12  Michael F. Roman, Chairman of the Board, Presi... NaN  ...               (5) NaN  
13  Inge G. Thulin, Former Executive Chairman of t... NaN  ...               (5) NaN  
14  Nicholas C. Gangestad, Senior Vice President a... NaN  ...               (5) NaN  
15  Ashish K. Khandpur, Executive Vice President, ... NaN  ...               (5) NaN  
16  Julie L. Bushman, Executive Vice President, In... NaN  ...               (5) NaN  
17  Joaquin Delgado, Former Executive Vice Preside... NaN  ...               (5) NaN  
18  Michael G. Vale, Executive Vice President, Saf... NaN  ...               (5) NaN  
19  All Directors and Executive Officers as a Grou... NaN  ...               (5) NaN  [20 rows x 20 columns]]
[                                                   0   1  ...                  6   7 
0                                       Name/address NaN  ...  Percent  of Class NaN  
1  The Vanguard Group(1) 100 Vanguard Blvd. Malve... NaN  ...               8.78 NaN  
2  State Street Corporation(2) State Street Finan... NaN  ...               7.36 NaN  
3  BlackRock, Inc.(3) 55 East 52nd Street New Yor... NaN  ...               7.30 NaN  [4 rows x 8 columns]]
https://en.xdnf.cn/q/119044.html

Related Q&A

AttributeError: NoneType object has no attribute replace_with

I am getting the following error:Traceback (most recent call last):File "2.py", line 22, in <module>i.string.replace_with(i.string.replace(u\xa0, -)) AttributeError: NoneType object has…

How to expand out a Pyspark dataframe based on column?

How do I expand a dataframe based on column values? I intend to go from this dataframe:+---------+----------+----------+ |DEVICE_ID| MIN_DATE| MAX_DATE| +---------+----------+----------+ | 1|…

How can I trigger my python script to automatically run via a ping?

I wrote a script that recurses through a set of Cisco routers in a network, and gets traffic statistics. On the router itself, I have it ping to the loopback address of my host PC, after a traffic thre…

How do I make my bot delete a message when it contains a certain word?

Okay so Im trying to make a filter for my bot, but one that isnt too complicated. Ive got this:@bot.event async def on_message(ctx,message):if fuck in Message.content.lower:Message.delete()But it gives…

pyinstaller cant find package Tix

I am trying to create an executable with pyinstaller for a python script with tix from tkinter. The following script also demonstrates the error: from tkinter import * from tkinter import tixroot = ti…

form.validate_on_submit() doesnt work(nothing happen when I submit a form)

Im creating a posting blog function for social media website and Im stuck on a problem: when I click on the "Post" button(on create_post.html), nothing happens.In my blog_posts/views.py, when…

How to find determinant of matrix using python

New at python and rusty on linear Algebra. However, I am looking for guidance on the correct way to create a determinant from a matrix in python without using Numpy. Please see the snippet of code belo…

How do I pass variables around in Python?

I want to make a text-based fighting game, but in order to do so I need to use several functions and pass values around such as damage, weapons, and health.Please allow this code to be able to pass &qu…

How to compare an item within a list of list in python

I am a newbie to python and just learning things as I do my project and here I have a list of lists which I need to compare between the second and last column and get the output for the one which has t…

Make for loop execute parallely with Pandas columns

Please convert below code to execute parallel, Here Im trying to map nested dictionary with pandas column values. The below code works perfectly but consumes lot of time. Hence looking to parallelize t…