a (presumably basic) web scraping of http://www.ssa.gov/cgi-bin/popularnames.cgi in urllib

2024/10/13 17:14:18

I am very new to Python (and web scraping). Let me ask you a question.

Many website actually do not report its specific URLs in Firefox or other browsers. For example, Social Security Admin shows popular baby names with ranks (since 1880), but the url does not change when I change the year from 1880 to 1881. It is constantly,

http://www.ssa.gov/cgi-bin/popularnames.cgi

Because I don't know the specific URL, I could not download the webpage using urllib.

In this page source, it includes:

<input type="text" name="year" id="yob" size="4" value="1880">

So presumably, if I can control this "year" value (like, "1881" or "1991"), I can deal with this problem. Am I right? I still don't know how to do it.

Can anybody tell me the solution for this please?

If you know some websites that may help my study, please let me know.

THANKS!

Answer

You can still use urllib. The button performs a POST to the current url. Using Firefox's Firebug I took a look at the network traffic and found they're sending 3 parameters: member, top, and year. You can send the same arguments:

import urllib
url = 'http://www.ssa.gov/cgi-bin/popularnames.cgi'post_params = { # member was blank, so I'm excluding it.'top'  : '25','year' : year}
post_args = urllib.urlencode(post_params)

Now, just send the url-encoded arguments:

urllib.urlopen(url, post_args)

If you need to send headers as well:

headers = {'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language' : 'en-US,en;q=0.5','Connection' : 'keep-alive','Host' : 'www.ssa.gov','Referer' : 'http://www.ssa.gov/cgi-bin/popularnames.cgi','User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0'}# With POST data:
urllib.urlopen(url, post_args, headers)

Execute the code in a loop:

for year in xrange(1880, 2014):# The above code...
https://en.xdnf.cn/q/69512.html

Related Q&A

Why is tuple being returned?

I have the following:tableNumber = session.query(TABLE.TABLESNUMBER).filter_by(TABLESID=self.TABLESID).first() return str(tableNumber)This is my TABLE class:class TABLE(Base):.... TABLESID =…

How to assert both UserWarning and SystemExit in pytest

Assert UserWarning and SystemExit in pytestIn my application I have a function that when provided with wrong argument values will raise a UserWarnings from warnings module and then raises SystemExit fr…

Distinguish button_press_event from drag and zoom clicks in matplotlib

I have a simple code that shows two subplots, and lets the user left click on the second subplot while recording the x,y coordinates of those clicks.The problem is that clicks to select a region to zoo…

String reversal in Python

I have taken an integer input and tried to reverse it in Python but in vain! I changed it into a string but still I am not able to. Is there any way to reverse it ? Is there any built-in function?I a…

Python: passing functions as arguments to initialize the methods of an object. Pythonic or not?

Im wondering if there is an accepted way to pass functions as parameters to objects (i.e. to define methods of that object in the init block).More specifically, how would one do this if the function de…

Encrypt and Decrypt by AES algorithm in both python and android

I have python and android code for AES encryption. When I encrypt a text in android, it decrypt on python successfully but it can’t decrypt in android side. Do anyone have an idea?Python code :impo…

How to conditionally assign values to tensor [masking for loss function]?

I want to create a L2 loss function that ignores values (=> pixels) where the label has the value 0. The tensor batch[1] contains the labels while output is a tensor for the net output, both have a …

Assign Colors to Lines

I am trying to plot a variable number of lines in matplotlib where the X, Y data and colors are stored in numpy arrays, as shown below. Is there a way to pass an array of colors into the plot function,…

How to display multiple annotations in Seaborn Heatmap cells

I want seaborn heatmap to display multiple values in each cell of the heatmap. Here is a manual example of what I want to see, just to be clear:data = np.array([[0.000000,0.000000],[-0.231049,0.000000]…

ImportError: No module named lxml on Mac

I am having a problem running a Python script and it is showing this message:ImportError: No module named lxmlI suppose I have to install somewhat called lxml but I am really newbie to Python and I don…