Python selenium get Developer Tools →Network→Media logs

2024/10/4 19:22:09

I am trying to programmatically do something that necessarily involves getting the "developer tools"→network→media logs.

I will spare you the details, long story short, I need to visit thousands of pages like this: https://music.163.com/#/song?id=ID, where ID after the equals sign is a number.

If you open such a page, there will be a play button, the button triggers a javascript that loads a music file that is not referenced in the entire page, and plays the file. (note: you may need a Chinese IP to listen to some songs, and need a VIP account to listen to some other songs.)

For example, this page: https://music.163.com/#/song?id=32477986, it should look like this:

enter image description here

If you click the blue button, the javascript is triggered, and the music file will be loaded by javascript and be played. This music file will not be an element in the webpage and therefore can't be directly scraped by find_element* methods.

But I have found a way to find the address of the music file.

In Firefox, press F12 to bring up the inspector/"developer tools", click network then click media. Click the blue button and then there will be multiple requests shown with the same file name, the file name will match ^[0-9a-f]+\.m4a, and the domain may be different.

Like this:

enter image description here

Click any of the records and you will find its address, any of these will work, like this:

enter image description here

And I am currently trying to figure out how to programmatically simulate this process.

I Googled this: python selenium developer tools network tab, and didn't find what I was looking for, which is exactly as I expected. I posted the link to show my research effort, and how Google doesn't understand the meaning of what you are trying search for.

Anyway I stumbled upon this: https://www.rkengler.com/how-to-capture-network-traffic-when-scraping-with-selenium-and-python/

And tested with these:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
capabilities = DesiredCapabilities.CHROME
capabilities["goog:loggingPrefs"] = {'performance': "ALL"}
driver = webdriver.Chrome(desired_capabilities=capabilities)
wait = WebDriverWait(driver, 15)
driver.get('https://music.163.com/#/song?id=32477986')
iframe = driver.find_element_by_xpath('//iframe[@id="g_iframe"]')
driver.switch_to.frame(iframe)
wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
play = driver.find_element_by_xpath('//div[2]/div/a[1]')
play.click()
time.sleep(10)
driver.get_log('performance')

It worked, but the output is too broad, and I prefer using Firefox.

I then tried to find all valid loggingPrefs options using Google: chrome all "loggingPrefs" options, unfortunately but unsurprisingly I could find nothing, except for browser:ALL and driver:ALL.

And I can't find any documentation that specifies all the possible switches.

But I thought maybe I have found a pattern, performance is a tab in inspector/devtools, and network is another tab.

So I replaced the two occurrences of 'performance' with 'network' and ran the code again:

InvalidArgumentException: Message: invalid argument: log type 'network' not found(Session info: chrome=89.0.4389.90)

This is what I got.

Regardless, this is what I had put together:

import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ECoptions = Options()
options.headless = True
path = (os.environ['APPDATA'] + '\Mozilla\Firefox\Profiles\Selenium').replace('\\', '/')
profile = webdriver.FirefoxProfile(path)
profile.set_preference("media.volume_scale", "0.0")capabilities = DesiredCapabilities.FIREFOX
capabilities["loggingPrefs"] = {'performance': 'ALL'}Firefox = webdriver.Firefox(firefox_profile=profile, desired_capabilities=capabilities, options=options)
wait = WebDriverWait(Firefox, 15)
Firefox.get('https://music.163.com/#/song?id=32477986')
iframe = Firefox.find_element_by_xpath('//iframe[@id="g_iframe"]')
Firefox.switch_to.frame(iframe)
wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))
play = Firefox.find_element_by_xpath('//div[2]/div/a[1]')
play.click()
time.sleep(10)
Firefox.get_log('performance')

And this is how it failed:

WebDriverException: Message: HTTP method not allowed

How the heavens I can get the Network→Media logs using Python selenium? I can't even make the logging preferences work. All the thing I have found are using 'loggingPrefs' key, and as you see it doesn't work. I seem to vaguely remember gecko:loggingPrefs but I can't find anything by Googling "gecko:loggingPrefs".

And this comment:Getting console.log output from Firefox with Selenium mentions driver.get_log('browser') will not work anymore. But it's unclear whether it applies to only browser or all the logs.

How can I get the Firefox inspector logs and how can I narrow it down to network→media tab after that?

I am really sorry if I haven't show enough research effort, how the hell am I going to research something online without using Google? And don't you know enough from your own experience using Google that Google never understands the meaning of your search terms and it only finds documents containing the keywords where the keywords randomly scatter around the document and the result doesn't even have to contain all keywords!

Google really is a bad researching tool and I really don't have anything better than Google. So if that's not enough research effort then I don't know anything that will qualify as enough research effort.

So how can I get inspector→network→media logs in Firefox using Python 3.9.5 selenium?


And Google leads me here, and frankly the onsite search engine is even worse than Google. I can't find the answer to what I am looking for which is precisely why I asked questions here.


After some more research I have finally found something: https://stackoverflow.com/a/65538568/15290516

This answer takes me one step closer to my goal, but I don't know a thing about javascript, and the testing returns:

JavascriptException: Message: Cyclic object value

But it does point to the right direction, the solution should involve .execute_script() to get the job done, but I don't know exactly what the commands should be, I tried Googling this: javascript get "devtools" "network" "media" "logs", see for yourself what it returns.


Hmm, I managed to get the performance log with Chrome and redirect it to a text file, I uploaded it to Google Drive.

I have found the address in the file (Notepad++ search .m4a), but I don't know how to filter the result to the requests relevant to the music file programmatically.

I think, for now I will be stuck with Chrome and performance log.

But I really have no idea how to filter the requests to get only the relevant requests. How can that be done?

Answer

Finally I have done it, all by myself, without anybody's help.

The trick is simple, once you know what to do, it isn't so hard to achieve.

The responses are in json format, so we need the json module.

The structure of the json varies, but the first level keys are fixed, there are always three keys: level, message, timestamp.

We need the message key, its value is a json object packed in a string, so we need json.loads to unpack it.

The structure of these packed json objects varies a lot, but there is always a message key and a method key inside the message key.

Here we are trying to scrape received media file addresses, and long story short, the messagemessagemethod key should equal to 'Network.responseReceived'.

If messagemessagemethod key equals to 'Network.responseReceived', then there will always be a messagemessageparamsresponsemimeType key.

That key stores the file type of the resource, I will spare you the details, I know .mp4 stands for Motion Picture Expert Group-4 and is a video format, but here the media type should be 'audio/mp4'.

If all the about criteria are satisfied then the address of the media file is the value of messagemessageparamsresponseurl key.

This is the final code:

import json
import os
import random
import sys
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ECpath = (os.environ['LOCALAPPDATA'] + '\\Google\\Chrome\\User Data')options = webdriver.ChromeOptions()
options.add_argument('--disable-gpu')
options.add_argument('--headless')
options.add_argument('--log-level=3')
options.add_argument('--mute-audio')
options.add_argument(f'--user-data-dir={path}')capabilities = DesiredCapabilities.CHROME
capabilities["goog:loggingPrefs"] = {'performance': 'ALL'}Chrome = webdriver.Chrome(options=options, desired_capabilities=capabilities)
wait = WebDriverWait(Chrome, 5)def getlink(addr):Chrome.get(addr)iframe = Chrome.find_element_by_xpath('//iframe[@id="g_iframe"]')Chrome.switch_to.frame(iframe)wait.until(EC.visibility_of_element_located((By.XPATH, '//div[2]/div/a[1]')))play = Chrome.find_element_by_xpath('//div[2]/div/a[1]')play.click()time.sleep(5)logs = Chrome.get_log('performance')addresses = []for i in logs:log = json.loads(i['message'])if log['message']['method'] == 'Network.responseReceived':if log['message']['params']['response']['mimeType'] == 'audio/mp4':addresses.append(log['message']['params']['response']['url'])check = set([i.split('/')[-1] for i in addresses])if len(check) == 1:return random.choice(addresses)if __name__ == '__main__':print(getlink(sys.argv[1]))
https://en.xdnf.cn/q/70577.html

Related Q&A

deep copy nested iterable (or improved itertools.tee for iterable of iterables)

PrefaceI have a test where Im working with nested iterables (by nested iterable I mean iterable with only iterables as elements). As a test cascade considerfrom itertools import tee from typing import …

ImportError: No module named cv2.cv

python 3.5 and windows 10I installed open cv using this command :pip install opencv_python-3.1.0-cp35-cp35m-win_amd64.whlThis command in python works fine :import cv2But when i want to import cv2.cv :i…

Why arent persistent connections supported by URLLib2?

After scanning the urllib2 source, it seems that connections are automatically closed even if you do specify keep-alive. Why is this?As it is now I just use httplib for my persistent connections... bu…

How to find accented characters in a string in Python?

I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. ) or special characters (e.g. ). I have to be able to search for these characters in the sentence so I can…

Import error with PyQt5 : undefined symbol:_ZNSt12out_of_rangeC1EPKc,versionQt_5

I had watched a video on YouTube about making your own web browser using PyQt5. Link to video: https://youtu.be/z-5bZ8EoKu4, I found it interesting and decided to try it out on my system. Please note t…

Python MySQL module

Im developing a web application that needs to interface with a MySQL database, and I cant seem to find any really good modules out there for Python.Im specifically looking for fast module, capable of h…

How to calculate correlation coefficients using sklearn CCA module?

I need to measure similarity between feature vectors using CCA module. I saw sklearn has a good CCA module available: https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.h…

cant open shape file with GeoPandas

I dont seem to be able to open the zip3.zip shape file I download from (http://www.vdstech.com/usa-data.aspx)Here is my code:import geopandas as gpd data = gpd.read_file("data/zip3.shp")this …

How to fix the error :range object is not callable in python3.6 [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.This question was caused by a typo or a problem that can no longer be reproduced. While similar q…

pyspark: Save schemaRDD as json file

I am looking for a way to export data from Apache Spark to various other tools in JSON format. I presume there must be a really straightforward way to do it.Example: I have the following JSON file jfil…