I am using the scrapy shell on this page Pittsburgh Steelers at New England Patriots - September 10th, 2015 to pull individual team stats. For example, I want to pull total yards for the away team (464) which, when inspecting the element and copying the XPath yields
//*[@id="team_stats"]/tbody/tr[5]/td[1]
but when I run
response.xpath('//*[@id="team_stats"]/tbody/tr[5]/td[1]')
nothing is returned. I noticed that this table is in a separate div from the initial data so I'm not sure if I need to be starting higher up. Even just a search on the
//*[@id="team_stats"]
xpath returns nothing. Any help would be greatly appreciated.
The problem you encounter is (as in most of cases like this) that the website uses JavaScript to render the complete information of the game. This means that Scrapy does not see the website as you see it when you open it in your browser.
Because Scrapy does not run any JavaScript after loading the page it does not render out the right table with the ID team_stats
. The contents of the "Team Stats" table are there in the loaded website however they are commented out.
One solution would be to extract the comment which contains the team statistics and convert that comment text to HTML and extract the data found there.
response.xpath('//div[@id="all_team_stats"]//comment()').extract()
The text above extracts the comments which contains your required table.
For future analysis I recommend you to use Chrome's Developer Tools where you can disable JavaScript for analyzing sites and load the site with that option. This will return the page's content as Scrapy would see it.
EDIT
After you extract the comment you can feed it into a new selector just like Markus mentioned in his comment:
new_selector = Selector(text=extracted_text)
And on this new selector you can use again .xpath()
as you would do on the response
object.
Removing the comment delimiter is easy: you have to remove it from the beginning and from the end of the extracted text which is a string. And comments in HTML start with <!--
and end with -->
. You need to feed the text between these characters to the new selector.
Extending the example from above:
extracted_text = response.xpath('//div[@id="all_team_stats"]//comment()').extract()[0]
new_selector = Selector(text=extracted_text[4:-3].strip())
new_selector.xpath('//*[@id="team_stats"]/tbody/tr[5]/td[1]').extract()