import requests
from bs4 import BeautifulSoup
import resource_url = requests.get('http://www.nytimes.com/pages/business/index.html')
div_classes = {'class' :['ledeStory' , 'story']}
title_tags = ['h2','h3','h4','h5','h6']source_text = source_url.text
soup = BeautifulSoup(source_text, 'html.parser')stories = soup.find_all("div", div_classes)h = []; h2 = []; h3 = []; h4 =[]for x in range(len(stories)):for x2 in range(len(title_tags)):hold = []; hold2 = []hold = stories[x].find(title_tags[x2])if hold is not None:hold2 = hold.find('a')if hold2 is not None:hh = (((hold.text.strip('a'))).strip())h.append(hh)#h.append(re.sub(r'[^\x00-\x7f]',r'', ((hold.text.strip('a'))).strip()))#h2.append(hold2.get('href'))hold = []hold = stories[x].find('p')if hold is not None:h3.append(re.sub(r'[^\x00-\x7f]',r'',((hold.text.strip('p')).strip())))else:h3.append('None')h4.append(h)
h4.append(h2)
h4.append(h3)
print(h4)
Hey everyone. I have been wanting to scrape some data, I almost completed my scraper when I noticed the printed output was replacing (') with (â\x80\x99). For example the title containing "China's" was coming out "Chinaâ\x80\x99s". I did some research and tried to use decode/encode (utf-8) with no avail. It would just tell me that you can not run decode on a str(). I tried using re.sub() which would let me delete (â\x80\x99) but would not let me replace it with a (') Since I want to use natural language processing to interpret the data a fear that not having apostrophes is greatly going to change the meaning. Help would be greatly appreciated, I feel like I have hit a block with this one.