Question 1

import requests
from bs4 import BeautifulSoup
import resource_url = requests.get('http://www.nytimes.com/pages/business/index.html')
div_classes = {'class' :['ledeStory' , 'story']}
title_tags = ['h2','h3','h4','h5','h6']source_text = source_url.text
soup = BeautifulSoup(source_text, 'html.parser')stories = soup.find_all("div", div_classes)h = []; h2 = []; h3 = []; h4 =[]for x in range(len(stories)):for x2 in range(len(title_tags)):hold = []; hold2 = []hold = stories[x].find(title_tags[x2])if hold is not None:hold2 = hold.find('a')if hold2 is not None:hh = (((hold.text.strip('a'))).strip())h.append(hh)#h.append(re.sub(r'[^\x00-\x7f]',r'', ((hold.text.strip('a'))).strip()))#h2.append(hold2.get('href'))hold = []hold = stories[x].find('p')if hold is not None:h3.append(re.sub(r'[^\x00-\x7f]',r'',((hold.text.strip('p')).strip())))else:h3.append('None')h4.append(h)
h4.append(h2)
h4.append(h3)
print(h4)

Hey everyone. I have been wanting to scrape some data, I almost completed my scraper when I noticed the printed output was replacing (') with (â\x80\x99). For example the title containing "China's" was coming out "Chinaâ\x80\x99s". I did some research and tried to use decode/encode (utf-8) with no avail. It would just tell me that you can not run decode on a str(). I tried using re.sub() which would let me delete (â\x80\x99) but would not let me replace it with a (') Since I want to use natural language processing to interpret the data a fear that not having apostrophes is greatly going to change the meaning. Help would be greatly appreciated, I feel like I have hit a block with this one.

Question 2

In ISO 8859-1 and related code sets (there are many of them), â has code point 0xE2. When you interpret the three bytes 0xE2, 0x80, 0x99 as a UTF-8 encoding, the character is U+2019, RIGHT SINGLE QUOTATION MARK (which is ’ or ’, as distinct from ' or ' — you may or may not be able to spot the difference).

I see a few possibilities for the source of your difficulties, any one or more of which could be the source of your trouble:

Your terminal is not set up to interpret UTF-8.
Your source code should use ' (U+0027, APOSTROPHE).
You're using Python 2.x rather than Python 3.x and it is having issues because of the use of Unicode (UTF-8). Against this (as Cory Madden pointed out), the code ends with print(h4) which is Python 3, so it probably isn't the issue.

It may be simplest to change the quotation mark into an ASCII apostrophe.

On the other hand, if you are analyzing HTML from elsewhere, you may have to consider how your script is going to handle UTF-8. Using quote marks from the Unicode U+20xx range is a very common choice; maybe your scraper needs to handle it?

Apostrophes are printing out as \x80\x99

Related Q&A

Have Sphinx replace docstring text

exit is not a keyword in Python, but no error occurs while using it

Tensorflow Datasets Reshape Images

Why is the python client not receiving SSE events?

sklearn Pipeline: argument of type ColumnTransformer is not iterable

PyQT Window: I want to remember the location it was closed at

Django Reusable Application Configuration

executable made with py2exe doesnt run on windows xp 32bit

Pandas reading NULL as a NaN float instead of str [duplicate]

How to invert differencing in a Python statsmodels ARIMA forecast?