Apostrophes are printing out as \x80\x99

2024/9/19 17:46:46
import requests
from bs4 import BeautifulSoup
import resource_url = requests.get('http://www.nytimes.com/pages/business/index.html')
div_classes = {'class' :['ledeStory' , 'story']}
title_tags = ['h2','h3','h4','h5','h6']source_text = source_url.text
soup = BeautifulSoup(source_text, 'html.parser')stories = soup.find_all("div", div_classes)h = []; h2 = []; h3 = []; h4 =[]for x in range(len(stories)):for x2 in range(len(title_tags)):hold = []; hold2 = []hold = stories[x].find(title_tags[x2])if hold is not None:hold2 = hold.find('a')if hold2 is not None:hh = (((hold.text.strip('a'))).strip())h.append(hh)#h.append(re.sub(r'[^\x00-\x7f]',r'', ((hold.text.strip('a'))).strip()))#h2.append(hold2.get('href'))hold = []hold = stories[x].find('p')if hold is not None:h3.append(re.sub(r'[^\x00-\x7f]',r'',((hold.text.strip('p')).strip())))else:h3.append('None')h4.append(h)
h4.append(h2)
h4.append(h3)
print(h4)

Hey everyone. I have been wanting to scrape some data, I almost completed my scraper when I noticed the printed output was replacing (') with (â\x80\x99). For example the title containing "China's" was coming out "Chinaâ\x80\x99s". I did some research and tried to use decode/encode (utf-8) with no avail. It would just tell me that you can not run decode on a str(). I tried using re.sub() which would let me delete (â\x80\x99) but would not let me replace it with a (') Since I want to use natural language processing to interpret the data a fear that not having apostrophes is greatly going to change the meaning. Help would be greatly appreciated, I feel like I have hit a block with this one.

Answer

In ISO 8859-1 and related code sets (there are many of them), â has code point 0xE2. When you interpret the three bytes 0xE2, 0x80, 0x99 as a UTF-8 encoding, the character is U+2019, RIGHT SINGLE QUOTATION MARK (which is ’ or , as distinct from ' or ' — you may or may not be able to spot the difference).

I see a few possibilities for the source of your difficulties, any one or more of which could be the source of your trouble:

  1. Your terminal is not set up to interpret UTF-8.
  2. Your source code should use ' (U+0027, APOSTROPHE).
  3. You're using Python 2.x rather than Python 3.x and it is having issues because of the use of Unicode (UTF-8). Against this (as Cory Madden pointed out), the code ends with print(h4) which is Python 3, so it probably isn't the issue.

It may be simplest to change the quotation mark into an ASCII apostrophe.

On the other hand, if you are analyzing HTML from elsewhere, you may have to consider how your script is going to handle UTF-8. Using quote marks from the Unicode U+20xx range is a very common choice; maybe your scraper needs to handle it?

https://en.xdnf.cn/q/72489.html

Related Q&A

Have Sphinx replace docstring text

I am documenting code in Sphinx that resembles this: class ParentClass(object):def __init__(self):passdef generic_fun(self):"""Call this function using /run/ParentClass/generic_fun()&quo…

exit is not a keyword in Python, but no error occurs while using it

I learn that exit is not a keyword in Python by,import keyword print(exit in keyword.kwlist) # Output: FalseBut there is no reminder of NameError: name exit is not defined while using it. The outpu…

Tensorflow Datasets Reshape Images

I want to build a data pipeline using tensorflow dataset. Because each data has different shapes, I cant build a data pipeline.import tensorflow_datasets as tfds import tensorflow as tfdataset_builder …

Why is the python client not receiving SSE events?

I am have a python client listening to SSE events from a server with node.js APIThe flow is I sent an event to the node.js API through call_notification.py and run seevents.py in loop using run.sh(see …

sklearn Pipeline: argument of type ColumnTransformer is not iterable

I am attempting to use a pipeline to feed an ensemble voting classifier as I want the ensemble learner to use models that train on different feature sets. For this purpose, I followed the tutorial avai…

PyQT Window: I want to remember the location it was closed at

I have a QDialog, and when the user closes the QDialog, and reopens it later, I want to remember the location and open the window at the exact same spot. How would I exactly remember that location?

Django Reusable Application Configuration

I have some Django middleware code that connects to a database. I want to turn the middleware into a reusable application ("app") so I can package it for distribution into many other project…

executable made with py2exe doesnt run on windows xp 32bit

I created an executable with py2exe on a 64bit windows 7 machine, and distributed the program.On a windows xp 32bit machine the program refuses to run exhibiting the following behavior:a popup window s…

Pandas reading NULL as a NaN float instead of str [duplicate]

This question already has answers here:How to treat NULL as a normal string with pandas?(4 answers)Closed 5 years ago.Given the file:$ cat test.csv a,b,c,NULL,d e,f,g,h,i j,k,l,m,nWhere the 3rd colum…

How to invert differencing in a Python statsmodels ARIMA forecast?

Im trying to wrap my head around ARIMA forecasting using Python and Statsmodels. Specifically, for the ARIMA algorithm to work, the data needs to be made stationary via differencing (or similar method)…