python check if utf-8 string is uppercase

2024/10/3 4:42:46

I am having trouble with .isupper() when I have a utf-8 encoded string. I have a lot of text files I am converting to xml. While the text is very variable the format is static. words in all caps should be wrapped in <title> tags and everything else <p>. It is considerably more complex then this, but this should be sufficent for my question.

My problem is that this is an utf-8 file. This is a must, as there will be some many non-English characters in the final output. This may be time to provide a brief example:

inputText.txt

RÉSUMÉ

Bacon ipsum dolor sit amet strip steakt-bone chicken, irure ground roundnostrud aute pancetta ham hockincididunt aliqua. Dolore short loinex chicken, chuck drumstick uthamburger ut andouille. In laborumeiusmod short loin, spare ribs enimball tip sausage. Tenderloin utconsequat flank. Tempor officiasirloin duis. In pancetta do, utdolore t-bone sint pork pariaturdolore chicken exercitation. Nostrudribeye tail, ut ullamco venison mollitpork chop proident consectetur fugiatreprehenderit officia ut tri-tip.

DesiredOutput

    <title>RÉSUMÉ</title><p>Bacon ipsum dolor sit amet strip steak t-bone chicken, irure ground round nostrudaute pancetta ham hock incididunt aliqua. Dolore short loin ex chicken, chuck drumstickut hamburger ut andouille. In laborum eiusmod short loin, spare ribs enim ball tip sausage.Tenderloin ut consequat flank. Tempor officia sirloin duis. In pancetta do, ut dolore t-bonesint pork pariatur dolore chicken exercitation. Nostrud ribeye tail, ut ullamco venisonmollit pork chop proident consectetur fugiat reprehenderit officia ut tri-tip.</p>

Sample Code

    #!/usr/local/bin/python2.7# yes this is an alt-install of pythonimport codecsimport sysimport refrom xml.dom.minidom import Documentdef main():fn = sys.argv[1]input = codecs.open(fn, 'r', 'utf-8')output = codecs.open('desiredOut.xml', 'w', 'utf-8')doc = Documents()doc = parseInput(input,doc)print>>output, doc.toprettyxml(indent='  ',encoding='UTF-8')def parseInput(input, doc):tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] #remove blank linesfor i in range(len(tokens)):# THIS IS MY PROBLEM. .isupper() is never true.if str(tokens[i]).isupper(): title = doc.createElement('title')tText = str(tokens[i]).strip('[\']')titleText = doc.createTextNode(tText.title())doc.appendChild(title)title.appendChild(titleText)else: p = doc.createElement('p')pText = str(tokens[i]).strip('[\']')paraText = doc.createTextNode(pText)doc.appendChild(p)p.appenedChild(paraText)return docif __name__ == '__main__':main()

ultimately it is pretty straight forward, I would accept critiques or suggestions on my code. Who wouldn't? In particular I am unhappy with str(tokens[i]) perhaps there is a better way to loop through a list of strings?

But the purpose of this question is to figure out the most efficient way to check if an utf-8 string is capitalized. Perhaps I should look into crafting a regex for this.

Do note, I did not run this code and it may not run just right. I hand picked the parts from working code and may have mistyped something. Alert me and I will correct it. lastly, note I am not using lxml

Answer

The primary reason that your published code fails (even with only ascii characters!) is that re.split() will not split on a zero-width match. r'\b' matches zero characters:

>>> re.split(r'\b', 'foo-BAR_baz')
['foo-BAR_baz']
>>> re.split(r'\W+', 'foo-BAR_baz')
['foo', 'BAR_baz']
>>> re.split(r'[\W_]+', 'foo-BAR_baz')
['foo', 'BAR', 'baz']

Also, you need flags=re.UNICODE to ensure that Unicode definitions of \b and \W etc are used. And using str() where you did is at best unnecessary.

So it wasn't really a Unicode problem per se at all. However some answerers tried to address it as a Unicode problem, with varying degrees of success ... here's my take on the Unicode problem:

The general solution to this kind of problem is to follow the standard bog-simple advice that applies to all text problems: Decode your input from bytestrings to unicode strings as early as possible. Do all processing in unicode. Encode your output unicode into byte strings as late as possible.

So: byte_string.decode('utf8').isupper() is the way to go. Hacks like byte_string.decode('ascii', 'ignore').isupper() are to be avoided; they can be all of (complicated, unneeded, failure-prone) -- see below.

Some code:

# coding: ascii
import unicodedatatests = ((u'\u041c\u041e\u0421\u041a\u0412\u0410', True), # capital of Russia, all uppercase(u'R\xc9SUM\xc9', True), # RESUME with accents(u'R\xe9sum\xe9', False), # Resume with accents(u'R\xe9SUM\xe9', False), # ReSUMe with accents)for ucode, expected in tests:printprint 'unicode', repr(ucode)for uc in ucode:print 'U+%04X %s' % (ord(uc), unicodedata.name(uc))u8 = ucode.encode('utf8')print 'utf8', repr(u8)actual1 = u8.decode('utf8').isupper() # the natural way of doing itactual2 = u8.decode('ascii', 'ignore').isupper() # @jathanismprint expected, actual1, actual2

Output from Python 2.7.1:

unicode u'\u041c\u041e\u0421\u041a\u0412\u0410'
U+041C CYRILLIC CAPITAL LETTER EM
U+041E CYRILLIC CAPITAL LETTER O
U+0421 CYRILLIC CAPITAL LETTER ES
U+041A CYRILLIC CAPITAL LETTER KA
U+0412 CYRILLIC CAPITAL LETTER VE
U+0410 CYRILLIC CAPITAL LETTER A
utf8 '\xd0\x9c\xd0\x9e\xd0\xa1\xd0\x9a\xd0\x92\xd0\x90'
True True Falseunicode u'R\xc9SUM\xc9'
U+0052 LATIN CAPITAL LETTER R
U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
U+0053 LATIN CAPITAL LETTER S
U+0055 LATIN CAPITAL LETTER U
U+004D LATIN CAPITAL LETTER M
U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
utf8 'R\xc3\x89SUM\xc3\x89'
True True Trueunicode u'R\xe9sum\xe9'
U+0052 LATIN CAPITAL LETTER R
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+0073 LATIN SMALL LETTER S
U+0075 LATIN SMALL LETTER U
U+006D LATIN SMALL LETTER M
U+00E9 LATIN SMALL LETTER E WITH ACUTE
utf8 'R\xc3\xa9sum\xc3\xa9'
False False Falseunicode u'R\xe9SUM\xe9'
U+0052 LATIN CAPITAL LETTER R
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+0053 LATIN CAPITAL LETTER S
U+0055 LATIN CAPITAL LETTER U
U+004D LATIN CAPITAL LETTER M
U+00E9 LATIN SMALL LETTER E WITH ACUTE
utf8 'R\xc3\xa9SUM\xc3\xa9'
False False True

The only differences with Python 3.x are syntactical -- the principle (do all processing in unicode) remains the same.

https://en.xdnf.cn/q/70766.html

Related Q&A

Failed building wheel for Twisted in Windows 10 python 3

Im trying to install rasa-core on my windows 10 machine.When installing with pip install, I get: Failed building wheel for TwistedThe same error appears when trying to install Twisted separately.How co…

How to set a fill color between two vertical lines

Using matplotlib, we can "trivially" fill the area between two vertical lines using fill_between() as in the example: https://matplotlib.org/3.2.1/gallery/lines_bars_and_markers/fill_between_…

Pythonic way to read file line by line?

Whats the Pythonic way to go about reading files line by line of the two methods below?with open(file, r) as f:for line in f:print lineorwith open(file, r) as f:for line in f.readlines():print lineOr …

Python Selenium clicking next button until the end

This is a follow up question from my first question, and I am trying to scrape a website and have Selenium click on next (until it cant be clicked) and collect the results as well.This is the html tag …

How can i detect one word with speech recognition in Python

I know how to detect speech with Python but this question is more specific: How can I make Python listening for only one word and then returns True if Python could recognize the word.I know, I could ju…

Finding matching and nonmatching items in lists

Im pretty new to Python and am getting a little confused as to what you can and cant do with lists. I have two lists that I want to compare and return matching and nonmatching elements in a binary form…

How to obtain better results using NLTK pos tag

I am just learning nltk using Python. I tried doing pos_tag on various sentences. But the results obtained are not accurate. How can I improvise the results ?broke = NN flimsy = NN crap = NNAlso I am …

Pandas apply on rolling with multi-column output

I am working on a code that would apply a rolling window to a function that would return multiple columns. Input: Pandas Series Expected output: 3-column DataFrame def fun1(series, ):# Some calculation…

Exceptions for the whole class

Im writing a program in Python, and nearly every method im my class is written like this: def someMethod(self):try:#...except someException:#in case of exception, do something here#e.g display a dialog…

Getting live output from asyncio subprocess

Im trying to use Python asyncio subprocesses to start an interactive SSH session and automatically input the password. The actual use case doesnt matter but it helps illustrate my problem. This is my c…