I have this helper function that gets rid of control characters in XML text:
def remove_control_characters(s): #Remove control characters in XML textt = ""for ch in s:if unicodedata.category(ch)[0] == "C":t += " "if ch == "," or ch == "\"":t += ""else:t += chreturn "".join(ch for ch in t if unicodedata.category(ch)[0]!="C")
I would like to know whether there is a unicode category for excluding quotation marks and commas.
In Unicode, control characters general category is 'Cc', even if they have no name.unicodedata.category()
returns the general category, as you can test for yourself in the python console :
>>>unicodedata.category(unicode('\00'))
'Cc'
For commas and quotation marks, the categories are Pi and Pf.
You only test the first character of the returned code in your example, so try instead :
cat = unicodedata.category(ch)if cat == "Cc" or cat == "Pi" or cat == "Pf":