I want to find abbreviations in the text and remove it. What I am currently doing is identifying consecutive capital letters and remove them.
But I see that it does not remove abbreviations such as MOOCs
, M.O.O.C
, M.O.O.Cs
. Is there an easy way of doing this in python? Or are there any libraries that I can use instead?
The re
regex library is probably the tool for the job.
In order to remove every string of consecutive uppercase letters, the following code can be used:
import re
mytext = "hello, look an ACRONYM"
mytext = re.sub(r"\b[A-Z]{2,}\b", "", mytext)
Here, the regex "\b[A-Z]{2,}\b"
searches for multiple consecutive (indicated by [...]{2,}
) capital letters (A-Z
), forming a complete word (\b...\b
). It then replaces them with the second string, ""
.
The convenient thing about regex is how easily it can be modified for more complex cases. For example:
mytext = re.sub(r"\b[A-Z\.]{2,}\b", "", mytext)
Will replace consecutive uppercase letters and full stops, removing acronyms like A.B.C.D. as well as ABCD. The \
before the .
is necessary as .
otherwise is used by regex as a kind of wildcard.
The ?
specifier could also be used to remove acronyms that end in s, for example:
mytext = re.sub(r"\b[A-Z\.]{2,}s?\b", "", mytext)
This regex will remove acronyms like ABCD, A.B.C.D, and even A.B.C.Ds. If other forms of acronym need to be removed, the regex can easily be modified to accommodate them.
The re
library also includes functions like findall, or the match function, which allow for programs to locate and process each acronym individually. This might come in handy if you want to, for example, look at a list of the acronyms being removed and check there are no legitimate words there.