Question 1

I want to find abbreviations in the text and remove it. What I am currently doing is identifying consecutive capital letters and remove them.

But I see that it does not remove abbreviations such as MOOCs, M.O.O.C, M.O.O.Cs. Is there an easy way of doing this in python? Or are there any libraries that I can use instead?

Question 2

The re regex library is probably the tool for the job.

In order to remove every string of consecutive uppercase letters, the following code can be used:

import re
mytext = "hello, look an ACRONYM"
mytext = re.sub(r"\b[A-Z]{2,}\b", "", mytext)

Here, the regex "\b[A-Z]{2,}\b" searches for multiple consecutive (indicated by [...]{2,}) capital letters (A-Z), forming a complete word (\b...\b). It then replaces them with the second string, "".

The convenient thing about regex is how easily it can be modified for more complex cases. For example:

mytext = re.sub(r"\b[A-Z\.]{2,}\b", "", mytext)

Will replace consecutive uppercase letters and full stops, removing acronyms like A.B.C.D. as well as ABCD. The \ before the . is necessary as . otherwise is used by regex as a kind of wildcard.

The ? specifier could also be used to remove acronyms that end in s, for example:

mytext = re.sub(r"\b[A-Z\.]{2,}s?\b", "", mytext)

This regex will remove acronyms like ABCD, A.B.C.D, and even A.B.C.Ds. If other forms of acronym need to be removed, the regex can easily be modified to accommodate them.

The re library also includes functions like findall, or the match function, which allow for programs to locate and process each acronym individually. This might come in handy if you want to, for example, look at a list of the acronyms being removed and check there are no legitimate words there.

Detect abbreviations in the text in python

Related Q&A

filtering of tweets received from statuses/filter (streaming API)

UndefinedError: current_user is undefined

Scipy filter with multi-dimensional (or non-scalar) output

How do I stop execution inside exec command in Python 3?

sqlalchemy concurrency update issue

Python matplotlib: Change axis labels/legend from bold to regular weight

Negative extra_requires in Python setup.py

Get Type in Robot Framework

Keras 2, TypeError: cant pickle _thread.lock objects

How to remove minimize/maximize buttons while preserving the icon?