For example we have following text:
"Spark is a framework for writing fast, distributed programs. Sparksolves similar problems as Hadoop MapReduce does but with a fastin-memory approach and a clean functional style API. ..."
I need all possible section of this text respectively, for one word by one word, then two by two, three by three to five to five.
like this:
ones : ['Spark', 'is', 'a', 'framework', 'for', 'writing, 'fast','distributed', 'programs', ...]
twos : ['Spark is', 'is a', 'a framework', 'framework for', 'for writing'...]
threes : ['Spark is a', 'is a framework', 'a framework for', 'framework for writing', 'for writing fast', ...]
. . .
fives : ['Spark is a framework for', 'is a framework for writing','a framework for writing fast','framework for writing fast distributed', ...]
Please note that the text to be processed is huge text( about 100GB).
I need the best solution for this process. May be it should be processed multi thread in parallel.
I don't need whole list at once, it can be streaming.
First of all, make sure that you have lines in your file then with no worries you can read it line-by-line (discussed here):
with open('my100GBfile.txt') as corpus:for line in corpus:sequence = preprocess(line)extract_n_grams(sequence)
Let's assume that your corpus doesn't need any special treatment. I guess you can find a suitable treatment for your text, I only want it to be chucked into desirable tokens:
def preprocess(string):# do what ever preprocessing that it needs to be done# e.g. convert to lowercase: string = string.lower()# return the sequence of tokensreturn string.split()
I don't know what do you want to do with n-grams. Lets assume that you want to count them as a language model which fits in your memory (it usually does, but I'm not sure about 4- and 5-grams). The easy way is to use off the shelf nltk
library:
from nltk.util import ngramslm = {n:dict() for n in range(1,6)}
def extract_n_grams(sequence):for n in range(1,6):ngram = ngrams(sentence, n)# now you have an n-gram you can do what ever you want# yield ngram# you can count them for your language model?for item in ngram:lm[n][item] = lm[n].get(item, 0) + 1