Question 1

For example we have following text:

"Spark is a framework for writing fast, distributed programs. Sparksolves similar problems as Hadoop MapReduce does but with a fastin-memory approach and a clean functional style API. ..."

I need all possible section of this text respectively, for one word by one word, then two by two, three by three to five to five. like this:

ones : ['Spark', 'is', 'a', 'framework', 'for', 'writing, 'fast','distributed', 'programs', ...]
twos : ['Spark is', 'is a', 'a framework', 'framework for', 'for writing'...]
threes : ['Spark is a', 'is a framework', 'a framework for', 'framework for writing', 'for writing fast', ...]
. . .
fives : ['Spark is a framework for', 'is a framework for writing','a framework for writing fast','framework for writing fast distributed', ...]

Please note that the text to be processed is huge text( about 100GB). I need the best solution for this process. May be it should be processed multi thread in parallel.

I don't need whole list at once, it can be streaming.

Question 2

First of all, make sure that you have lines in your file then with no worries you can read it line-by-line (discussed here):

with open('my100GBfile.txt') as corpus:for line in corpus:sequence = preprocess(line)extract_n_grams(sequence)

Let's assume that your corpus doesn't need any special treatment. I guess you can find a suitable treatment for your text, I only want it to be chucked into desirable tokens:

def preprocess(string):# do what ever preprocessing that it needs to be done# e.g. convert to lowercase: string = string.lower()# return the sequence of tokensreturn string.split()

I don't know what do you want to do with n-grams. Lets assume that you want to count them as a language model which fits in your memory (it usually does, but I'm not sure about 4- and 5-grams). The easy way is to use off the shelf nltk library:

from nltk.util import ngramslm = {n:dict() for n in range(1,6)}
def extract_n_grams(sequence):for n in range(1,6):ngram = ngrams(sentence, n)# now you have an n-gram you can do what ever you want# yield ngram# you can count them for your language model?for item in ngram:lm[n][item] = lm[n].get(item, 0) + 1

extracting n grams from huge text

Related Q&A

Python: Input validate with string length

Mergesort Python implementation

Use variable in different class [duplicate]

Embedded function returns None

calculate days between several dates in python

Appeding different list values to dictionary in python

Split only part of list in python

How to find the index of the element in a list that first appears in another given list?

How to yield fragment URLs in scrapy using Selenium?

Django Database Migration