extracting n grams from huge text

2024/7/7 6:02:53

For example we have following text:

"Spark is a framework for writing fast, distributed programs. Sparksolves similar problems as Hadoop MapReduce does but with a fastin-memory approach and a clean functional style API. ..."

I need all possible section of this text respectively, for one word by one word, then two by two, three by three to five to five. like this:

ones : ['Spark', 'is', 'a', 'framework', 'for', 'writing, 'fast','distributed', 'programs', ...]

twos : ['Spark is', 'is a', 'a framework', 'framework for', 'for writing'...]

threes : ['Spark is a', 'is a framework', 'a framework for', 'framework for writing', 'for writing fast', ...]

. . .

fives : ['Spark is a framework for', 'is a framework for writing','a framework for writing fast','framework for writing fast distributed', ...]

Please note that the text to be processed is huge text( about 100GB). I need the best solution for this process. May be it should be processed multi thread in parallel.

I don't need whole list at once, it can be streaming.

Answer

First of all, make sure that you have lines in your file then with no worries you can read it line-by-line (discussed here):

with open('my100GBfile.txt') as corpus:for line in corpus:sequence = preprocess(line)extract_n_grams(sequence)

Let's assume that your corpus doesn't need any special treatment. I guess you can find a suitable treatment for your text, I only want it to be chucked into desirable tokens:

def preprocess(string):# do what ever preprocessing that it needs to be done# e.g. convert to lowercase: string = string.lower()# return the sequence of tokensreturn string.split()

I don't know what do you want to do with n-grams. Lets assume that you want to count them as a language model which fits in your memory (it usually does, but I'm not sure about 4- and 5-grams). The easy way is to use off the shelf nltk library:

from nltk.util import ngramslm = {n:dict() for n in range(1,6)}
def extract_n_grams(sequence):for n in range(1,6):ngram = ngrams(sentence, n)# now you have an n-gram you can do what ever you want# yield ngram# you can count them for your language model?for item in ngram:lm[n][item] = lm[n].get(item, 0) + 1
https://en.xdnf.cn/q/120224.html

Related Q&A

Python: Input validate with string length

Ok so i need to ensure that a phone number length is correct. I came up with this but get a syntax error.phone = int(input("Please enter the customers Phone Number.")) if len(str(phone)) == 1…

Mergesort Python implementation

I have seen a lot of mergesort Python implementation and I came up with the following code. The general logic is working fine, but it is not returning the right results. How can I fix it? Code: def me…

Use variable in different class [duplicate]

This question already has answers here:How to access variables from different classes in tkinter?(2 answers)Closed 7 years ago.I am a beginner in python. I have a problem with using variable in differ…

Embedded function returns None

My function returns None. I have checked to make sure all the operations are correct, and that I have a return statement for each function.def parameter_function(principal, annual_interest_rate, durati…

calculate days between several dates in python

I have a file with a thousand lines. Theres 12 different dates in a single row. Im looking for two conditions. First: It should analyze row by row. For every row, it should check only for the dates bet…

Appeding different list values to dictionary in python

I have three lists containing different pattern of values. This should append specific values only inside a single dictionary based on some if condition.I have tried the following way to do so but i go…

Split only part of list in python

I have a list[Paris, 458 boulevard Saint-Germain, Marseille, 29 rue Camille Desmoulins, Marseille, 1 chemin des Aubagnens]i want split after keyword "boulevard, rue, chemin" like in output[Sa…

How to find the index of the element in a list that first appears in another given list?

a = [3, 4, 2, 1, 7, 6, 5] b = [4, 6]The answer should be 1. Because in a, 4 appears first in list b, and its index is 1.The question is that is there any fast code in python to achieve this?PS: Actual…

How to yield fragment URLs in scrapy using Selenium?

from my poor knowledge about webscraping Ive come about to find a very complex issue for me, that I will try to explain the best I can (hence Im opened to suggestions or edits in my post).I started usi…

Django Database Migration

Hi have a django project a full project now I want to migrate to mysql from the default Sqlite3 which is the default database. I am on a Mac OS and I dont know how to achieve this process. Any one wit…