Question 1

I have the following Python code where I collect data from standard input into a list and run syntaxnet on it. The data is in the form of json objects from which I will extract the text field and feed it to syntaxnet.

data = []
for line in sys.stdin:data.append(line)
run_syntaxnet(data)    ##This is a function##

I am doing this because I do not want Syntaxnet to run for every single tweet since it will take a very long time and hence decrease performance.

Also, when I run this code on very large data, I do not want to keep collecting it forever and run out of memory. So I want to collect data in chunks- may be like 10000 tweets at a time and run Syntaxnet on them. Can someone help me how to do this?

Also, I want to understand what can be the maximum length of the list data so that I do not run out of memory.

EDIT:

I used the code:

data = []
for line in sys.stdin:data.append(line)if len(data) == 10000:run_syntaxnet(data)    ##This is a function##data = []

which runs perfectly fine if the number of rows in the input data is a multiple of 10000. I am not sure what to do with the remainder of the rows.

For example, if the total number of rows is 12000, the first 10000 rows get processed as I want, but the next 2000 are left off since the condition len(data) > 10000 is not met.

I want to do something like:

if len(data) > 10000 or 'EOF of input file is reached':run_syntaxnet(data)

Can someone tell me how to check for the EOF of input file? Thanks in advance!

PS: All the data to the python file is from Pig Streaming. Also, I can not afford to actually count the number of row sin the input data and send as a parameter since I have millions of rows and counting itself will take forever.

Question 2

I think this is all you need:

data = []
for line in sys.stdin:data.append(line)if len(data) == 10000:run_syntaxnet(data)    ##This is a function##data = []

once the list get to 10000, then run the function and reset your data list. Also the maximum size of the list will vary from machine to machine, depending on how much memory you have, so it will probably be best to try it out with different lengths and find out what is optimum.

Collect data in chunks from stdin: Python

Related Q&A

Getting and calculating stuff through tkinter widets

Why does this condition execute to false when it should execute to true?

Convert nested JSON to CSV in Python 2.7

How do I rectify this error: newline is invalid keyword argument for this function

How to remove substring from a string in python?

Try to print frame * and diagonal in python

How do I have an object rebound off the canvas border?

How to scrape data using next button with ellipsis using Scrapy

Execution Code Tracking - How to know which code has been executed in project?

Python code to ignore errors