Collect data in chunks from stdin: Python

2024/10/9 12:27:26

I have the following Python code where I collect data from standard input into a list and run syntaxnet on it. The data is in the form of json objects from which I will extract the text field and feed it to syntaxnet.

data = []
for line in sys.stdin:data.append(line)
run_syntaxnet(data)    ##This is a function##

I am doing this because I do not want Syntaxnet to run for every single tweet since it will take a very long time and hence decrease performance.

Also, when I run this code on very large data, I do not want to keep collecting it forever and run out of memory. So I want to collect data in chunks- may be like 10000 tweets at a time and run Syntaxnet on them. Can someone help me how to do this?

Also, I want to understand what can be the maximum length of the list data so that I do not run out of memory.

EDIT:

I used the code:

data = []
for line in sys.stdin:data.append(line)if len(data) == 10000:run_syntaxnet(data)    ##This is a function##data = []

which runs perfectly fine if the number of rows in the input data is a multiple of 10000. I am not sure what to do with the remainder of the rows.

For example, if the total number of rows is 12000, the first 10000 rows get processed as I want, but the next 2000 are left off since the condition len(data) > 10000 is not met.

I want to do something like:

if len(data) > 10000 or 'EOF of input file is reached':run_syntaxnet(data)

Can someone tell me how to check for the EOF of input file? Thanks in advance!

PS: All the data to the python file is from Pig Streaming. Also, I can not afford to actually count the number of row sin the input data and send as a parameter since I have millions of rows and counting itself will take forever.

Answer

I think this is all you need:

data = []
for line in sys.stdin:data.append(line)if len(data) == 10000:run_syntaxnet(data)    ##This is a function##data = []

once the list get to 10000, then run the function and reset your data list. Also the maximum size of the list will vary from machine to machine, depending on how much memory you have, so it will probably be best to try it out with different lengths and find out what is optimum.

https://en.xdnf.cn/q/118585.html

Related Q&A

Getting and calculating stuff through tkinter widets

I was wondering how to calculate stuff using tkinter buttons. Im making a simple program to calculate seconds to hours:minutes:seconds. The user inputs an integer using the entry widget on the seconds …

Why does this condition execute to false when it should execute to true?

I have this code in my spider basic.py file:if l.add_xpath(price, //*[@id="price"]/text(),MapCompose(lambda i: i.replace(,, ), float),re = [,.0-9]):l.add_value(available, 1) else:l.add_value(…

Convert nested JSON to CSV in Python 2.7

Have seen a lot of thread but unable to found the solution for mine. I want to convert one nested JSON to CSV in Python 2.7. The sample JSON file is as below:sample.json # My JSON file that mainly cont…

How do I rectify this error: newline is invalid keyword argument for this function

Im currently working with raspberry pi and using DHT11 to read temperature and humidity values every second. I have to save these values into a database in real time. Heres my code that showing sensor …

How to remove substring from a string in python?

How can I remove the all lowercase letters before and after "Johnson" in these strings? str1 = aBcdJohnsonzZz str2 = asdVJohnsonkkkExpected results are as below:str1 = BJohnsonZ str2 = VJohn…

Try to print frame * and diagonal in python

I try to print * in frame and in diagonal .This is what I did:x=10 y=10 def print_frame(n, m, c):print c * mfor i in range(1, n - 1):print c , *(n-2-i),c, *i , c , cprint c * mprint_frame(10, 10, *)T…

How do I have an object rebound off the canvas border?

I am using the canvas widget from tkinter to create an ellipse and have it move around in the canvas. However when the ellipse comes in contact with the border it gets stuck to wall instead of bouncing…

How to scrape data using next button with ellipsis using Scrapy

I need to continuously get the data on next button <1 2 3 ... 5> but theres no provided href link in the source also theres also elipsis. any idea please? heres my codedef start_requests(self):u…

Execution Code Tracking - How to know which code has been executed in project?

Let say that I have open source project from which I would like to borrow some functionality. Can I get some sort of report generated during execution and/or interaction of this project? Report should…

Python code to ignore errors

I have a code that stops running each time there is an error. Is there a way to add a code to the script which will ignore all errors and keep running the script until completion?Below is the code:imp…