Randomly sampling lines from a file

2024/9/24 2:29:28

I have a csv file which is ~40gb and 1800000 lines.

I want to randomly sample 10,000 lines and print them to a new file.

Right now, my approach is to use sed as:

(sed -n '$vars' < input.txt) > output.txt

Where $vars is a randomly generated list of lines. (Eg: 1p;14p;1700p;...;10203p)

While this works, it takes about 5 minutes per execution. It's not a huge time, but I was wondering if anybody had ideas on how to make it quicker?

Answer

The biggest advantage to having lines of the same length is that you don't need to find newlines to know where each line starts. With a file size of ~40GB containing ~1.8M lines, you have a line length of ~20KB/line. If you want to sample 10K lines, you have ~40MB between lines. This is almost certainly around three orders of magnitude larger than the size of a block on your disk. Therefore, seeking to the next read location is much much more efficient than reading every byte in the file.

Seeking will work with files that have unequal line lenghs (e.g., non-ascii characters in UTF-8 encoding), but will require minor modifications to the method. If you have unequal lines, you can seek to an estimated location, then scan to the start of the next line. This is still quite efficient because you will be skipping ~40MB for every ~20KB you need to read. Your sampling uniformity will be compromised slightly since you will select byte locations instead of line locations, and you won't know which line number you are reading for sure.

You can implement your solution directly with the Python code that generates your line numbers. Here is a sample of how to deal with lines that all have the same number of bytes (usually ascii encoding):

import random
from os.path import getsize# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000file_size = getsize(file_name)
with open(file_name) as file:# Read the first line to get the lengthfile.readline()line_size = file.tell()# You don't have to seek(0) here: if line #0 is selected,# the seek will happen regardless later.# Assuming you are 100% sure all lines are equal, this might# discard the last line if it doesn't have a trailing newline.# If that bothers you, use `math.round(file_size / line_size)`line_count = file_size // line_size# This is just a trivial example of how to generate the line numbers.# If it doesn't work for you, just use the method you already have.# By the way, this will just error out (ValueError) if you try to# select more lines than there are in the file, which is idealselection_indices = random.sample(range(line_count), selection_count)selection_indices.sort()# Now skip to each line before reading it:prev_index = 0for line_index in selection_indices:# Conveniently, the default seek offset is the start of the file,# not from current positionif line_index != prev_index + 1:file.seek(line_index * line_size)print('Line #{}: {}'.format(line_index, file.readline()), end='')# Small optimization to avoid seeking consecutive lines.# Might be unnecessary since seek probably already does# something like that for youprev_index = line_index

If you are willing to sacrifice a (very) small amount of uniformity in the distribution of line numbers, you can easily apply a similar technique to files with unequal line lengths. You just generate random byte offsets, and skip to the next full line after the offset. In the following implementation, it is assumed that you know for a fact that no line is longer than 40KB in length. You would have to do something like this if your CSV had non-ascii unicode characters encoded in UTF-8, because even if the lines all contained the same number of characters, they would contain different numbers of bytes. In this case, you would have to open the file in binary mode, since otherwise you might run into decoding errors when you skip to a random byte, if that byte happens to be mid-character:

import random
from os.path import getsize# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
# An upper bound on the line size in bytes, not chars
# This serves two purposes:
#   1. It determines the margin to use from the end of the file
#   2. It determines the closest two offsets are allowed to be and
#      still be 100% guaranteed to be in different lines
max_line_bytes = 40000file_size = getsize(file_name)
# make_offset is a function that returns `selection_count` monotonically
# increasing unique samples, at least `max_line_bytes` apart from each
# other, in the range [0, file_size - margin). Implementation not provided.
selection_offsets = make_offsets(selection_count, file_size, max_line_bytes)
with open(file_name, 'rb') as file:for offset in selection_offsets:# Skip to each offsetfile.seek(offset)# Readout to the next full linefile.readline()# Print the next line. You don't know the number.# You also have to decode it yourself.print(file.readline().decode('utf-8'), end='')

All code here is Python 3.

https://en.xdnf.cn/q/71753.html

Related Q&A

Is Pythons hashlib.sha256(x).hexdigest() equivalent to Rs digest(x,algo=sha256)

Im not a python programmer, but Im trying to translate some Python code to R. The piece of python code Im having trouble with is:hashlib.sha256(x).hexdigest()My interpretation of this code is that the…

How to do Histogram Equalization on specific area

I have a image and I want to do HE or CLAHE on specific area of the image. I already have a mask for the image. Is there any possible way to do so?

Timing out a multiprocessing function

I need to set a time limit on a python function which use some multiprocessing stuff (I dont know if it matters). Something like this:function(a_list):p1 = Process(a_list[0:len(a_list/2)])p2 = Process(…

How can I get the actual axis limits when using ax.axis(equal)?

I am using ax.axes(equal) to make the axis spacing equal on X and Y, and also setting xlim and ylim. This over-constrains the problem and the actual limits are not what I set in ax.set_xlim() or ax.set…

Python rarfile package: fail to open files

So I was trying to archive a .rar file using rarfile library in Python, but it keeps saying "failed to open". Am using Mac OS X El Capitan, python 2.7. Any help would be appreciated, thanks.O…

Sublime Text 3 Python Interactive Console? [duplicate]

This question already has answers here:Cant send input to running program in Sublime Text(5 answers)Closed 7 years ago.I have been using a lot of sublime text 3 to write python. However, whenever a pro…

Is there any value to a Switch / Case implementation in Python?

Recently, I saw some discussions online about how there is no good "switch / case" equivalent in Python. I realize that there are several ways to do something similar - some with lambda, som…

Replacing the existing MainWindow with a new window with Python, PyQt, Qt Designer

Im new to Python GUI programming Im have trouble making a GUI app. I have a main window with only a button widget on it. What i want to know is how to replace the existing window with a new window when…

Using LaTeX Beamer to display code

Im using the following LaTeX code in a Beamer presentation:\begin{frame}\begin{figure}\centering\tiny\lstset{language=python}\lstinputlisting{code/get_extent.py}\end{figure} \end{frame}Is it possible t…

Python metaclass and the object base class

After reading the excellent SO post, I tried crafting a module level metaclass:def metaclass(future_class_name, future_class_parents, future_class_attrs):print "module.__metaclass__"future_cl…