I have a csv file which is ~40gb and 1800000 lines.
I want to randomly sample 10,000 lines and print them to a new file.
Right now, my approach is to use sed as:
(sed -n '$vars' < input.txt) > output.txt
Where $vars
is a randomly generated list of lines. (Eg: 1p;14p;1700p;...;10203p)
While this works, it takes about 5 minutes per execution. It's not a huge time, but I was wondering if anybody had ideas on how to make it quicker?
The biggest advantage to having lines of the same length is that you don't need to find newlines to know where each line starts. With a file size of ~40GB containing ~1.8M lines, you have a line length of ~20KB/line. If you want to sample 10K lines, you have ~40MB between lines. This is almost certainly around three orders of magnitude larger than the size of a block on your disk. Therefore, seeking to the next read location is much much more efficient than reading every byte in the file.
Seeking will work with files that have unequal line lenghs (e.g., non-ascii characters in UTF-8 encoding), but will require minor modifications to the method. If you have unequal lines, you can seek to an estimated location, then scan to the start of the next line. This is still quite efficient because you will be skipping ~40MB for every ~20KB you need to read. Your sampling uniformity will be compromised slightly since you will select byte locations instead of line locations, and you won't know which line number you are reading for sure.
You can implement your solution directly with the Python code that generates your line numbers. Here is a sample of how to deal with lines that all have the same number of bytes (usually ascii encoding):
import random
from os.path import getsize# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000file_size = getsize(file_name)
with open(file_name) as file:# Read the first line to get the lengthfile.readline()line_size = file.tell()# You don't have to seek(0) here: if line #0 is selected,# the seek will happen regardless later.# Assuming you are 100% sure all lines are equal, this might# discard the last line if it doesn't have a trailing newline.# If that bothers you, use `math.round(file_size / line_size)`line_count = file_size // line_size# This is just a trivial example of how to generate the line numbers.# If it doesn't work for you, just use the method you already have.# By the way, this will just error out (ValueError) if you try to# select more lines than there are in the file, which is idealselection_indices = random.sample(range(line_count), selection_count)selection_indices.sort()# Now skip to each line before reading it:prev_index = 0for line_index in selection_indices:# Conveniently, the default seek offset is the start of the file,# not from current positionif line_index != prev_index + 1:file.seek(line_index * line_size)print('Line #{}: {}'.format(line_index, file.readline()), end='')# Small optimization to avoid seeking consecutive lines.# Might be unnecessary since seek probably already does# something like that for youprev_index = line_index
If you are willing to sacrifice a (very) small amount of uniformity in the distribution of line numbers, you can easily apply a similar technique to files with unequal line lengths. You just generate random byte offsets, and skip to the next full line after the offset. In the following implementation, it is assumed that you know for a fact that no line is longer than 40KB in length. You would have to do something like this if your CSV had non-ascii unicode characters encoded in UTF-8, because even if the lines all contained the same number of characters, they would contain different numbers of bytes. In this case, you would have to open the file in binary mode, since otherwise you might run into decoding errors when you skip to a random byte, if that byte happens to be mid-character:
import random
from os.path import getsize# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
# An upper bound on the line size in bytes, not chars
# This serves two purposes:
# 1. It determines the margin to use from the end of the file
# 2. It determines the closest two offsets are allowed to be and
# still be 100% guaranteed to be in different lines
max_line_bytes = 40000file_size = getsize(file_name)
# make_offset is a function that returns `selection_count` monotonically
# increasing unique samples, at least `max_line_bytes` apart from each
# other, in the range [0, file_size - margin). Implementation not provided.
selection_offsets = make_offsets(selection_count, file_size, max_line_bytes)
with open(file_name, 'rb') as file:for offset in selection_offsets:# Skip to each offsetfile.seek(offset)# Readout to the next full linefile.readline()# Print the next line. You don't know the number.# You also have to decode it yourself.print(file.readline().decode('utf-8'), end='')
All code here is Python 3.