Pyarrow read/write from s3

2024/9/8 10:30:49

Is it possible to read and write parquet files from one folder to another folder in s3 without converting into pandas using pyarrow.

Here is my code:

import pyarrow.parquet as pq
import pyarrow as pa
import s3fss3 = s3fs.S3FileSystem()bucket = 'demo-s3'pd = pq.ParquetDataset('s3://{0}/old'.format(bucket), filesystem=s3).read(nthreads=4).to_pandas()
table = pa.Table.from_pandas(pd)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), filesystem=s3, use_dictionary=True, compression='snappy')
Answer

If you do not wish to copy the files directly, it appears you can indeed avoid pandas thus:

table = pq.ParquetDataset('s3://{0}/old'.format(bucket),filesystem=s3).read(nthreads=4)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), filesystem=s3, use_dictionary=True, compression='snappy')
https://en.xdnf.cn/q/73124.html

Related Q&A

matplotlib draw showing nothing

Im using pythons matplotlib to do some contours using contour and contourf functions. They all work fine when using show, but when I try to use draw() inside a method, I get the matplotlib window but n…

Measure the height of a string in Tkinter Python?

I need the height in pixel of a string in a Tkiner widget. It is the text in a row of a Listbox.I know I can measure the width of a string with tkinter.font.Font.measure. But how can I get the height?…

pandas filter by multiple columns NULL

I have a pandas dataframe like:df = pd.DataFrame({Last_Name: [Smith, None, Brown], First_Name: [John, None, Bill],Age: [35, 45, None]})And could manually filter it using:df[df.Last_Name.isnull() & …

Closed IPython Notebook that was running code

How it works? I got some code running in an IPython Notebook. Some iterative work. Accidentally I closed the browser with the running Notebook, but going back to the IPython Dashboard I see that this …

os.popen().read() - charmap decoding error

I have already read UnicodeDecodeError: charmap codec cant decode byte X in position Y: character maps to <undefined>. While the error message is similar, the code is completely different, becau…

multiprocessing numpy not defined error

I am using the following test code:from pathos.multiprocessing import ProcessingPool as Pool import numpydef foo(obj1, obj2):a = obj1**2b = numpy.asarray(range(1,5))return obj1, bif __name__ == __main_…

Convert mp4 sound to text in python

I want to convert a sound recording from Facebook Messenger to text. Here is an example of an .mp4 file send using Facebooks API: https://cdn.fbsbx.com/v/t59.3654-21/15720510_10211855778255994_5430581…

pandas map one column to the combination of two columns

I am working with a DataFrame which looks like thisList Numb Name 1 1 one 1 2 two 2 3 three 4 4 four 3 5 fiveand I am trying to compute…

Pyspark command not recognised

I have anaconda installed and also I have downloaded Spark 1.6.2. I am using the following instructions from this answer to configure spark for Jupyter enter link description hereI have downloaded and …

Sliding window in Python for GLCM calculation

I am trying to do texture analysis in a satellite imagery using GLCM algorithm. The scikit-image documentation is very helpful on that but for GLCM calculation we need a window size looping over the im…