Question 1

Is it possible to use a timestamp field in the pyarrow table to partition the s3fs file system by "YYYY/MM/DD/HH" while writing parquet file to s3?

Question 2

I was able to achieve with a pyarrow write_to_dataset function which allows you to specify partition columns to create subdirectories.

Example:

import os
import s3fs
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow.filesystem import S3FSWrapperaccess_key = <access_key>
secret_key = <secret_key>
bucket_name = <bucket_name>fs = s3fs.S3FileSystem(key=access_key, secret=secret_key)bucket_uri = 's3://{0}/{1}'.format(bucket_name, "data")data = {'date': ['2018-03-04T14:12:15.653Z', '2018-03-03T14:12:15.653Z', '2018-03-02T14:12:15.653Z', '2018-03-05T14:12:15.653Z'],'battles': [34, 25, 26, 57],'citys': ['london', 'newyork', 'boston', 'boston']}
df = pd.DataFrame(data, columns=['date', 'battles', 'citys'])
df['date'] = df['date'].map(lambda t: pd.to_datetime(t, format="%Y-%m-%dT%H:%M:%S.%fZ"))
df['year'], df['month'], df['day'] = df['date'].apply(lambda x: x.year), df['date'].apply(lambda x: x.month), df['date'].apply(lambda x: x.day)
df.groupby(by=['citys'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, bucket_uri, filesystem=fs, partition_cols=['year', 'month', 'day'], use_dictionary=True,  compression='snappy', use_deprecated_int96_timestamps=True)

Pyarrow s3fs partition by timestamp

Related Q&A

flask run vs. python

Pandas-Add missing years in time series data with duplicate years

Saving zip list to csv in Python

unable to download the pipeline provided by spark-nlp library

Can setattr() can be defined in a class with slots?

mysql-connector python IN operator stored as list

Pandas: Use iterrows on Dataframe subset

Can I parameterize a pytest fixture with other fixtures?

fit method in python sklearn

Django 1.9 JSONField update behavior