Question 1

I have a MapReduce job that I'm trying to migrate to PySpark. Is there any way of defining the name of the output file, rather than getting part-xxxxx?

In MR, I was using the org.apache.hadoop.mapred.lib.MultipleTextOutputFormat class to achieve this,

PS: I did try the saveAsTextFile() method. For example:

lines = sc.textFile(filesToProcessStr)
counts = lines.flatMap(lambda x: re.split('[\s&]', x.strip()))\
.saveAsTextFile("/user/itsjeevs/mymr-output")

This will create the same part-0000 files.

[13:46:25] [spark] $ hadoop fs -ls /user/itsjeevs/mymr-output/
Found 3 items
-rw-r-----   2 itsjeevs itsjeevs          0 2014-08-13 13:46 /user/itsjeevs/mymr-output/_SUCCESS
-rw-r--r--   2 itsjeevs itsjeevs  101819636 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00000
-rw-r--r--   2 itsjeevs itsjeevs   17682682 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00001

EDIT

Recently read the article which would make life much easier for Spark users.

Question 2

Spark is also using Hadoop under the hood, so you can probably get what you want. This is how saveAsTextFile is implemented:

def saveAsTextFile(path: String) {this.map(x => (NullWritable.get(), new Text(x.toString))).saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}

You could pass in a customized OutputFormat to saveAsHadoopFile. I have no idea how to do that from Python though. Sorry for the incomplete answer.

Specifying the output file name in Apache Spark

Related Q&A

How to plot two DataFrame on same graph for comparison

Babel doesnt recognize jinja2 extraction method for language support

Automatically simplifying/refactoring Python code (e.g. for loops - list comprehension)? [closed]

Knowing the number of iterations needed for convergence in SVR scikit-learn

Why is `NaN` considered smaller than `-np.inf` in numpy?

recursion within a class

There is an example of Spyne client?

Safely bind method from one class to another class in Python [duplicate]

Basic Python OpenCV cropping and resizing

Why does Keras loss drop dramatically after the first epoch?