how to specify the partition for mapPartition in spark

2024/10/15 9:20:15

What I would like to do is compute each list separately so for example if I have 5 list ([1,2,3,4,5,6],[2,3,4,5,6],[3,4,5,6],[4,5,6],[5,6]) and I would like to get the 5 lists without the 6 I would do something like :

data=[1,2,3,4,5,6]+[2,3,4,5,6,7]+[3,4,5,6,7,8]+[4,5,6,7,8,9]+[5,6,7,8,9,10]def function_1(iter_listoflist):final_iterator=[]for sublist in iter_listoflist:final_iterator.append([x for x in sublist if x!=6])return iter(final_iterator)  sc.parallelize(data,5).glom().mapPartitions(function_1).collect()

then cut the lists so I get the first lists again. Is there a way to simply separate the computation? I don't want the lists to mix and they might be of different sizes.

thank you

Philippe

Answer

As far as I understand your intentions all you need here is to keep individual lists separate when you parallelize your data:

data = [[1,2,3,4,5,6], [2,3,4,5,6,7], [3,4,5,6,7,8],[4,5,6,7,8,9], [5,6,7,8,9,10]]rdd = sc.parallelize(data)rdd.take(1) # A single element of a RDD is a whole list
## [[1, 2, 3, 4, 5, 6]]

Now you can simply map using a function of your choice:

def drop_six(xs):return [x for x in xs if x != 6]rdd.map(drop_six).take(3)
## [[1, 2, 3, 4, 5], [2, 3, 4, 5, 7], [3, 4, 5, 7, 8]]
https://en.xdnf.cn/q/117849.html

Related Q&A

Keeping just the hh:mm:ss from a time delta

I have a column of timedeltas which have the attributes listed here. I want the output in my pandas table to go from:1 day, 13:54:03.0456to:13:54:03How can I drop the date from this output?

How to return the index of numpy ndarray based on search?

I have a numpy 2D array, import numpy as np array1 = array([[ 1, 2, 1, 1],[ 2, 2, 2, 1],[ 1, 1, 1, 1],[1, 3, 1, 1],[1, 1, 1, 1]])I would like to find the element 3 and know its location. So,…

Python:Christmas Tree

I need to print a Christmas tree that looks like this:/\ / \ / \Here is my code so far:for count in range (0,20):variable1 = count-20variable2 = count*2print({0:{width1}}{1:{width2}} .format(/,\\,…

Send back json to client side

I just started developing with cherrypy, so I am struggling a little bit. In client side I am selecting some data, converting it to json and sending to server side via post method. Then I am doing a fe…

Can I use PyInstaller from Python 2.7 to compile an executable for a Python 3 script?

So, I tried installing PyInstaller in my Python 3.4 dir but, for some reason, Ive been getting errors and Im not able to install it. I however, do have a working PyInstaller in my Python 2.7 dir. I nee…

exporting different lists to .txt in python

I have a few lists which I all want to export to the same .txt file. So far I only export 3 of the lists usingmy_array=numpy.array(listofrandomizedconditions) my_array2=numpy.array(inputsuser) my_arra…

Retrieving information from dictionary

Im having hard time trying to read my dictionary variable. Python keeps throwing the following error:TypeError: string indices must be integersThis is a sample that should give you an idea of what my p…

Python WMI Hyper-v GetSummaryInformation result

Im trying to retrieve information from all the available VMs on a Hyper-V Server. The problem is that when I ask for the summary information, i get a list of useless COMObjects.I cant find a way of get…

How to dynamically change variable name in form.vars.var_name

I have defined counter variable in controller.I can define tables and fields dynamically.tables = [db.define_table(example_table_%s % x,Field(example_field_%s % x, type=string, ...)...)for x in range(0…

Why will one loop modify a list of lists, but the other wont [duplicate]

This question already has answers here:Python list doesnt reflect variable change(6 answers)Closed 8 years ago.One of the answers in "python way" to parse and conditionally replace every elem…