deep copy nested iterable (or improved itertools.tee for iterable of iterables)

2024/10/4 19:30:40

Preface

I have a test where I'm working with nested iterables (by nested iterable I mean iterable with only iterables as elements).

As a test cascade consider

from itertools import tee
from typing import (Any,Iterable)def foo(nested_iterable: Iterable[Iterable[Any]]) -> Any:...def test_foo(nested_iterable: Iterable[Iterable[Any]]) -> None:original, target = tee(nested_iterable)  # this doesn't copy iterators elementsresult = foo(target)assert is_contract_satisfied(result, original)def is_contract_satisfied(result: Any,original: Iterable[Iterable[Any]]) -> bool:...

E.g. foo may be simple identity function

def foo(nested_iterable: Iterable[Iterable[Any]]) -> Iterable[Iterable[Any]]:return nested_iterable

and contract is simply checks that flattened iterables have same elements

from itertools import (chain,starmap,zip_longest)
from operator import eq
...
flatten = chain.from_iterabledef is_contract_satisfied(result: Iterable[Iterable[Any]],original: Iterable[Iterable[Any]]) -> bool:return all(starmap(eq,zip_longest(flatten(result), flatten(original),# we're assuming that ``object()``# will create some unique object# not presented in any of argumentsfillvalue=object())))

But if some of nested_iterable elements is an iterator, it may be exhausted since tee is making shallow copies, not deep ones, i.e. for given foo and is_contract_satisfied next statement

>>> test_foo([iter(range(10))])

leads to predictable

Traceback (most recent call last):...test_foo([iter(range(10))])File "...", line 19, in test_fooassert is_contract_satisfied(result, original)
AssertionError

Problem

How to deep copy an arbitrary nested iterable?

Note

I'm aware of copy.deepcopy function, but it won't work for file objects.

Answer

Naive solution

Straightforward algorithm would be

  1. Perform elementwise copying of original nested iterable.
  2. Make n copies of elementwise copy.
  3. Obtain coordinates related to each independent copy.

which may be implemented like

from itertools import tee
from operator import itemgetter
from typing import (Any,Iterable,Tuple,TypeVar)Domain = TypeVar('Domain')def copy_nested_iterable(nested_iterable: Iterable[Iterable[Domain]],*,count: int = 2) -> Tuple[Iterable[Iterable[Domain]], ...]:def shallow_copy(iterable: Iterable[Domain]) -> Tuple[Iterable[Domain], ...]:return tee(iterable, count)copies = shallow_copy(map(shallow_copy, nested_iterable))return tuple(map(itemgetter(index), iterables)for index, iterables in enumerate(copies))

Pros:

  • quite easy to read & explain.

Cons:

  • if we wanted to extend our approach for iterables with greater nesting level (like iterable of nested iterables and so on) this approach doesn't look helpful.

We can do better.

Improved solution

If we look at itertools.tee function documentation, it contains Python recipe, which with help of functools.singledispatch decorator can be rewritten like

from collections import (abc,deque)
from functools import singledispatch
from itertools import repeat
from typing import (Iterable,Tuple,TypeVar)Domain = TypeVar('Domain')@functools.singledispatch
def copy(object_: Domain,*,count: int) -> Iterable[Domain]:raise TypeError('Unsupported object type: {type}.'.format(type=type(object_)))# handle general case
@copy.register(object)
# immutable strings represent a special kind of iterables
# that can be copied by simply repeating
@copy.register(bytes)
@copy.register(str)
# mappings cannot be copied as other iterables
# since they are iterable only by key
@copy.register(abc.Mapping)
def copy_object(object_: Domain,*,count: int) -> Iterable[Domain]:return itertools.repeat(object_, count)@copy.register(abc.Iterable)
def copy_iterable(object_: Iterable[Domain],*,count: int = 2) -> Tuple[Iterable[Domain], ...]:iterator = iter(object_)# we are using `itertools.repeat` instead of `range` here# due to efficiency of the former# more info at# https://stackoverflow.com/questions/9059173/what-is-the-purpose-in-pythons-itertools-repeat/9098860#9098860queues = [deque() for _ in repeat(None, count)]def replica(queue: deque) -> Iterable[Domain]:while True:if not queue:try:element = next(iterator)except StopIteration:returnelement_copies = copy(element,count=count)for sub_queue, element_copy in zip(queues, element_copies):sub_queue.append(element_copy)yield queue.popleft()return tuple(replica(queue) for queue in queues)

Pros:

  • handles nesting on deeper levels or even mixed elements like both iterables and non-iterables on the same level,
  • may be extended for user-defined structures (e.g. for making independent deep copies of them).

Cons:

  • less readable (but as we know "practicality beats purity"),
  • provides some overhead related to dispatching (but it's ok since it is based on dictionary lookup which has O(1) complexity).

Test

Preparation

Let's define our nested iterable as follows

nested_iterable = [range(10 ** index) for index in range(1, 7)]

Since iterators creation says nothing about underlying copies performance, let's define function for iterators exhausting (described here)

exhaust_iterable = deque(maxlen=0).extend

Time

Using timeit package

import timeitdef naive(): exhaust_iterable(copy_nested_iterable(nested_iterable))def improved(): exhaust_iterable(copy_iterable(nested_iterable))print('naive approach:', min(timeit.repeat(naive)))
print('improved approach:', min(timeit.repeat(improved)))

I have on my laptop with Windows 10 x64 in Python 3.5.4

naive approach: 5.1863865
improved approach: 3.5602296000000013

Memory

Using memory_profiler package

Line #    Mem usage    Increment   Line Contents
================================================78     17.2 MiB     17.2 MiB   @profile79                             def profile_memory(nested_iterable: Iterable[Iterable[Any]]) -> None:80     68.6 MiB     51.4 MiB       result = list(flatten(flatten(copy_nested_iterable(nested_iterable))))

for "naive" approach and

Line #    Mem usage    Increment   Line Contents
================================================78     17.2 MiB     17.2 MiB   @profile79                             def profile_memory(nested_iterable: Iterable[Iterable[Any]]) -> None:80     68.7 MiB     51.4 MiB       result = list(flatten(flatten(copy_iterable(nested_iterable))))

for "improved" one.

Note: I've made different runs of script because making them at once won't be representative since second statement will reuse previously created under-the-hood int objects.


Conclusion

As we can see both functions have similar performance, but the last one supports deeper levels of nesting and looks pretty extensible.

Advertisement

I've added "improved" solution to lz package from 0.4.0 version which can be used like

>>> from lz.replication import replicate
>>> iterable = iter(range(5))
>>> list(map(list, replicate(iterable,count=3)))
[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]

It is property-based tested using hypothesis framework, so we may be sure that it works as expected.

https://en.xdnf.cn/q/70576.html

Related Q&A

ImportError: No module named cv2.cv

python 3.5 and windows 10I installed open cv using this command :pip install opencv_python-3.1.0-cp35-cp35m-win_amd64.whlThis command in python works fine :import cv2But when i want to import cv2.cv :i…

Why arent persistent connections supported by URLLib2?

After scanning the urllib2 source, it seems that connections are automatically closed even if you do specify keep-alive. Why is this?As it is now I just use httplib for my persistent connections... bu…

How to find accented characters in a string in Python?

I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. ) or special characters (e.g. ). I have to be able to search for these characters in the sentence so I can…

Import error with PyQt5 : undefined symbol:_ZNSt12out_of_rangeC1EPKc,versionQt_5

I had watched a video on YouTube about making your own web browser using PyQt5. Link to video: https://youtu.be/z-5bZ8EoKu4, I found it interesting and decided to try it out on my system. Please note t…

Python MySQL module

Im developing a web application that needs to interface with a MySQL database, and I cant seem to find any really good modules out there for Python.Im specifically looking for fast module, capable of h…

How to calculate correlation coefficients using sklearn CCA module?

I need to measure similarity between feature vectors using CCA module. I saw sklearn has a good CCA module available: https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.h…

cant open shape file with GeoPandas

I dont seem to be able to open the zip3.zip shape file I download from (http://www.vdstech.com/usa-data.aspx)Here is my code:import geopandas as gpd data = gpd.read_file("data/zip3.shp")this …

How to fix the error :range object is not callable in python3.6 [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.This question was caused by a typo or a problem that can no longer be reproduced. While similar q…

pyspark: Save schemaRDD as json file

I am looking for a way to export data from Apache Spark to various other tools in JSON format. I presume there must be a really straightforward way to do it.Example: I have the following JSON file jfil…

How to start a background thread when django server is up?

TL;DR In my django project, where do I put my "start-a-thread" code to make a thread run as soon as the django server is up?First of all, Happy New Year Everyone! This is a question from a n…