Add new column to a HuggingFace dataset

2024/10/9 4:24:49

In the dataset I have 5000000 rows, I would like to add a column called 'embeddings' to my dataset.

dataset = dataset.add_column('embeddings', embeddings)

The variable embeddings is a numpy memmap array of size (5000000, 512).

But I get this error:

ArrowInvalidTraceback (most recent call last)in ----> 1 dataset = dataset.add_column('embeddings', embeddings)

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs) 486 } 487 # apply actual function --> 488 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) 489 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out] 490 # re-apply format to the output

/opt/conda/lib/python3.8/site-packages/datasets/fingerprint.py in wrapper(*args, **kwargs) 404 # Call actual function 405 --> 406 out = func(self, *args, **kwargs) 407 408 # Update fingerprint of in-place transforms + update in-place history of transforms

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in add_column(self, name, column, new_fingerprint) 3346 :class:Dataset 3347 """ -> 3348 column_table = InMemoryTable.from_pydict({name: column}) 3349 # Concatenate tables horizontally 3350 table = ConcatenationTable.from_tables([self._data, column_table], axis=1)

/opt/conda/lib/python3.8/site-packages/datasets/table.py in from_pydict(cls, *args, **kwargs) 367 @classmethod 368 def from_pydict(cls, *args, **kwargs): --> 369 return cls(pa.Table.from_pydict(*args, **kwargs)) 370 371 @inject_arrow_table_documentation(pa.Table.from_batches)

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib._from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: only handle 1-dimensional arrays

How can I solve, possibly in an efficient way, since the embeddings array does not fit the RAM?

Answer
from datasets import load_datasetds = load_dataset("cosmos_qa", split="train")new_column = ["foo"] * len(ds)
ds = ds.add_column("new_column", new_column)

and you get a dataset

Dataset({features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label', 'new_column'],num_rows: 25262
})
https://en.xdnf.cn/q/70061.html

Related Q&A

Django: how to order_by on a related field of a related field

Im using annotate to add a property to an object which I can then use for order_by. However, I want to annotate on a field of a relation on a relation. I know I should be able to get to the field someh…

How to extract the cell state and hidden state from an RNN model in tensorflow?

I am new to TensorFlow and have difficulties understanding the RNN module. I am trying to extract hidden/cell states from an LSTM. For my code, I am using the implementation from https://github.com/ay…

Python - Nested List to Tab Delimited File?

I have a nested list comprising ~30,000 sub-lists, each with three entries, e.g.,nested_list = [[x, y, z], [a, b, c]].I wish to create a function in order to output this data construct into a tab delim…

How to make sure buildout doesnt use the already installed packages?

I am trying to switch fully to buildout - but our development environment already has lot of stuff installed in /usr/lib/pythonxx/How can I make sure that buildout doesnt use the libraries installed on…

Can python setup.py install use wheels?

I am using setuptools. Is there a way to have the following command use wheels instead of source?python setup.py installIn particular, I have a custom package that requires pandas. While pandas insta…

Getting the last element of a level in a multiindex

I have a dataframe in this format:a b x 1 1 31 1 2 1 1 3 42 1 4 423 1 5 42 1 6 3 1 7 44 1 8 65437 1 9 73 2 1 5656 2 2 7 2 3 5 2 4 5 2 5 34a a…

Sphinx and JavaScript Documentation Workflow [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 5…

Getting two characters from string in python [duplicate]

This question already has answers here:Split string every nth character(19 answers)How to iterate over a list in chunks(40 answers)Closed last year.how to get in python from string not one character, b…

I Call API from PYTHON I get the response 406 Not Acceptable

I created a API in my site and Im trying to call an API from python but I always get 406 as a response, however, if I put the url in the browser with the parameters, I can see the correct answerI alrea…

TypeError: unsupported operand type(s) for +=: builtin_function_or_method and int

I am receiving this error (TypeError: unsupported operand type(s) for +=: builtin_function_or_method and int) when trying to run this codetotal_exams = 0 for total_exams in range(1, 100001):sum += tota…