Hugging Face Course Notes: Chapter5

Hugging Face
Pytorch
Deep Learning
NLP
LLM
HuggingFaceHub
Author

Ismail TG

Published

February 27, 2024

!pip install datasets
Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.7/536.7 kB 8.0 MB/s eta 0:00:00
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.13.1)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.25.2)
Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (14.0.2)
Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 14.0 MB/s eta 0:00:00
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.31.0)
Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.2)
Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.4.1)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 16.7 MB/s eta 0:00:00
Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.3)
Requirement already satisfied: huggingface-hub>=0.19.4 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.20.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.19.4->datasets) (4.9.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2024.2.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.4)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.17.1 dill-0.3.8 multiprocess-0.70.16
!pip install transformers
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.37.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.13.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.20.3)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.25.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (23.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2023.12.25)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.31.0)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.15.2)
Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.2)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.2)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (4.9.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.2.2)
from datasets import load_dataset
from transformers import AutoTokenizer

Intrudoction:

  • In chapter 3 we learn how to use the library datsets by loading datest from the hub and building a compute_metrics function, and using Dataset.map() function, however this functionalities doesn’t represent the whole picture about dataset library.
  • In this chapetr we will go deeper and try understand :
    • How to load a dataset when it’s not available on the Hub
    • How to slice and dice a dataset
    • What to do when the datset is large
    • What is “memory mapping” and Apache row?
    • How to create our own dataset and push it to the hub?

What if my dataset isn’t on the Hub?

  • Most of the time we will deal with situation when the dataset we want to work with isn’t on the HUB, In this section we’ll show you how huggingface Datasets can be used to load datasets that aren’t available on the Hugging Face Hub.

Working with local and remote datasets:

  • Hugging face Datasets provides loading scripts to handle the loading of local and remote datasets. It supports several common data formats, such as:
load_dataset("csv", data_files="my_file.csv")
load_dataset("text", data_files="my_file.txt")
load_dataset("json", data_files="my_file.jsonl")
load_dataset("pandas", data_files="my_dataframe.pkl")



* As shown , for each data format we just need to specify the type of loading script in the `load_dataset()` function, along with a `data_files` argument that specifies the path to one or more files.
* First we will deal with loading a dataset from local files; later we’ll see how to do the same with remote files.



### Loading a local dataset

* Here we'll use the [SQuAD-it](https://github.com/crux82/squad-it/) dataset, which is a large-scale dataset for question answering in Italian.

    * The training and test splits are hosted on GitHub, so we can download them with a simple wget command:


::: {#cell-9 .cell outputId='6433a784-3bd2-4b27-f0fc-3135e17e61bb'}
``` {.python .cell-code}
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz
--2024-02-27 13:44:06--  https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/crux82/squad-it/master/SQuAD_it-train.json.gz [following]
--2024-02-27 13:44:06--  https://raw.githubusercontent.com/crux82/squad-it/master/SQuAD_it-train.json.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7725286 (7.4M) [application/octet-stream]
Saving to: ‘SQuAD_it-train.json.gz’

SQuAD_it-train.json 100%[===================>]   7.37M  --.-KB/s    in 0.1s    

2024-02-27 13:44:06 (68.4 MB/s) - ‘SQuAD_it-train.json.gz’ saved [7725286/7725286]

--2024-02-27 13:44:06--  https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/crux82/squad-it/master/SQuAD_it-test.json.gz [following]
--2024-02-27 13:44:07--  https://raw.githubusercontent.com/crux82/squad-it/master/SQuAD_it-test.json.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1051245 (1.0M) [application/octet-stream]
Saving to: ‘SQuAD_it-test.json.gz’

SQuAD_it-test.json. 100%[===================>]   1.00M  --.-KB/s    in 0.07s   

2024-02-27 13:44:07 (15.1 MB/s) - ‘SQuAD_it-test.json.gz’ saved [1051245/1051245]

:::

  • We need decompress them frist:
!gzip -dkv SQuAD_it-*.json.gz
SQuAD_it-test.json.gz:   87.5% -- created SQuAD_it-test.json
SQuAD_it-train.json.gz:  82.3% -- created SQuAD_it-train.json
!rm -rf SQuAD_it-*.json.gz
  • Now we can download our dataset from local file as if it from the hub:
dataset = load_dataset('json', data_files= "SQuAD_it-train.json", field= "data" )
  • Be default downloading local file creates a DatasetDict with train split:
dataset
DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})
  • We can view that we have 442 rows, let’s see one of them:
dataset['train'][0]
  • Great, we’ve loaded our first local dataset! But while this worked for the training set, what we really want is to include both the train and test splits in a single DatasetDict object so we can apply Dataset.map() functions across both splits at once.
  • To do this, we can provide a dictionary to the data_files argument that maps each split name to a file associated with that split:
# the proper way to load dataset from local file:
data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset
DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})
  • This is exactly what we wanted. Now, we can apply various preprocessing techniques to clean up the data, tokenize the reviews, and so on.

  • The loading scripts in Datasets acually support the automatic decompression of the input files, we could have skipped the use of gzip by pointing the data_files argument directly to the compressed files:

data_files = {'train': 'SQuAD_it-train.json.gz', 'test': 'SQuAD_it-test.json.gz'}
dataset = load_dataset('json', data_files=data_files, field= 'data')
dataset
DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})
  • The automatic decompression also applies to other common formats like ZIP and TAR, so you just need to point data_files to the compressed files and you’re good to go!

Loading a remote dataset:

  • Loading remote files is very similar to loading locally, we just need to point to data_file the url where the data is stored instead to providing the path to lacal files.
  • For example the SQuAD_it dataset is stored on github so we could build our dataset from that url directly:
data_files = {'train':'https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz',
              'test': 'https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz'}
dataset = load_dataset('json', data_files=data_files, field='data')

Time to slice and dice:

  • In this section we will explore various features Datasets provide in order to clean and prepare the dataset for the next steps.

Slicing and dicing our data:

  • Like Pandas Datasets provides several functions to manipulate the content of DatasetDict Data object, we already use the .map() method.

  • Here we will use Drug Review Dataset from UC Irvine Machine Learning Repository which contains patient reviews on various drugs, along with the condition being treated and a 10-star rating of the patient’s satisfaction.

  • First we need to download it and unzip it:

!wget https://archive.ics.uci.edu/static/public/462/drug+review+dataset+drugs+com.zip
--2024-02-27 14:22:46--  https://archive.ics.uci.edu/static/public/462/drug+review+dataset+drugs+com.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘drug+review+dataset+drugs+com.zip’

drug+review+dataset     [       <=>          ]  41.00M  30.7MB/s    in 1.3s    

2024-02-27 14:22:48 (30.7 MB/s) - ‘drug+review+dataset+drugs+com.zip’ saved [42989872]
!unzip drug+review+dataset+drugs+com.zip
Archive:  drug+review+dataset+drugs+com.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   
  • We will use csv arguments here even though we have tsv files, we just need to specifying the delimiter argument in the load_dataset() function as follows:
data_files = {'train': 'drugsComTrain_raw.tsv',
              'test': 'drugsComTest_raw.tsv'}
dataset = load_dataset('csv', data_files=data_files, delimiter= '\t')
  • A good practice when doing any sort of data analysis is to grab a small random sample to get a quick feel for the type of data you’re working with. In Datasets, we can create a random sample by chaining the Dataset.shuffle() and Dataset.select() functions together:
drug_sample = dataset['train'].shuffle(seed= 42).select(range(1000))
drug_sample[:2]
{'Unnamed: 0': [87571, 178045],
 'drugName': ['Naproxen', 'Duloxetine'],
 'condition': ['Gout, Acute', 'ibromyalgia'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."'],
 'rating': [9.0, 3.0],
 'date': ['September 2, 2015', 'November 7, 2011'],
 'usefulCount': [36, 13]}
  • Here we fixed the seed for reproducibility.

  • .select() method works with iterator so we provide a range()

  • Then we slice that sample the python way.

  • From this sample we can already see a few quirks in our dataset:

    • The Unnamed: 0 column looks suspiciously like an anonymized ID for each patient.
    • The condition column includes a mix of uppercase and lowercase labels.
    • The reviews are of varying length and contain a mix of Python line separators () as well as HTML character codes like &#039;.
  • We will adress each issue with Dataset library:

  • First the unnamed column may be just Id fro each patient, so we check if each one of those IDs is unique or not:

for split in dataset.keys():
    assert len(dataset[split]) == len(dataset[split].unique("Unnamed: 0"))
  • It’s better to rename that column to something more meaningfull:
dataset = dataset.rename_column('Unnamed: 0', 'patient_id')
dataset
DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})
  • Now we have to lower case all the condition values. This can be easily achieved by .map() method:
def lower_case(example):
  return {'condition': example['condition'].lower()}
dataset.map(lower_case)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-27-b34dfc422493> in <cell line: 1>()
----> 1 dataset.map(lower_case)

/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py in map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_names, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, desc)
    866             cache_file_names = {k: None for k in self}
    867         return DatasetDict(
--> 868             {
    869                 k: dataset.map(
    870                     function=function,

/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py in <dictcomp>(.0)
    867         return DatasetDict(
    868             {
--> 869                 k: dataset.map(
    870                     function=function,
    871                     with_indices=with_indices,

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
    591             self: "Dataset" = kwargs.pop("self")
    592         # apply actual function
--> 593         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    594         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    595         for dataset in datasets:

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
    556         }
    557         # apply actual function
--> 558         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    559         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    560         # re-apply format to the output

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
   3103                     desc=desc or "Map",
   3104                 ) as pbar:
-> 3105                     for rank, done, content in Dataset._map_single(**dataset_kwargs):
   3106                         if done:
   3107                             shards_done += 1

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in _map_single(shard, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset)
   3456                     _time = time.time()
   3457                     for i, example in shard_iterable:
-> 3458                         example = apply_function_on_filtered_inputs(example, i, offset=offset)
   3459                         if update_data:
   3460                             if i == 0:

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in apply_function_on_filtered_inputs(pa_inputs, indices, check_same_num_examples, offset)
   3359             if with_rank:
   3360                 additional_args += (rank,)
-> 3361             processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
   3362             if isinstance(processed_inputs, LazyDict):
   3363                 processed_inputs = {

<ipython-input-26-98b6e79f37ab> in lower_case(example)
      1 def lower_case(example):
----> 2   return {'condition': example['condition'].lower()}

AttributeError: 'NoneType' object has no attribute 'lower'
  • Its seems like some of condition values are NoneType, which the map function cannot support.
  • Let’s drop these rows using Dataset.filter(), which works in a similar way to Dataset.map() and expects a function that receives a single example of the dataset.
  • We will use the lambda function:
dataset = dataset.filter(lambda x: x['condition'] is not None)
  • Now lets aplly .map() method on lower_case() funtion:
dataset = dataset.map(lower_case)

Creating new columns:

  • When dealing with reviews datasets, its good practice to count the number of words in each review, so lets create a function that achieve that goal:
def compute_review_length(example):
  return {'review_length': len(example['review'].split())}
  • compute_review_length() returns a dictionary whose key does not correspond to one of the column names in the dataset. In this case, when compute_review_length() is passed to Dataset.map(), it will be applied to all the rows in the dataset to create a new review_length column:
dataset = dataset.map(compute_review_length)
dataset['train'][2]
{'patient_id': 92703,
 'drugName': 'Lybrel',
 'condition': 'birth control',
 'review': '"I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\r\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas."',
 'rating': 5.0,
 'date': 'December 14, 2009',
 'usefulCount': 17,
 'review_length': 134}
  • As expected, we can see a review_length column has been added to our training set. We can sort this new column with Dataset.sort() to see what the extreme values look like:
dataset['train'].sort('review_length')[:3]
{'patient_id': [111469, 13653, 53602],
 'drugName': ['Ledipasvir / sofosbuvir',
  'Amphetamine / dextroamphetamine',
  'Alesse'],
 'condition': ['hepatitis c', 'adhd', 'birth control'],
 'review': ['"Headache"', '"Great"', '"Awesome"'],
 'rating': [10.0, 10.0, 10.0],
 'date': ['February 3, 2015', 'October 20, 2009', 'November 23, 2015'],
 'usefulCount': [41, 3, 0],
 'review_length': [1, 1, 1]}
  • We could delete rows that countains review with less than 30 words, by using the combination of lambda function + .filter():
dataset = dataset.filter(lambda x: x['review_length'] > 30)
  • The last problemm we have to deal with is html characters, We can use Python’s html module to unescape these characters, like so:
import html
  • We’ll use Dataset.map() to unescape all the HTML characters in our corpus:
dataset = dataset.map(lambda x: {'review': html.unescape(x['review'])})

The map() method’s superpowers:

  • The last mapping we apply on dataset takes almost 40s to execute, this duration can be reduced if we pass an argument to the .map() method: batched=True, this will create batches and lets the code applied in many elements at once, we could also construct the code in list comprehension instead of regular for loop, which also add preformance the operation:
new_dataset = dataset.map(lambda x: {'review': [html.unescape(o) for o in x['review']]}, batched= True)
  • This time our map() function execute in 1s, 40x time faster!

  • Using .map() with batched set to true is very powerful tool that will help us later with tokenizer()

  • Now lets tokenize our dataset since we are done EDA part:

mdl = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(mdl)
def tokenizer_function(example):
  return tokenizer(example['review'], truncation= True)
tokenized_ds = dataset.map(tokenizer_function, batched=True)
  • All of this functionality condensed into a single method is already pretty amazing, but there’s more! With Dataset.map() and batched=True you can change the number of elements in your dataset. This is super useful in many situations where you want to create several training features from one example, and we will need to do this as part of the preprocessing for several of the NLP tasks we’ll undertake in Chapter 7.

  • In machine learning, an example is usually defined as the set of features that we feed to the model. In some contexts, these features will be the set of columns in a Dataset, but in others (like here and for question answering), multiple features can be extracted from a single example and belong to a single column.

  • Let’s have a look at how it works! Here we will tokenize our examples and truncate them to a maximum length of 128, but we will ask the tokenizer to return all the chunks of the texts instead of just the first one. This can be done with return_overflowing_tokens=True:

def tokenize_split(example):
  return tokenizer(
      example['review'],
      truncation= True,
      max_length= 128,
      return_overflowing_tokens= True
  )
  • Let’s test this before we pass it to the map() function:
res = tokenize_split(dataset['train'][0])
[len(inp) for inp in res['input_ids']]
[128, 45]
  • So, our first example in the training set became two features because it was tokenized to more than the maximum number of tokens we specified: the first one of length 128 and the second one of length 45. Now let’s do this for all elements of the dataset!
tokenized_ds = dataset.map(tokenize_split, batched= True)
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-47-4c142a3791ef> in <cell line: 1>()
----> 1 tokenized_ds = dataset.map(tokenize_split, batched= True)

/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py in map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_names, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, desc)
    866             cache_file_names = {k: None for k in self}
    867         return DatasetDict(
--> 868             {
    869                 k: dataset.map(
    870                     function=function,

/usr/local/lib/python3.10/dist-packages/datasets/dataset_dict.py in <dictcomp>(.0)
    867         return DatasetDict(
    868             {
--> 869                 k: dataset.map(
    870                     function=function,
    871                     with_indices=with_indices,

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
    591             self: "Dataset" = kwargs.pop("self")
    592         # apply actual function
--> 593         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    594         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    595         for dataset in datasets:

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
    556         }
    557         # apply actual function
--> 558         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    559         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    560         # re-apply format to the output

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
   3103                     desc=desc or "Map",
   3104                 ) as pbar:
-> 3105                     for rank, done, content in Dataset._map_single(**dataset_kwargs):
   3106                         if done:
   3107                             shards_done += 1

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in _map_single(shard, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset)
   3499                                 writer.write_table(pa.Table.from_pandas(batch))
   3500                             else:
-> 3501                                 writer.write_batch(batch)
   3502                         num_examples_progress_update += num_examples_in_batch
   3503                         if time.time() > _time + config.PBAR_REFRESH_TIME_INTERVAL:

/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py in write_batch(self, batch_examples, writer_batch_size)
    567                 inferred_features[col] = typed_sequence.get_inferred_type()
    568         schema = inferred_features.arrow_schema if self.pa_writer is None else self.schema
--> 569         pa_table = pa.Table.from_arrays(arrays, schema=schema)
    570         self.write_table(pa_table, writer_batch_size)
    571 

/usr/local/lib/python3.10/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_arrays()

/usr/local/lib/python3.10/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.validate()

/usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Column 8 named input_ids expected length 1000 but got length 1447
  • Looking at the error message will give us a clue: there is a mismatch in the lengths of one of the columns, one being of length 1,463 and the other of length 1,000. here those 1,000 examples gave 1,463 new features, resulting in a shape error.
  • The problem is that we’re trying to mix two different datasets of different sizes: the drug_dataset columns will have a certain number of examples (the 1,000 in our error), but the tokenized_dataset we are building will have more (the 1,463 in the error message; it is more than 1,000 because we are tokenizing long reviews into more than one example by using return_overflowing_tokens=True). That doesn’t work for a Dataset, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the remove_columns argument:
tokenized_ds = dataset.map(tokenize_split, batched= True, remove_columns= dataset['train'].column_names)
len(tokenized_ds['train'])
204198
len(dataset['train'])
138514