Formatting common types of text datasets

Run in Google Colab

This tutorial demonstrates how to format text data in various popular Python formats before uploading to Cleanlab Studio. Each section of the tutorial covers one specific data format and outlines the steps to create a data file that Cleanlab Studio can seamlessly process. In this tutorial, we focus on how to produce a properly formatted data file, not how to run Cleanlab Studio on it – for that refer to our text data quickstart tutorial.

Recall that Cleanlab Studio can be directly run on text datasets stored in one of the following formats: CSV, JSON, Excel, Pandas DataFrame. The application natively supports many other data formats listed in this guide, refer to it instead if your data are not in one of the formats presented in this tutorial.

This tutorial covers the following Python data formats:

Huggingface Datasets
Tensorflow Datasets
Scikit-learn Datasets

The example below shows how a data file will look after formatting. After formatting, you can directly load the dataset into Cleanlab Studio.

,text,label
0,To make her feel threatened,fear
1,It might be linked to the trust factor of your friend.,neutral
2,"Super, thanks",gratitude
3,What does FPTP have to do with the referendum?,confusion

Install and import required dependencies

You can use pip to install all packages required for this tutorial as follows:

%pip install pandas tensorflow-datasets tensorflow datasets scikit-learn

import pandas as pd

1. Huggingface Datasets

Here, we load the Go Emotions dataset which consists of Reddit comments (text) labeled for 27 emotion categories (including Neutral). This is a multi-label classification dataset, where more than one label can apply to a single text example.

For multi-class classification text datasets, you can still use the same workflow outlined below to get a data file for Cleanlab Studio.

# Load dataset from the Hub
from datasets import load_dataset, concatenate_datasets

emotion_dict = load_dataset("go_emotions")
emotion_dict

For finding issues across splits, we concatenate the splits into one single dataset.

emotion_ds = concatenate_datasets(emotion_dict.values())
emotion_ds

    Dataset({
        features: ['text', 'labels', 'id'],
        num_rows: 54263
    })

View few examples from the dataset

emotion_ds[:5]

    {'text': ["My favourite food is anything I didn't have to cook myself.",
      'Now if he does off himself, everyone will think hes having a laugh screwing with people instead of actually dead',
      'WHY THE FUCK IS BAYLESS ISOING',
      'To make her feel threatened',
      'Dirty Southern Wankers'],
     'labels': [[27], [27], [2], [14], [3]],
     'id': ['eebbqej', 'ed00q6i', 'eezlygj', 'ed7ypvh', 'ed0bdzj']}

Method for formatting Huggingface text dataset

def format_huggingface_text_dataset(
    dataset, text_key, label_key, output_csvpath, label_mapping
):
    """Convert a Huggingface text dataset to a Cleanlab Studio file format.

    dataset: datasets.Dataset
        HuggingFace text dataset
    text_key: str
        column name for text in dataset
    label_key: str
        column name for label in dataset
    label_mapping: Dict[str, int]
        id to label str mapping
        If labels are already strings, set label_mapping to None
    output_csvpath: str
        filepath to save csv file

    """
    df = dataset.to_pandas()
    df = df.rename(columns={text_key: "text", label_key: "label"})

    # Map integer labels to label strings, for example, 0 -> positive, 1 -> negative
    if label_mapping:
        if isinstance(dataset[0][label_key], list):
            df["label"] = [
                ",".join([label_mapping[label_id] for label_id in label_id_list])
                for label_id_list in df["label"]
            ]
        elif isinstance(dataset[0][label_key], int):
            df["label"] = [label_mapping[label_id] for label_id in df["label"]]

    # Save to csv
    df.to_csv(output_csvpath, index=False)
    print(f"Saved data file to {output_csvpath}")

    return

# construct mapping from id to label str
label_str_list = emotion_ds.features["labels"].feature.names
label_mapping = {i: name for i, name in enumerate(label_str_list)}
print(label_mapping)

    {0: 'admiration', 1: 'amusement', 2: 'anger', 3: 'annoyance', 4: 'approval', 5: 'caring', 6: 'confusion', 7: 'curiosity', 8: 'desire', 9: 'disappointment', 10: 'disapproval', 11: 'disgust', 12: 'embarrassment', 13: 'excitement', 14: 'fear', 15: 'gratitude', 16: 'grief', 17: 'joy', 18: 'love', 19: 'nervousness', 20: 'optimism', 21: 'pride', 22: 'realization', 23: 'relief', 24: 'remorse', 25: 'sadness', 26: 'surprise', 27: 'neutral'}

Format the dataset and save to a csv file.

format_huggingface_text_dataset(
    dataset=emotion_ds,
    text_key="text",
    label_key="labels",
    label_mapping=label_mapping,
    output_csvpath="./emotion.csv",
)

Let’s view the data file created.

pd.read_csv("./emotion.csv").head(5)

	text	label	id
0	My favourite food is anything I didn't have to...	neutral	eebbqej
1	Now if he does off himself, everyone will thin...	neutral	ed00q6i
2	WHY THE FUCK IS BAYLESS ISOING	anger	eezlygj
3	To make her feel threatened	fear	ed7ypvh
4	Dirty Southern Wankers	annoyance	ed0bdzj

Now you can load the file ./emotion.csv into Cleanlab Studio, either using the Web Interface or Python API (see Load Dataset for more details on the latter).

2. Tensorflow datasets

import tensorflow_datasets as tfds
import tensorflow as tf

Here, we load the IMDB Reviews dataset, which contains reviews classified as either positive or negative (binary classification data).

imdb_reviews, metadata = tfds.load(
    "imdb_reviews", split="train", with_info=True, as_supervised=True
)

View few examples from the dataset.

tfds.as_dataframe(imdb_reviews.take(5), metadata)

2023-10-23 18:25:20.650895: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

	label	text
0	0 (neg)	This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.
1	0 (neg)	I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.
2	0 (neg)	Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town? <br /><br />Nothing even remotely resembling that happened on the Canadian side of the border during the Klondike gold rush. Mr. Mann and company appear to have mistaken Dawson City for Deadwood, the Canadian North for the American Wild West.<br /><br />Canadian viewers be prepared for a Reefer Madness type of enjoyable howl with this ludicrous plot, or, to shake your head in disgust.
3	1 (pos)	This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received.
4	1 (pos)	As others have mentioned, all the women that go nude in this film are mostly absolutely gorgeous. The plot very ably shows the hypocrisy of the female libido. When men are around they want to be pursued, but when no "men" are around, they become the pursuers of a 14 year old boy. And the boy becomes a man really fast (we should all be so lucky at this age!). He then gets up the courage to pursue his true love.

Method for formatting Tensorflow text dataset

def format_tensorflow_text_dataset(
    dataset, metadata, text_key, label_key, output_csvpath, label_mapping
):
    """Convert a Tensorflow text dataset to a Studio file format.

    dataset: tf.data.Dataset
        Tensorflow text dataset
    metadata: tfds.core.DatasetInfo
        Info associated with dataset
    text_key: str
        column name for text in dataset
    label_key: str
        column name for label in dataset
    label_mapping: Dict[str, int]
        id to label str mapping
        If labels are already strings, set label_mapping to None
    output_csvpath: str
        filepath to save csv file

    """
    df = tfds.as_dataframe(dataset, metadata)

    # Map integer labels to label strings, for example, 0 -> positive, 1 -> negative
    if label_mapping:
        df[label_key] = [label_mapping[label_id] for label_id in df[label_key]]

    df = df.rename(columns={text_key: "text", label_key: "label"})

    # Save to csv
    df.to_csv(output_csvpath, index=False)
    print(f"Saved data file to {output_csvpath}")

    return

# construct mapping from id to label str
label_str_list = metadata.features["label"].names
label_mapping = {i: label_str for i, label_str in enumerate(label_str_list)}

Format the dataset and save to a csv file.

format_tensorflow_text_dataset(
    dataset=imdb_reviews,
    metadata=metadata,
    text_key="text",
    label_key="label",
    label_mapping=label_mapping,
    output_csvpath="./imdb_reviews.csv",
)

Let’s view the data file created.

pd.read_csv("./imdb_reviews.csv").head(5)

	label	text
0	neg	b"This was an absolutely terrible movie. Don't...
1	neg	b'I have been known to fall asleep during film...
2	neg	b'Mann photographs the Alberta Rocky Mountains...
3	pos	b'This is the kind of film for a snowy Sunday ...
4	pos	b'As others have mentioned, all the women that...

Now you can load the file ./imdb_reviews.csv to Cleanlab Studio, either using the Web Interface or Python API (see Load Dataset for more details on the latter).

3. Sklearn datasets

Install the scikit-learn library

pip install -U scikit-learn

Here, we load the 20 newsgroups dataset which consists of 18000 newsgroups text posts categorized amongst 20 possible topics, split into train and test sets (multi-class text classification dataset).

from sklearn.datasets import fetch_20newsgroups

# Load the dataset
newsgroups_train = fetch_20newsgroups(subset="train")

We can view the text and label from the data and target attributes of the dataset. Let’s view an example from the dataset and its corresponding label.

print(f"Label: {newsgroups_train.target[0]}")
print(newsgroups_train.data[0])

    Label: 7
    From: lerxst@wam.umd.edu (where's my thing)
    Subject: WHAT car is this!?
    Nntp-Posting-Host: rac3.wam.umd.edu
    Organization: University of Maryland, College Park
    Lines: 15
    
     I was wondering if anyone out there could enlighten me on this car I saw
    the other day. It was a 2-door sports car, looked to be from the late 60s/
    early 70s. It was called a Bricklin. The doors were really small. In addition,
    the front bumper was separate from the rest of the body. This is 
    all I know. If anyone can tellme a model name, engine specs, years
    of production, where this car is made, history, or whatever info you
    have on this funky looking car, please e-mail.
    
    Thanks,
    - IL
       ---- brought to you by your neighborhood Lerxst ----
    
    
    
    
    

In order to view the label names corresponding to integers, we can use the target_names attribute of the dataset object.

# View first 5 categories
newsgroups_train.target_names[:5]

    ['alt.atheism',
     'comp.graphics',
     'comp.os.ms-windows.misc',
     'comp.sys.ibm.pc.hardware',
     'comp.sys.mac.hardware']

Method for formatting sklearn text dataset

def format_sklearn_text_dataset(dataset, output_csvpath, label_mapping):
    """Convert a sklearn text dataset to a Studio file format.

    dataset: sklearn.utils.Bunch
        sklearn text dataset
    label_mapping: Dict[str, int]
        id to label str mapping
        If labels are already strings, set label_mapping to None
    output_csvpath: str
        filepath to save csv file

    """

    # Map integer labels to label strings, for example, 0 -> positive, 1 -> negative
    if label_mapping:
        label_col = [label_mapping[label_id] for label_id in dataset.target]
    else:
        label_col = dataset.target

    df = pd.DataFrame({"text": dataset.data, "label": label_col})

    # Save to csv
    df.to_csv(output_csvpath, index=False)
    print(f"Saved data file to {output_csvpath}")
    return

# construct mapping from id to label str
label_mapping = {
    i: labe_str for i, labe_str in enumerate(newsgroups_train.target_names)
}

Format the dataset and save to a csv file.

format_sklearn_text_dataset(newsgroups_train, "./newsgroups_train.csv", label_mapping)

Let’s view the data file created.

pd.read_csv("./newsgroups_train.csv").head(5)

	text	label
0	From: lerxst@wam.umd.edu (where's my thing)\nS...	rec.autos
1	From: guykuo@carson.u.washington.edu (Guy Kuo)...	comp.sys.mac.hardware
2	From: twillis@ec.ecn.purdue.edu (Thomas E Will...	comp.sys.mac.hardware
3	From: jgreen@amber (Joe Green)\nSubject: Re: W...	comp.graphics
4	From: jcm@head-cfa.harvard.edu (Jonathan McDow...	sci.space

Now you can load the file ./newsgroups_train.csv to Cleanlab Studio, either using the Web Interface or Python API (see Load Dataset for more details on the latter).

Formatting common types of text datasets

Install and import required dependencies​

1. Huggingface Datasets​

Method for formatting Huggingface text dataset​

Format the dataset and save to a csv file.​

2. Tensorflow datasets​

Method for formatting Tensorflow text dataset​

Format the dataset and save to a csv file.​

3. Sklearn datasets​

Install the scikit-learn library​

Method for formatting sklearn text dataset​

Format the dataset and save to a csv file.​

Install and import required dependencies

1. Huggingface Datasets

Method for formatting Huggingface text dataset

Format the dataset and save to a csv file.

2. Tensorflow datasets

Method for formatting Tensorflow text dataset

Format the dataset and save to a csv file.

3. Sklearn datasets

Install the scikit-learn library

Method for formatting sklearn text dataset

Format the dataset and save to a csv file.