Formatting common types of Python text datasets
This tutorial demonstrates how to format text data in various popular Python formats before uploading to Cleanlab Studio. Each section of the tutorial covers one specific data format and outlines the steps to create a data file that Cleanlab Studio can seamlessly process. In this tutorial, we focus on how to produce a properly formatted data file, not how to run Cleanlab Studio on it – for that refer to our text data quickstart tutorial.
Recall that Cleanlab Studio can be directly run on text datasets stored in one of the following formats: CSV, JSON, Excel, Pandas DataFrame. The application natively supports many other data formats listed in this guide, refer to it instead if your data are not in one of the formats presented in this tutorial.
This tutorial covers the following Python data formats:
The example below shows how a data file will look after formatting. After formatting, you can directly load the dataset into Cleanlab Studio.
,text,label
0,To make her feel threatened,fear
1,It might be linked to the trust factor of your friend.,neutral
2,"Super, thanks",gratitude
3,What does FPTP have to do with the referendum?,confusion
Install and import required dependencies
You can use pip
to install all packages required for this tutorial as follows:
%pip install pandas tensorflow-datasets tensorflow datasets scikit-learn
import pandas as pd
1. Huggingface Datasets
Here, we load the Go Emotions dataset which consists of Reddit comments (text) labeled for 27 emotion categories (including Neutral). This is a multi-label classification dataset, where more than one label can apply to a single text example.
For multi-class classification text datasets, you can still use the same workflow outlined below to get a data file for Cleanlab Studio.
# Load dataset from the Hub
from datasets import load_dataset, concatenate_datasets
emotion_dict = load_dataset("go_emotions")
emotion_dict
For finding issues across splits, we concatenate the splits into one single dataset.
emotion_ds = concatenate_datasets(emotion_dict.values())
emotion_ds
View few examples from the dataset
emotion_ds[:5]
Method for formatting Huggingface text dataset
def format_huggingface_text_dataset(
dataset, text_key, label_key, output_csvpath, label_mapping
):
"""Convert a Huggingface text dataset to a Cleanlab Studio file format.
dataset: datasets.Dataset
HuggingFace text dataset
text_key: str
column name for text in dataset
label_key: str
column name for label in dataset
label_mapping: Dict[str, int]
id to label str mapping
If labels are already strings, set label_mapping to None
output_csvpath: str
filepath to save csv file
"""
df = dataset.to_pandas()
df = df.rename(columns={text_key: "text", label_key: "label"})
# Map integer labels to label strings, for example, 0 -> positive, 1 -> negative
if label_mapping:
if isinstance(dataset[0][label_key], list):
df["label"] = [
",".join([label_mapping[label_id] for label_id in label_id_list])
for label_id_list in df["label"]
]
elif isinstance(dataset[0][label_key], int):
df["label"] = [label_mapping[label_id] for label_id in df["label"]]
# Save to csv
df.to_csv(output_csvpath, index=False)
print(f"Saved data file to {output_csvpath}")
return
# construct mapping from id to label str
label_str_list = emotion_ds.features["labels"].feature.names
label_mapping = {i: name for i, name in enumerate(label_str_list)}
print(label_mapping)
Format the dataset and save to a csv file.
format_huggingface_text_dataset(
dataset=emotion_ds,
text_key="text",
label_key="labels",
label_mapping=label_mapping,
output_csvpath="./emotion.csv",
)
Let’s view the data file created.
pd.read_csv("./emotion.csv").head(5)
text | label | id | |
---|---|---|---|
0 | My favourite food is anything I didn't have to... | neutral | eebbqej |
1 | Now if he does off himself, everyone will thin... | neutral | ed00q6i |
2 | WHY THE FUCK IS BAYLESS ISOING | anger | eezlygj |
3 | To make her feel threatened | fear | ed7ypvh |
4 | Dirty Southern Wankers | annoyance | ed0bdzj |
Now you can load the file ./emotion.csv
into Cleanlab Studio, either using the Web Interface or Python API (see Load Dataset for more details on the latter).
2. Tensorflow datasets
import tensorflow_datasets as tfds
import tensorflow as tf
Here, we load the IMDB Reviews dataset, which contains reviews classified as either positive or negative (binary classification data).
imdb_reviews, metadata = tfds.load(
"imdb_reviews", split="train", with_info=True, as_supervised=True
)
View few examples from the dataset.
tfds.as_dataframe(imdb_reviews.take(5), metadata)
label | text | |
---|---|---|
0 | 0 (neg) | This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it. |
1 | 0 (neg) | I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all. |
2 | 0 (neg) | Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town? <br /><br />Nothing even remotely resembling that happened on the Canadian side of the border during the Klondike gold rush. Mr. Mann and company appear to have mistaken Dawson City for Deadwood, the Canadian North for the American Wild West.<br /><br />Canadian viewers be prepared for a Reefer Madness type of enjoyable howl with this ludicrous plot, or, to shake your head in disgust. |
3 | 1 (pos) | This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received. |
4 | 1 (pos) | As others have mentioned, all the women that go nude in this film are mostly absolutely gorgeous. The plot very ably shows the hypocrisy of the female libido. When men are around they want to be pursued, but when no "men" are around, they become the pursuers of a 14 year old boy. And the boy becomes a man really fast (we should all be so lucky at this age!). He then gets up the courage to pursue his true love. |
Method for formatting Tensorflow text dataset
def format_tensorflow_text_dataset(
dataset, metadata, text_key, label_key, output_csvpath, label_mapping
):
"""Convert a Tensorflow text dataset to a Studio file format.
dataset: tf.data.Dataset
Tensorflow text dataset
metadata: tfds.core.DatasetInfo
Info associated with dataset
text_key: str
column name for text in dataset
label_key: str
column name for label in dataset
label_mapping: Dict[str, int]
id to label str mapping
If labels are already strings, set label_mapping to None
output_csvpath: str
filepath to save csv file
"""
df = tfds.as_dataframe(dataset, metadata)
# Map integer labels to label strings, for example, 0 -> positive, 1 -> negative
if label_mapping:
df[label_key] = [label_mapping[label_id] for label_id in df[label_key]]
df = df.rename(columns={text_key: "text", label_key: "label"})
# Save to csv
df.to_csv(output_csvpath, index=False)
print(f"Saved data file to {output_csvpath}")
return
# construct mapping from id to label str
label_str_list = metadata.features["label"].names
label_mapping = {i: label_str for i, label_str in enumerate(label_str_list)}
Format the dataset and save to a csv file.
format_tensorflow_text_dataset(
dataset=imdb_reviews,
metadata=metadata,
text_key="text",
label_key="label",
label_mapping=label_mapping,
output_csvpath="./imdb_reviews.csv",
)
Let’s view the data file created.
pd.read_csv("./imdb_reviews.csv").head(5)
label | text | |
---|---|---|
0 | neg | b"This was an absolutely terrible movie. Don't... |
1 | neg | b'I have been known to fall asleep during film... |
2 | neg | b'Mann photographs the Alberta Rocky Mountains... |
3 | pos | b'This is the kind of film for a snowy Sunday ... |
4 | pos | b'As others have mentioned, all the women that... |
Now you can load the file ./imdb_reviews.csv
to Cleanlab Studio, either using the Web Interface or Python API (see Load Dataset for more details on the latter).
3. Sklearn datasets
Install the scikit-learn library
pip install -U scikit-learn
Here, we load the 20 newsgroups dataset which consists of 18000 newsgroups text posts categorized amongst 20 possible topics, split into train and test sets (multi-class text classification dataset).
from sklearn.datasets import fetch_20newsgroups
# Load the dataset
newsgroups_train = fetch_20newsgroups(subset="train")
We can view the text and label from the data
and target
attributes of the dataset. Let’s view an example from the dataset and its corresponding label.
print(f"Label: {newsgroups_train.target[0]}")
print(newsgroups_train.data[0])
In order to view the label names corresponding to integers, we can use the target_names
attribute of the dataset object.
# View first 5 categories
newsgroups_train.target_names[:5]
Method for formatting sklearn text dataset
def format_sklearn_text_dataset(dataset, output_csvpath, label_mapping):
"""Convert a sklearn text dataset to a Studio file format.
dataset: sklearn.utils.Bunch
sklearn text dataset
label_mapping: Dict[str, int]
id to label str mapping
If labels are already strings, set label_mapping to None
output_csvpath: str
filepath to save csv file
"""
# Map integer labels to label strings, for example, 0 -> positive, 1 -> negative
if label_mapping:
label_col = [label_mapping[label_id] for label_id in dataset.target]
else:
label_col = dataset.target
df = pd.DataFrame({"text": dataset.data, "label": label_col})
# Save to csv
df.to_csv(output_csvpath, index=False)
print(f"Saved data file to {output_csvpath}")
return
# construct mapping from id to label str
label_mapping = {
i: labe_str for i, labe_str in enumerate(newsgroups_train.target_names)
}
Format the dataset and save to a csv file.
format_sklearn_text_dataset(newsgroups_train, "./newsgroups_train.csv", label_mapping)
Let’s view the data file created.
pd.read_csv("./newsgroups_train.csv").head(5)
text | label | |
---|---|---|
0 | From: lerxst@wam.umd.edu (where's my thing)\nS... | rec.autos |
1 | From: guykuo@carson.u.washington.edu (Guy Kuo)... | comp.sys.mac.hardware |
2 | From: twillis@ec.ecn.purdue.edu (Thomas E Will... | comp.sys.mac.hardware |
3 | From: jgreen@amber (Joe Green)\nSubject: Re: W... | comp.graphics |
4 | From: jcm@head-cfa.harvard.edu (Jonathan McDow... | sci.space |
Now you can load the file ./newsgroups_train.csv
to Cleanlab Studio, either using the Web Interface or Python API (see Load Dataset for more details on the latter).