Existing Datasets¶

Name	Nb classes	Image Size	Auto Download	Type
MNIST	10	28x28x1	YES	Images
Fashion MNIST	10	28x28x1	YES	Images
KMNIST	10	28x28x1	YES	Images
EMNIST	10	28x28x1	YES	Images
QMNIST	10	28x28x1	YES	Images
MNIST Fellowship	30	28x28x1	YES	Images
Rainbow MNIST	10	28x28x3	YES	Images
Colored MNIST	2	28x28x3	YES	Images
SVHN	10	32x32x3	YES	Images
Synbols	50	32x32x3	YES	Images
CIFAR10	10	32x32x3	YES	Images
CIFAR100	100	32x32x3	YES	Images
CIFAR Fellowship	110	32x32x3	YES	Images
STL10	10	96x96x3	YES	Images
Omniglot	964	105x105x1	YES	Images
CTRL minus	87	32x32x3	YES	Images
CTRL plus	87	32x32x3	YES	Images
CTRL in	87	32x32x3	YES	Images
CTRL out	97	32x32x3	YES	Images
CTRL plastic	97	32x32x3	YES	Images
FER2013	7	48x48x1	NO	Images
TinyImageNet200	200	64x64x3	YES	Images
GTSRB	43	~64x64x3	YES	Images
EuroSAT	10	64x64x3	YES	Images
ImageNet100	100	224x224x3	NO	Images
ImageNet1000	1000	224x224x3	NO	Images
CORe50	50	224x224x3	YES	Images
CORe50-v2-79	50	224x224x3	YES	Images
CORe50-v2-196*	50	224x224x3	YES	Images
CORe50-v2-391	50	224x224x3	YES	Images
Stanford Car196	196	224x224x3	YES	Images
Stream-51	51	224x224x3	YES	Images
Caltech101	101 (102)	224x224x3	YES	Images
Caltech256	257	224x224x3	YES	Images
FGVC Aircraft	30/70/100	224x224x3	YES	Images
Food101	101	224x224x3	YES	Images
DTD	47	224x224x3	YES	Images
AwA2	50	224x224x3	YES	Images
CUB200	200	224x224x3	YES	Images
Terra Incognita	10	224x224x3	YES	Images
PACS	7	224x224x3	NO	Images
VLCS	5	224x224x3	YES	Images
OfficeHome	65	224x224x3	NO	Images
DomainNet	345	224x224x3	YES	Images
Pascal-VOC 2007	20	224x224x3	YES	Images
Pascal-VOC 2012	20	512x512x3	YES	Segmentation Maps
MultiNLI	5		YES	Text
OxfordFlower102	102	?x?x3	YES	Images
OxfordPet	37	?x?x3	YES	Images
SUN397	397	?x?x3	YES	Images
Birdsnap	500	?x?x3	YES	Images
MetaShift	410	?x?x3	YES	Images
Fluent Speech	31	?	YES	Audio

All datasets have for arguments train and download, like a torchvision.dataset. Those datasets are then modified to create continuum scenarios.

Once a dataset is created, it is fed to a scenario that will split it in multiple tasks.

Continuum supports many datasets implemented in torchvision in such as MNIST, or CIFAR100:

from continuum import ClassIncremental
from continuum.datasets import MNIST

clloader = ClassIncremental(
    MNIST("/my/data/folder", download=True, train=True),
    increment=1,
    initial_increment=5
)

The data from these small datasets can be automatically downloaded with the option download.

Larger datasets such as ImageNet or CORe50 are also available, although their initialization differ:

from continuum import ClassIncremental
from continuum.datasets import ImageNet1000

dataset_100 = ImageNet1000("/my/data/folder/imagenet/train/", train=True)
dataset_1000 = ImageNet1000("/my/data/folder/imagenet/val/", train=False)

Note that Continuum cannot download ImageNet’s data, it’s on you! We also provide ImageNet100, a subset of 100 classes of ImageNet. The subset meta-data are automatically downloaded, or you can provide them with the option data_subset.

Multiple versions of CORe50 are proposed. For all, the data can automatically be downloaded:

from continuum.datasets import Core50, Core50v2_79, Core50v2_196, Core50v2_391

dataset = Core50("/my/data/folder/", train=True, download=True)
dataset_79 = Core50v2_79("/my/data/folder/", train=True, download=True)
dataset_196 = Core50v2_196("/my/data/folder/", train=True, download=True)
dataset_391 = Core50v2_391("/my/data/folder/", train=True, download=True)

If you wish to learn CORe50 in the class-incremental scenario (NC), Core50 suffices. Although, for instance-incremental scenario (NI and NIC), you need to use Core50v2_79, Core50v2_196, or Core50v2_391 (see our doc about it). Refer to the datatset official webpage for more information about the different versions.

In addition to Computer Vision dataset, Continuum also provide one NLP dataset:

from continuum.datasets import MultiNLI

dataset=MultiNLI("/my/data/folder", train=True, download=True)

The MultiNLI dataset provides text written in different styles and categories. This dataset can be used in Continual Learning in a New Instances (NI) setting where all categories are known from the start, but with styles being incrementally added.

Adding Your Own Datasets¶

The goal of continuum is to propose the most used benchmark scenarios of continual learning but also to make easy the creation of new scenarios through an adaptable framework.

For example, the type of scenarios are easy to use with others dataset:

InMemoryDataset, for in-memory numpy array:

from continuum.datasets import InMemoryDataset

x_train, y_train = gen_numpy_array()
dataset = InMemoryDataset(x_train, y_train)

PyTorchDataset,for datasets defined in torchvision:

from torchvision.datasets import CIFAR10
from continuum.datasets import PyTorchDataset
dataset = PyTorchDataset("/my/data/folder/", dataset_type=CIFAR10, train=True, download=True)

ImageFolderDataset, for datasets having a tree-like structure, with one folder per class:

from continuum.datasets import ImageFolderDataset

dataset_train = ImageFolderDataset("/my/data/folder/train/")
dataset_test = ImageFolderDataset("/my/data/folder/test/")

Fellowship, to combine several continual datasets.:

from torchvision.datasets import CIFAR10, CIFAR100
from continuum.datasets import Fellowship

dataset = Fellowship(datasets=[
        CIFAR10(data_path="/my/data/folder1/", train=True),
        CIFAR100(data_path="/my/data/folder1/", train=True)
    ],
    update_labels=True
)

The update_labels parameter determines if we want that different datasets have different labels or if we do not care about it. The default value of update_labels is True. Note that Continuum already provide pre-made Fellowship:

from continuum.datasets import MNISTFellowship, CIFARFellowship

dataset_MNIST = MNISTFellowship("/my/data/folder", train=True)
dataset_CIFAR = CIFARFellowship("/my/data/folder", train=True)

You may want datasets that have a different transformation for each new task, e.g. MNIST with different rotations or pixel permutations. Continuum also handles it! However it’s a scenario’s speficic, not dataset, thus look over the Scenario doc.

Supervised setting without Continual¶

Continuum is awesome but you don’t want to do continual learning? Simply want to train a model on a single try on the whole dataset? No problem.

All Continuum datasets can be directly converted to tasksets, which implement the Pytorch Dataset and thus can be directly given to a DataLoader.

Here are an example with MNIST, but all datasets work the same:

from torch.utils.data import DataLoader
from continuum.datasets import MNIST

dataset = MNIST("/my/data/folder", train=True, download=True)
taskset = dataset.to_taskset(
    trsf=None  # Put your transformations here if you want some
    target_trsf=None  # Put your target transformations here if you want some
)

loader = DataLoader(taskset, batch_size=32, shuffle=True)

for x, y, t in loader:
    pass  # Your model here

Multi-target datasets¶

Continuum also allows a dataset to return multiple targets per data point. In class incremental, only the first target is taken in account for designing the increments.

from continuum import ClassIncremental
from continuum.datasets import FluentSpeech
from torch.utils.data import DataLoader

dataset = FluentSpeech("/my/data/folder", train=True)

scenario = ClassIncremental(dataset, increment=1)

for taskset in scenario:
    loader = DataLoader(taskset, batch_size=1)

    for x, y, t in loader:
        print(x.shape, y.shape, t.shape, np.unique(y[:, 0]))
        break

In this situation, the FluentSpeech dataset has 4 targets, but only the first one is used to determine the tasks.

The Different Data Types¶

Each dataset has a data type. Here are the ones already implemented:

IMAGE_ARRAY: Image is directly as an array. Example: MNIST dataset.
IMAGE_PATH: Image is stored as a patch to the actual image. Example: ImageNet dataset.
TEXT: For HuggingFace datasets.
TENSOR: Any kind of tensor.
SEGMENTATION: For the Continual Semantic Segmentation datasets.
OBJ_DETECTION: Still WIP.
H5: Storing tensor in a H5 dataset. For the end user, the usage is quite similar to TENSOR.
AUDIO: Storing a string to a audio file. Needs to install the Soundfile library.