Existing Datasets

Name

Nb classes

Image Size

Auto Download

Type

MNIST

10

28x28x1

YES

Images

Fashion MNIST

10

28x28x1

YES

Images

KMNIST

10

28x28x1

YES

Images

EMNIST

10

28x28x1

YES

Images

QMNIST

10

28x28x1

YES

Images

SVHN

10

28x28x1

YES

Images

MNIST Fellowship

30

28x28x1

YES

Images

Rainbow MNIST

10

28x28x3

YES

Images

Colored MNIST

2

28x28x3

YES

Images

Synbols

50

32x32x3

YES

Images

CIFAR10

10

32x32x3

YES

Images

CIFAR100

100

32x32x3

YES

Images

CIFAR Fellowship

110

32x32x3

YES

Images

STL10

10

96x96x3

YES

Images

Omniglot

964

105x105x1

YES

Images

CTRL minus

87

32x32x3

YES

Images

CTRL plus

87

32x32x3

YES

Images

CTRL in

87

32x32x3

YES

Images

CTRL out

97

32x32x3

YES

Images

CTRL plastic

97

32x32x3

YES

Images

FER2013

7

48x48x1

NO

Images

TinyImageNet200

200

64x64x3

YES

Images

GTSRB

43

~64x64x3

YES

Images

EuroSAT

10

64x64x3

YES

Images

ImageNet100

100

224x224x3

NO

Images

ImageNet1000

1000

224x224x3

NO

Images

CORe50

50

224x224x3

YES

Images

CORe50-v2-79

50

224x224x3

YES

Images

CORe50-v2-196*

50

224x224x3

YES

Images

CORe50-v2-391

50

224x224x3

YES

Images

Stanford Car196

196

224x224x3

YES

Images

Stream-51

51

224x224x3

YES

Images

Caltech101

101 (102)

224x224x3

YES

Images

Caltech256

257

224x224x3

YES

Images

FGVC Aircraft

30/70/100

224x224x3

YES

Images

Food101

101

224x224x3

YES

Images

DTD

47

224x224x3

YES

Images

AwA2

50

224x224x3

YES

Images

CUB200

200

224x224x3

YES

Images

Terra Incognita

10

224x224x3

YES

Images

PACS

7

224x224x3

NO

Images

VLCS

5

224x224x3

YES

Images

OfficeHome

65

224x224x3

NO

Images

DomainNet

345

224x224x3

YES

Images

Pascal-VOC 2007

20

224x224x3

YES

Images

Pascal-VOC 2012

20

512x512x3

YES

Segmentation Maps

MultiNLI

5

YES

Text

OxfordFlower102

102

?x?x3

YES

Images

OxfordPet

37

?x?x3

YES

Images

SUN397

397

?x?x3

YES

Images

Birdsnap

500

?x?x3

YES

Images

MetaShift

410

?x?x3

YES

Images

Fluent Speech

31

?

YES

Audio

All datasets have for arguments train and download, like a torchvision.dataset. Those datasets are then modified to create continuum scenarios.

Once a dataset is created, it is fed to a scenario that will split it in multiple tasks.

Continuum supports many datasets implemented in torchvision in such as MNIST, or CIFAR100:

from continuum import ClassIncremental
from continuum.datasets import MNIST

clloader = ClassIncremental(
    MNIST("/my/data/folder", download=True, train=True),
    increment=1,
    initial_increment=5
)

The data from these small datasets can be automatically downloaded with the option download.

Larger datasets such as ImageNet or CORe50 are also available, although their initialization differ:

from continuum import ClassIncremental
from continuum.datasets import ImageNet1000

dataset_100 = ImageNet1000("/my/data/folder/imagenet/train/", train=True)
dataset_1000 = ImageNet1000("/my/data/folder/imagenet/val/", train=False)

Note that Continuum cannot download ImageNet’s data, it’s on you! We also provide ImageNet100, a subset of 100 classes of ImageNet. The subset meta-data are automatically downloaded, or you can provide them with the option data_subset.

Multiple versions of CORe50 are proposed. For all, the data can automatically be downloaded:

from continuum.datasets import Core50, Core50v2_79, Core50v2_196, Core50v2_391

dataset = Core50("/my/data/folder/", train=True, download=True)
dataset_79 = Core50v2_79("/my/data/folder/", train=True, download=True)
dataset_196 = Core50v2_196("/my/data/folder/", train=True, download=True)
dataset_391 = Core50v2_391("/my/data/folder/", train=True, download=True)

If you wish to learn CORe50 in the class-incremental scenario (NC), Core50 suffices. Although, for instance-incremental scenario (NI and NIC), you need to use Core50v2_79, Core50v2_196, or Core50v2_391 (see our doc about it). Refer to the datatset official webpage for more information about the different versions.

In addition to Computer Vision dataset, Continuum also provide one NLP dataset:

from continuum.datasets import MultiNLI

dataset=MultiNLI("/my/data/folder", train=True, download=True)

The MultiNLI dataset provides text written in different styles and categories. This dataset can be used in Continual Learning in a New Instances (NI) setting where all categories are known from the start, but with styles being incrementally added.

Adding Your Own Datasets

The goal of continuum is to propose the most used benchmark scenarios of continual learning but also to make easy the creation of new scenarios through an adaptable framework.

For example, the type of scenarios are easy to use with others dataset:

InMemoryDataset, for in-memory numpy array:

from continuum.datasets import InMemoryDataset

x_train, y_train = gen_numpy_array()
dataset = InMemoryDataset(x_train, y_train)

PyTorchDataset,for datasets defined in torchvision:

from torchvision.datasets import CIFAR10
from continuum.datasets import PyTorchDataset
dataset = PyTorchDataset("/my/data/folder/", dataset_type=CIFAR10, train=True, download=True)

ImageFolderDataset, for datasets having a tree-like structure, with one folder per class:

from continuum.datasets import ImageFolderDataset

dataset_train = ImageFolderDataset("/my/data/folder/train/")
dataset_test = ImageFolderDataset("/my/data/folder/test/")

Fellowship, to combine several continual datasets.:

from torchvision.datasets import CIFAR10, CIFAR100
from continuum.datasets import Fellowship

dataset = Fellowship(datasets=[
        CIFAR10(data_path="/my/data/folder1/", train=True),
        CIFAR100(data_path="/my/data/folder1/", train=True)
    ],
    update_labels=True
)

The update_labels parameter determines if we want that different datasets have different labels or if we do not care about it. The default value of update_labels is True. Note that Continuum already provide pre-made Fellowship:

from continuum.datasets import MNISTFellowship, CIFARFellowship

dataset_MNIST = MNISTFellowship("/my/data/folder", train=True)
dataset_CIFAR = CIFARFellowship("/my/data/folder", train=True)

You may want datasets that have a different transformation for each new task, e.g. MNIST with different rotations or pixel permutations. Continuum also handles it! However it’s a scenario’s speficic, not dataset, thus look over the Scenario doc.

Supervised setting without Continual

Continuum is awesome but you don’t want to do continual learning? Simply want to train a model on a single try on the whole dataset? No problem.

All Continuum datasets can be directly converted to tasksets, which implement the Pytorch Dataset and thus can be directly given to a DataLoader.

Here are an example with MNIST, but all datasets work the same:

from torch.utils.data import DataLoader
from continuum.datasets import MNIST

dataset = MNIST("/my/data/folder", train=True, download=True)
taskset = dataset.to_taskset(
    trsf=None  # Put your transformations here if you want some
    target_trsf=None  # Put your target transformations here if you want some
)

loader = DataLoader(taskset, batch_size=32, shuffle=True)

for x, y, t in loader:
    pass  # Your model here

Multi-target datasets

Continuum also allows a dataset to return multiple targets per data point. In class incremental, only the first target is taken in account for designing the increments.

from continuum import ClassIncremental
from continuum.datasets import FluentSpeech
from torch.utils.data import DataLoader

dataset = FluentSpeech("/my/data/folder", train=True)

scenario = ClassIncremental(dataset, increment=1)

for taskset in scenario:
    loader = DataLoader(taskset, batch_size=1)

    for x, y, t in loader:
        print(x.shape, y.shape, t.shape, np.unique(y[:, 0]))
        break

In this situation, the FluentSpeech dataset has 4 targets, but only the first one is used to determine the tasks.

The Different Data Types

Each dataset has a data type. Here are the ones already implemented:

  • IMAGE_ARRAY: Image is directly as an array. Example: MNIST dataset.

  • IMAGE_PATH: Image is stored as a patch to the actual image. Example: ImageNet dataset.

  • TEXT: For HuggingFace datasets.

  • TENSOR: Any kind of tensor.

  • SEGMENTATION: For the Continual Semantic Segmentation datasets.

  • OBJ_DETECTION: Still WIP.

  • H5: Storing tensor in a H5 dataset. For the end user, the usage is quite similar to TENSOR.

  • AUDIO: Storing a string to a audio file. Needs to install the Soundfile library.