Existing Datasets¶
Name |
Nb classes |
Image Size |
Auto Download |
Type |
---|---|---|---|---|
MNIST |
10 |
28x28x1 |
YES |
Images |
Fashion MNIST |
10 |
28x28x1 |
YES |
Images |
KMNIST |
10 |
28x28x1 |
YES |
Images |
EMNIST |
10 |
28x28x1 |
YES |
Images |
QMNIST |
10 |
28x28x1 |
YES |
Images |
MNIST Fellowship |
30 |
28x28x1 |
YES |
Images |
Rainbow MNIST |
10 |
28x28x3 |
YES |
Images |
Colored MNIST |
2 |
28x28x3 |
YES |
Images |
SVHN |
10 |
32x32x3 |
YES |
Images |
Synbols |
50 |
32x32x3 |
YES |
Images |
CIFAR10 |
10 |
32x32x3 |
YES |
Images |
CIFAR100 |
100 |
32x32x3 |
YES |
Images |
CIFAR Fellowship |
110 |
32x32x3 |
YES |
Images |
STL10 |
10 |
96x96x3 |
YES |
Images |
Omniglot |
964 |
105x105x1 |
YES |
Images |
CTRL minus |
87 |
32x32x3 |
YES |
Images |
CTRL plus |
87 |
32x32x3 |
YES |
Images |
CTRL in |
87 |
32x32x3 |
YES |
Images |
CTRL out |
97 |
32x32x3 |
YES |
Images |
CTRL plastic |
97 |
32x32x3 |
YES |
Images |
FER2013 |
7 |
48x48x1 |
NO |
Images |
TinyImageNet200 |
200 |
64x64x3 |
YES |
Images |
GTSRB |
43 |
~64x64x3 |
YES |
Images |
EuroSAT |
10 |
64x64x3 |
YES |
Images |
ImageNet100 |
100 |
224x224x3 |
NO |
Images |
ImageNet1000 |
1000 |
224x224x3 |
NO |
Images |
CORe50 |
50 |
224x224x3 |
YES |
Images |
CORe50-v2-79 |
50 |
224x224x3 |
YES |
Images |
CORe50-v2-196* |
50 |
224x224x3 |
YES |
Images |
CORe50-v2-391 |
50 |
224x224x3 |
YES |
Images |
Stanford Car196 |
196 |
224x224x3 |
YES |
Images |
Stream-51 |
51 |
224x224x3 |
YES |
Images |
Caltech101 |
101 (102) |
224x224x3 |
YES |
Images |
Caltech256 |
257 |
224x224x3 |
YES |
Images |
FGVC Aircraft |
30/70/100 |
224x224x3 |
YES |
Images |
Food101 |
101 |
224x224x3 |
YES |
Images |
DTD |
47 |
224x224x3 |
YES |
Images |
AwA2 |
50 |
224x224x3 |
YES |
Images |
CUB200 |
200 |
224x224x3 |
YES |
Images |
Terra Incognita |
10 |
224x224x3 |
YES |
Images |
PACS |
7 |
224x224x3 |
NO |
Images |
VLCS |
5 |
224x224x3 |
YES |
Images |
OfficeHome |
65 |
224x224x3 |
NO |
Images |
DomainNet |
345 |
224x224x3 |
YES |
Images |
Pascal-VOC 2007 |
20 |
224x224x3 |
YES |
Images |
Pascal-VOC 2012 |
20 |
512x512x3 |
YES |
Segmentation Maps |
MultiNLI |
5 |
YES |
Text |
|
OxfordFlower102 |
102 |
?x?x3 |
YES |
Images |
OxfordPet |
37 |
?x?x3 |
YES |
Images |
SUN397 |
397 |
?x?x3 |
YES |
Images |
Birdsnap |
500 |
?x?x3 |
YES |
Images |
MetaShift |
410 |
?x?x3 |
YES |
Images |
Fluent Speech |
31 |
? |
YES |
Audio |
All datasets have for arguments train and download, like a torchvision.dataset. Those datasets are then modified to create continuum scenarios.
Once a dataset is created, it is fed to a scenario that will split it in multiple tasks.
Continuum supports many datasets implemented in torchvision in such as MNIST, or CIFAR100:
from continuum import ClassIncremental
from continuum.datasets import MNIST
clloader = ClassIncremental(
MNIST("/my/data/folder", download=True, train=True),
increment=1,
initial_increment=5
)
The data from these small datasets can be automatically downloaded with the option download.
Larger datasets such as ImageNet or CORe50 are also available, although their initialization differ:
from continuum import ClassIncremental
from continuum.datasets import ImageNet1000
dataset_100 = ImageNet1000("/my/data/folder/imagenet/train/", train=True)
dataset_1000 = ImageNet1000("/my/data/folder/imagenet/val/", train=False)
Note that Continuum cannot download ImageNet’s data, it’s on you! We also provide ImageNet100, a subset of 100 classes of ImageNet. The subset meta-data are automatically downloaded, or you can provide them with the option data_subset.
Multiple versions of CORe50 are proposed. For all, the data can automatically be downloaded:
from continuum.datasets import Core50, Core50v2_79, Core50v2_196, Core50v2_391
dataset = Core50("/my/data/folder/", train=True, download=True)
dataset_79 = Core50v2_79("/my/data/folder/", train=True, download=True)
dataset_196 = Core50v2_196("/my/data/folder/", train=True, download=True)
dataset_391 = Core50v2_391("/my/data/folder/", train=True, download=True)
If you wish to learn CORe50 in the class-incremental scenario (NC), Core50 suffices. Although, for instance-incremental scenario (NI and NIC), you need to use Core50v2_79, Core50v2_196, or Core50v2_391 (see our doc about it). Refer to the datatset official webpage for more information about the different versions.
In addition to Computer Vision dataset, Continuum also provide one NLP dataset:
from continuum.datasets import MultiNLI
dataset=MultiNLI("/my/data/folder", train=True, download=True)
The MultiNLI dataset provides text written in different styles and categories. This dataset can be used in Continual Learning in a New Instances (NI) setting where all categories are known from the start, but with styles being incrementally added.
Adding Your Own Datasets¶
The goal of continuum is to propose the most used benchmark scenarios of continual learning but also to make easy the creation of new scenarios through an adaptable framework.
For example, the type of scenarios are easy to use with others dataset:
InMemoryDataset, for in-memory numpy array:
from continuum.datasets import InMemoryDataset
x_train, y_train = gen_numpy_array()
dataset = InMemoryDataset(x_train, y_train)
PyTorchDataset,for datasets defined in torchvision:
from torchvision.datasets import CIFAR10
from continuum.datasets import PyTorchDataset
dataset = PyTorchDataset("/my/data/folder/", dataset_type=CIFAR10, train=True, download=True)
ImageFolderDataset, for datasets having a tree-like structure, with one folder per class:
from continuum.datasets import ImageFolderDataset
dataset_train = ImageFolderDataset("/my/data/folder/train/")
dataset_test = ImageFolderDataset("/my/data/folder/test/")
Fellowship, to combine several continual datasets.:
from torchvision.datasets import CIFAR10, CIFAR100
from continuum.datasets import Fellowship
dataset = Fellowship(datasets=[
CIFAR10(data_path="/my/data/folder1/", train=True),
CIFAR100(data_path="/my/data/folder1/", train=True)
],
update_labels=True
)
The update_labels parameter determines if we want that different datasets have different labels or if we do not care about it. The default value of update_labels is True. Note that Continuum already provide pre-made Fellowship:
from continuum.datasets import MNISTFellowship, CIFARFellowship
dataset_MNIST = MNISTFellowship("/my/data/folder", train=True)
dataset_CIFAR = CIFARFellowship("/my/data/folder", train=True)
You may want datasets that have a different transformation for each new task, e.g. MNIST with different rotations or pixel permutations. Continuum also handles it! However it’s a scenario’s speficic, not dataset, thus look over the Scenario doc.
Supervised setting without Continual¶
Continuum is awesome but you don’t want to do continual learning? Simply want to train a model on a single try on the whole dataset? No problem.
All Continuum datasets can be directly converted to tasksets, which implement the Pytorch Dataset and thus can be directly given to a DataLoader.
Here are an example with MNIST, but all datasets work the same:
from torch.utils.data import DataLoader
from continuum.datasets import MNIST
dataset = MNIST("/my/data/folder", train=True, download=True)
taskset = dataset.to_taskset(
trsf=None # Put your transformations here if you want some
target_trsf=None # Put your target transformations here if you want some
)
loader = DataLoader(taskset, batch_size=32, shuffle=True)
for x, y, t in loader:
pass # Your model here
Multi-target datasets¶
Continuum also allows a dataset to return multiple targets per data point. In class incremental, only the first target is taken in account for designing the increments.
from continuum import ClassIncremental
from continuum.datasets import FluentSpeech
from torch.utils.data import DataLoader
dataset = FluentSpeech("/my/data/folder", train=True)
scenario = ClassIncremental(dataset, increment=1)
for taskset in scenario:
loader = DataLoader(taskset, batch_size=1)
for x, y, t in loader:
print(x.shape, y.shape, t.shape, np.unique(y[:, 0]))
break
In this situation, the FluentSpeech dataset has 4 targets, but only the first one is used to determine the tasks.
The Different Data Types¶
Each dataset has a data type. Here are the ones already implemented:
IMAGE_ARRAY: Image is directly as an array. Example: MNIST dataset.
IMAGE_PATH: Image is stored as a patch to the actual image. Example: ImageNet dataset.
TEXT: For HuggingFace datasets.
TENSOR: Any kind of tensor.
SEGMENTATION: For the Continual Semantic Segmentation datasets.
OBJ_DETECTION: Still WIP.
H5: Storing tensor in a H5 dataset. For the end user, the usage is quite similar to TENSOR.
AUDIO: Storing a string to a audio file. Needs to install the Soundfile library.