Skip to content

Dataset Guide

Published on 30th November, 2022. Updated on 24th January, 2025.

Introduction

Dataset is referring to data stored under /datasets on CREATE HPC.

The purpose of dataset is to share commonly used data among users.

The benefit of using a dataset includes:

  • prevent duplication
  • restrict user access to execute and read only. (write access can be provided to principal investigator, administrator and maintainer)

Datasets that are available on CREATE HPC

These are the following datasets that are available for request within CREATE:

These are the datasets publicly available to all users within CREATE:

If you need to obtain access to any of the dataset listed above, please apply for access on the e-research portal groups page or, if there is not an "Apply" button visible for the group, email one of the named contacts to request access.

Answerals

Role Email Name
Principal Investigator alfredo.iacoangeli@kcl.ac.uk Alfredo Iacoangli

Bioresource

Role Email Name
Principal Investigator gerome.breen@kcl.ac.uk Gerome Breen
Administrator jonathan.coleman@kcl.ac.uk Jonathan Coleman
Administrator sang_hyuck.lee@kcl.ac.uk Sang-Hyuck Lee
Administrator rujia.1.wang@kcl.ac.uk Rujia Wang

BSTOP

Role Email Name
Principal Investigator michael.simpson@kcl.ac.uk Michael Simpson
Administrator nick.dand@kcl.ac.uk Nick Dand

commonmind

Role Email Name
Principal Investigator TBA

image_net

Role Email Name
Principal Investigator TBA

NYGC and Target ALS

Role Email Name
Principal Investigator alfredo.iacoangeli@kcl.ac.uk Alfredo Iacoangli

ProjectMineDF2

Role Email Name
Principal Investigator alfredo.iacoangeli@kcl.ac.uk Alfredo Iacoangli
Administrator aminah.2.ali@kcl.ac.uk Aminah Ali

ROSMAP

Role Email Name
Principal Investigator jernej.ule@kcl.ac.uk Jernej Ule
Administrator charlotte.capitanchik@kcl.ac.uk Charlotte Capitanchik
Administrator silvia.hnatova@kcl.ac.uk Silvia Hnatova

SEA-AD

Role Email Name
Principal Investigator jernej.ule@kcl.ac.uk Jernej Ule
Administrator charlotte.capitanchik@kcl.ac.uk Charlotte Capitanchik
Administrator silvia.hnatova@kcl.ac.uk Silvia Hnatova

TCGA

Role Email Name
Principal Investigator francesca.ciccarelli@kcl.ac.uk Francesca Ciccarelli

TREM2

Role Email Name
Principal Investigator alfredo.iacoangeli@kcl.ac.uk Alfredo Iacoangli
Principal Investigator angela.k.hodges@kcl.ac.uk Angela Hodges
Principal Investigator richard.j.dobson@kcl.ac.uk Richard Dobson

UK Biobank

Role Email Name
Principal Investigator gerome.breen@kcl.ac.uk Gerome Breen
Administrator jonathan.coleman@kcl.ac.uk Jonathan Coleman
Administrator alexandra.a.gillett@kcl.ac.uk Alexandra Gillett

UK DRI MAP

Role Email Name
Principal Investigator jernej.ule@kcl.ac.uk Jernej Ule
Administrator charlotte.capitanchik@kcl.ac.uk Charlotte Capitanchik
Administrator silvia.hnatova@kcl.ac.uk Silvia Hnatova

Working with Datasets: An ImageNet Example

ImageNet is an image database organised according to the WordNet hierarchy, in which each node of the hierarchy is depicted by hundreds and thousands of images

Loading the data set

Tip

The ImageNet dataset available via /datasets/image_net is very large.
If you wish to work with a smaller dataset, you can download a subset Tiny ImageNet; otherwise, you can use the whole dataset and skip this step.

1
2
3
4
5
cd /scratch/users/k1234567
mkdir imagenet
cd imagenet
wget https://cs231n.stanford.edu/tiny-imagenet-200.zip
unzip tiny-imagenet-200.zip

Running a python script

Note

Before proceeding, ensure you have PyTorch installed. Please refer to our PyTorch documentation for installation instructions

Either run your own python script or run this basic script for training a ResNet model on the dataset using PyTorch (adjust accordingly):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import itertools
from torch.utils.data import DataLoader, Subset

# Transforms
transform = transforms.Compose([
    transforms.Resize(64),
    transforms.ToTensor()
])

# Load dataset
train_data_full = torchvision.datasets.ImageFolder(
    root='/scratch/users/k1234567/imagenet/tiny-imagenet-200/train',
    transform=transform
)

small_indices = list(range(2000)) # for the first 2000 images
train_data = Subset(train_data_full, small_indices)
train_loader = DataLoader(train_data, batch_size=32, shuffle=True, num_workers=4)

# Simple model
model = torchvision.models.resnet18(num_classes=200)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(5):
    running_loss = 0.0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}")

Submitting the training Job

To run the training on the cluster, use a SLURM batch scipt:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --partition=interruptible_gpu
#SBTACH --time=02:00:00

module load cuda

source pytorch-venv/bin/activate

python train_tiny_imagenet.py --data /scratch/users/k1234567/imagenet/tiny-imagenet-200

Now run the bash script:

Tip

Ensure you have activated your virtual environment and are on a compute node

1
sbatch imagenet.sh