Action Recognition Using LSTM

21 minute read

This is the fifth assignment for the Computer Vision (CSE-527) course from Fall 19 at Stony Brook University. In this homework, we accomplish action recognition using Recurrent Neural Network (RNN), (Long-Short Term Memory) LSTM in particular using the dataset called UCF101, which consists of 101 different actions/classes and for each action, there are 145 samples. Each sample is tagged into either training or testing. Each sample is supposed to be a short video, but we have 25 frames from each videos to reduce the amount of data. Consequently, a training sample is an image tuple that forms a 3D volume with one dimension encoding temporal correlation between frames and a label indicating what action it is.

To tackle this problem, we aim to build a neural network that can not only capture spatial information of each frame but also temporal information between frames. Fortunately, you don’t have to do this on your own. RNN — a type of neural network designed to deal with time-series data — is right here for you to use. In particular, you will be using LSTM for this task.

Instead of training an end-to-end neural network from scratch whose computation is prohibitively expensive, we divide this into two steps: feature extraction and modelling. Below are the things you implement:

Feature extraction. Use any of the pre-trained models to extract features from each frame. Specifically, we dont use the activations of the last layer as the features tend to be task specific towards the end of the network. Hints:
- A good starting point would be to use the first fully connected layer of pre-trained VGG16 network torchvision.models.vgg16 (4096 dim) as features of each video frame. This will result into a 4096x25 matrix for each video.
- Normalize your images using torchvision.transforms
```
  normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
  prep = transforms.Compose([ transforms.ToTensor(), normalize ])
  prep(img)
```
The mean and std. mentioned above is specific to Imagenet data

More details of image preprocessing in PyTorch can be found at http://pytorch.org/tutorials/beginner/data_loading_tutorial.html
Modelling. With the extracted features, we build an LSTM network which takes a dx25 sample as input (where d is the dimension of the extracted feature for each frame), and outputs the action label of that sample.
Evaluation. After training your network, we evaluate our model with the testing data by computing the prediction accuracy. The baseline test accuracy for this data is 75%. Moreover, we also compare the result of our network with that of support vector machine (SVM) (stacking the dx25 feature matrix to a long vector and train a SVM).

Notice that the size of the raw images is 256x340, whereas our pre-trained model might take nxn images as inputs. To solve this problem, instead of resizing the images which unfavorably changes the spatial ratio, we take a better solution: Cropping five nxn images, one at the image center and four at the corners and compute the d-dim features for each of them, and average these five d-dim feature to get a final feature representation for the raw image.

For example, VGG takes 224x224 images as inputs, so we take the five 224x224 croppings of the image, compute 4096-dim VGG features for each of them, and then take the mean of these five 4096-dim vectors to be the representation of the image.

In order to save computational time, one can do the classification task only for the first 25 classes of the whole dataset.

Dataset

Download dataset at UCF101(Image data for each video) and the annos folder which has the video labels and the label to class name mapping is included in the assignment folder uploaded.

UCF101 dataset contains 101 actions and 13,320 videos in total.

annos/actions.txt
- lists all the actions (ApplyEyeMakeup, .., YoYo)
annots/videos_labels_subsets.txt
- lists all the videos (v_000001, .., v_013320)
- labels (1, .., 101)
- subsets (1 for train, 2 for test)
images/
- each folder represents a video
- the video/folder name to class mapping can be found using annots/videos_labels_subsets.txt, for e.g. v_000001 belongs to class 1 i.e. ApplyEyeMakeup
- each video folder contains 25 frames

Some Tutorials

Good materials for understanding RNN and LSTM
- http://blog.echen.me
- http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Implementing RNN and LSTM with PyTorch
- LSTM with PyTorch
- RNN with PyTorch

import os
os.environ['CUDA_LAUNCH_BLOCKING']='1'

import random 
import time
import sys

import pickle

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from skimage import io, transform

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils, models

from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

plt.ion()   # interactive mode

# Random seed
torch.manual_seed(42)
np.random.seed(42)

# HELPER METHODS BELOW
def print_gpu_stats():
  total_memory = torch.cuda.get_device_properties(torch.cuda.current_device()).total_memory
  cached = torch.cuda.memory_cached()
  allocated = torch.cuda.memory_allocated()
  free = cached - allocated
  free_pc = (free / cached)
  print('GPU Stats:', total_memory, cached, allocated, free, free_pc)

Problem 1. Feature extraction

# ==========================================
#  Class to manage interaction with dataset
# ==========================================
class UCF101Dataset(Dataset):

    def __init__(self, annotations_file, images_dir, is_train, 
                 images_per_sample=25, num_classes=(1, 102), transform=None):
        self.annotations_frame = pd.read_csv(annotations_file, delimiter='\t', header=None)
        self.annotations_frame = self.annotations_frame[(self.annotations_frame[1] >= num_classes[0]) & (self.annotations_frame[1] < num_classes[1])]
        if is_train:
          self.annotations_frame = self.annotations_frame[self.annotations_frame[2] == 1]
        else:
          self.annotations_frame = self.annotations_frame[self.annotations_frame[2] == 2]

        self.images_dir = images_dir
        self.images_per_sample = images_per_sample
        self.transform = transform

    def __len__(self):
        return len(self.annotations_frame)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
            
        images_path = os.path.join(self.images_dir, self.annotations_frame.iloc[idx, 0])
        images = []
        
        for i in range(1, self.images_per_sample + 1):
            img_name = os.path.join(images_path, 'i_{:04d}.jpg'.format(i))
            image = io.imread(img_name)

            if self.transform:
              image = self.transform(image)

            images.append(image)
            
        label = self.annotations_frame.iloc[idx, 1]

        sample = {'images': images, 'label': int(label)}
        return sample

ucf_25_train = UCF101Dataset(annotations_file='annos/videos_labels_subsets.txt', 
                    images_dir='images/',
                    is_train=True,
                    images_per_sample=25,
                    num_classes=(1, 26))
ucf_25_test = UCF101Dataset(annotations_file='annos/videos_labels_subsets.txt', 
                    images_dir='images/',
                    is_train=False,
                    images_per_sample=25,
                    num_classes=(1, 26))

print('Training samples for 25 classes:', len(ucf_25_train))
print('Test samples for 25 classes:', len(ucf_25_test))

Training samples for 25 classes: 2409
Test samples for 25 classes: 951

# ==========================================
#          Sanity check sample data
# ==========================================
fig = plt.figure()
for i in range(0, len(ucf_25_train)):
    sample = ucf_25_train[10 * i]
    ax = plt.subplot(1, 4, i + 1)
    plt.tight_layout()
    ax.set_title('Sample #{}'.format(i))
    ax.axis('off')
    plt.imshow(sample['images'][10])

    if i == 3:
        plt.show()
        break

png

# Cleanup
del ucf_25_train, ucf_25_test, fig, sample

# ==========================================
#  Tweak Pretrained Network for Feature Ext
# ==========================================
vgg16 = models.vgg16(pretrained=True)
# print(vgg16)
vgg16.classifier = vgg16.classifier[:2]

# Disable gradient calculations and freeze weights
for param in vgg16.parameters():
    param.requires_grad = False

print(vgg16)

Downloading: "https://download.pytorch.org/models/vgg16-397923af.pth" to /root/.cache/torch/checkpoints/vgg16-397923af.pth
100%|██████████| 528M/528M [00:05<00:00, 96.9MB/s]


VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): ReLU(inplace=True)
    (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (20): ReLU(inplace=True)
    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (22): ReLU(inplace=True)
    (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): ReLU(inplace=True)
    (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (27): ReLU(inplace=True)
    (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (29): ReLU(inplace=True)
    (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096, bias=True)
    (1): ReLU(inplace=True)
  )
)

# ==========================================
#             Custom Transforms
# ==========================================
class UCFNormFiveCrop(object):
    def __init__(self):
        self.toTensor = transforms.ToTensor()
        self.normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                              std=[0.229, 0.224, 0.225])
        
        self.prep = transforms.Compose([self.toTensor, self.normalize])

    def __call__(self, image):
        image = self.prep(image)        
        
        five_crops = []
        five_crops.append(image[:, 16:240, 58:282].unsqueeze(0)) # C
        five_crops.append(image[:, 0:224, 0:224].unsqueeze(0)) # TL
        five_crops.append(image[:, 32:256, 0:224].unsqueeze(0)) # BL
        five_crops.append(image[:, 0:224, 116:340].unsqueeze(0)) # TR
        five_crops.append(image[:, 32:256, 116:340].unsqueeze(0)) # BR

        # result = torch.mean(result, dim=0)
        return torch.cat((five_crops[0], five_crops[1], five_crops[2],
                            five_crops[3], five_crops[4]))

def get_data_loader(num_classes, images_per_sample, batch_size):
  train_transformed = UCF101Dataset(
      annotations_file='annos/videos_labels_subsets.txt', 
                      images_dir='images/',
                      is_train=True,
                      images_per_sample=images_per_sample,
                      num_classes=num_classes,
                      transform=UCFNormFiveCrop())

  test_transformed = UCF101Dataset(
      annotations_file='annos/videos_labels_subsets.txt', 
                      images_dir='images/',
                      is_train=False,
                      images_per_sample=images_per_sample,
                      num_classes=num_classes,
                      transform=UCFNormFiveCrop())

  train_dataloader = DataLoader(train_transformed, batch_size=batch_size,
                                shuffle=False, num_workers=4)
  print('Total number of train batches =', len(train_dataloader), 
        'with batch size =', batch_size, 
        'for total entries =', len(train_transformed))

  test_dataloader = DataLoader(test_transformed, batch_size=batch_size,
                               shuffle=False, num_workers=4)
  print('Total number of test batches =', len(test_dataloader), 
        'with batch size =', batch_size, 
        'for total entries =', len(test_transformed))
  
  return train_dataloader, test_dataloader

# ==========================================
#             Feature Extraction
# ==========================================
def extract_features_and_dump(model, dataloader, is_train, num_classes):
  model.eval() # Ensure no gradients are calculated and VGG16 is in eval mode

  samples = []
  epoch_time = time.time()

  if torch.cuda.is_available():
    model = model.cuda()

  for i_batch, sample_batched in enumerate(dataloader):
      start_time = time.time()
      # Reshape images to desired shape!
      images = torch.cat(sample_batched['images'])
      labels = sample_batched['label'].type(torch.FloatTensor)
      combined_shape = list(images.shape[:0]) + [-1] + list(images.shape[2:])
      images =  images.view(combined_shape)

      if torch.cuda.is_available():
          images = images.cuda()
          labels = labels.cuda()
      
      # Extract features!
      vgg_time = time.time()
      features = model(images)
      vgg_time_end = time.time()

      batch_features = list(features.split(images_per_sample * 5, dim=0))
      for j in range(0, len(batch_features)):
        image_features = batch_features[j].split(5, dim=0) # 5 because of 5 crop
        vgg_features = []
        for k in range(0, len(image_features)):
          vgg_features.append(torch.cat((torch.mean(image_features[k], dim=0), labels[j].reshape(1)), dim=0))

        samples.extend(vgg_features)

      # CLEANUP VARIABLES FROM GPU!
      del images, batch_features, image_features, vgg_features

      # if i_batch == 5:
      #   break
      
      if (i_batch) % 50 == 0:
        now = time.time()
        print('For Batch {:04d} Time: {:06.2f}s with VGG Time: {:06.2f}s | Elapse Time: {:08.1f}s | Samples Size {:08d}'.format(
            i_batch, now - start_time, vgg_time_end - vgg_time, now - epoch_time, len(samples)))
        print_gpu_stats()

  # Dump to file
  file_name = 'ucf_{}_{}-{}.pt'.format('train' if is_train else 'test', num_classes[0], num_classes[1] - 1)
  print('Dumping {} entries to file name {}'.format(len(samples), file_name))
  print(torch.stack(samples).size())
  torch.save(torch.stack(samples), file_name)

  # CLEANUP!
  del samples

# Splitting 25 classes into 2 jobs, 1-15 and 16-25 - Google Drive Randomly Times Out!!!
num_classes=(1, 16)
images_per_sample = 25
batch_size = 2
train_set_1_dataloader, test_set_1_dataloader = get_data_loader(num_classes, images_per_sample, batch_size)

Total number of train batches = 721 with batch size = 2 for total entries = 1442
Total number of test batches = 284 with batch size = 2 for total entries = 568

print('TRAINING DATA BATCH METRICS')
extract_features_and_dump(vgg16, train_set_1_dataloader, True, num_classes)

TRAINING DATA BATCH METRICS
For Batch 0000 Time: 000.84s with VGG Time: 000.64s | Elapse Time: 000002.1s | Samples Size 00000050
GPU Stats: 17071734784 11274289152 1486029824 9788259328 0.8681930360339937
For Batch 0050 Time: 000.73s with VGG Time: 000.61s | Elapse Time: 000039.7s | Samples Size 00002550
GPU Stats: 17071734784 11316232192 1528269824 9787962368 0.8649488806812917
For Batch 0100 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000077.2s | Samples Size 00005050
GPU Stats: 17071734784 11358175232 1570509824 9787665408 0.8617286851170144
For Batch 0150 Time: 000.73s with VGG Time: 000.61s | Elapse Time: 000114.8s | Samples Size 00007550
GPU Stats: 17071734784 11400118272 1612749824 9787368448 0.8585321848843359
For Batch 0200 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000152.4s | Samples Size 00010050
GPU Stats: 17071734784 11442061312 1654989824 9787071488 0.8553591194040964
For Batch 0250 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000189.9s | Samples Size 00012550
GPU Stats: 17071734784 11484004352 1697229824 9786774528 0.8522092319039901
For Batch 0300 Time: 000.73s with VGG Time: 000.61s | Elapse Time: 000227.4s | Samples Size 00015050
GPU Stats: 17071734784 11528044544 1739469824 9788574720 0.8491097239119065
For Batch 0350 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000264.9s | Samples Size 00017550
GPU Stats: 17071734784 11569987584 1781709824 9788277760 0.8460059000872304
For Batch 0400 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000302.4s | Samples Size 00020050
GPU Stats: 17071734784 11611930624 1823949824 9787980800 0.8429244986849829
For Batch 0450 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000340.0s | Samples Size 00022550
GPU Stats: 17071734784 11653873664 1866189824 9787683840 0.8398652776059474
For Batch 0500 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000377.7s | Samples Size 00025050
GPU Stats: 17071734784 11695816704 1908429824 9787386880 0.8368279982237314
For Batch 0550 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000415.1s | Samples Size 00027550
GPU Stats: 17071734784 11737759744 1950669824 9787089920 0.8338124253227175
For Batch 0600 Time: 000.73s with VGG Time: 000.61s | Elapse Time: 000452.6s | Samples Size 00030050
GPU Stats: 17071734784 11781799936 1992909824 9788890112 0.8308484412546725
For Batch 0650 Time: 000.73s with VGG Time: 000.62s | Elapse Time: 000490.2s | Samples Size 00032550
GPU Stats: 17071734784 11823742976 2035149824 9788593152 0.8278760094725524
For Batch 0700 Time: 000.73s with VGG Time: 000.61s | Elapse Time: 000527.8s | Samples Size 00035050
GPU Stats: 17071734784 11865686016 2077389824 9788296192 0.8249245917009103
Dumping 36050 entries to file name ucf_train_1-15.pt
torch.Size([36050, 4097])

num_classes=(16, 26)
images_per_sample = 25
batch_size = 2
train_set_1_dataloader, test_set_1_dataloader = get_data_loader(num_classes, images_per_sample, batch_size)
print('TRAINING DATA BATCH METRICS')
extract_features_and_dump(vgg16, train_set_1_dataloader, True, num_classes)

Total number of train batches = 484 with batch size = 2 for total entries = 967
Total number of test batches = 192 with batch size = 2 for total entries = 383
TRAINING DATA BATCH METRICS
For Batch 0000 Time: 000.83s with VGG Time: 000.64s | Elapse Time: 000002.1s | Samples Size 00000050
GPU Stats: 17071734784 12067012608 876929024 11190083584 0.927328407412235
For Batch 0050 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000039.8s | Samples Size 00002550
GPU Stats: 17071734784 12067012608 919169024 11147843584 0.9238279552811088
For Batch 0100 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000077.4s | Samples Size 00005050
GPU Stats: 17071734784 12067012608 961409024 11105603584 0.9203275031499826
For Batch 0150 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000114.9s | Samples Size 00007550
GPU Stats: 17071734784 12067012608 1003649024 11063363584 0.9168270510188564
For Batch 0200 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000152.5s | Samples Size 00010050
GPU Stats: 17071734784 12067012608 1045889024 11021123584 0.9133265988877303
For Batch 0250 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000190.0s | Samples Size 00012550
GPU Stats: 17071734784 12067012608 1088129024 10978883584 0.9098261467566041
For Batch 0300 Time: 000.73s with VGG Time: 000.61s | Elapse Time: 000227.6s | Samples Size 00015050
GPU Stats: 17071734784 12067012608 1130369024 10936643584 0.906325694625478
For Batch 0350 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000265.2s | Samples Size 00017550
GPU Stats: 17071734784 12067012608 1172609024 10894403584 0.9028252424943518
For Batch 0400 Time: 000.73s with VGG Time: 000.61s | Elapse Time: 000302.8s | Samples Size 00020050
GPU Stats: 17071734784 12067012608 1214849024 10852163584 0.8993247903632255
For Batch 0450 Time: 000.73s with VGG Time: 000.61s | Elapse Time: 000340.4s | Samples Size 00022550
GPU Stats: 17071734784 12067012608 1257089024 10809923584 0.8958243382320994
Dumping 24175 entries to file name ucf_train_16-25.pt
torch.Size([24175, 4097])

# Sanity check!
# file_name = 'ucf_train_1-25.pkl'
# print('Reading entries from file name {}'.format(file_name))
# ucf_25_train_samples = torch.load(file_name)
# print(ucf_25_train_samples[0][0].size())
# print(ucf_25_train_samples[0][1])
# print(ucf_25_train_samples[299][0].size())
# print(ucf_25_train_samples[299][1])

num_classes=(1, 16)
print('TEST DATA BATCH METRICS')
train_set_1_dataloader, test_set_1_dataloader = get_data_loader(num_classes, images_per_sample, batch_size)
extract_features_and_dump(vgg16, test_set_1_dataloader, False, num_classes)

TEST DATA BATCH METRICS
Total number of train batches = 721 with batch size = 2 for total entries = 1442
Total number of test batches = 284 with batch size = 2 for total entries = 568
For Batch 0000 Time: 000.76s with VGG Time: 000.62s | Elapse Time: 000023.1s | Samples Size 00000050
GPU Stats: 17071734784 12067012608 876929024 11190083584 0.927328407412235
For Batch 0050 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000298.4s | Samples Size 00002550
GPU Stats: 17071734784 12067012608 919169024 11147843584 0.9238279552811088
For Batch 0100 Time: 000.74s with VGG Time: 000.63s | Elapse Time: 000565.3s | Samples Size 00005050
GPU Stats: 17071734784 12067012608 961409024 11105603584 0.9203275031499826
For Batch 0150 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000813.3s | Samples Size 00007550
GPU Stats: 17071734784 12067012608 1003649024 11063363584 0.9168270510188564
For Batch 0200 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 001141.5s | Samples Size 00010050
GPU Stats: 17071734784 12067012608 1045889024 11021123584 0.9133265988877303
For Batch 0250 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 001449.8s | Samples Size 00012550
GPU Stats: 17071734784 12067012608 1088129024 10978883584 0.9098261467566041
Dumping 14200 entries to file name ucf_test_1-15.pt
torch.Size([14200, 4097])
TEST DATA BATCH METRICS
Total number of train batches = 445 with batch size = 2 for total entries = 890
Total number of test batches = 175 with batch size = 2 for total entries = 350
For Batch 0000 Time: 000.77s with VGG Time: 000.63s | Elapse Time: 000026.8s | Samples Size 00000050
GPU Stats: 17071734784 12067012608 876929024 11190083584 0.927328407412235
For Batch 0050 Time: 000.73s with VGG Time: 000.61s | Elapse Time: 000340.0s | Samples Size 00002550
GPU Stats: 17071734784 12067012608 919169024 11147843584 0.9238279552811088
For Batch 0100 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000671.9s | Samples Size 00005050
GPU Stats: 17071734784 12067012608 961409024 11105603584 0.9203275031499826
For Batch 0150 Time: 000.73s with VGG Time: 000.61s | Elapse Time: 000993.6s | Samples Size 00007550
GPU Stats: 17071734784 12067012608 1003649024 11063363584 0.9168270510188564
Dumping 8750 entries to file name ucf_test_16-24.pt
torch.Size([8750, 4097])

num_classes=(16, 26)
print('TEST DATA BATCH METRICS')
train_set_1_dataloader, test_set_1_dataloader = get_data_loader(num_classes, images_per_sample, batch_size)
extract_features_and_dump(vgg16, test_set_1_dataloader, False, num_classes)

TEST DATA BATCH METRICS
Total number of train batches = 484 with batch size = 2 for total entries = 967
Total number of test batches = 192 with batch size = 2 for total entries = 383
For Batch 0000 Time: 000.82s with VGG Time: 000.64s | Elapse Time: 000002.3s | Samples Size 00000050
GPU Stats: 17071734784 15279849472 4070541312 11209308160 0.7336006928956217
For Batch 0050 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000040.1s | Samples Size 00002550
GPU Stats: 17071734784 15279849472 2138846208 13141003264 0.8600217749579674
For Batch 0100 Time: 000.72s with VGG Time: 000.61s | Elapse Time: 000077.8s | Samples Size 00005050
GPU Stats: 17071734784 15279849472 2181086208 13098763264 0.8572573498190021
For Batch 0150 Time: 000.73s with VGG Time: 000.61s | Elapse Time: 000115.5s | Samples Size 00007550
GPU Stats: 17071734784 15279849472 2223326208 13056523264 0.854492924680037
Dumping 9575 entries to file name ucf_test_16-25.pt
torch.Size([9575, 4097])

Problem 2. Modelling

Print the size of your training and test data

def merge_batches(batches, is_train):
  result = None
  for batch in batches:
    file_name = 'ucf_{}_{}-{}.pt'.format('train' if is_train else 'test', batch[0], batch[1] - 1)
    samples = torch.load(file_name)
    if result is None:
      result = samples
    else:
      result = torch.cat((result, samples), dim=0)

    # print(result)

  return result

batches = [(1, 16), (16, 26)]
ucf_25_train_samples = merge_batches(batches, True)
print(ucf_25_train_samples.size())
ucf_25_test_samples = merge_batches(batches, False)
print(ucf_25_test_samples.size())

torch.Size([60225, 4097])
torch.Size([23775, 4097])

def samples_to_tensor(samples):
  # Features
  features = samples[:, :-1]
  features_tensor = torch.stack(torch.split(features, 25, dim=0))
  
  # Labels
  labels = samples[:, -1]

  labels_tensor = torch.stack(torch.split(labels, 25, dim=0)).cpu().detach().numpy()
  labels_tensor = torch.tensor(np.mean(labels_tensor, axis=1))

  return features_tensor, labels_tensor

train_features, train_labels = samples_to_tensor(ucf_25_train_samples)
test_features, test_labels = samples_to_tensor(ucf_25_test_samples)

# Don't hardcode the shape of train and test data
print('Shape of training data is :', train_features.size())
print('Shape of training label data is :', train_labels.size())
print('Shape of test/validation data is :', test_features.size())
print('Shape of test/validation label data is :', test_labels.size())

Shape of training data is : torch.Size([2409, 25, 4096])
Shape of training label data is : torch.Size([2409])
Shape of test/validation data is : torch.Size([951, 25, 4096])
Shape of test/validation label data is : torch.Size([951])

class LSTMClassifier(nn.Module):

    def __init__(self, input_dim, hidden_dim, num_layers, batch_size, 
                 target_size):
        super(LSTMClassifier, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.batch_size = batch_size
        self.target_size = target_size

        self.lstm = nn.LSTM(input_dim, self.hidden_dim, self.num_layers)
        self.linear = nn.Linear(hidden_dim, target_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        return (torch.zeros(self.num_layers, self.batch_size, self.hidden_dim).cuda(),
                torch.zeros(self.num_layers, self.batch_size, self.hidden_dim).cuda())
        
    def forward(self, video_frames):
        lstm_in = video_frames.view(len(video_frames), self.batch_size, -1)
        lstm_out, self.hidden = self.lstm(lstm_in, self.hidden)
        output = self.linear(lstm_out[-1].view(self.batch_size, -1))
        return output

def train_model(model, train_features, train_labels, num_epochs, learning_rate,
                optimizer=None):
  since = time.time()
  if optimizer == None:
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)

  if torch.cuda.is_available():
    model = model.cuda()

  loss_function = nn.CrossEntropyLoss()

  total_step = len(train_features)

  model.train()
  for epoch in range(num_epochs):

      data = list(zip(train_features, train_labels))
      random.shuffle(data)
      train_features, train_labels = zip(*data)

      correct = 0
      total = 0
      for i, (video, label) in enumerate(zip(train_features, train_labels)):
          # -1 here because loss function requires this to be between (0, num_classes]
          label = label.type(torch.LongTensor).view(-1) - 1

          if torch.cuda.is_available():
            video, label = video.cuda(), label.cuda()

          model.zero_grad()        
          model.hidden = model.init_hidden()

          predictions = model(video)
          loss = loss_function(predictions, label) 
          loss.backward()
          optimizer.step()

          # Track the accuracy
          _, predicted = torch.max(predictions.data, 1)
          total += label.size(0)
          correct += (predicted == label).sum().item()

          # if i != 0 and  i % (50) == 0:
          #   print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Accuracy: {:.2f}%'
          #           .format(epoch + 1, num_epochs, i + 1, total_step, loss.item(),
          #                   (correct / total) * 100))

      print('Training Accuracy for epoch {}: {:.3f}%'.format(epoch + 1, (correct / total) * 100))

  elapsed = time.time() - since
  print('Train time elapsed in seconds: ', elapsed)
  return (correct / total) * 100

num_epochs = 5
model = LSTMClassifier(input_dim=4096, hidden_dim=512, num_layers=1, 
                       batch_size=1, target_size=25)

training_accuracy = train_model(model, train_features, train_labels, num_epochs, 0.05)

Training Accuracy for epoch 1: 71.067%
Training Accuracy for epoch 2: 96.472%
Training Accuracy for epoch 3: 99.419%
Training Accuracy for epoch 4: 99.875%
Training Accuracy for epoch 5: 100.000%
Train time elapsed in seconds:  68.63698673248291

Problem 3. Evaluation

def test_model(model, test_features, test_labels):
  since = time.time()
  model.eval()
  with torch.no_grad():
      correct = 0
      total = 0
      for video, label in zip(test_features, test_labels):
          # -1 here because loss function during training required this to be between (0, num_classes]
          label = label.type(torch.LongTensor).view(-1) - 1

          if torch.cuda.is_available():
            # Move to GPU
            video, label = video.cuda(), label.cuda()

          outputs = model(video)
          _, predicted = torch.max(outputs.data, 1)
          total += label.size(0)
          correct += (predicted == label).sum().item()

      print('Test Accuracy of the model on test images: {} %'.format((correct / total) * 100))
  
  elapsed = time.time() - since
  print('Test time elapsed in seconds: ', elapsed)
  return (correct / total) * 100

test_accuracy = test_model(model, test_features, test_labels)

Test Accuracy of the model on test images: 85.17350157728707 %
Test time elapsed in seconds:  1.9488310813903809

# torch.save(model.state_dict(), 'model_25_4ep_512hid_009lr_{}.pt'.format(time.time()))

Print the train and test accuracy of your model

# Don't hardcode the train and test accuracy
print('Training accuracy is %2.3f :' %(training_accuracy) )
print('Test accuracy is %2.3f :' %(test_accuracy) )

Training accuracy is 100.000 :
Test accuracy is 85.174 :

SVM

def get_np_features_from_tensor(train_features, train_labels, test_features, test_labels):
  train_combined_shape = list(train_features.shape[:1]) + [-1] + list(train_features.shape[3:])
  
  train_features_svc = train_features.view(train_combined_shape).cpu().numpy()
  train_labels_svc = train_labels.cpu().numpy()

  test_combined_shape = list(test_features.shape[:1]) + [-1] + list(test_features.shape[3:])
  test_features_svc = test_features.view(test_combined_shape).cpu().numpy()
  test_labels_svc = test_labels.cpu().numpy()

  return train_features_svc, train_labels_svc, test_features_svc, test_labels_svc

train_features_svc, train_labels_svc, test_features_svc, test_labels_svc = get_np_features_from_tensor(train_features, train_labels, test_features, test_labels)

print(train_features_svc.shape)
print(train_labels_svc.shape)
print(test_features_svc.shape)
print(test_labels_svc.shape)

(2409, 102400)
(2409,)
(951, 102400)
(951,)

def onevsrest_simple_svc(train_features, train_labels, test_features, test_labels, train_lambda=0.008):
  start_time = time.time()
  classif = SVC(C=train_lambda, kernel='linear')
  print('Classifier:', classif)
  print('Started fitting!')
  classif.fit(train_features, train_labels)
  print('Elapsed time: {:.2f}s'.format(time.time() - start_time))
  print('Started predicting!')
  predictions = classif.predict(test_features)
  print('Elapsed time: {:.2f}s'.format(time.time() - start_time))
  print('Calculating scores!')
  score = classif.score(train_features, train_labels) * 100
  accuracy = accuracy_score(test_labels, predictions) * 100
  print('Done!')
  print('Elapsed time: {:.2f}s'.format(time.time() - start_time))
  return score, accuracy, predictions

train_lambda = 0.11
print('Regularization Factor: {}'.format(train_lambda))

svm_training_accuracy, svm_test_accuracy, predictions = onevsrest_simple_svc(train_features_svc, 
                                                                             train_labels_svc,
                                                                             test_features_svc,
                                                                             test_labels_svc,
                                                                             train_lambda)

print("The accuracy of SVC classifier is {:.2f}% for train and {:.2f}% for test".format(svm_training_accuracy, svm_test_accuracy))

Regularization Factor: 0.11
Classifier: SVC(C=0.11, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
Started fitting!
Elapsed time: 553.98s
Started predicting!
Elapsed time: 824.36s
Calculating scores!
Done!
Elapsed time: 1498.29s
The accuracy of SVC classifier is 100.00% for train and 83.49% for test

# cm = confusion_matrix(test_labels_svc, predictions)
# print(cm)

Print the train and test and test accuracy of SVM

# Don't hardcode the train and test accuracy
print('Training accuracy is %2.3f :' %(svm_training_accuracy) )
print('Test accuracy is %2.3f :' %(svm_test_accuracy) )

Training accuracy is 100.000 :
Test accuracy is 83.491 :

Bonus

Get features for classes from 26-101.

We do this by batching 15 classes at a time and we can merge them later when we want to feed to our model. I skip the code for this part is essentially using the same methods to split (call the batch_dump method below for each batch) and merge batches for feature extraction that we’ve used already.

batches = [(26, 41), (41, 56), (56, 71), (71, 86), (86, 102)]
images_per_sample = 25
batch_size = 2

def batch_dump(num_classes, images_per_sample, batch_size):
  train_set_small, test_set_small = get_data_loader(num_classes, images_per_sample, batch_size)
  print('TRAINING DATA BATCH METRICS')
  extract_features_and_dump(vgg16, train_set_small, True, num_classes)
  print('TEST DATA BATCH METRICS')
  extract_features_and_dump(vgg16, test_set_small, False, num_classes)

Merge all batches!

batches_to_merge = [(1, 16), (16, 26), (26, 41), (41, 56), (56, 71), (71, 86), (86, 102)]
images_per_sample = 25
batch_size = 2

ucf_101_train_samples = merge_batches(batches_to_merge, True)
print(ucf_101_train_samples.size())
ucf_101_test_samples = merge_batches(batches_to_merge, False)
print(ucf_101_test_samples.size())

torch.Size([238425, 4097])
torch.Size([94575, 4097])

train_features_101, train_labels_101 = samples_to_tensor(ucf_101_train_samples)
test_features_101, test_labels_101 = samples_to_tensor(ucf_101_test_samples)

Print the size of your training and test data

# Don't hardcode the shape of train and test data
print('Shape of training data is :', train_features_101.size())
print('Shape of training label data is :', train_labels_101.size())
print('Shape of test/validation data is :', test_features_101.size())
print('Shape of test/validation label data is :', test_labels_101.size())

Shape of training data is : torch.Size([9537, 25, 4096])
Shape of training label data is : torch.Size([9537])
Shape of test/validation data is : torch.Size([3783, 25, 4096])
Shape of test/validation label data is : torch.Size([3783])

Modelling and evaluation

num_epochs = 5
model = LSTMClassifier(input_dim=4096, hidden_dim=512, num_layers=1, 
                       batch_size=1, target_size=101)

training_accuracy_101 = train_model(model, train_features_101, train_labels_101, num_epochs, 0.05)

Training Accuracy for epoch 1: 57.691%
Training Accuracy for epoch 2: 89.106%
Training Accuracy for epoch 3: 97.096%
Training Accuracy for epoch 4: 99.591%
Training Accuracy for epoch 5: 100.000%
Train time elapsed in seconds:  275.56703662872314

test_accuracy_101 = test_model(model, test_features_101, test_labels_101)

Test Accuracy of the model on test images: 68.01480306634946 %
Test time elapsed in seconds:  7.775494575500488

# Don't hardcode the train and test accuracy
print('Training accuracy is %2.3f :' %(training_accuracy_101) )
print('Test accuracy is %2.3f :' %(test_accuracy_101) )

Training accuracy is 100.000 :
Test accuracy is 68.015 :

This wraps us the 5th assignment that I completed as part of the Computer Vision course. This assignment took a lot of time as it requires a lot of compute for the feature extraction parts and not something that I can be kicked off and let run, especially on Google Colab as it kills long running processes. That is why I had to improvise and split the data into batches for feature extraction and then later merge to gather features for all the classes. That is how I was able to complete the task for all 101 classes when we were required to do so for only 25 classes.

Narayan Acharya

Action Recognition Using LSTM

Dataset

Some Tutorials

Problem 1. Feature extraction

Problem 2. Modelling

Print the size of your training and test data

Problem 3. Evaluation

Print the train and test accuracy of your model

SVM

Print the train and test and test accuracy of SVM

Bonus

Get features for classes from 26-101.

Merge all batches!

Print the size of your training and test data

Modelling and evaluation

You may also enjoy

Answering Questions About Stony Brook

Summer 2020 At Playstation!

Sem2 Update

New Macbook Pro 16!