Datasets @ CVPR 2021 : Problems that shouldn’t be missed

Ever since I started working on real world applications of machine learning, I have become obsessed with the dataset quality and have spent hours to ensure nothing goes wrong in training datasets. While working with data, one interesting thing that I realised is that organisational datasets can give a window into the complexity level of the problems these organisations are trying to solve. Newly released datasets in the public domain can be a great proxy for understanding the developments in computer vision as well as new avenues of problems to be solved.

Like every year, I religiously scanned through the heaps of papers at CVPR but this time I decided to do it a bit differently. This year I intentionally looked up for papers specifically on datasets or dataset quality management. To my expectation, I was not at all disappointed. This tool [LINK] by Joshua Preston at Georgia Tech was very useful to me for exploring the wild web of papers this year. I came across a lot of interesting datasets and ingenious ways in which people are working to solve them.

In this blog, I have briefly summarised few datasets papers that I found fascinating and read through all of them to extract some great details:

1. The Multi-Temporal Urban Development SpaceNet Dataset

This one is the most interesting dataset paper in CVPR due to its approach in solving a very hard global problem. This dataset tries to solve the problem of quantifying urbanisation in a region using satellite imagery analytics which can be a huge assistance to nations that dont have infrastructure and financial resources to setup effective Civil Registration System. The dataset primarily is about tracking construction in around 101 locations around the world using satellite images captured across a time span of 18 to 26 months. There are over 11 million annotations with unique pixel-level labelling of individual buildings and construction sites over time.

All of it might make it sound like a slightly more challenging object segmentation and tracking problem. But wait for it. Segmentation is tough but this one is segmentation on steroids. To make it clear, there are roughly more than 30 objects per frame. Also, unlike normal video data, there is little consistency between frames due to weather, illuminations and seasonal effects on the ground etc. This makes it way more tough than our favourite video classification datasets like MOT17 and Stanford Drone dataset.

While this can be a hard problem, solving it can worth a lot of benefits in terms of global welfare.

[Paper Link Here] [Dataset link]

2. Towards Semantic Segmentation of Urban-Scale 3D Point Clouds: A Dataset, Benchmarks and Challenges

This year’s conference was heavy on 3D image processing and its corresponding methods. So, this dataset called Sensat Urban also doesn’t come as surprise except this dataset of photogrammetric 3D point clouds is way bigger than any of the open source datasets available so far. It covers over 7.6 km. square of the city landscape covering York, Cambridge and Birmingham. Each point cloud has been labelled as one of 13 semantic classes. This dataset has potential to advance research into a lot of promising areas like automated areal surveying, smart cities and large infrastructure planning and management.

In this paper, they also experimented with the color information in point clouds and demonstrated that neural networks trained on color-rich point clouds were able to generalise better on the test set. This actually provides an important direction in development of future applications in this area.

[Paper Link here] [Github repo]

3. Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

This one is another favourite dataset from this year as it takes a slightly different approach to image captioning and video summarisation problem. Generally, for such tasks, we have datasets like COCO, having image along with their accompanying textual caption. While this approach has proven promising, we often forget that a lot of rich summarisation of our visual experience happens in terms of spoken language. Using this intuition, this dataset builds up a corpus of 500k audio descriptions of short videos depicting a broad range of different events. However, they just dont stop at presenting an awesome dataset, they also provided an elegant solution to solve video/caption retrieval using Adaptive Mean Margin(AMM) approach.

[Paper Link here]

4. Conceptual 12M : Pushing Web-Scale Image-Text Pre-training to recognise Long-Tail visual concepts

Recently, model pretraining has gain huge popularity due to performance gains from pre-trained transformers and CNN architectures. Usually, we want to train the model on a similar dataset with huge no. of labels and then use transfer learning to utilise model on downstream tasks. So far, the only available large scale dataset for pre-training was CC-3M dataset for vision+language tasks with 3 million captions. Now, Google research team has extended this dataset to 12 million image-caption pairs by relaxing the constraints for data scrapping. What is more interesting is the approach by which the dataset was generated. The use of Google Cloud Natural Language API and Google Cloud Vision API for filtering tasks during dataset curation is something that can be a good lesson for any future dataset curation tasks. Using 12M dataset, the image captioning model was able to learn long-tailed concepts i.e. the concepts which are quite specific and rare in datasets. The results of the training approach were quite impressive and visualised below.

[Paper Link here]

5. Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers

While there is a much talk about the fully autonomous self-driving systems, the fact remains that its a very hard problem multiple problems to solve simultaneously in real-time. One of the crucial part is making these autonomous systems understand the behaviour of pedestrians in response to their presence. Intuitively, we do it all the time but it can be quite a challenge for machines to do so. So, prediction of pedestrian trajectory in dense environment is a challenging task. Thus, Euro-PVI dataset is curated to solve this problem by training the model on labelled dataset of pedestrian and bicyclist trajectories. Earlier, datasets like Stanford Drones, nuScenes and Lyft L5 focused on trajectories of nearby vehicles but that is only one part of the complete picture in autonomous systems. Euro-PVI provides a comprehensive picture of interactions with information like visual scene at the time of interaction, velocities & acceleration during interaction and overall coordinate trajectory during the whole interaction.

All such information has to be mapped to the relevant latent space by the trained model. To tackle the problem of joint representation of trajectory and visual information in latent space, the same paper also proposes a generative architecture of Joint-B-VAE, a variational autoencoder trained to encode the actor’s trajectories and decode them into the future resultant trajectories.

[Paper Link here]

Conclusion

Overall, it was an amazing experience getting to know about the interesting problems that ML researchers are trying to solve using the labelled datasets and how the domain of machine learning and AI have started to emphasise on datasets as much as the algorithms. It will be interesting to follow up the datasets released in upcoming conferences like ICCV as well.