7 min read

Deep Learning is one of the major players for facilitating the analytics and learning in the IoT domain. A really good roundup of the state of deep learning advances for big data and IoT is described in the paper Deep Learning for IoT Big Data and Streaming Analytics: A Survey by Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. In this article, we have attempted to draw inspiration from this research paper to establish the importance of IoT datasets for deep learning applications. The paper also provides a handy list of commonly used datasets suitable for building deep learning applications in IoT, which we have added at the end of the article.

IoT and Big Data: The relationship

IoT and Big data have a two-way relationship. IoT is the main producer of big data, and as such an important target for big data analytics to improve the processes and services of IoT. However, there is a difference between the two.

  • Large-Scale Streaming data: IoT data is a large-scale streaming data. This is because a large number of IoT devices generate streams of data continuously. Big data, on the other hand, lack real-time processing.
  • Heterogeneity: IoT data is heterogeneous as various IoT data acquisition devices gather different information. Big data devices are generally homogeneous in nature.
  • Time and space correlation: IoT sensor devices are also attached to a specific location, and thus have a location and time-stamp for each of the data items. Big data sensors lack time-stamp resolution.
  • High noise data: IoT data is highly noisy, owing to the tiny pieces of data in IoT applications, which are prone to errors and noise during acquisition and transmission. Big data, in contrast, is generally less noisy.

Big data, on the other hand, is classified according to conventional 3V’s, Volume, Velocity, and Variety. As such techniques used for Big data analytics are not sufficient to analyze the kind of data, that is being generated by IoT devices. For instance, autonomous cars need to make fast decisions on driving actions such as lane or speed change. These decisions should be supported by fast analytics with data streaming from multiple sources (e.g., cameras, radars, left/right signals, traffic light etc.). This changes the definition of IoT big data classification to 6V’s.

  • Volume: The quantity of generated data using IoT devices is much more than before and clearly fits this feature.
  • Velocity: Advanced tools and technologies for analytics are needed to efficiently operate the high rate of data production.
  • Variety: Big data may be structured, semi-structured, and unstructured data. The data types produced by IoT include text, audio, video, sensory data and so on.
  • Veracity: Veracity refers to the quality, consistency, and trustworthiness of the data, which in turn leads to accurate analytics.
  • Variability: This property refers to the different rates of data flow.
  • Value: Value is the transformation of big data to useful information and insights that bring competitive advantage to organizations.

Despite the recent advancement in DL for big data, there are still significant challenges that need to be addressed to mature this technology. Every 6 characteristics of IoT big data imposes a challenge for DL techniques. One common denominator for all is the lack of availability of IoT big data datasets.  

IoT datasets and why are they needed

Deep learning methods have been promising with state-of-the-art results in several areas, such as signal processing, natural language processing, and image recognition. The trend is going up in IoT verticals as well. IoT datasets play a major role in improving the IoT analytics. Real-world IoT datasets generate more data which in turn improve the accuracy of DL algorithms. However, the lack of availability of large real-world datasets for IoT applications is a major hurdle for incorporating DL models in IoT. The shortage of these datasets acts as a barrier to deployment and acceptance of IoT analytics based on DL since the empirical validation and evaluation of the system should be shown promising in the natural world. The lack of availability is mainly because:

  • Most IoT datasets are available with large organizations who are unwilling to share it so easily.
  • Access to the copyrighted datasets or privacy considerations. These are more common in domains with human data such as healthcare and education.

While there is a lot of ground to be covered in terms of making datasets for IoT available, here is a list of commonly used datasets suitable for building deep learning applications in IoT.

Dataset Name Domain Provider Notes Address/Link
CGIAR dataset Agriculture, Climate CCAFS High-resolution climate
datasets for a variety
of fields including agricultural
http://www.ccafs-climate.org/
Educational
Process
Mining
Education University
of Genova
Recordings of 115 subjects’
activities through a logging
application while learning
with an educational simulator
http://archive.ics.uci.edu/ml/datasets/Educational+Process+Mining+%28EPM%29%3A+A+Learning+Analytics+Data+Set
Commercial
Building
Energy Dataset
Energy,
Smart Building
IIITD Energy related data set
from a commercial building
where data is sampled
more than once a minute.
http://combed.github.io/
Individual
household
electric power
consumption
Energy,
Smart home
EDF R&D,
Clamart,
France
One-minute sampling rate
over a period of almost
4 years
http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption
AMPds dataset

Energy,
Smart home
S. Makonin AMPds contains electricity,
water, and natural gas
measurements at one minute
intervals for 2 years of
monitoring
http://ampds.org/
UK Domestic
Appliance-Level
Electricity
Energy,
Smart Home
Kelly and
Knottenbelt

Power demand from five
houses. In each house both
the whole-house mains
power demand as well as
power demand from individual
appliances are recorded.
http://www.doc.ic.ac.uk/∼dk3810/data/
PhysioBank
databases

Healthcare PhysioNet Archive of over 80
physiological datasets.
https://physionet.org/physiobank/database/
Saarbruecken
Voice Database

Healthcare Universitat¨
des
Saarlandes

A collection of voice
recordings from more than
2000 persons for pathological
voice detection.
http://www.stimmdatebank.coli.uni-saarland.de/help_en.php4  
T-LESS

 

Industry CMP at
Czech
Technical
University

An RGB-D dataset and
evaluation methodology for
detection and 6D pose
estimation of texture-less
objects
http://cmp.felk.cvut.cz/t-less/
CityPulse Dataset
Collection
Smart City CityPulse
EU FP7
project
Road Traffic Data, Pollution
Data, Weather, Parking
http://iot.ee.surrey.ac.uk:8080/datasets.html
Open Data
Institute – node
Trento
Smart City Telecom
Italia
Weather, Air quality,
Electricity,
Telecommunication
http://theodi.fbk.eu/openbigdata/
Malaga datasets Smart City City of
Malaga
A broad range of categories
such as energy, ITS,
weather, Industry, Sport, etc.
http://datosabiertos.malaga.eu/dataset
Gas sensors for
home activity
monitoring
Smart home Univ. of
California
San Diego
Recordings of 8 gas sensors
under three conditions
including background, wine
and banana presentations.
http://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring
CASAS datasets
for activities of
daily living
Smart home Washington
State
University
Several public datasets related to Activities of Daily Living (ADL) performance in a two story home, an apartment, and an office settings. http://ailab.wsu.edu/casas/datasets.html
ARAS Human
Activity Dataset

Smart home Bogazici
University
Human activity recognition
datasets collected from two
real houses with multiple
residents during two months.

https://www.cmpe.boun.edu.tr/aras/
MERLSense Data Smart home,
building
Mitsubishi
Electric
Research
Labs

Motion sensor data of
residual traces from a
network of over 200 sensors
for two years, containing
over 50 million records.
http://www.merl.com/wmd
SportVU

 

Sport Stats LLC

 

Video of basketball and
soccer games captured from
6 cameras.
http://go.stats.com/sportvu
RealDisp Sport O. Banos

 

Includes a wide range of
physical activities (warm up,
cool down and fitness
exercises).
http://orestibanos.com/datasets.htm  
Taxi Service
Trajectory
Transportation Prediction
Challenge,
ECML
PKDD 2015
Trajectories performed by
all the 442 taxis running in the city of Porto, in Portugal.
http://www.geolink.pt/ecmlpkdd2015-challenge/dataset.html
GeoLife GPS
Trajectories

Transportation Microsoft A GPS trajectory by a
sequence of time-stamped
points
https://www.microsoft.com/en-us/download/details.aspx?id=52367
T-Drive trajectory
data
Transportation Microsoft Contains a one-week
trajectories of 10,357 taxis
https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/
Chicago Bus
Traces data

Transportation M. Doering

 

Bus traces from the
Chicago Transport Authority
for 18 days with a rate
between 20 and 40 seconds.
http://www.ibr.cs.tu-bs.de/users/mdoering/bustraces/

 

Uber trip
data

Transportation FiveThirtyEight About 20 million Uber
pickups in New York City
during 12 months.
https://github.com/fivethirtyeight/uber-tlc-foil-response
Traffic Sign
Recognition

Transportation K. Lim

 

Three datasets: Korean
daytime, Korean nighttime,
and German daytime
traffic signs based on
Vienna traffic rules.
https://figshare.com/articles/Traffic_Sign_Recognition_Testsets/4597795
DDD17

 

Transportation J. Binas End-To-End DAVIS
Driving Dataset.
http://sensors.ini.uzh.ch/databases.html

 

 

 

Content Marketing Editor at Packt Hub. I blog about new and upcoming tech trends ranging from Data science, Web development, Programming, Cloud & Networking, IoT, Security and Game development.

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here