Deep Learning is one of the major players for facilitating the analytics and learning in the IoT domain. A really good roundup of the state of deep learning advances for big data and IoT is described in the paper Deep Learning for IoT Big Data and Streaming Analytics: A Survey by Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. In this article, we have attempted to draw inspiration from this research paper to establish the importance of IoT datasets for deep learning applications. The paper also provides a handy list of commonly used datasets suitable for building deep learning applications in IoT, which we have added at the end of the article.
IoT and Big Data: The relationship
IoT and Big data have a two-way relationship. IoT is the main producer of big data, and as such an important target for big data analytics to improve the processes and services of IoT. However, there is a difference between the two.
- Large-Scale Streaming data: IoT data is a large-scale streaming data. This is because a large number of IoT devices generate streams of data continuously. Big data, on the other hand, lack real-time processing.
- Heterogeneity: IoT data is heterogeneous as various IoT data acquisition devices gather different information. Big data devices are generally homogeneous in nature.
- Time and space correlation: IoT sensor devices are also attached to a specific location, and thus have a location and time-stamp for each of the data items. Big data sensors lack time-stamp resolution.
- High noise data: IoT data is highly noisy, owing to the tiny pieces of data in IoT applications, which are prone to errors and noise during acquisition and transmission. Big data, in contrast, is generally less noisy.
Big data, on the other hand, is classified according to conventional 3V’s, Volume, Velocity, and Variety. As such techniques used for Big data analytics are not sufficient to analyze the kind of data, that is being generated by IoT devices. For instance, autonomous cars need to make fast decisions on driving actions such as lane or speed change. These decisions should be supported by fast analytics with data streaming from multiple sources (e.g., cameras, radars, left/right signals, traffic light etc.). This changes the definition of IoT big data classification to 6V’s.
- Volume: The quantity of generated data using IoT devices is much more than before and clearly fits this feature.
- Velocity: Advanced tools and technologies for analytics are needed to efficiently operate the high rate of data production.
- Variety: Big data may be structured, semi-structured, and unstructured data. The data types produced by IoT include text, audio, video, sensory data and so on.
- Veracity: Veracity refers to the quality, consistency, and trustworthiness of the data, which in turn leads to accurate analytics.
- Variability: This property refers to the different rates of data flow.
- Value: Value is the transformation of big data to useful information and insights that bring competitive advantage to organizations.
Despite the recent advancement in DL for big data, there are still significant challenges that need to be addressed to mature this technology. Every 6 characteristics of IoT big data imposes a challenge for DL techniques. One common denominator for all is the lack of availability of IoT big data datasets.
IoT datasets and why are they needed
Deep learning methods have been promising with state-of-the-art results in several areas, such as signal processing, natural language processing, and image recognition. The trend is going up in IoT verticals as well. IoT datasets play a major role in improving the IoT analytics. Real-world IoT datasets generate more data which in turn improve the accuracy of DL algorithms. However, the lack of availability of large real-world datasets for IoT applications is a major hurdle for incorporating DL models in IoT. The shortage of these datasets acts as a barrier to deployment and acceptance of IoT analytics based on DL since the empirical validation and evaluation of the system should be shown promising in the natural world. The lack of availability is mainly because:
- Most IoT datasets are available with large organizations who are unwilling to share it so easily.
- Access to the copyrighted datasets or privacy considerations. These are more common in domains with human data such as healthcare and education.
While there is a lot of ground to be covered in terms of making datasets for IoT available, here is a list of commonly used datasets suitable for building deep learning applications in IoT.
Dataset Name | Domain | Provider | Notes | Address/Link |
CGIAR dataset | Agriculture, Climate | CCAFS | High-resolution climate datasets for a variety of fields including agricultural |
http://www.ccafs-climate.org/ |
Educational Process Mining |
Education | University of Genova |
Recordings of 115 subjects’ activities through a logging application while learning with an educational simulator |
http://archive.ics.uci.edu/ml/datasets/Educational+Process+Mining+%28EPM%29%3A+A+Learning+Analytics+Data+Set |
Commercial Building Energy Dataset |
Energy, Smart Building |
IIITD | Energy related data set from a commercial building where data is sampled more than once a minute. |
http://combed.github.io/ |
Individual household electric power consumption |
Energy, Smart home |
EDF R&D, Clamart, France |
One-minute sampling rate over a period of almost 4 years |
http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption |
AMPds dataset |
Energy, Smart home |
S. Makonin | AMPds contains electricity, water, and natural gas measurements at one minute intervals for 2 years of monitoring |
http://ampds.org/ |
UK Domestic Appliance-Level |
Electricity Energy, Smart Home |
Kelly and Knottenbelt |
Power demand from five houses. In each house both the whole-house mains power demand as well as power demand from individual appliances are recorded. |
http://www.doc.ic.ac.uk/∼dk3810/data/ |
PhysioBank databases |
Healthcare | PhysioNet | Archive of over 80 physiological datasets. |
https://physionet.org/physiobank/database/ |
Saarbruecken Voice Database |
Healthcare | Universitat¨ des Saarlandes |
A collection of voice recordings from more than 2000 persons for pathological voice detection. |
http://www.stimmdatebank.coli.uni-saarland.de/help_en.php4 |
T-LESS
|
Industry | CMP at Czech Technical University |
An RGB-D dataset and evaluation methodology for detection and 6D pose estimation of texture-less objects |
http://cmp.felk.cvut.cz/t-less/ |
CityPulse Dataset Collection |
Smart City | CityPulse EU FP7 project |
Road Traffic Data, Pollution Data, Weather, Parking |
http://iot.ee.surrey.ac.uk:8080/datasets.html |
Open Data Institute – node Trento |
Smart City | Telecom Italia |
Weather, Air quality, Electricity, Telecommunication |
http://theodi.fbk.eu/openbigdata/ |
Malaga datasets | Smart City | City of Malaga |
A broad range of categories such as energy, ITS, weather, Industry, Sport, etc. |
http://datosabiertos.malaga.eu/dataset |
Gas sensors for home activity monitoring |
Smart home | Univ. of California San Diego |
Recordings of 8 gas sensors under three conditions including background, wine and banana presentations. |
http://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring |
CASAS datasets for activities of daily living |
Smart home | Washington State University |
Several public datasets related to Activities of Daily Living (ADL) performance in a two story home, an apartment, and an office settings. | http://ailab.wsu.edu/casas/datasets.html |
ARAS Human Activity Dataset |
Smart home | Bogazici University |
Human activity recognition datasets collected from two real houses with multiple residents during two months. |
https://www.cmpe.boun.edu.tr/aras/ |
MERLSense Data | Smart home, building |
Mitsubishi Electric Research Labs |
Motion sensor data of residual traces from a network of over 200 sensors for two years, containing over 50 million records. |
http://www.merl.com/wmd |
SportVU
|
Sport | Stats LLC
|
Video of basketball and soccer games captured from 6 cameras. |
http://go.stats.com/sportvu |
RealDisp | Sport | O. Banos
|
Includes a wide range of physical activities (warm up, cool down and fitness exercises). |
http://orestibanos.com/datasets.htm |
Taxi Service Trajectory |
Transportation | Prediction Challenge, ECML PKDD 2015 |
Trajectories performed by all the 442 taxis running in the city of Porto, in Portugal. |
http://www.geolink.pt/ecmlpkdd2015-challenge/dataset.html |
GeoLife GPS Trajectories |
Transportation | Microsoft | A GPS trajectory by a sequence of time-stamped points |
https://www.microsoft.com/en-us/download/details.aspx?id=52367 |
T-Drive trajectory data |
Transportation | Microsoft | Contains a one-week trajectories of 10,357 taxis |
https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/ |
Chicago Bus Traces data |
Transportation | M. Doering
|
Bus traces from the Chicago Transport Authority for 18 days with a rate between 20 and 40 seconds. |
http://www.ibr.cs.tu-bs.de/users/mdoering/bustraces/
|
Uber trip data |
Transportation | FiveThirtyEight | About 20 million Uber pickups in New York City during 12 months. |
https://github.com/fivethirtyeight/uber-tlc-foil-response |
Traffic Sign Recognition |
Transportation | K. Lim
|
Three datasets: Korean daytime, Korean nighttime, and German daytime traffic signs based on Vienna traffic rules. |
https://figshare.com/articles/Traffic_Sign_Recognition_Testsets/4597795 |
DDD17
|
Transportation | J. Binas | End-To-End DAVIS Driving Dataset. |
http://sensors.ini.uzh.ch/databases.html |
I added there some thermal solar data: https://github.com/stritti/thermal-solar-plant-dataset
i have require smart parking data set