Machine Generated Data: TempDuino II
This is the second article on MGD, the first is here. In that article I had setup a simple sensor to capture temperature and was recording that value every minute into a file. We left off with the sensor running. Now that we have some data lets get into it a bit and see what we can learn.
$ wc -l raw_temp_data.csv
43948 raw_temp_data.csv
Nice, almost 44,000 observations. Keep in mind that the majority of time in analysis is spent in data preparation and cleaning. Especially if you have data from different sources in different formats. In this simplified example we begin to see some of what that data preparation and cleaning will look like using some basic linux shell commands.
We know our date should all be of the form date,reading and here is a sample:
2012/01/17 07:07:59,56.26
The following is a regular expression representation of the form above.
‘[0-9]{4}/[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{2}.[0-9]{2}$’
(If you are unfamiliar with regular expressions and are interested in this subject thats a good area to invest your time.) The command “grep -v pattern” will return all lines which do not match the given pattern. That is perfect to see how “clean” our data is.
$ grep -v -E ‘[0-9]{4}/[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{2}.[0-9]{2}$’ raw_temp_data.csv
2012/01/16 22:03:16,60.660.65
2012/01/16 22:03:25,60.660.65
2012/01/17 04:45:22012/01/17 07:07:59,56.26
2012/01/17
Interesting. Four “dirty” entries out of 44k. First two are double reads from the sensor. Then comes the interesting bit in the third line there, where we lost a few hours. The time goes from 4:45:22 to 07:07:59. It turns out my computer kernel panicked at 4:45am. I didn’t get to it until 7:07am. This is a classic missing data problem, but thats for another post. For now, we will simply clean up the offending lines (in this case drop the -v from grep and pipe to another file) and move on.
$grep -E ‘[0-9]{4}/[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{2}.[0-9]{2}$’ raw_temp_data.csv> temp_data.csv
Good enough for this experiment. Whats this look like? Time series data lends itself pretty naturally to plots, and plots present a nice visual playground for understanding. Time to launch R.
Once R is running, I load the data which we have just cleaned into a data frame using R’s built in data<-read.csv function.
> data<-read.csv("./temp_data.csv") > plot(data[,2])
That has some oddities in it, most likely a bad sensor read. Lets dig around a bit for low temps.
$ awk -F”,” ‘{if ($2 < 45.00) print $2}’ temp_data.csv
21.98
33.41
19.34
29.01
Since the outdoor temp over night was 28, these values are probably garbage.
$awk -F”,” ‘{if ($2 > 45.00) print $1”,”$2}’ temp_data.csv > cleaned_temp_data.csv
Add some color, labels, legend and a spline:

Conclusions:
The first thing that pops into my mind is that my temperature sensor is not very good. I mean look at all those outliers! The most interesting thing, however, is that the very simplest sensor system captures enough data that real interesting insights can be derived. Extrapolate that across industries and different data available, be it from log files on a web server or sensors in a factory, and its interesting to think about what can happen. Thats the big deal about machine generated data. There is real value here currently not being leveraged.
Thanks to @statsinthewild for help on the finer points on R’s plot command.
