Slow progress. I've spent the last couple of days generally looking around at Neural Networks and OpenCVs machine learning code. My aim was to find some relatively simple to implement examples of NNs and form a basic impression of the applicability of the techniques to the task of skin and marker recognition in a video stream.
Now I should point out that my understanding of most ML techniques and NNs is rudimentary at best and, at this point, that's fine since I am looking for practical techniques that can be employed at a rapid prototyping stage (hence the cost of experimenting with a proposed solution to a given problem should measure in the hours rather than days); so I am not going to go into too much depth on the AI part of things here.
What I do need to do though is define the methodology and results of the experiments I've been running so far since a write-up at this stage will form a useful appendix to my thesis and demonstrate the design process and rationale behind the "vision" system of the WGI.
Briefly though my method has been to construct a small application that allows the generation of "samples" of an video, where mouse clicks on a frame of the video stream creates an N*N sized image surrounding the centre of the mouse click, which allows the generation of a set of images which represent objects of a given type (e.g. skin, not skin (!), blue marker, red marker etc).++ A separate application is then used to create a dataset suitable for training the ML technique. This means that an image is transformed into an array of either BGR or HSV values and saved to a file. I then concatenate and randomise the files into one single dataset with a label to identify the sample so end up with something along the lines of :
SN, 1.0, 2.0, 6.0, 1.0, 2.0, 5.0, 1.0, 5.0, 3.0
NS, 6.0, 6.0, 6.0, 1.0, 6.0, 5.0, 1.0, 6.0, 3.0
BM, 2.0, 2.0, 6.0, 1.0, 6.0, 5.0, 4.0, 6.0, 3.0
for a 3x3 sample. I should note that I need to add an additional step into the above to verify the data set (via visual inspection) to ensure that "mis-clicks" are not adding in erroneous samples.
I then run this through a modified version of the OpenCV code for letter_recog (both the C and python versions - which is interesting in its right since the same techniques give differing results [implementation differences?]) in order to obtain a some basic data on the validity of the technique and yesterday finally managed to write some code to visualise the results on a live or recorded stream.
The above needs formalising, and I still have to do some work in that the Boost code is set for too many classes and falls over if I reduce the number < 10; there's no python version of mlp or nbayes and I also need to actually look at NNs properly.
And of course I need to formalise the results. My early impression is that the ML techniques are promising but too slow, however this is quite possibly an implementation issue (python boost takes minutes vs seconds for the C++ version). I also need to look in more detail at differing sample dimensions and sizes, the effects of differing colour spaces since I've only tested with HSV so far and would like to see the results with grey, BGR, HS only etc....currently too many FP and NFPs e.g.
The above is a typical "noisy" frame (PS3eye this time) and the chair and table in the image are coloured very similar to areas of skin hence all the noise. The larger areas of skin that have not been correctly classified are overexposed and I have made a deliberate choice not to classify these areas. That of course is primarily a hardware issue displaying the importance of the camera system colour accuracy in variable lighting conditions.
Oh and of course, all of this is a side project and NOT the main focus of my thesis...all very interesting though :)
Now I should point out that my understanding of most ML techniques and NNs is rudimentary at best and, at this point, that's fine since I am looking for practical techniques that can be employed at a rapid prototyping stage (hence the cost of experimenting with a proposed solution to a given problem should measure in the hours rather than days); so I am not going to go into too much depth on the AI part of things here.
What I do need to do though is define the methodology and results of the experiments I've been running so far since a write-up at this stage will form a useful appendix to my thesis and demonstrate the design process and rationale behind the "vision" system of the WGI.
Briefly though my method has been to construct a small application that allows the generation of "samples" of an video, where mouse clicks on a frame of the video stream creates an N*N sized image surrounding the centre of the mouse click, which allows the generation of a set of images which represent objects of a given type (e.g. skin, not skin (!), blue marker, red marker etc).++ A separate application is then used to create a dataset suitable for training the ML technique. This means that an image is transformed into an array of either BGR or HSV values and saved to a file. I then concatenate and randomise the files into one single dataset with a label to identify the sample so end up with something along the lines of :
SN, 1.0, 2.0, 6.0, 1.0, 2.0, 5.0, 1.0, 5.0, 3.0
NS, 6.0, 6.0, 6.0, 1.0, 6.0, 5.0, 1.0, 6.0, 3.0
BM, 2.0, 2.0, 6.0, 1.0, 6.0, 5.0, 4.0, 6.0, 3.0
for a 3x3 sample. I should note that I need to add an additional step into the above to verify the data set (via visual inspection) to ensure that "mis-clicks" are not adding in erroneous samples.
I then run this through a modified version of the OpenCV code for letter_recog (both the C and python versions - which is interesting in its right since the same techniques give differing results [implementation differences?]) in order to obtain a some basic data on the validity of the technique and yesterday finally managed to write some code to visualise the results on a live or recorded stream.
The above needs formalising, and I still have to do some work in that the Boost code is set for too many classes and falls over if I reduce the number < 10; there's no python version of mlp or nbayes and I also need to actually look at NNs properly.
And of course I need to formalise the results. My early impression is that the ML techniques are promising but too slow, however this is quite possibly an implementation issue (python boost takes minutes vs seconds for the C++ version). I also need to look in more detail at differing sample dimensions and sizes, the effects of differing colour spaces since I've only tested with HSV so far and would like to see the results with grey, BGR, HS only etc....currently too many FP and NFPs e.g.
rtrees 5x5 |
knearest 5x5 |
The above is a typical "noisy" frame (PS3eye this time) and the chair and table in the image are coloured very similar to areas of skin hence all the noise. The larger areas of skin that have not been correctly classified are overexposed and I have made a deliberate choice not to classify these areas. That of course is primarily a hardware issue displaying the importance of the camera system colour accuracy in variable lighting conditions.
Oh and of course, all of this is a side project and NOT the main focus of my thesis...all very interesting though :)
Comments
Post a Comment