Monday, 14 November 2011

Getting the design right vs getting the right design

So today I've been looking at ways to make the text input interaction more fluid and investigating if I can able some level of error correction (i.e. a spell checker).

1) 2 hours of use off and on and my arm aches, my back aches, my wrist aches. Even with the minimal arm movement that is required, my forearm still needs supporting and to spell a 5 letter words I am still moving the focal point about....6 inches across which is enough to induce fatigue over time. If I had more time I would do an about turn and look at just using a single finger (although that reintroduces the segmentation issue which the pinch technique solves)....

2) Writing without visual feedback is more taxing than I had thought. I originally did an experiment which involved people walking and writing at the same time as a "proof of concept" exploration. I suspect I had the task wrong though - I should have had them do it blindfolded! Luckily I've still time to repeat that experiment.

3) The limitations of the hacked together prototype are becoming rapidly apparent. Converting points to words introduces a significant lag into the "glue" application ( and the poor performance of the fiducial tracker both in terms of the "jitter" and frequent loss of acquisition is making this painful.

4) On the positive side, I am both impressed and disappointed by just how well the air-writing can work. Its best with visual feedback but on single letters performance is excellent. For cursive handwriting performance is....variable. With visual feedback I'd estimate 80% of words are in the alternate spelling list. Without feedback that drops to maybe 50%. I've a small word list, I need to benchmark obviously.

5) Mobility. The need for compensation due to movement is very apparent, Discreet gestures are just not registered, recognition rate suffers immensely, reliability is terrible - very very unimpressive performance.

6) I had intended to explore the pico-projector more over the next 2 weeks. Sadly the MHL link between my Samsung SII and the projector is unstable meaning if I do anything I need to hook up the laptop...except I have heat related issues with the laptop causing it to crash if I put it in my bag for more than 5 minutes :/ Added to this, battery life on the projector is about 20 minutes on a good day.

All in all, I know this is a prototype but its far less impressive than I had hoped for unless I manipulate conditions extensively. Still that's about par for the course for a V0.1. A good learning experience so far, probably not shaping up to be the greatest Masters thesis ever...but then I knew prototyping was risky.

In other words I have to wonder if this is the right design for this type of interaction. Its hard to tell how many of the issues I'm experiencing are to do with the technology verses issues with the approach I've taken overall. Thats a different kettle of fish though!

Wednesday, 9 November 2011


I've been busy with whipping my literature review and methodology sections together for the last few weeks (with the occasional diversion to tout the surveys, still a very low response so far *sadface*) and I'm heading towards crunch time now where I'm going to have to bring everything together for a draft version early next months.

Since I'm now more in a documenting than development phase I've little work done on the prototype apart from to add recording/playback capabilities so that a session can be "recorded" and then I can explore if changes to the gestural interface improve recognition (although that isn't a major aim at this point).

Again, a quick plea to anyone reading, just a few more responses to the gesture and display surveys and I'll be able to start my analysis for that data so if you have 5 minutes it would be greatly appreciated.

Monday, 24 October 2011

Final survey now up

Now the hard part, I have to find some people to participate! I've vastly over stated the amount of time needed to take the surveys since I know everyone is different in how they approach these. If there's any incentive, you get to see me performing gestures...more likely to be a deterrent given I didn't even shave before hand!

If you wander across this blog before December 1st, if you could please take 20 minutes to participate in one of the surveys it would be a huge help.

Added some anotations to the video...

that is all :)

Camshift Tracker v0.1 up

I thought I'd upload my tracker, watch the video from yesterday for an example of the sort of performance to expect under optimal conditions! Optimal conditions means stable lighting, and removing elements of a similar colour to that which you wish to track. Performance is probably a little worse, (and at best similar to) the touchless SDK.

Under suboptimal conditions...well its useless but then so are most trackers which is a real source of complaint about most of the computer vision research out there.....not that they perform poorly but rather that there is far too little honesty in just how poorly various algorithms perform under non-laboratory conditions.

I've a few revisions to make to improve performance and stability and I'm not proud of the code. It's been...8 years since I last did anything with C++ and to be frank I'd describe this more as a hack. Once this masters is out of the way I plan to look at this again and try out some of my ideas but I really see any approach which relies on colour based segmentation and standard webcams as having limited applicability.

So its taken a while but I hope this proves of use to someone someone some day.

Survey 1

Well with my current need to avoid people as much as possible I've had to make a last minute change to my methodology for data gathering. Hopefully I'll be able to mingle with the general populace again next week and do a user study but this week at least I'm in exile! Hence I have put together 3 surveys of which the first is online. The first ones quite lengthy but it would be a huge help if anyone who wanders across this would take 20 minutes to participate.

Gesture Survey 1

Sunday, 23 October 2011

Video 2!

Took me forever to get around to that one but I've been trying to solve lots of little problems. There's no sound so please read my comments on the video for an explanation of what you're seeing.

The main issue I'm now having is with the fiducial tracking in that the distance between the centroids of each fiducial is important in recognising when a pinch gesture is made, however due to the factors of distance from the camera causing the area of the fiducial to vary and, at the same time, the often poor quality of the bounding area for each fiducial causing the area to vary, I cant get the pinch point to the level where it provides "natural feedback" to the user i.e. the obvious feedback point where the systems perception and the users perception should agree is when the user can feel that they have touched their fingers together.

As it stands, due to the computer vision problems my system can be off by as much as 1cm :(

I should actually say that it IS possible to reduce this however then tracking suffers and the systems state (which is really limited to engaged/unengaged) varies wildly meaning that dynamic gestures are poorly recognised.


I could go back to the beginning and do another iteration of the basic marker tracking code - I've mentioned one option (laplacian) with my current hardware that I think would enhance the performance (and allow me to get rid of the makers!) and I could also do some basic contour detection within the current code which might enhance things...but this is NOT a computer science thesis I'm working on and feel I've treked further along that road than I had intended already.

Hence any additional code is going to focus specifically on making the interaction with the air-writing interface as fluid as possible. Before that though - SURVEY time!

Friday, 21 October 2011

More Observations

After this post I AM going to make videos ;)

I spent some time doing some basic tests last night under non optimal (but good) conditions:

1) Double click/single click/long tap/short tap
These all can be supported using in air interactions and pinch gestures. I'd estimate I had +90% accuracy in detection rate for everything apart from single click. Single click is harder to do since it can only be flagged after the delay for detecting a double click has expired and this leads to some lag in the responsiveness of the application.

2) The predator/planetary cursor design.
In order to increase the stability of my primary marker when only looking at a single point e.g. when air drawing, I decided to modify my cursor design. I feel that both fiducial points should be visible to the user but it didn't quite "feel" right to me using either the upper or lower fiducial when concentrating on a single point hence I've introduced a mid-point cursor that is always 1/2 way between the 2 fiducial. The "feel" when interacting is now much better since the "pinch point" is where we would normally naturally expect a pen to be. 

3) Pinch movement
In relation to the above though the fact that pinching/unpinching moves the points is causing me some issues with accuracy and extraneous points being add to any drawing. I'm hoping to overcome this by better accuracy of pinch/unpinch events however THAT is tied back to accuracy on the fiducial positioning/area detection.

4) Kalman filtering
I'm not too sure how happy I am with the Kalman filtering on the input. While it increases stability it creates a more "fluid" movement of the marker which isn't good for tight changes in direction. That said it makes the air-writing feel very smooth - I wish I could increase the FPS...which I may attempt to do by making the markermonitor use sockets rather than pipes. However I feel like I've spent enough time on the technical details of the prototype and am loathe to spend more at this point.

5) Breathing.
I was surprised at how much impact breathing makes  when sitting down. Depending on the distance between the fiducial and camera this can be massive during a deep breath and is enough to cause any gestures to be poorly recognised. A more advanced system would have to compensate for this and also any movement involved in walking so giros/accelerometers are a must in the longer term. This was already a known requirement for any projection system and has been looked at in some papers (see Murata & Fujinami 2011) hence I expect that any "real world" system would have access to this data. Not too much I can do about this at this point.

6) Right arm/lower right quadrant block when sitting down
This one surprised me. When sitting down and using the right arm for movement, the lower right hand quadrant nearest the body is essentially "blocked" for use in my system since its difficult to move the arm back to this position. Not an issue when standing.

I plan on making some tweaks in terms of the pinch/unpinch detection to see if I can improve the accuracy of that and some UI changes to support it but the next step with the prototype is to take some empirical measurements on the systems performance,

Right, time to make some videos.

Murata, Satoshi;   Fujinami, Kaori; 2011 stabilization of Projected Image for Wearable Walking Support System Using Pico-projector

Thursday, 20 October 2011

So Wheres the Survey/Video?

I've had a very unexpected event happen in that my little one has come down with mumps (who has already mostly recovered from it) and its something I've never had or been immunised against, hence I've had to cancel the study I had organised for this weekend (it obviously would not be ethical for me to be in close contact with people while I might have a serious communicable illness...I just wish others would take a similar attitude when ill). And I may have to avoid contact with people for up-to 3 weeks since the contagious period is 5 days before developing symptoms and 9 days afterwards which rather puts the dampers on my plans for a user study...3 weeks from now I had planned to be writing up my analysis NOT still analysing my results. PANIC!

Hence, I've adapted my research plan - I'm going to be putting up a survey this weekend which I'll run for 3 weeks, run a limited (5 uesrs! lol) user study of the prototype just after that and have to base my results/discussion/conclusion on that. Hence video up tomorrow (promise) with survey up on Saturday/Sunday. Best laid plans and all that :)


Well there's obviously going to be a flurry of interest in WGIs given the publishing of the Omnitouch paper. Brilliant stuff, anyone want to fund me to buy a PrimeSense camera? Seriously though, ToF cameras solve a lot of the computer vision problems I have been experiencing and I was very tempted to work with a Kinect , the problem being that the Kinects depth perception doesn't work below 50cm and that would have lead to an interaction style similar to Mistrys, one which I have discounted due to various ergonomic and social acceptance factors.

If I had access to this technology I would be VERY interested in applying it to the non-touch gestural interaction style I've been working on since I see the near term potential of the combined projection/WGI in enabling efficient micro-interactions (interactions which take less time to perform than it does to take a mobile phone from your pocket/bag).

Anyways, good stuff and its nice to see an implementation demonstrating some of the potential of the technology without the sing-and-dance that accompanied SixthSense.

(Harrison et al, 2011)

Monday, 17 October 2011

Have we got a video (2)?

Yes, but I'm not posting it yet *grin*

A very frustrating bug cropped up when I tried tying the camshift based detector into the marker tracking service - only 1/3 of the marker co-ordinate updates were being processed! Sure, my code is ugly, inefficient, leaking memory left right and centre BUT thats no reason to just silently discard the data I'm generating (and yes, I am generating what I think I'm generating). I strongly suspect the culprit is asyncproc - I've had some experience before with trying to parse data via pipes and hence know its....not the preferred way to do things, however proof of concept wise I hoped it would save the hassle of having to get processes talking to each other. *sigh* "If its worth doing once, its worth doing right."

Anyways, I've worked around it, and have the basics up and running. What are the basics?

- Basic gmail reader. Main purpose here is to look at pinch scrolling.

- Basic notifier. Shows new mails as they arrive. Purpose is to examine if the pico-projector provides efficient support for micro-interactions.

- Basic music player. Main purpose here is to test simple gesture control.

- Basic text input area. Main purpose is to test the air-writing concept.

The basic test run I've done so far suggests that there's work to be done, but the basic concept is sound and the hardware/software is sufficient for a small scale study.  Some thoughts :

- dwell regions need to be active rather than passive. By this I mean that its too easy to enter a region and unintentional execute the associated behaviour. Requiring that the markers are engaged when within the dwell region will satisfy this.

- Engaged distance needs to be a function of the area of the fiducials otherwise the engaged state is entered when the markers are at different widths depending on the distance from the camera. 

- if the application supports any form of position based input (whether it be dwell regions, hover buttons, clicks etc) use positions substantially away from the edge of the "screen".

- Marker trails; I'm actually finding these confusing, in part since there are two of them. I had thought a while ago about drawing a "pointer" that is at the mean distance between the two fiducials when in the "engaged" state. I think I need to experiment with that.

- I need to work with this concept of "command mode" a bit more. The basic concept I've been working on is that the user draws a circle to execute a command, and then ether a word or symbol (e.g. music note) to execute commands. "commands" in this context are things such as switching between "applications". While it works reasonably well, I have the main module interpreting gestures and then passing a text string to the application, due to the way that the gesture recogniser works performance would be improved if the unistroke recogniser was using a reduced set of gestures which are applicable to each module. This would improve gesture recognition substantially I think but I need to get some metrics to test that.

- Performance: Performance is DIRE! I've managed to do some rather nifty things with Python that I had thought would require C++ but the slapdash approach that I've taken to this prototype framework has things running at about 10% of the speed that it should.

Another long night ahead of me *sigh*

UPDATE: Frustratingly I've found that the camshift tracker is also very lighting dependant. What was a reasonably positive experience under daytime lighting conditions deteriorated rapidly once night fell and I attempted to use the code under artificial lighting. No amount of tweaking of parameters or changing marker colour would rectify things :
- the yellow marker was reasonably stable, as good as the light green marker I had been using for the hsv tracker. That said, under my homes lighting white takes on a yellowish hue and hence the track would get confused on areas of white.
- red/orange/pink were all terrible, frequently becoming confused with either skin tones , areas of my carpet or lights on my laptop.
- dark green was not detected at all.
- dark blue frequently became confused with my laptop screen, laptop keyboard or areas of my t-shirt (black).

So the biggest problem that the project has is that the implementation of a robust system is VERY difficult which is going to make user testing potentially tricky.

THANKFULLY this has all framed my user study very tightly - within group study, look at natural gesture drawing, compared to use of the system, command gestures, limited number of letters, words and sentences; some basic interface command and control tasks; and videos on public/private perception of WGI usage, one set using an "exagerated" UI (e.g. SixthSense) compared to the discrete "OS6Sense" style (do people even perceive it as being discrete?).

Only 4 weeks later than I had initially intended doing it *sigh*

Oh video? Hmmmm not today, there's still a few tweaks I need to do before putting it online (e.g. putting something other than Jakalope in my media dir...I like Jakalope but its their latest album and I cant help but feel that I'm listening to Brittany Spears! The shame!)

Sunday, 16 October 2011

New Detector Done

Much better but I'm still not happy with it - camshift + backproj + kalman means that the marker coordinates are a lot smoother with far less noise (obviously) but the nature of detecting markers in segmented video still leads to a less than robust implementation. There's room for improvement and I still need to add in some form of input dialog for naming markers (and I must confess I am CLUELESS on the c++ side for that.....wxwidgets? Qt?) but I'm that little bit happier.

As per usual I had hoped for a video, but the lack of a dialog makes configuring things into a manual process (I've got basic save/load support working but given how sensitive this is to lighting still its a lot of messing around) hence I'm delaying yet again. Given my page views though I don't think I will be disappointing many people.

What is frustrating is the amount of time I've had to spend on basic work with computer vision rather than looking at the actual interactions for this technology. While I may NOT be the greatest coder ever, or even 1/2 as clever as I once thought I was (about 25 years ago), with the number of truly great coders and minds who have worked on computer vision I'm still somewhat disappointed that there's nothing really magnitudes more robust out there than what I'm doing. That said of course, if I was able to work with a kinect or similar tech I would expect something far more impressive but the 50cm limit for the depth sensing renders that point moot (by about 25cm). And I still think that some of the AI techniques could pay off dividends....and there are still a number of basic tricks I could apply (e.g. laplace to build a feature module for sift/surf/hmm detection of hands and finger pose detection - I cant help but think that would work really well) but I have to get away from the computer vision research sadly.

Anyways - I do think I'm at the last fence; video definitely over the next few days.

Saturday, 15 October 2011

Rewrite of fiducial detector

Its the last thing I want to do - I've roughed out code for most of the UI elements, the plumbing for the back-end works (although you can hear it rattle in places and there is considerable scope for improvement) but the marker detection code just isn't up for to the job and is getting a rewrite to use camshift and a Kalman filter. I tried the Kalman on the current code and its effective in smoothing the jitter caused by variations in centroid position but the continual loss of marker and the extreme numbers I am having to use to sense when the markers are engaged/unengaged is making it a frustrating experience.

I MUST come up with something working by Monday so that I can do something with this and was hoping to be tweaking various parameters of the interaction today but I'm going right back to stage one. Very frustrating but I ran a few experiments with the camshift algorithm and feel its required to make the air-writing implementation flow smoothly.

All nighter it looks like then :(

Friday, 14 October 2011

Drag Drop - a gesture design delema!

So I've run into an interesting interaction design problem. I've implemented some very basic list interface elements and initially supported the scrolling interaction via dwell regions. I'm unhappy with this for a number or reasons :

1) Dwell regions are not obvious to a user since there is no visual feedback to the user as to their presence. While I can provide feedback, there are times where I may choose not to do so (e.g. where the dwell region overlaps with the list).
2) Dwell regions when combined with other UI elements can hinder users interaction - e.g. if a user wishes to select an item that is within the dwell region and the dwell region initiates the scrolling behaviour causing the users selected item to move.
3) Interaction is very basic and I dont really want to implement any more support for these.

The obvious alternative to a dwell region though is drag and drop (or in the case of OS6Sense, pinch and unpinch) however since these are gestures, there's a possibility that the gestures will be interpreted as a command interaction.

But I want to be able to support pinch and unpinch...I think this is one that I will just have to try and see how it works out, but I think I have found a flaw in the interaction style.

Sunday, 9 October 2011

Another couple of observations

Schwaller in his reflections noted that developing an easy way to calibrate the marker tracking was important. I've observed that for application development, providing alternate input methods is equally important...quick a general usability principle of course, and all harkens back to providing multiple redundant modalities but....

My framework is about 50% of the way there, I'm becoming VERY tempted to look at a native android client but on x86 since I have the horsepower to drive things. If I had more time I'd go for it but...

Friday, 7 October 2011

Some observations/questions

Study delayed since I think I can make progress with the prototype and answer some of my questions while opening up new ones :/ I'm glad I know this sort of last minute thing is quite common in research or I might be panicking (3 months to go, omg!).

I'm still having problems with marker tracking due to varying lighting conditions.  At home, my "reliable" green marker doesn't like my bedroom but is great downstairs and in my office. Blue/red/yellow - all tend to suffer from background noise. I may have to try pink! Basically I know that colour based segmentation and blob tracking is a quick and easy way of prototyping this, but real world? Terrible!

If using dynamic gestures what are the best symbols to use? In fact is any semiotic system useful for gesture interaction? One could also ask are symbolic gestures really that useful for a wearable system....

Where should a camera point? i.e. where should its focus be?
I've found myself starting gestures slightly left of any central line, so primarily using the right side of my body - how true does this hold for other people? Is there a specific camera angle that is useful? These ones are for the study.

I was trying out a mockup of the interface on my palm and while I know the "natural" interaction style would be to make any projected UI elements into touch elements, my tracking just cant support it (in fact I have to wonder how anyone has done that since it sends histogram based trackers e.g. camshift insane). Hence there is a gulf between the input modalities location and that of the visual output modality which I don't believe represents an effect interaction paradigm. I've not tried it on other surfaces, I want to get a little further with the UI first.

Still hopeful of putting together another demo video by Monday.

Wednesday, 5 October 2011

Another Quick Update

I've been very busy putting together a framework to support a number of small applications for implementation - the apps are intended to be nothing more than proof of concept and to explore some of the interaction issue e.g. are dwell regions a better option than selectable areas (we're in eye tracking territory now)?; can these be applied to navigation?; How do we implement a mobile projected UI (terra incognita I believe)?

The framework is largely event/message driven since it affords us with loose coupling and dynamic run-time binding for both messages and classes ~ if I wasn't farting around with abstraction of the services (useful in the longer become both sinks and producers of events) it would probably come in at < 200 lines of code...

The point being while I'm not supposed to be writing code at this stage I am and hope to have at least a video by the end of the weekend (yes a week late).

Saturday, 1 October 2011

VERY excited!

Back to the research today and today is the day where I had set myself the serious goal of knuckling down and rewriting my introduction and literature review because I am VERY unhappy with both. I'd finished up my study definition document for my exploratory study next week and was doing some research into social acceptability and gestures...when it hit me. Most of the research suggests that only discrete gestures are socially acceptable (thus sixthsense/minority report style interaction is unlikely to be accepted by users in many social situations) so I asked myself :

1) Why look at how users naturally perform the gestures? Good question....and to be honest, because I honestly don't KNOW what I will find out. I *think* I know, but theres a huge gulf there!
2) How do I make a discrete gesture based system?

and I had also been asking myself :

3) How do I expand the number of states that I can represent using my current implementation?

And it hit me like an express train.

If I'm designing an air-writing gesture based interaction, design a system that recognises users gestures as they would naturally write!

Most technical and interaction issues solved in one fell swoop! Hopefully demo with another video tomorrow!

Sunday, 25 September 2011

Do with have a video?

Yes, we have a video!

I wasn't intending to work on any code this weekend but I felt compelled to try out the recognition server and run another set of tests but with the Logitech C900 in place. Results were an improvement on the PS3 eye, in part due to the better low light capabilities, in part due to the camera placement, and in part due to the wider angle.

Some anecdotal notes :

The recognition server provided seems to perform better that the unistroke implementation - I still need to sit down and do the numbers but I wouldn't be surprised if it wasn't significantly better.

I suspect recall for all but the most basic figures/shapes provided via the default unistroke implementation will be poor amongst users. On the flip side, most of us know the alphabet!

Big problem with the use of fiducials on the end of the fingers - they become obscured during natural hand movements! I ended up cupping the marker in my hand and squeezing it to cover it so the I had control of the markers visability. Keeping the fiducial visible requires holding the hand in a position that is simply not ergonomic.

After a few hours usage my wrist aches (but then I do suffer from PA).

I had the advantage of visual and audible feedback during this test - I suspect the performance will deteriorate with that removed. 

Another big problem is drawing letters that require multiple strokes - i, k, f, t, x, 4 etc all cause problems - have yet to test capitals.

obviously no support for correction or refinement - while this could be supported I cant see it being possible without visual feedback...hence reduces the impact of the system on improved situational awareness.

Ramifications - The original sixth sense system had very poor ergonomics as well as suffering from a range of technical issues. Choice of the unistroke recognition engine likely non-optimal (may be implementation dependent though), will need to revisit.

Where's the code then you ask? I may just throw stuff up over the next few days, but my god is it tatty but I'm not going to allow code shame to stop me. I'd like to have something which performs somewhat better than the current version in terms of the interaction support before I do so though.....

Thursday, 22 September 2011


Just an update - while I had originally approached this with the intention of releasing the code as open source my findings regarding....well various aspects of this project, but relevant to this aim, the code itself, means that I'm putting any software development on the back burner for the next few weeks while I perform a study into how people naturally perform gestures. I'm also looking at some options to improve certain show stopping issues with the system (primarily the limited FOV of the webcam).

Any code that does emerge for the project, at least for version 0.1 is unlikely to be very robust but I think that can be overcome : I'm currently thinking that broad colour segmentation followed by some form of object matching technique (e.g. SIFT/SURF) should make quite a robust and reasonably fast algorithm for marker detection however if the FOV problem cant be solved, I actually think that ANY vision based system is inappropriate for this sort of interaction style.

Yes, that's a bit damning, however I am doing HCI research here, not computer vision....and that doesnt mean that I dont have other tricks (literally) up my sleeve :)

Saturday, 10 September 2011


Children back at school and I'm back off my hols (a rather interesting time in Estonia if you're interested).

I've spent most of the last week becoming increasingly frustrated with my attempts at image segmentation. I've moved to a c++ implementation for speed and, while the VERY simplistic HSV segmentation technique I am using works, the problem is that I cannot get it to work robustly and doubt that it will ever do such.

I've now covered the range of available techniques and even tried to plumb the depths of just emerging ones and it seems that every computer vision based object tracking implementation or algorithm suffers for the same issue with robustness (openTLD, camshift, touchless, hsv segmentation and cvBlob etc etc). YES, it can be made to work, but issues include (depending on the algorithm) :

- Object drift : over time the target marker will cease to be recognised and other objects will become the target focus.
- Multiple objects : During segments where the camera is moving new objects will appear, some of which cannot be differentiated from the target.
- Target object loss : Due to changes in size, lighting, speed etc the target will be totally lost.
- Target jitter : The centroid of the target cannot be accurately determined.

I'll expand that list as I think of more.

So basically given a semi static camera, a semi uniform "background", uniform lighting, an object can be tracked with some degree of reliability.

Its also worth noting that 2 variables, fiducial colour and lighting uniformity, have the largest impact on reliability of tracking. I was incredibly optimistic this week when I tried to segment a light green pen top and found it to be highly accurately tracked during one experiment; but then I returned to the same code and object later in the day under different lighting conditions and reliability fell massively until I recalibrated.

I am unsure of how to proceed next I must admit; while I didn't expect things to be 100% reliable I had expected one of the available techniques to produce better results than I have had so far. If I had more raw power to throw at things (and more raw time) I'd return to looking at some of the AI techniques (Deep Convoluted Networks) as well as somewhat simpler SIFT/SURF implementations but sadly I am out of time for this portion of my research (and in many ways its THE most crucial aspect)....

Sunday, 21 August 2011

Enabling External Screen

Another slow week ~ parenting, employment, school holidays and academia do not mix well.

I have the hardware up and running how I want now. My initial build with the projector on the right and camera on the left had 2 problems. Firstly, as previously noted, its preferable to have the camera on the side of the users dominant hand since this will be used the most. Secondly...the PK301 has a small fan for cooling and having this on the right meant that the exhaust heat was blowing onto my neck and was hot enough to be very uncomfortable. So today I've switched things around, tied things off, added extra washers (oh, the technology!) for stability and, apart from the need for a more stable joint to hold the projector in a position so that it is level with the users view, I am reasonably satisfied with my budget implementation. Not $350...about £350 but then I could have bought a cheaper projector (and may well pick up a ShowWX for testing) and come under budget. I still need to add a reflector to allow for quick repositioning of the output, but getting close.

However I am rambling, I really wanted to just make a note of what I had to do to enable my laptops lid to be closed and still have the external screen function :

gconftool-2 --type string --set /apps/gnome-power-manager/buttons/lid_battery "nothing"

gconftool-2 --type string --set /apps/gnome-power-manager/buttons/lid_ac "nothing"

This is rather hackish since the laptop screen stays on when closed but I couldn't find any other way to do this - I'm reasonably sure I don't have this issue in windows - so if anyone can suggest an alternative.....

Friday, 12 August 2011

Fixing "Corrupt JPEG data: n extraneous bytes before marker" error

jdmarker.c lives in the OpenCV 3rdparty/libjpeg directory, edit out the appropriate line (search for "EXTRANEOUS"), make and install....happy days.

Edit: Hmmmm actually the above doesn't work since OpenCV is linking the installed library rather than the 3rdParty version and I can't seem to convince CMake to use the local one...ahhh well one for another day since redirecting standard err to /dev/null works just as well.

Sunday, 7 August 2011

Hardware Design

I finally took the plunge and did the epoxy wields on the prototype rig using some Bondloc titanium epoxy and it works really well, setting hard in just a few minutes and rock solid after 15. While the rig is incredibly primitive it does allow me to shoulder mount both the projector and camera and do so so that they are stable. I just need to make a final decision re placement (left vs right for camera/projector) ~ my initial take was to place the projector on the right so as to be in line with my dominant eye, but I think of more importance/potential is to have a better correlation between the forward facing camera and the dominant hand for pointing...which should also reduce occlusion of the projection.

I'm slowly uncovering papers in this area and found another one today Designing a Miniature Wearable Visual Robot which details the design rationale behind a robotised wearable camera. Mayol et al (2002) use a 3D human model to examine different frames of reference and requirements for the device identifying 3 frames (the wearer's body and active task, alignment to static surroundings, wearers position relative to independent objects) . They also identify 2 requirements, decoupling of the wearers motion from the motion of the sensor and the provision of a wide Field of View. Since we are dealing with a static rather than motorised sensor, it is only the first frame that is of particular relevance however it is interesting to note how a robotised system would enable these different frames.

They also note that, given the proximity of the device to other humans, that :

"a sensor able to indicate where it is looking (and hence where it is not looking) is more socially acceptable than using or wearing wholly passive sensors" (P1)

This is a very interesting point since the social acceptance of a wearable system is a major factor influencing the usability of "always on" wearable systems.

They go on to examine the 3 factors used in their analysis of the most optimal location to wear the robot, detailing FOV, user motion and view of the "handling space" which they define and stress the importance of via the following statement:

"The area immediately in front of the chest is the region in which the majority of manipulation occurs, based on data from biomechanical analysis" (P2 cites [2])

Of final relevance to us is there discussion of their results from fusing these criteria. The forehead is identified as the most optimal position but discounted due to the "importance of decoupling the sensors attention from the user's attention" and alternate positions are considered. Their analysis concludes that if maximal FOV and minimal motion are the most important factors that the shoulder is the optimal alternative.

Phew. And I want one.

Mayol's Robot [1]

 Along with the papers I've read on projector positioning it seems that shoulder mounting wins for both projector and camera ~ happy happy joy joy!

[1]W. Mayol, B. Tordoff, and D. Murray. Designing a miniature wearable visual robot. In IEEE Int. Conf. on Robotics and Automation, Washington DC, USA, 2002.

[2] W.S. Marras, in G. Salvendy, Handbook of Human factors and Ergonomics Sec. Ed., chapter Biomechanics of The Human Body, John Willey, 1997.

Machine Learning & OpenCV

Slow progress. I've spent the last couple of days generally looking around at Neural Networks and OpenCVs machine learning code. My aim was to find some relatively simple to implement examples of NNs and form a basic impression of the applicability of the techniques to the task of skin and marker recognition in a video stream.

Now I should point out that my understanding of most ML techniques and NNs is rudimentary at best and, at this point, that's fine since I am looking for practical techniques that can be employed at a rapid prototyping stage (hence the cost of experimenting with a proposed solution to a given problem should measure in the hours rather than days); so I am not going to go into too much depth on the AI part of things here.

What I do need to do though is define the methodology and results of the experiments I've been running so far since a write-up at this stage will form a useful appendix to my thesis and demonstrate the design process and rationale behind the "vision" system of the WGI.

Briefly though my method has been to construct a small application that allows the generation of "samples" of an video, where mouse clicks on a frame of the video stream creates an N*N sized image surrounding the centre of the mouse click, which allows the generation of a set of images which represent objects of a given type (e.g. skin, not skin (!), blue marker, red marker etc).++ A separate application is then used to create a dataset suitable for training the ML technique. This means that an image is transformed into an array of either BGR or HSV values and saved to a file. I then concatenate and randomise the files into one single dataset with a label to identify the sample so end up with something along the lines of :

SN, 1.0, 2.0, 6.0, 1.0, 2.0, 5.0, 1.0, 5.0, 3.0 
NS, 6.0, 6.0, 6.0, 1.0, 6.0, 5.0, 1.0, 6.0, 3.0 
BM, 2.0, 2.0, 6.0, 1.0, 6.0, 5.0, 4.0, 6.0, 3.0 

for a 3x3 sample. I should note that I need to add an additional step into the above to verify the data set (via visual inspection) to ensure that "mis-clicks" are not adding in erroneous samples.

I then run this through a modified version of  the OpenCV code for letter_recog (both the C and python versions - which is interesting in its right since the same techniques give differing results [implementation differences?]) in order to obtain a some basic data on the validity of the technique and yesterday finally managed to write some code to visualise the results on a live or recorded stream.

The above needs formalising, and I still have to do some work in that the Boost code is set for too many classes and falls over if I reduce the number < 10; there's no python version of mlp or nbayes and I also need to actually look at NNs properly.

And of course I need to formalise the results. My early impression is that the ML techniques are promising but too slow, however this is quite possibly an implementation issue (python boost takes minutes vs seconds for the C++ version). I also need to look in more detail at differing sample dimensions and sizes, the effects of differing colour spaces since I've only tested with HSV so far and would like to see the results with grey, BGR, HS only etc....currently too many FP and NFPs e.g.
rtrees 5x5

knearest 5x5

The above is a typical "noisy" frame (PS3eye this time) and the chair and table in the image are coloured very similar to areas of skin hence all the noise. The larger areas of skin that have not been correctly classified are overexposed and I have made a deliberate choice not to classify these areas. That of course is primarily a hardware issue displaying the importance of the camera system colour accuracy in variable lighting conditions.

Oh and of course, all of this is a side project and NOT the main focus of my thesis...all very interesting though :)

Wednesday, 3 August 2011

handwriting Recognition

More of a note post (I've been busy with work/children this week). I noted that one omission in the demonstrations of SixthSense was any form of text input. I want to address this by the provision of an "AirWriting" interface but had been struggling to find any form of decent hand-writting recognition software for Linux. However, Bret Comstock Waldow (now theres a name for the 21st Century!) has come to the rescue with ink2text and SHIP, which, as far as I can tell, runs the MS tablet hand-writing recognition DLLs under Wine and provides access to the routines via a service. Brilliant!

I am getting quite excited at what may be possible by gluing these bits together!

As a side note I also acquired a PS3eye and hacked it apart last night and was please to find support in Ubuntu 10.10 for it out of the box (so to speak). Colours seemed somewhat subdued in comparison to the c910 however the effects of the superior frame rate are obvious.

Roll on weekend!

Monday, 1 August 2011


Playing with markers today yielded some interesting results ~ I'm using a rather brute force approach in opencv by using InRangeS against an HSV image, and have pulled the ranges via an app to obtain a sample of representative pixels. Quick and dirty, but I'm encouraged by the results. False positives are high and its not robust in terms of working across lighting conditions but in-part the problem is the choice of material for the markers (electrical masking tape) which suggests :

CVFB-R5: Markers must be composed of a material which is minimally reflective.

I need to define "minimally" with more precision obviously but I think the above would improve the detection in varied lighting conditions and possibly reduce false positives if I resample. And yes, I know this should have been obvious but I have to harken back to my comments about a lack of design rationale...

I'd be reasonable confident that with a few improvements that this would roughly match the functionality of the SixthSense system so anything over and above this is an improvement.

Next steps are to look at cvblob and then move onto Neural Networks

Some notes : Popović wang ~ ~ MIT data glove.

kakuman A survey of skin-color modeling and detection methods

Sunday, 31 July 2011

git repository, google project

First code commit! I've not used git before for source control (my programming over the last 5 years has been a solo effort so no need for anything beyond tarball source control) hence I just wanted to make some notes for myself on how to use it.

Firstly, code is hosted on google at To check out a copy:

git clone os6sense

There's very little there yet unless you are interested in just how badly python can be mangled.

And for my own notes, to add a file, commit changes and update the repository :

git add filename.ext
git commit -m "Insert Comment"
git push origin master

I'll put together a download at some point in the future

Hand tracking

My initial attempt at bolting all the bits together has run into a delay since I've decided to try and epoxy wield some parts together and don't have any suitable,  so yesterday my focus was on marker-less hand tracking.

There are a lot of impressive videos on youtube showing hand-tracking when you throw in a moving background, varying lighting and a video stream that 99.999% of the time wont contain the object you want to track (but may contain 100000s of similar objects), well things don't tend to work as well as we would like.

The last time I looked at hand tracking was probably 5 years ago and I wasn't too impressed by the reliability or robustness of the various techniques on offer and I cant say I'm any more impressed today. I recently had a quick look at various rapid approaches - touchless works but is marker based, TLD works but loses the target in long video and ends up learning a different target (but might be applicable if some form of re-initialisation was employed) and HandVu (which I had high hopes for) was too lighting sensitive. As said, these were quick looks and I will revisit TLD at least in the near future.
MIT budget "data-glove"

While I don't want to use fiducial markers, when MIT are proposing the use of gloves that even my wife would be ashamed to wear (and she loves colourful things) in order to improved recognition accuracy, well, one has to realise that we just haven't solved this problem yet.

Just how bad is it though? Well there have been multiple studies [cite] investigating skin detection and proposing values for HSV skin segmentation, and the theory behind a lot of this work looks solid (e.g. HS values for skin in a narrow range due to skin pigmentation being the result of blood colour and the amount of melanin [cite]), but throwing my training samples at them (incredibly cluttered background, variable lighting conditions) produces far too many false positives and false negatives to be of any practical value. Looking at the underlying RGB and HSV values also suggest that this approach is going to be of little practical application in "everyday" scenarios hence I'll be moving onto fiducial markers for today.

Thursday, 28 July 2011

Prototype Hardware Rig

Its late here so I'm a brief post. As I've said, hardware wise there are a lot of options as to where to place the various components for a WGI and a few studies out there have looked at the various components in some depth (TODO refs). However an obvious constraint for this project is available resources (hardware, time, money etc) and when it comes to the act of physically assembling thing, and doing so quickly, for my first build it is almost a matter of just making the best out of things.

Hence I present the rigging for O6Sense FG v0.01 which has been inspired by observing the trend towards the wearing of large closed-cup headphones and the opportunities this opens in having a "socially acceptable" wearable prototype.

What is it? Its a pair of cheap headphones with the cups cut off and 2 Rode pop-shield mounts fitted into the rear wire loops. I cant help but feel some providence at work here since not only is the rear arm a snug fit the screw holes are the exact right size for the projector mount. Obviously this design will change. The weight of the projector being off center is no doubt going to be a huge hassle however I need to try the shoulder mounted position for this since a couple of studies have identified a number of advantages to the use of this location. If the camera is off center as well though this will balance things out (almost)..

Prototype (un-assembled):
Screws, tape & Dremel on tomorrows shopping list.

FOV - camera options

So continuing looking at cameras, firstly let me be clear I have a VERY limited budget for this project having already pushed out the boat to buy an Optoma PK301 (I'll cover pico-projectors at a later date) hence commercial options such as this HQ lens and pre-modded IR cameras are just out of my price bracket. Hence the PS3 camera is looking very tempting given they can be picked up on ebay for less than £15 and a large range of hacks have already been done for them.

I wanted to document my comparison of the various options I have considered though :

NameFOV (degrees)fps320fps640fps1280Cost
PS3 Eye75/5612060NA£15
C91083H 1606030£70
Kinect58H (IR)/63H (RGB) 3030NA£100
Samsung SII75???????30NA

The above table is incomplete obviously ~ I've thrown in the SII since I have one available but I can't find any specifications on the camera, even from the datasheet hence the numbers are a guestimate based on a comparison with the C910.

Doing the above research confirmed that I will have to rule out depth based systems such as Kinect and Asus's Xtion since the minimal operating distance for the IR camera is 0.5m in the case of the Kinect and 0.8m for the Xtion. I believe the kinects FOV can be improved to 90 deg via an add on lens but that obviously increases the expense. Pity but a conscious design decision that I am now making is to focus on the "natural gesture position" that I illustrated earlier based on the advantages of it being eyes-free. I am aiming to incorporate a forward aiming camera as well though so, yes, we're talking about a 2 camera system now (possibly with very simple homebrew IR modifications).

I think the main modification that is going to be needed is to increase the FOV of the camera and do so cheaply - some interesting ideas I uncovered for this:

Commercial camera wide angle lens
CCTV wide angle lens
Adapt a lens from a door peep hole

I like the idea of the door peep hole - a nice hack and within my budget.


Wednesday, 27 July 2011

First captures done

I made my first set of captures today - I'm not sure they are usable though since the act of role-playing the system use brought up a number of interesting issues :

Worn Camera Position
Mistry using Sixth Sense[1]
An option in regards to the camera position is to simply place it in the same position as used by Mistry. However my concerns here are 2 fold:

1) Design rationale. There is no design rationale as to the placement of the camera.I don't mean an extended examination of the rationale, just a simple overview of the thinking behind the placement.

2) Ergonomics. I can't help but think how uncomfortable that position is going to be to use for protracted periods of time (gorilla arm effect) or after a long day. Also in that position what about people with a high BMI - isn't the camera angle going to be greater than 90 degrees?

EDIT: Another couple of concerns in the back of my mind seem pertinent:

3) Social constraints. Having ones arms so far forwards is going to draw attention to system usage and quite frankly, looks peculiar. We don't tend to naturally gesture in an exaggerated fashion unless we are agitated or upset and I would suspect that people would feel uncomfortable using such a gestural style in public spaces.

4) Situational Awareness. One of the advantages of an "eyes-free" gestural system is the ability to improve situational awareness but concentrating on the performance of a gesture in front of the space requires attention to be placed on the gesture.

So I'm not convinced that a single forward facing camera is the best option for a WGI...or even that a monocular camera system is viable for the range of applications that have been demonstrated with Sixth Sense. While the position might be pragmatic, giving an evenly distributed view on both the users hands, the usability looks to be less than optimal in that position if we consider a wider range of users than your skinny MIT students and failed bodybuilders (moi)!

Field Of View
A "Natural" gesture pose?
I've thought for a while now though that  a serious drawback to the use of computer vision for WGIs is in the limited FOV for gestural input - if I could do some blue-sky research alongside this project it would probably be to use 2 cameras on motorised heads each dedicated to locating and tracking 1 hand but I just don't have the resources for that. But I DO have the resources to look at the use of 2 cameras, 1 focused down to capture gestures in a more natural position (hands at the sides) and one in the forward position. Of course a fish-eye lens is another option...(*sigh* oh for funding).

This of course led me to thinking about camera angles. As said, the lack of design rationale means we don't really know if there were particular benefits in the selection of the camera Mistry used ~ personally I'm working (initially) with the c910 for the widescreen angle and HD capture @ 30fps but I have a growing inclination to look at IR..but I digress. To help consider FOV and camera angles I put together a couple of illustrations.

Quite interesting to have a visual representation of just what angle (160 deg) is necessary to capture the hands in a natural gesture position and also observe the limitations of the forward facing camera in capturing anything more than the extremities of the hand. Another observation is that the hand seldom has more than the fingers over the 90 degree line, which might be useful....

Frame Rate
Going back to the difficulties I've been having with openCV frame rate is yet another issue. I know a lot of computer vision projects have been using the PS3Eye due to its ability to capture at 120FPS ~ obviously this provides a far clearer image of the region of interest on any given frame but the trade off is, of course, the resolution drops to 320x240. Still its one to be considered.

Anyways, quite a lot of food for thought....

1. image source :
2. Thanks to fuzzimo for the protractor.

Monday, 25 July 2011

Some notes - opencv

Just quickly throwing a recording app together yesterday I found that the video size wasn't being changed - a little digging suggests that the reliance on icv_SetPropertyCAM_V4L is always going to fail if the change in resolution between the width and height calls result in an unsupported resolution on the first call. Why isn't a simple call to set the video size with both height and width  parameters by exposing icvSetVideoSize supported?

Its not my aim to patch opencv though, so for my purposes I've updated the values for DEFAULT_V4L_WIDTH and DEFAULT_V4L_HEIGHT in highgui/src/cap_v4l.cpp for 1280x720 and rebuilt. Yes its a fudge, and if I remember I'll have to bug it.

But with that fixed I have a little recorder application ready to go with the only issue left to solved being, well, the usual fun games and open source politics. I get the following error when recording :
Corrupt JPEG data: 1 extraneous bytes before marker 0xd0
Reading this thread suggests that ffmpegs MJPEG support is broken. There's a patch to fix it, but it will need manually applying. I want to avoid to many custom changes to 3rd party libs hence I'm going to ignore that for now and try and get openCV to use YUYV rather than MJPEG but my initial attempt at that failed. Hoping to get some video recorded while there's some sunshine about hence that's one for later.

Oh an extra note - calls to :

cv.SetCaptureProperty( capture, cv.CV_CAP_PROP_FPS, fps )

appear to work but if I try :

cv.GetCaptureProperty( capture, cv.CV_CAP_PROP_FPS )

I get -1 returned. In addition although the initial set returns without error, if I try to set it too high, because I'm using the same FPS as in the initial set, to set the video FPS, this can result in some funky high speed video recordings.


Of course I could have left well enough alone and just used the defaults which did work fine.

UPDATE: A little more playing reveals that setting WITH_JPEG=OFF in CMakeCache.txt switches off JPEG support (command line switch changed or broken?) and finally we can access the YUYV stream. Sadly though the performance is about 1/2 of that of the MJPEG stream :/

Saturday, 23 July 2011

The Patent

I was somewhat surprised to come across a patent for Sixth Sense given the initial declaration that the code would be open sourced. I was even more surprised reading the contents of the patent at how general it is...oh hum lets not go there apart from to say I'm not a fan of broad patents, but I wanted to bring it up since 1) it is the only "detailed" source of information on the implementation of Sixth Sense and 2) its useful to acknowledge and recognised since I don't particularly want to be "trolled" in this research.

So yes, its out there and its worth a quick skim through (or not since a "clean room" implementation might be advisable but too late for me!) since it tells us that the source for Sixth Sense is largely based on several open source projects (see 0120). Touchless is used for fiducial recognition, the $1 Unistroke Recogniser algorithm for gesture commands, and ARToolkit for the output. OpenCV is mentioned and possibly does some of the heavy lifting for object recognition (possibly HMM?). I also just realised that the microphone is working in tandem with the camera when it used on paper, probably using some aspect of the sound on the paper to indicate contact with the destination surface since a single camera by itself is insufficient to determine when contact occurs.

So what do we do with this knowledge?

I've played with OpenCV and HandVu in the past and found them (for hand tracking at least) not that great since neither really solve the problem of reliable background segmentation in complex environments hence I can see the logic in using fiducials although a brief play (with touchless) suggests that even a fiducial based recognition system is unlikely to be perfect (at least in the case of a single unmodified webcam). This does lead to an important point for me in terms of requirements :

CVFB-R1: The computer vision system must be able to reliably determine fiducial positions in complex background images.
CVFB-R2: The computer vision system must be able to reliably determine fiducial positions in varied background images.
CVFB-R3: The computer vision system must be able to reliably determine fiducial positions with varying lighting conditions.
CVFB-R4: CVFB-R1 - CVFB-R3 must be met for 4 fiducial markers, each of a distinct colour.

and should it be possible to work without fiducial markers :

CVSB-R1: The computer vision system must be able to reliably determine hand shape in complex background images.
CVSB-R2: The computer vision system must be able to reliably determine hand shape in varied background images.

CVSB-R3: The computer vision system must be able to reliably determine hand shape with varying lighting conditions.
CVSB-R4:  The computer vision system must be able to reliably discriminate between left and right hands.

I rather suspect that I'm going to have to be flexible with tests/thresholds to determine if these requirements are met and it should also be noted that it has been recognised that no single based computer vision technique has been found to work for all applications or environments (Wach et al, 2011, p60) hence there may be some opportunity to improve on the generic libraries/algorithms which it would seem natural to apply (e.g. touchless, cvBlob)

Moving on, for those who haven't played with $1 Unistroke recogniser (Wobbrock, 2007) its impressive. I'd be reasonably confident based of the results of the tests for this algorithm in its reliability and robustness, IF the above requirements can be met.

Keeping to the KISS principle I'm going to use this as the basis of my first experiments (and code woo-hoo!) which are going to be :

1) Capture short (<5 minute) segments of video with a worn webcam (in my case I have a Logitech C910 handy, not the most discrete of cameras but sadly my Microsoft Life show broke grrrrr) in a variety of environments while wearing fiducial markers on 4 fingers.

2) Capture short (<5 minute) segments of video with a worn webcam in a variety of environments without markers.

3) Based on these exemplary videos test various recognition techniques from openCV to determine the optimal technique which meets the above requirements.

4) Apply and test sample gestures against $1 Unistroke Recogniser (Python implementation)
4.1) optional Determine if there are any differences in the performance/reliability of the Python versions.

Okay that's my week planned then, comments?


Wachs, J, Kölsch, M, Stern, H & EDAN, Y 2011, ‘Vision-based hand-gesture applications’ in Communications of the ACM vol. 54, no. 2 p. 60-71

Wobbrock  et al 2007

Welcome to the Open Source Sixth Sense WGI Project!

Its been over two years since Pranav Mistry announced that his Sixth Sense project would be made open source. Since then we've heard little about this technology or the project and like many AR point technology research examples, this appears to have become abandonware.

So when it came around for me to pick a topic for my Masters thesis I couldn't help but think it would be a great opportunity to do a design and build project for a similar system, investigating the HCI aspects of this novel technology as the focus...and that's what I'm doing.

First though I need to build one and along the way I couldn't help but think 2 things :

1) This is also a good opportunity to build my first open source project so that, if other researchers want to explore this technology, an artefact exists allowing for rapid development of a research system.

2) There's also an opportunity to examine an interesting concept of "The Inventor as the User" as a UCD perspective on the development of novel technologies.

To briefly expand on the above then, it is my intention to expand this project with a forum, wiki and source code which will allow anyone to create a comparable WGI; I'll be keeping this blog updated with my progress and thoughts and I will be actively encouraging input on this work to expand and moderate my perspective so that I end up neither blinkered or missing the "obvious".

Please, wish me luck ;)