Sunday, 31 July 2011

git repository, google project

First code commit! I've not used git before for source control (my programming over the last 5 years has been a solo effort so no need for anything beyond tarball source control) hence I just wanted to make some notes for myself on how to use it.

Firstly, code is hosted on google at https://code.google.com/p/os6sense/. To check out a copy:

git clone https://morphemass@code.google.com/p/os6sense/ os6sense


There's very little there yet unless you are interested in just how badly python can be mangled.

And for my own notes, to add a file, commit changes and update the repository :

git add filename.ext
git commit -m "Insert Comment"
git push origin master


I'll put together a download at some point in the future

Hand tracking

My initial attempt at bolting all the bits together has run into a delay since I've decided to try and epoxy wield some parts together and don't have any suitable,  so yesterday my focus was on marker-less hand tracking.

There are a lot of impressive videos on youtube showing hand-tracking when you throw in a moving background, varying lighting and a video stream that 99.999% of the time wont contain the object you want to track (but may contain 100000s of similar objects), well things don't tend to work as well as we would like.

The last time I looked at hand tracking was probably 5 years ago and I wasn't too impressed by the reliability or robustness of the various techniques on offer and I cant say I'm any more impressed today. I recently had a quick look at various rapid approaches - touchless works but is marker based, TLD works but loses the target in long video and ends up learning a different target (but might be applicable if some form of re-initialisation was employed) and HandVu (which I had high hopes for) was too lighting sensitive. As said, these were quick looks and I will revisit TLD at least in the near future.
MIT budget "data-glove"

While I don't want to use fiducial markers, when MIT are proposing the use of gloves that even my wife would be ashamed to wear (and she loves colourful things) in order to improved recognition accuracy, well, one has to realise that we just haven't solved this problem yet.

Just how bad is it though? Well there have been multiple studies [cite] investigating skin detection and proposing values for HSV skin segmentation, and the theory behind a lot of this work looks solid (e.g. HS values for skin in a narrow range due to skin pigmentation being the result of blood colour and the amount of melanin [cite]), but throwing my training samples at them (incredibly cluttered background, variable lighting conditions) produces far too many false positives and false negatives to be of any practical value. Looking at the underlying RGB and HSV values also suggest that this approach is going to be of little practical application in "everyday" scenarios hence I'll be moving onto fiducial markers for today.

Thursday, 28 July 2011

Prototype Hardware Rig

Its late here so I'm a brief post. As I've said, hardware wise there are a lot of options as to where to place the various components for a WGI and a few studies out there have looked at the various components in some depth (TODO refs). However an obvious constraint for this project is available resources (hardware, time, money etc) and when it comes to the act of physically assembling thing, and doing so quickly, for my first build it is almost a matter of just making the best out of things.

Hence I present the rigging for O6Sense FG v0.01 which has been inspired by observing the trend towards the wearing of large closed-cup headphones and the opportunities this opens in having a "socially acceptable" wearable prototype.


What is it? Its a pair of cheap headphones with the cups cut off and 2 Rode pop-shield mounts fitted into the rear wire loops. I cant help but feel some providence at work here since not only is the rear arm a snug fit the screw holes are the exact right size for the projector mount. Obviously this design will change. The weight of the projector being off center is no doubt going to be a huge hassle however I need to try the shoulder mounted position for this since a couple of studies have identified a number of advantages to the use of this location. If the camera is off center as well though this will balance things out (almost)..

Prototype (un-assembled):
Screws, tape & Dremel on tomorrows shopping list.

FOV - camera options

So continuing looking at cameras, firstly let me be clear I have a VERY limited budget for this project having already pushed out the boat to buy an Optoma PK301 (I'll cover pico-projectors at a later date) hence commercial options such as this HQ lens and pre-modded IR cameras are just out of my price bracket. Hence the PS3 camera is looking very tempting given they can be picked up on ebay for less than £15 and a large range of hacks have already been done for them.

I wanted to document my comparison of the various options I have considered though :

NameFOV (degrees)fps320fps640fps1280Cost
PS3 Eye75/5612060NA£15
C91083H 1606030£70
Kinect58H (IR)/63H (RGB) 3030NA£100
Xtion58H£120
Samsung SII75???????30NA

The above table is incomplete obviously ~ I've thrown in the SII since I have one available but I can't find any specifications on the camera, even from the datasheet hence the numbers are a guestimate based on a comparison with the C910.

Doing the above research confirmed that I will have to rule out depth based systems such as Kinect and Asus's Xtion since the minimal operating distance for the IR camera is 0.5m in the case of the Kinect and 0.8m for the Xtion. I believe the kinects FOV can be improved to 90 deg via an add on lens but that obviously increases the expense. Pity but a conscious design decision that I am now making is to focus on the "natural gesture position" that I illustrated earlier based on the advantages of it being eyes-free. I am aiming to incorporate a forward aiming camera as well though so, yes, we're talking about a 2 camera system now (possibly with very simple homebrew IR modifications).

I think the main modification that is going to be needed is to increase the FOV of the camera and do so cheaply - some interesting ideas I uncovered for this:

Commercial camera wide angle lens
CCTV wide angle lens
Adapt a lens from a door peep hole

I like the idea of the door peep hole - a nice hack and within my budget.

[1] http://forums.logitech.com/t5/Webcams/Question-on-HD-Pro-c910/td-p/568268

Wednesday, 27 July 2011

First captures done

I made my first set of captures today - I'm not sure they are usable though since the act of role-playing the system use brought up a number of interesting issues :

Worn Camera Position
Mistry using Sixth Sense[1]
An option in regards to the camera position is to simply place it in the same position as used by Mistry. However my concerns here are 2 fold:

1) Design rationale. There is no design rationale as to the placement of the camera.I don't mean an extended examination of the rationale, just a simple overview of the thinking behind the placement.

2) Ergonomics. I can't help but think how uncomfortable that position is going to be to use for protracted periods of time (gorilla arm effect) or after a long day. Also in that position what about people with a high BMI - isn't the camera angle going to be greater than 90 degrees?

EDIT: Another couple of concerns in the back of my mind seem pertinent:

3) Social constraints. Having ones arms so far forwards is going to draw attention to system usage and quite frankly, looks peculiar. We don't tend to naturally gesture in an exaggerated fashion unless we are agitated or upset and I would suspect that people would feel uncomfortable using such a gestural style in public spaces.

4) Situational Awareness. One of the advantages of an "eyes-free" gestural system is the ability to improve situational awareness but concentrating on the performance of a gesture in front of the space requires attention to be placed on the gesture.

So I'm not convinced that a single forward facing camera is the best option for a WGI...or even that a monocular camera system is viable for the range of applications that have been demonstrated with Sixth Sense. While the position might be pragmatic, giving an evenly distributed view on both the users hands, the usability looks to be less than optimal in that position if we consider a wider range of users than your skinny MIT students and failed bodybuilders (moi)!

Field Of View
A "Natural" gesture pose?
I've thought for a while now though that  a serious drawback to the use of computer vision for WGIs is in the limited FOV for gestural input - if I could do some blue-sky research alongside this project it would probably be to use 2 cameras on motorised heads each dedicated to locating and tracking 1 hand but I just don't have the resources for that. But I DO have the resources to look at the use of 2 cameras, 1 focused down to capture gestures in a more natural position (hands at the sides) and one in the forward position. Of course a fish-eye lens is another option...(*sigh* oh for funding).

This of course led me to thinking about camera angles. As said, the lack of design rationale means we don't really know if there were particular benefits in the selection of the camera Mistry used ~ personally I'm working (initially) with the c910 for the widescreen angle and HD capture @ 30fps but I have a growing inclination to look at IR..but I digress. To help consider FOV and camera angles I put together a couple of illustrations.
 


Quite interesting to have a visual representation of just what angle (160 deg) is necessary to capture the hands in a natural gesture position and also observe the limitations of the forward facing camera in capturing anything more than the extremities of the hand. Another observation is that the hand seldom has more than the fingers over the 90 degree line, which might be useful....

Frame Rate
Going back to the difficulties I've been having with openCV frame rate is yet another issue. I know a lot of computer vision projects have been using the PS3Eye due to its ability to capture at 120FPS ~ obviously this provides a far clearer image of the region of interest on any given frame but the trade off is, of course, the resolution drops to 320x240. Still its one to be considered.

Anyways, quite a lot of food for thought....

1. image source : http://www.technologyreview.com/files/32476/mistry_portrait_x220.jpg)
2. Thanks to fuzzimo for the protractor.

Monday, 25 July 2011

Some notes - opencv

Just quickly throwing a recording app together yesterday I found that the video size wasn't being changed - a little digging suggests that the reliance on icv_SetPropertyCAM_V4L is always going to fail if the change in resolution between the width and height calls result in an unsupported resolution on the first call. Why isn't a simple call to set the video size with both height and width  parameters by exposing icvSetVideoSize supported?

Its not my aim to patch opencv though, so for my purposes I've updated the values for DEFAULT_V4L_WIDTH and DEFAULT_V4L_HEIGHT in highgui/src/cap_v4l.cpp for 1280x720 and rebuilt. Yes its a fudge, and if I remember I'll have to bug it.

But with that fixed I have a little recorder application ready to go with the only issue left to solved being, well, the usual fun games and open source politics. I get the following error when recording :
Corrupt JPEG data: 1 extraneous bytes before marker 0xd0
Reading this http://forums.quickcamteam.net/showthread.php?tid=1439 thread suggests that ffmpegs MJPEG support is broken. There's a patch to fix it, but it will need manually applying. I want to avoid to many custom changes to 3rd party libs hence I'm going to ignore that for now and try and get openCV to use YUYV rather than MJPEG but my initial attempt at that failed. Hoping to get some video recorded while there's some sunshine about hence that's one for later.

Oh an extra note - calls to :

cv.SetCaptureProperty( capture, cv.CV_CAP_PROP_FPS, fps )

appear to work but if I try :

cv.GetCaptureProperty( capture, cv.CV_CAP_PROP_FPS )

I get -1 returned. In addition although the initial set returns without error, if I try to set it too high, because I'm using the same FPS as in the initial set, to set the video FPS, this can result in some funky high speed video recordings.

*sigh*

Of course I could have left well enough alone and just used the defaults which did work fine.

UPDATE: A little more playing reveals that setting WITH_JPEG=OFF in CMakeCache.txt switches off JPEG support (command line switch changed or broken?) and finally we can access the YUYV stream. Sadly though the performance is about 1/2 of that of the MJPEG stream :/

Saturday, 23 July 2011

The Patent

http://www.freepatentsonline.com/20100199232.pdf

I was somewhat surprised to come across a patent for Sixth Sense given the initial declaration that the code would be open sourced. I was even more surprised reading the contents of the patent at how general it is...oh hum lets not go there apart from to say I'm not a fan of broad patents, but I wanted to bring it up since 1) it is the only "detailed" source of information on the implementation of Sixth Sense and 2) its useful to acknowledge and recognised since I don't particularly want to be "trolled" in this research.

So yes, its out there and its worth a quick skim through (or not since a "clean room" implementation might be advisable but too late for me!) since it tells us that the source for Sixth Sense is largely based on several open source projects (see 0120). Touchless is used for fiducial recognition, the $1 Unistroke Recogniser algorithm for gesture commands, and ARToolkit for the output. OpenCV is mentioned and possibly does some of the heavy lifting for object recognition (possibly HMM?). I also just realised that the microphone is working in tandem with the camera when it used on paper, probably using some aspect of the sound on the paper to indicate contact with the destination surface since a single camera by itself is insufficient to determine when contact occurs.

So what do we do with this knowledge?

I've played with OpenCV and HandVu in the past and found them (for hand tracking at least) not that great since neither really solve the problem of reliable background segmentation in complex environments hence I can see the logic in using fiducials although a brief play (with touchless) suggests that even a fiducial based recognition system is unlikely to be perfect (at least in the case of a single unmodified webcam). This does lead to an important point for me in terms of requirements :

CVFB-R1: The computer vision system must be able to reliably determine fiducial positions in complex background images.
CVFB-R2: The computer vision system must be able to reliably determine fiducial positions in varied background images.
CVFB-R3: The computer vision system must be able to reliably determine fiducial positions with varying lighting conditions.
CVFB-R4: CVFB-R1 - CVFB-R3 must be met for 4 fiducial markers, each of a distinct colour.

and should it be possible to work without fiducial markers :

CVSB-R1: The computer vision system must be able to reliably determine hand shape in complex background images.
CVSB-R2: The computer vision system must be able to reliably determine hand shape in varied background images.

CVSB-R3: The computer vision system must be able to reliably determine hand shape with varying lighting conditions.
CVSB-R4:  The computer vision system must be able to reliably discriminate between left and right hands.


I rather suspect that I'm going to have to be flexible with tests/thresholds to determine if these requirements are met and it should also be noted that it has been recognised that no single based computer vision technique has been found to work for all applications or environments (Wach et al, 2011, p60) hence there may be some opportunity to improve on the generic libraries/algorithms which it would seem natural to apply (e.g. touchless, cvBlob)

Moving on, for those who haven't played with $1 Unistroke recogniser (Wobbrock, 2007) its impressive. I'd be reasonably confident based of the results of the tests for this algorithm in its reliability and robustness, IF the above requirements can be met.

Keeping to the KISS principle I'm going to use this as the basis of my first experiments (and code woo-hoo!) which are going to be :

1) Capture short (<5 minute) segments of video with a worn webcam (in my case I have a Logitech C910 handy, not the most discrete of cameras but sadly my Microsoft Life show broke grrrrr) in a variety of environments while wearing fiducial markers on 4 fingers.

2) Capture short (<5 minute) segments of video with a worn webcam in a variety of environments without markers.


3) Based on these exemplary videos test various recognition techniques from openCV to determine the optimal technique which meets the above requirements.

4) Apply and test sample gestures against $1 Unistroke Recogniser (Python implementation)
4.1) optional Determine if there are any differences in the performance/reliability of the Python versions.

Okay that's my week planned then, comments?

REFERENCES


Wachs, J, K├Âlsch, M, Stern, H & EDAN, Y 2011, ‘Vision-based hand-gesture applications’ in Communications of the ACM vol. 54, no. 2 p. 60-71

Wobbrock  et al 2007 http://depts.washington.edu/aimgroup/proj/dollar/

Welcome to the Open Source Sixth Sense WGI Project!

Its been over two years since Pranav Mistry announced that his Sixth Sense project would be made open source. Since then we've heard little about this technology or the project and like many AR point technology research examples, this appears to have become abandonware.

So when it came around for me to pick a topic for my Masters thesis I couldn't help but think it would be a great opportunity to do a design and build project for a similar system, investigating the HCI aspects of this novel technology as the focus...and that's what I'm doing.

First though I need to build one and along the way I couldn't help but think 2 things :

1) This is also a good opportunity to build my first open source project so that, if other researchers want to explore this technology, an artefact exists allowing for rapid development of a research system.

2) There's also an opportunity to examine an interesting concept of "The Inventor as the User" as a UCD perspective on the development of novel technologies.

To briefly expand on the above then, it is my intention to expand this project with a forum, wiki and source code which will allow anyone to create a comparable WGI; I'll be keeping this blog updated with my progress and thoughts and I will be actively encouraging input on this work to expand and moderate my perspective so that I end up neither blinkered or missing the "obvious".

Please, wish me luck ;)