Home > Uncategorized > Weekly QuEST Discussion Topics and News, 12 Feb

Weekly QuEST Discussion Topics and News, 12 Feb

QuEST 12 Feb 2016:

 We want to discuss the application of compositional captioning and its relationship to the unexpected query.  The article:  Deep Compositional Captioning:  Describing Novel Object Categories without Paired Training Data, arXiv:1511.05284v1 [cs.CV] 17 Nov 2015,

•      While recent deep neural network models have achieved promising results on the image captioning task, they rely largely on the availability of corpora with paired image and sentence captions to describe objects in context.

•      In this work, we propose the Deep Compositional Captioner (DCC) to address the task of generating descriptions of novel objects which are not present in paired image sentence datasets.

•      The goals of the discussion is to understand the approach AND to use it to discuss the idea of the unexpected query – I had great hope for it to give me insight into our fundamental challenge – the unexpected query – and it could be they did but as far as I can currently tell (I intend to spend more time digging through it) – they point out that current caption systems just spit back out previously learned associations (image – caption pairs) – when you don’t have something in your training set (image caption pairs) that can account for the meaning of a test image or video snippet you lose – cause it will give you its best previously experienced linguistic expression from the image-caption pair data!

•      That is brilliant and makes me really appreciate the capability and limitations of the approach – now as far as I can tell what they do in this DCC paper is bring to bear on the challenge a language model that is trained as most RNN language models are trained on some text data – their really cool innovation (we’ve seen this in other articles also) is they embed in a common space with the CNN trained on object/image data bases – and now by combining with a image caption model network / data – they are able to associate previously experienced linguistic expressions that did NOT come from the image-caption training – they are able to combine the experiences of the language model and image model and some image-caption data – really cool – BUT as far as I can tell they still can only use as a caption previously experienced linguistic expressions (but in their case some of those expressions came from the language model not from the object image model or from the image-caption data model) – that makes perfect sense to me –

•      The unexpected query problem is still an open question – how can I generate a caption using linguistic expressions I’ve never used before possibly because my image/video snippet is completely novel compared to anything I’ve seen – I think Scott’s example of a yellow frog is spectacular – and I think the answer still lies in a quest wrapper – one that facilitates the imagining of the feature data both in the RNN and the CNN – generating a simulation that uses those imagined features attempting to find a high confidence answer when combined with the bottom up evoked features

•      so think of an approach that generates these confabulated feature combinations –  if by doing such an imagined simulation for part of the representation space – for example I ignore the color factor (I’m not saying that there is a color only feature that is just a metaphor for the discussion – the feature level part of the CNN before the output layer) that accounts for that aspect of the stimulus – and the rest of the features match very strongly with a frog on a lily pad – but by tracking what part of the CNN I imagined I use the RNN to capture the best words for that part of the original stimulus – maybe I can generate a yellow frog caption?

•      I don’t think this has to be hand tweaked – I think I could envision a systematic approach that for every input the ‘conscious’ wrapper generates these ‘plausible narratives’ – when there appears to be an inconsistency in the combinations in the representation but each part alone is very similar to prior experiences – sometimes the answer is exactly what the stimulus evoked but sometimes there is a inferred representation that generates a really good answer and has to be considered

•      I’ve been thinking a lot about the unexpected query and QuEST – we have said continually but without much detailed back up arguments that we seek to design systems that can respond acceptably for a range of stimuli that were not included in the pool of expected stimuli when the system was designed

•      Bottom line up front:  we can engineer from the outside using concepts from transfer learning changes to existing machine learning solutions to adapt to new tasks or domains – what we seek in quest is the ability for the system to immediately respond to a range of stimuli that we wouldn’t have thought it could have – and it does this using a complementary representation that is situated/simulated

•      We’ve entered into the discussion the ideas from transfer learning – it added some specificity to how we might define UQs relevant to learning systems that use conventional statistical approaches for representation (like deep learning …) – that led us to define domains and tasks – see for example the Pan/Yang IEEE article for clean definitions or our deck of slides on unexpected query – domains capture the feature space and the pdfs –the task definition captures the labels and the predictive functioned being learned from the data

•      The UQ is when something changes in the domain or the task – for example a new set of labels – that would be a change in task – so in our content curation problem imagine when I added a category of type of document that previously wasn’t in my label set – a pile of documents for ‘Seth’ – I could imagine the features / pdfs / transfer function might not have to be changed but I need to be able to find from those tools which documents Seth would want to see – that is an unexpected query to the original system

•      Another example was when the AC3 team took the image net trained CNN and used those weights to provide a solution for the narratives for video snippets problem – the retraining of the RNN system with some video snippets truth data is a means to get a system that would not be expected to respond acceptably to video snippet queries to have a shot at responding acceptably – so the original system as trained with image net and labels would have UQ for the category of inputs of video snippets – using multi-frame features is a clear example of changing the domain – feature space – so our engineering wrapper is a means to change a previous solution to handle this category of UQs

•      Where does QuEST help?  It is clear that putting people in the loop to adapt solutions to change the design / predictive functions … is a relatively straight forward means to adapt to categories of inputs that the system is not expected to respond acceptably to – but if the system has to respond NOW (while the system is being redesigned – a requirement for autonomy) how can we make such a system have some expectation of responding acceptably – that is where a sys2/conscious system has an enormous advantage – I would also like to address the deep mind perspective on re-enforcement learning – imagine re-enforcement learning as one of the means to adapt to new queries – but they on the surface can’t provide the immediate response that has any hope of being acceptable –

•      A situated simulation based representation might just be able to have a representation that is different enough to the sensor space representation to facilitate an immediate acceptable response – think of the color constancy challenge – generate a representation where the amount of light coming off a snowball inside versus a piece of coal outside –

•      Wiki – Color constancy is an example of subjective constancy and a feature of the human color perception system which ensures that the perceived color of objects remains relatively constant under varying illumination conditions. A green apple for instance looks green to us at midday, when the main illumination is white sunlight, and also at sunset, when the main illumination is red. This helps us identify objects

•      We see a similar representation in music – independent of the base key the tune is the key ‘meaning’ –

•      To generate a deep leaning solution to solve this problem would never be able to generate all the data necessary for all the variations of illuminant – by having acombination of representations we reduce the need for all the original training data

•      Lastly I want to remind all of the recent Toyota proclamation:  Toyota Wants Its Cars to Expect the Unexpected — http://www.technologyreview.com/news/545186/toyota-wants-its-cars-to-expect-the-unexpected/

•      Japanese carmaker Toyota revals details of an ambitious $1 billion effort to advance AI and robotics.

•      By Will Knight on January 4, 2016

•      Fundamental advances are needed in order for computers and robots to be much smarter and more useful.

•      Gill Pratt, CEO of the Toyota Research Institute, speaks at CES in Las Vegas.

Toyota revealed more details of an ambitious plan to invest in artificial intelligence and robots today during a keynote speech by Gill Pratt, CEO of the new $1 billion Toyota Research Institute (TRI) at CES in Las Vegas


So in conclusion – we want to hit the details of the DCC article from these perspectives 

news summary (42)

Categories: Uncategorized
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: