πŸ‡«πŸ‡· About PAISS πŸ”—

5-day summer school from 2-6 July 2018, held at Inria Grenoble, co-organized by Inria/Univ. Grenoble Alpes and NAVER LABS Europe. All the slides for the sessions have been uploaded on the official website. What follows are just my personal notes which I will be updating and plugging holes in as I go through the official slides. Feel free to comment!

Some statistics πŸ”—

44 nationalities, 25% women, 60% students, 15% academics, 25% professional

Legend for the notes πŸ”—

Yann LeCun: πŸŒ‡ Deep learning: past, present, and future πŸ”—

VP & Chief AI Scientist @ FB, Professor @ NYU

Supervised learning πŸŽ“ is good for… πŸ”—

input output
speech words
image categories
portrait name
photo caption
text topic..

Traditional ML πŸ”—

hand-engineered feature extractor -> trainable classifier

Deep learning πŸ”—

end-to-end learning, from low- to high-level features, to be fed into a trainable classifier, all with adaptable parameters

Timeline (Past and Present) πŸ”—

AI winter πŸ”—

πŸ—¨ “It’s alchemy, it never works… Only Yann can make it work.”

Early Work with DARPA πŸ”—

“Semantic segmentation with CNN”: submitted to CVPR, rejected by all 3 reviewers

CNN’s πŸ”—

image segmentation πŸ”—

Other applications of CNN’s πŸ”—

Reinforcement learning πŸ”—

Incredible success in games, like Go and Atari.

FB open source πŸ”—

πŸŒ… What’s missing? (The future) πŸ”—

Marry DL with reasoning via differentiable programming πŸ”—

Augmenting neural nets with memory modules, as kind of a hippocampus

πŸ•πŸ‘Ά Enormous knowledge/understanding from little observations… πŸ”—

Like how animals and humans learn a good world model

Senior scientist @ NAVERLABS

πŸ’¬ “It was a bit of a gamble, with the due date coming very soon” πŸ‘Ά

From query image, represent it as descriptor and find images with similar descriptors.

Lots of applications πŸ”—

Shopping from reverse image search, ambient intelligence (robots expected to interact with humans with person re-identification)…

Inherent ambiguity πŸ”—

What does the user mean with a single query image? -> It’s application dependent

The most studied task in the field, with largest set of mature applications

Families of representations πŸ”—

β€œLegacy” methods πŸ”—

Deep representations πŸ”—

What worked for the landmark recognition challenge @ CVPR18πŸ”—

CNN-based global features win πŸ”—

All have query expansion and/or diffusion πŸ”—

If we are confident that image A is very similar to query image Q, we can also return the images retrieved from image A to augment query on Q.

Multi-resolution, ensembling were used πŸ”—

🚢🚷 Person re-identification πŸ”—

Identify the same person from different images of that person. This is challenging because often, the bounding boxes of the people are not aligned; may be cropped/occluded. Many recent approaches propose task-specific representations which aim to fix the alignment issue:

We can also use a global representation:

πŸ“‘ “In defense of global representations for person re-identification” πŸ”—

πŸ“‘ “Re-ID done right” πŸ”—

*Uses grad-CAM to figure out which areas of the image matter, as well as which dimensions.

Semantic retrieval πŸ”—

Instance-level retrieval works on single objects only, whereas semantic retrieval is about retrieving the full scene.

Summary πŸ”—

Still missing… πŸ”—

Questions πŸ”—

Contrastive loss (two streams) vs. triplet loss? πŸ”—

Which is better depends on the system, empirically there’s no clear winner.

A random open-source project that looks similar to their approach πŸ”—

Siamese network/triplet loss: O(N^3) for exhaustive search. How to improve this? πŸ”—

It’s in the implementation details, namely hard triplet mining. Refer to this open-source PyTorch implementation for guidance.

Remaining questions πŸ”—

Andrew Zisserman: πŸŽ“ Self-supervised Learning πŸ”—

Computer vision, recovering 3D structure from image etc.

Why self-supervision? πŸ”—

What is self-supervision? πŸ”—

The task is to define a proxy loss, and implicitly learn something of semantic significance. It’s often complementary, and combining improves performance.

πŸ“‘ Unsupervised visual representation learning by context prediction (Carl Doersch) πŸ”—

Proxy task: predict relative position of 2 subregions within a larger image.

Using a proxy task (relative positioning), we got weights of a network similar to that trained by ImageNet AlexNet (supervised).

❓So the convolutional layers were trained in a self-supervised manner… but how did they train the final softmax classifier layer? Was it supervised?

Self-supervised learning from images πŸ”—

tricky details πŸ”—

You always need to check (in some creative way I guess) whether the network is truly solving the problem *the way you want.

Methods πŸ”—

proxy tasks πŸ”—

Self-supervised learning from videos πŸ”—

πŸ“‘ “Shuffle and Learn” (Misra, 2016) πŸ”—

❓Human pose estimation πŸ”—

Summary: lessons so far πŸ”—

Important to select/sample informative data in training πŸ”—

SSL using the arrow of time :archery: (Donglai Wei, CVPR) πŸ”—

Coldplay’s β€œThe Scientist” music video πŸ”—

Chris Martin was mouthing all the words backwards!!!!!!!!

Interestingly, this paper used the Flickr dataset πŸ”—

Task πŸ”—

Proxy task: predict if a video is playing forwards or backwards. What cues does the network use to make the prediction?

Some gotchas/trivial shortcuts πŸ”—

So we need to remove all black bars, and all zoom/tilt scenes, from the training samples.

πŸ‘€ SSL using temporal coherence of color πŸ³οΈβ€πŸŒˆ (Self-Supervised Tracking via Video ColorizationπŸ”—

Color is mostly temporally coherent. Proxy task: Colorize all frames of a gray scale version using a reference single colorized frame.

Cool visualization πŸ”—

Project embedding to 3 dimensions and plot as RGB

Self-supervised learning from videos with sound (audio-visual co-supervision) πŸ”—

πŸ“‘ Audio-visual embedding (AVE-Net) There are two properties to tease out via proxy tasks:

Semantic consistency between sound and image πŸ”—

What looks like a drum will sound like a drum… so what does a drum sound like? Task: Does this single video frame match this sound snippet?

Synchronization between sound and image πŸ”—

When you hit the drum, a sound will play.

We end up learning… πŸ”—

πŸ“‘ “Objects that Sound”, ICCV 2017

Other examples of audio-visual co-supervision πŸ”—

Summary πŸ”—

Enables learning without supervision πŸ”—

Classification performance is on par with ImageNet-trained models.

SSL from videos with sound πŸ”—

Applicable to other domains with paired signals πŸ”—

❓Teething problems πŸ”—

❓Humans have replay buffers, when they’re sleeping, etc. πŸ”—

Questions πŸ”—

How do you come up with proxy tasks? πŸ”—

Do you go bottom-up:

Or do you go backwards, like,

πŸ›  Practical: πŸ–Ό Image retrieval πŸ”—

Problems with repurposed classification network πŸ”—

Models πŸ”—

Benchmark πŸ”—

Cordelia Schmid: πŸ“Ή Action Recognition πŸ”—

INRIA Paris; IGCV, chair for CVPR; awarded prize for fundamental contributions to CV

Automatic video understanding πŸ”—

Data πŸ”—

Evaluation πŸ”—

Possible outputs πŸ”—

Challenges πŸ”—

Why video? πŸ”—

Is the woman opening or closing the door? We need motion information; still images not sufficient

History πŸ”—

Johansson (1973) motion perception

Optical flow πŸ”—

Deriving a motion field from apparent motion of brightness patterns in the image. Definition: Displacement of a point from frame t-1 to frame t.

Data πŸ”—

MPI sintel flow dataset

Assumptions πŸ”—

Approaches πŸ”—

Applications πŸ”—

Conclusion πŸ”—

Action Recognition Task 1: Action Localization πŸ”—

Approaches πŸ”—

Action Localization Datasets πŸ”—

Action Recognition Task 2: Action Classification πŸ”—

Nice AVA dataset πŸ”—

Atomic actions composed in a hierarchy

Human-action co-occurrence πŸ”—

Failure modes (“trivial shortcuts”) πŸ”—

actual action predicted action
reaching out arm “hand shake”
covering mouth “smoking”
looking down “writing”

It seems like the network is making its prediction solely based on the global pose, which is not good enough.

πŸ“‘ A flexible model for training action localization with varying levels of supervision (Cordelia Schmid) πŸ”—

Action-detection model πŸ”—

Evaluation πŸ”—

50% Intersection over Union (IoU) threshold

Basically faster R-CNN πŸ”—

RGB resnet-50 to get embeddings of frames, and feed into region proposal network.

Conclusion πŸ”—

Julien Mairal: πŸ“Š Large-Scale Optimization for Machine Learning πŸ”—

Research Scientist @ Inria Grenoble, multiple research awards We spend a lot of our time minimizing a cost function. How do we do this on a large scale? E.g. for large recommender systems, with hundreds of terabytes of data.

Supervised learning πŸ”—

Labels are in… πŸ”—

\([-1, 1]\) binary
\(k\) multiclass
\(\mathbb{R}\) regression
\(\mathbb{R^n}\) multivariate regression
any general set structured prediction

With linear models, logreg, SVM… πŸ”—

With neural networks… πŸ”—

Finding optimal \(w\) involves minimizing a non-convex function.

Not just supervised learning πŸ”—

\[\frac{1}{n} \sum L(\hat y , y) + \omega(f)\], where \(\omega(f)\) is the regularization term.

Intro to statistical learning and gradient-based optimization πŸ”—

Setting πŸ”—

\[min (R(h) = E_{(x, y) \sim P} [L(y, h(x))])\]

Nesterov’s acceleration (momentum) πŸ”—

n-convex problems, can get stuck in saddle-point (flat region)

Stochastic optimization πŸ”—

Variance-reduced stochastic gradient descent πŸ”—

Acceleration works in practice, but poorly understood.

Questions πŸ”—

Is there some theoretical optimal learning rate for neural networks like \(\frac{1}{L}\)? πŸ”—

In practice we just usually try some learning rate, divide by 10, try again, divide by 10… lol.

Jean Ponce: 🏫 Weakly Supervised and Unsupervised Methods for Image and Video Interpretation πŸ”—

Inria and NYU; set up research labs at MIT, Stanford, UIUC; β€œPillar of PRAIRIE Institute’

Problem with human labels πŸ”—

Expensive πŸ”—

MS COCO: 2.5M labelled instances and 238K images of 91 object categories, all painstakingly outlined.

Subjective labelling πŸ”—

Why not delineate individual grass blades? Or the man’s hat?

Overview πŸ”—

Weaker forms of supervision πŸ”—

Semi-supervised πŸ”—

Unsupervised πŸ”—

❓learning can be done by predicting things about the world πŸ”—

Cosegmentation of images πŸ”—

TV series come with metadata πŸ”—

πŸ“‘ β€œHello! My name is… Buffy” – Automatic Naming of Characters in TV Video

Temporal localization as classification πŸ”—

temporal action localization

(Cho, kwak, Schmid, Ponce, 2015) πŸ”—

objects often co-occur with each other, union of horse and fence… πŸ”—

matching model: probabilistic Hough matching πŸ”—

simple iterative algo πŸ”—

  1. retrieve 10 nearest neighbous from some global descriptors (Olivia and Torralba 2006)
  2. Match

fed movie scripts from bourne identity, learns what a car is over πŸ”—

“the dog looks like the pig looks like meeee… I’ll skip on that one” πŸ”—

yann πŸ”—

graph transformer networks; the second part nobody read, complicated….

we need to remember the world is 3-dimenional πŸ”—

‼️ architectures need to go beyond pattern recognition ⭐️

Martial Hebert: πŸ€– Robotics for Vision πŸ”—

progress in CV has not translated into progress for autonomous systems/robotics. why?

real world problems πŸ”—

πŸ“Ή monocular drone flight through a dense forest; any failure on a single frame could have catastrophic results

vastly reductive overview of typical system πŸ”—

input -> perception box -> interpretation -> reasoning & decision

However, even if your perception and interpretation have 95% accuracy, this is still unacceptable in the real world. We need to be able to shortcut this pipeline.

introspection πŸ”—

Know when you don’t know. Evaluate whether the input (e.g. the visibility of the surroundings) is good enough to make a decision with.

input -> perception box -> interpretation -> reasoning & decision
 |                                                  ^
 |__________________________________________________|

best approach: πŸ”—

deliberative approach with

poor metrics πŸ”—

do you care if your point cloud is 3mm or 1mm accurate? or if every pixel in semantic segmentation is correctly labeled? what’s more important is how good the planned trajectory is, or the high-level shapes of the segmentations.

multiple hypothesis πŸ”—

aka explainable predictions, structured uncertainty, …

perception box -> interpretation

output multiple likely and diverse hypotheses.

	  |              .*.
	  |             ..  .          *          ..*..
likelihood|            .     .        ...        .    .
	  |  .        .      .      ..  ..      ..    ..
	  |   ..     ..       ..  ..     ...  ..        .......
	  |     ......          ..          ..
	  -----------------------------------------------------
			     perception output

multiple decision strategies πŸ”—

aka adaptive planning, dynamic control strategy, …

reasoning & decision

anytime prediction πŸ”—

output a crude early result, and continuously improve it until interruption

Concept Explanation
Interruptability can give an answer at any time
Monotonicity answers do not get worse over time
Diminishing Returns initial improvements are more useful

conclusion πŸ”—

Nicolas Mansard: πŸƒ Humanoid Motion Planning πŸ”—

move like a human

problem formulation for end-to-end control πŸ”—

minimize preview cost of control sequence such that given

how to integrate knowledge of subproblems in robotics, in IA? πŸ”—

Geometry πŸ”—

solve inverse geometry with…

Dynamics πŸ”—

Optimal control πŸ”—

Agile robots πŸ”—

Julien Perez: πŸ“– Machine Reading πŸ”—

multi documents answering, on Wikipedia πŸ”—

“Who invented neural networks?” “Lecun et. al.””

motivations πŸ”—

applications πŸ”—

information extraction approaches πŸ”—

machine reading tasks πŸ”—

The bAbI project: ALL THE TESTS!

Models of reading πŸ”—

Retrieval models πŸ”—

applications πŸ”—

open questions πŸ”—

-

Emmanuel Dupoux: πŸ‘„ Speech and Natural Language Processing πŸ”—

currently at DeepMind cognitive machine learning team, also INRIA

duplex πŸ”—

levels of meaning πŸ”—

-insert photo here-

4 challenges πŸ”—

possible solutions πŸ”—

putting this in practice: deconstructing HAL πŸ”—

speech recognition (ASR) -> Acoustic modeling ->

summing up πŸ”—

Kyunghyun Cho: 🌏 Machine Translation πŸ”—

NYU and Research Scientist at FAIR; PhD at Aalto and Postdoc at Montreal Best-known for work on statistical neural machine translation.

πŸ’¬ “Two-thirds of what I wanted to cover has been covered already… So I decided, let’s revamp all the slides, make it fun, and also try to cover the latest research.”

History πŸ”—

Hierarchical structure by Borr πŸ”—

borr hovy levin morphological analysis world structure

Rosenblatt (1962) πŸ”—

Allen (1987) πŸ”—

rudimentary CNN

Chrisman (1992) πŸ”—

Modern neural translation πŸ”—

Schwenk (2004), …. Replace all the Statistical machine translation (SMT) with just a neural network

how do we represent a token? πŸ”—

how do we use neural net? we apply an πŸ‘€ affine transform, a nonlinearity, stack them, and hope that more will result in better representation, right? “we don’t need to know anything about language to do NLP”

Language modelling πŸ”—

Captures the distribution over all possible sentences.

\[p(X) = p((x_1 , x_2, …, x_T)) = p(x_1)p(x_2|x_1) … p(x_T|x_1, x_2, …, x_{T-1})\] #+END_SRC

autoregressive language modelling πŸ”—

each conditional is a sentence classifier the distribution over the next token is based on all the previous tokens

count-based n-gram language models πŸ”—

estimate with MLE (# heads)/(# coin tosses)

neural n-gram language model πŸ”—

infinite context recurrent language models (Mikolov, 2010) πŸ”—

conclusion πŸ”—

NMT πŸ”—

use autoregressive language model as a decoder

RNN NMT πŸ”—

this model is super general and decoupled from language; can be used for image captioning, speech recognition… in production, you need to do much more that’s very engineered towards translation itself, e.g. choosing the correct gender pronoun (done manually as post-processing or cleanup) checking whether a sentence is grammatical or not is an unsolved problem… so far the analyses are limited (but spellchecker is often used as post-processing step) πŸ’¬ “i have 18 minutes and like 80 slides” πŸ˜‚

practice πŸ”—

check out Nematus, OpenNMT, FairSeq…z πŸ”—

NMT is coined by “Yours truly” at a poster in 2014

decoding from a recurrent language model πŸ”—

beam search is defacto standard πŸ”—

⭐️ “learning to decode” πŸ”—

neural network = forgetting machine

multilingual translaion for low-resource languages πŸ”—

for universal encoder/decoders, the models became able to handle mid-sentence code-switching

limitations πŸ”—

model-agnostic meta-learning MAML (FInn, 2018) πŸ”—

multi-task learning starts to overfit to sourcetask πŸ”—

with metalearning it never overfits to source task πŸ”—

metalearning has lots of benefits in MNT and RL πŸ”—

Leon Bottou: Artificial Intelligence Unsupervised Learning and Causation πŸ”—

what works? πŸ”—

Image recognition, MNT, RL in games

what doesn’t work? πŸ”—

training demands too much data πŸ”—

playing atari games… after playing more games than any teenager can endure playing go… after playing more games than any grandmaster has played (actually, all of mankind πŸ˜‚)

the statistical problem is only a proxy πŸ”—

e.g. “giving a phone call” according to photos on the Internet is something that just happens when a :man_dancing_tone3: person is near a ☎️ phone. So our statistical models are completely missing the concept of “giving a phone call”

structure does not help πŸ”—

2 studies from stanford in 2016; one with bigrams and one with recursive parsing tree; replace parse tree with random structure worked just as well

causation πŸ”—

relation between causation and unsupervised learning

causation and statistics πŸ”—

what is causation? πŸ”—

manipulations πŸ”—

causation can be inferred from the outcome of manipulations e.g. β˜”οΈ if we ban all umbrellas, will rain stop?

Reichenbach’s principle πŸ”—

Policy-based πŸ”—

Pontryagin’s maximum principle (1956) optimize expected discounted sum of future rewards via gradient ascent

Off-policy RL πŸ”—

behaviour policy mu(a|x), target policy pi(a|x)

credit assignment problem πŸ”—

use importance sampling to re-weight the TD (temporal difference) of the target policy over the old policy trajectory (behaviour policy) πŸ‘ unbiased estimate of Q^pi ❌ large (possibly infinite) variance

comparison in terms of trace coefficients c_s πŸ”—

caveat πŸ”—

still, all of these algorithms fail on games with little feedback/reward, cannot encode long-term dependencies

V-trace πŸ”—

Similar to Retrace, but learning state-value function \(V\) instead of a state-action-value function \(Q\). Introduced in πŸ“‘ IMPALA (2018)

πŸ›  Practical: πŸ›’ Reinforcement Learning by Criteo πŸ”—

Hugo Larochelle: Generalizing From Few Examples with Meta-Learning πŸ”—

Google Brain, few-shot learning

Exploit alternative data that is imperfect but plentiful πŸ”—

data type of learning
unlabeled data unsupervised learning
multimodal data multimodal learning
multidomain data transfer learning, domain adaptation

Unseen classes in multidomain data πŸ”—

Machines are getting better at it πŸ”—

πŸ“‘ Human-level concept learning through probabilistic program induction Generating new (unseen) characters

Meta-learning πŸ”—

A pretty old topic dating back to Schmidhuber’s “Learning to learn” and neural networks that learn to modify their own weights.

What is meta-learning? πŸ”—

Learning to generalize; evaluate on never-seen problems/dataset

Components πŸ”—

-

sequentially fit gaussian process
multi-task bayesian algorithms; transfer model selection between different problems
Larochelle: coming up with a fixed learning algorithm
⭐️ a meta-learner that picks different modules depending on the dataset

πŸ“‘ Matching networks for one shot learning (2016) πŸ”—

πŸ“‘ Prototypical Networks for Few-shot learning (2017) πŸ”—

Use same embedding function f for test and training set Prototypal class k πŸ‘€ If you’re using Gaussian it will be Euclidean distance? Equivalent to training an embedding such that a Gaussian classifier would work well.

⭐️ suggests there’s room for improvement in generating episodes

πŸ“‘ Optimization as a Model for Few-shot learning (2017) πŸ”—

Using an LSTM-esque meta-learner to update parameters

πŸ“‘ MAML for Dast Adaptation of Deep Networks (Finn, 2017) πŸ”—

Can start with this rather than meta-learner LSTM

Bias transformation πŸ”—

⭐️ there’s a hierarchical bayes interpretation of MAML’s 1 gradient descent update; from Berkeley as well

πŸ“‘ SNAIL (Mishra) πŸ”—

Cold-start item recommendation (a meta-learning perspective) πŸ”—

low-shot learning from imaginary Data (2018) πŸ”—

areas of improvement πŸ”—

Lourdes Agapito: πŸ’€ Deep learning for 3D human pose estimation πŸ”—

UCL?

Capturing 3D dynamic scenes πŸ”—

problems:

  1. unsupervised non-rigid surface reconstruction
  2. 3d human pose estimation from image
  3. dynamic and semantic segmentation

challenges πŸ”—

Approaches for 3D human pose estimation from image πŸ”—

direct regression with CNN πŸ”—

pipeline πŸ”—

πŸ“‘ Convolutional Pose Machines (2016) πŸ”—

πŸ“‘ PPCA by Lourdes Agapito πŸ”—

image -> heatmap -> joint identification ->predicted belief maps->3D lifting->project back to 2D belief maps loss on fused predicted belief maps and projected belief map

πŸ“‘ Direct, Dense, and Deformable: Template-Based Non-Rigid 3D Reconstruction from RGB Video πŸ”—

Synthesia πŸ”—

tracking face with 3d models instead of vertices; can modify the parameters and re-synthesize the faces

voice dubbing! πŸ”—

3D scene understanding πŸ”—

Mask-RCNN is performing really well, on 80 categories from the COCO dataset

πŸ“‘ Mask Fusion (2018) πŸ”—

Conclusions πŸ”—

πŸ›  Practical: πŸ”₯ PyTorch Practical Session by FAIR πŸ”—

Conducted by Francisco Massa

Tensor (ndarray) library πŸ”—

np.ndarray <=> torch.Tensor

Automatic differentiation engine πŸ”—

computation done on-the-fly πŸ”—

gradients by automatic backpropagation through the graph πŸ”—

debugging πŸ”—