Member-only story
Long Term Dependencies in Sequence Learning
Sir Kepler, a renowned astronomer with famous three laws under his authorship took long time to device his principles. A Neural Network(NN) would have predicted Tycho Brahe’s trajectories without requiring Kepler’s flash of insight to try fitting the data to an ellipse. Sir Isaac Newton, however, would have had a much harder time deriving his laws of gravitation from a generic model. But what gives NNs, so much of power that they learn data at unprecedented speed and so generic in nature? It is Optimization.
Most deep learning algorithms involve optimization of some sort. Optimization refers to the task of either minimizing or maximizing some function f(x) by altering x. We usually phrase most optimization problems in terms of minimizing f(x). Maximization may be accomplished via a minimization algorithm by minimizing −f ( x ).
Optimization in general is an extremely difficult task. Traditionally, machine learning has avoided the difficulty of general optimization by carefully designing the objective function and constraints to ensure that the optimization problem is convex. When training neural networks, we must confront the general non-convex case. Even convex optimization is not without its complications.