Initializing recurrent weight matrix in RNN with identity matrix, and using ReLU for activation
The empirical results are able to reproduce but not easy. The training process is highly unstable.
For the toy problem, it turns out the optimization algorithm will first find a local optima. Obviously, returning 1 should be a good choice if the network failed to learn the knowledge about how to compute the sum. May be better to avoid this kind of consistent mean.
At some point, the optimization algorithm will find a super-steep slope gives huge gradient. Gradient clipping is not a final solution.