# Reference

Contents

Knet.AutoGradModule

Usage:

x = Param([1,2,3])          # user declares parameters with Param
x => P([1,2,3])             # Param is just a struct wrapping a value
value(x) => [1,2,3]         # value returns the thing wrapped
sum(x .* x) => 14           # Params act like regular values
y = @diff sum(x .* x)       # Except when we differentiate using @diff
y => T(14)                  # you get another struct
value(y) => 14              # which carries the same result
params(y) => [x]            # and the Params that it depends on
grad(y,x) => [2,4,6]        # and the gradients for all Params

Param(x) returns a struct that acts like x but marks it as a parameter you want to compute gradients with respect to.

@diff expr evaluates an expression and returns a struct that contains the result (which should be a scalar) and gradient information.

grad(y, x) returns the gradient of y (output by @diff) with respect to any parameter x::Param, or nothing if the gradient is 0.

value(x) returns the value associated with x if x is a Param or the output of @diff, otherwise returns x.

params(x) returns an iterator of Params found by a recursive search of object x.

Alternative usage:

x = [1 2 3]
f(x) = sum(x .* x)
f(x) => 14
gradloss(f)(x) => ([2 4 6], 14)

Given a scalar valued function f, grad(f,argnum=1) returns another function g which takes the same inputs as f and returns the gradient of the output with respect to the argnum'th argument. gradloss is similar except the resulting function also returns f's output.

## KnetArray

Knet.KnetArrays.KnetArrayType
KnetArray{T}(undef,dims)
KnetArray(a::AbstractArray)
Array(k::KnetArray)

Container for GPU arrays that supports most of the AbstractArray interface. The constructor allocates a KnetArray in the currently active device, as specified by CUDA.device(). KnetArrays and Arrays can be converted to each other as shown above, which involves copying to and from the GPU memory. Only Float32/64 KnetArrays are fully supported.

KnetArrays use the CUDA.jl package for allocation and some operations. Currently some of the custom CUDA kernels that implement elementwise, broadcasting, and reduction operations for KnetArrays work faster. Once these are improved in CUDA.jl, KnetArrays will be retired.

Supported functions:

• Indexing: getindex, setindex! with the following index types:

• 1-D: Real, Colon, OrdinalRange, AbstractArray{Real}, AbstractArray{Bool}, CartesianIndex, AbstractArray{CartesianIndex}, EmptyArray, KnetArray{Int32} (low level), KnetArray{0/1} (using float for BitArray) (1-D includes linear indexing of multidimensional arrays)
• 2-D: (Colon,Union{Real,Colon,OrdinalRange,AbstractVector{Real},AbstractVector{Bool},KnetVector{Int32}}), (Union{Real,AbstractUnitRange,Colon}...) (in any order)
• N-D: (Real...)
• Array operations: ==, !=, adjoint, argmax, argmin, cat, convert, copy, copyto!, deepcopy, display, eachindex, eltype, endof, fill!, findmax, findmin, first, hcat, isapprox, isempty, length, ndims, one, ones, permutedims, pointer, rand!, randn!, reshape, similar, size, stride, strides, summary, transpose, vcat, vec, zero. (Boolean operators generate outputs with same type as inputs; no support for KnetArray{Bool}.)

• Unary functions with broadcasting: -, abs, abs2, acos, acosh, asin, asinh, atan, atanh, cbrt, ceil, cos, cosh, cospi, digamma, erf, erfc, erfcinv, erfcx, erfinv, exp, exp10, exp2, expm1, floor, gamma, lgamma, log, log10, log1p, log2, loggamma, one, round, sign, sin, sinh, sinpi, sqrt, tan, tanh, trigamma, trunc, zero

• Binary functions with broadcasting: !=, *, +, -, /, <, <=, ==, >, >=, ^, max, min

• Reduction operators: maximum, minimum, prod, sum

• Statistics: mean, std, stdm, var, varm

• Linear algebra: (*), axpy!, lmul!, norm, rmul!

• Knet extras: batchnorm, bce, bmm, cat1d, conv4, cpucopy, deconv4, dropout, elu, gpucopy, logistic, logp, logsoftmax, logsumexp, mat, nll, pool, relu, RNN, selu, sigm, softmax, unpool (Only 4D/5D, Float32/64 KnetArrays support conv4, pool, deconv4, unpool)

source

## File I/O

Knet.FileIO_gpu.saveFunction
Knet.save(filename, args...; kwargs...)

Call FileIO.save after serializing Knet specific args.

File format is determined by the filename extension. JLD and JLD2 are supported. Other formats may work if supported by FileIO, please refer to the documentation of FileIO and the specific format. Example:

Knet.save("foo.jld2", "name1", value1, "name2", value2)
source
Knet.FileIO_gpu.loadFunction
Knet.load(filename, args...; kwargs...)

Call FileIO.load then deserialize Knet specific values.

File format is determined by FileIO. JLD and JLD2 are supported. Other formats may work if supported by FileIO, please refer to the documentation of FileIO and the specific format. Example:

Knet.load("foo.jld2")           # returns a ("name"=>value) dictionary
Knet.load("foo.jld2", "name1")  # returns the value of "name1" in "foo.jld2"
Knet.load("foo.jld2", "name1", "name2")   # returns tuple (value1, value2)
source
Knet.FileIO_gpu.@saveMacro
Knet.@save "filename" variable1 variable2...

Save the values of the specified variables to filename in JLD2 format.

When called with no variable arguments, write all variables in the global scope of the current module to filename. See JLD2.

source
Knet.FileIO_gpu.@loadMacro
Knet.@load "filename" variable1 variable2...

Load the values of the specified variables from filename in JLD2 format.

When called with no variable arguments, load all variables in filename. See JLD2.

source

## Parameter initialization

Knet.Train20.paramFunction
param(array; atype)
param(dims...; init, atype)
param0(dims...; atype)

The first form returns Param(atype(array)).

The second form Returns a randomly initialized Param(atype(init(dims...))).

The third form param0 is an alias for param(dims...; init=zeros).

By default, init is xavier_uniform and atype is Knet.atype().

source
Knet.Train20.xavierFunction
xavier_uniform(a...; gain=1)
xavier(a...; gain=1)

Return uniform random weights in the range ± gain * sqrt(6 / (fanin + fanout)). The a arguments are passed to rand to specify type and dimensions. See (Glorot and Bengio 2010) or the PyTorch docs for a description. The function implements equation (16) of the referenced paper. Also known as Glorot initialization. The function xavier is an alias for xavier_uniform. See also xavier_normal.

source
Knet.Train20.xavier_uniformFunction
xavier_uniform(a...; gain=1)
xavier(a...; gain=1)

Return uniform random weights in the range ± gain * sqrt(6 / (fanin + fanout)). The a arguments are passed to rand to specify type and dimensions. See (Glorot and Bengio 2010) or the PyTorch docs for a description. The function implements equation (16) of the referenced paper. Also known as Glorot initialization. The function xavier is an alias for xavier_uniform. See also xavier_normal.

source
Knet.Train20.xavier_normalFunction
xavier_normal(a...; gain=1)

Return normal distributed random weights with mean 0 and std gain * sqrt(2 / (fanin + fanout)). The a arguments are passed to rand. See (Glorot and Bengio 2010) and PyTorch docs for a description. Also known as Glorot initialization. See also xavier_uniform.

source
Knet.Train20.gaussianFunction
gaussian(a...; mean=0.0, std=0.01)

Return a Gaussian array with a given mean and standard deviation. The a arguments are passed to randn.

source
Knet.Train20.bilinearFunction

Bilinear interpolation filter weights; used for initializing deconvolution layers.

Arguments:

T : Data Type

fw: Width upscale factor

fh: Height upscale factor

IN: Number of input filters

ON: Number of output filters

Example usage:

w = bilinear(Float32,2,2,128,128)

source

## Activation functions

Knet.Ops20.eluFunction
elu(x)

Return (x > 0 ? x : exp(x)-1).

Reference: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) (https://arxiv.org/abs/1511.07289).

source
Knet.Ops20.seluFunction
selu(x)

Return λ01 * (x > 0 ? x : α01 * (exp(x)-1)) where λ01=1.0507009873554805 and α01=1.6732632423543778.

Reference: Self-Normalizing Neural Networks (https://arxiv.org/abs/1706.02515).

source
Knet.Ops20.sigmFunction
sigm(x)

Return 1/(1+exp(-x)).

Reference: Numerically stable sigm implementation from http://timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick.

source

## Loss functions

Knet.Ops20.accuracyFunction
accuracy(scores, labels; dims=1, average=true)

Given an unnormalized scores matrix and an Integer array of correct labels, return the ratio of instances where the correct label has the maximum score. dims=1 means instances are in columns, dims=2 means instances are in rows. Use average=false to return the pair (ncorrect,count) instead of the ratio (ncorrect/count). The valid labels should be integers in the range 1:numclasses, if labels[i] == 0, instance i is skipped.

source
accuracy(model; data, dims=1, average=true, o...)

Compute the number of correct predictions of a model over a dataset:

accuracy(model(inputs; kwargs...), labels; dims) for (inputs,labels) in data

and return (ncorrect/count) if average=true or (ncorrect,count) if average=false where count is the number instances not skipped (instances with label==0 are skipped) and ncorrect is the number of them correctly labeled by the model.

The model should be a function returning scores given inputs, and data should be an iterable of (inputs,labels) pairs. The valid labels should be integers in the range 1:numclasses, if labels[i] == 0, instance i is skipped.

source
Knet.Ops20.bceFunction
bce(scores, labels; average=true)

Computes binary cross entropy loss given predicted unnormalized scores and answer labels for a binary prediction task. Label values should be in {0,1}. Scores are unrestricted and will be converted to probabilities using

probs = 1 ./ (1 .+ exp.(-scores))

The loss calculated is

-(labels .* log.(probs) .+ (1 .- labels) .* log.(1 .- probs))

The return value is (total/count) if average=true and (total,count) if average=false where count is the number of instances and total is their total loss.

See also logistic which computes the same loss with {-1,1} labels.

Reference: https://towardsdatascience.com/nothing-but-numpy-understanding-creating-binary-classification-neural-networks-with-e746423c8d5c

source
Knet.Ops20.logisticFunction
logistic(scores, labels; average=true)

Computes logistic loss given predicted unnormalized scores and answer labels for a binary prediction task.

log.(1 .+ exp.(-labels .* scores))

Label values should be {-1,1}. Scores are unrestricted. The return value is (total/count) if average=true and (total,count) if average=false where count is the number of instances and total is their total loss.

See also bce which computes the same loss with {0,1} labels.

Reference: https://towardsdatascience.com/nothing-but-numpy-understanding-creating-binary-classification-neural-networks-with-e746423c8d5c

source
Knet.Ops20.logpFunction
softmax(x; dims=:)
logsoftmax(x; dims=:)

Treat entries in x as as unnormalized log probabilities and return normalized (log) probabilities, i.e.

softmax(x; dims) = exp.(x) ./ sum(exp.(x); dims=dims)
logsoftmax(x; dims) = x .- log.(sum(exp.(x); dims=dims))

For numerical stability x = x .- maximum(x,dims=dims) is performed before exponentiation.

dims is an optional argument, if not specified the normalization is over the whole x, otherwise the normalization is performed over the given dimensions. In particular, if x is a matrix, dims=1 normalizes columns of x and dims=2 normalizes rows of x.

source
Knet.Ops20.logsoftmaxFunction
softmax(x; dims=:)
logsoftmax(x; dims=:)

Treat entries in x as as unnormalized log probabilities and return normalized (log) probabilities, i.e.

softmax(x; dims) = exp.(x) ./ sum(exp.(x); dims=dims)
logsoftmax(x; dims) = x .- log.(sum(exp.(x); dims=dims))

For numerical stability x = x .- maximum(x,dims=dims) is performed before exponentiation.

dims is an optional argument, if not specified the normalization is over the whole x, otherwise the normalization is performed over the given dimensions. In particular, if x is a matrix, dims=1 normalizes columns of x and dims=2 normalizes rows of x.

source
Knet.Ops20.logsumexpFunction
logsumexp(x;dims=:)

Compute log(sum(exp(x);dims)) in a numerically stable manner.

dims is an optional argument, if not specified the summation is over the whole x, otherwise the summation is performed over the given dimensions. In particular if x is a matrix, dims=1 sums columns of x and dims=2 sums rows of x.

source
Knet.Ops20.nllFunction
nll(scores, labels; dims=1, average=true)

Return the negative log likelihood for a single batch of data given an unnormalized scores matrix and an Integer array of correct labels. The scores matrix should have size (classes,instances) if dims=1 or (instances,classes) if dims=2. labels[i] should be in 1:classes to indicate the correct class for instance i, or 0 to skip instance i.

The return value is (total/count) if average=true and (total,count) if average=false where count is the number of instances not skipped (i.e. label != 0) and total is their total negative log likelihood.

Example

Let's assume that there are three classes (cat, dog, ostrich) and just 2 instances with the unnormalized score scores[:,1] and scores[:,2] respectively. The first instance is actually a cat and the second instance a dog:

scores = [12.2    0.3;
2.0   21.5;
0.0  -21.0]
labels = [1, 2]
nll(scores,labels)
# returns 2.1657e-5

The probabilites are derived from the scores and the negative log-probabilities corresponding to the labels are averaged:

probabilites = exp.(scores) ./ sum(exp.(scores),dims=1)
-(log(probabilites[labels[1],1]) + log(probabilites[labels[2],2]))/2
# returns 2.1657e-5
source
nll(model; data, dims=1, average=true, o...)

Compute the negative log likelihood for a model over a dataset:

nll(model(inputs; kwargs...), labels; dims) for (inputs,labels) in data

and return (total/count) if average=true or (total,count) if average=false where count is the number of instances not skipped (instances with label==0 are skipped) and total is their total negative log likelihood.

The model should be a function returning scores given inputs, and data should be an iterable of (inputs,labels) pairs. The valid labels should be integers in the range 1:numclasses, if labels[i] == 0, instance i is skipped.

source
Knet.Ops20.softmaxFunction
softmax(x; dims=:)
logsoftmax(x; dims=:)

Treat entries in x as as unnormalized log probabilities and return normalized (log) probabilities, i.e.

softmax(x; dims) = exp.(x) ./ sum(exp.(x); dims=dims)
logsoftmax(x; dims) = x .- log.(sum(exp.(x); dims=dims))

For numerical stability x = x .- maximum(x,dims=dims) is performed before exponentiation.

dims is an optional argument, if not specified the normalization is over the whole x, otherwise the normalization is performed over the given dimensions. In particular, if x is a matrix, dims=1 normalizes columns of x and dims=2 normalizes rows of x.

source

## Convolution and Pooling

Knet.Ops20.conv4Function
conv4(w, x; kwargs...)

Execute convolutions or cross-correlations using filters specified with w over tensor x.

If w has dimensions (W1,W2,...,Cx,Cy) and x has dimensions (X1,X2,...,Cx,N), the result y will have dimensions (Y1,Y2,...,Cy,N) where Cx is the number of input channels, Cy is the number of output channels, N is the number of instances, and Wi,Xi,Yi are spatial dimensions with Yi determined by:

Yi = 1 + floor((Xi + 2*padding[i] - ((Wi-1)*dilation[i] + 1)) / stride[i])

padding, stride and dilation are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.

Keywords

• padding=0: the number of extra zeros implicitly concatenated at the start and end of each dimension.
• stride=1: the number of elements to slide to reach the next filtering window.
• dilation=1: dilation factor for each dimension.
• mode=0: 0 for convolution and 1 for cross-correlation (which flips the filter).
• alpha=1: can be used to scale the result.
• group=1: can be used to perform grouped convolutions.
source
Knet.Ops20.deconv4Function
deconv4(w, x; kwargs...)

Simulate 4-D deconvolution by using transposed convolution operation. Its forward pass is equivalent to backward pass of a convolution (gradients with respect to input tensor). Likewise, its backward pass (gradients with respect to input tensor) is equivalent to forward pass of a convolution. Since it swaps forward and backward passes of convolution operation, padding and stride options belong to output tensor. See this report for further explanation.

If w has dimensions (W1,W2,...,Cy,Cx) and x has dimensions (X1,X2,...,Cx,N), the result y=deconv4(w,x) will have dimensions (Y1,Y2,...,Cy,N) where

Yi = (Xi - 1)*stride[i] + ((Wi-1)*dilation[i] + 1) - 2*padding[i]

Here Cx is the number of x channels, Cy is the number of y channels, N is the number of instances, and Wi,Xi,Yi are spatial dimensions. Padding and stride are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.

Keywords

• padding=0: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.
• stride=1: the number of elements to slide to reach the next filtering window.
• mode=0: 0 for convolution and 1 for cross-correlation.
• alpha=1: can be used to scale the result.
• handle: handle to a previously created cuDNN context. Defaults to a Knet allocated handle.
• group=1: can be used to perform grouped convolutions.
source
Knet.Ops20.poolFunction
pool(x; kwargs...)

Compute pooling of input values (i.e., the maximum or average of several adjacent values) to produce an output with smaller height and/or width.

If x has dimensions (X1,X2,...,Cx,N), the result y will have dimensions (Y1,Y2,...,Cx,N) where

Yi=1+floor((Xi+2*padding[i]-window[i])/stride[i])

Here Cx is the number of input channels, N is the number of instances, and Xi,Yi are spatial dimensions. window, padding and stride are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.

Keywords:

• window=2: the pooling window size for each dimension.
• padding=0: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.
• stride=window: the number of elements to slide to reach the next pooling window.
• mode=0: 0 for max, 1 for average including padded values, 2 for average excluding padded values, 3 for deterministic max.
• maxpoolingNanOpt=1: Nan numbers are not propagated if 0, they are propagated if 1.
• alpha=1: can be used to scale the result.
source

## Recurrent neural networks

Knet.Ops20.RNNType
rnn = RNN(inputSize, hiddenSize; opts...)
rnn(x; batchSizes) => y
rnn.h, rnn.c  # hidden and cell states

RNN returns a callable RNN object rnn. Given a minibatch of sequences x, rnn(x) returns y, the hidden states of the final layer for each time step. rnn.h and rnn.c fields can be used to set the initial hidden states and read the final hidden states of all layers. Note that the final time step of y always contains the final hidden state of the last layer, equivalent to rnn.h for a single layer network.

Dimensions: The input x can be 1, 2, or 3 dimensional and y will have the same number of dimensions as x. size(x)=(X,[B,T]) and size(y)=(H/2H,[B,T]) where X is inputSize, B is batchSize, T is seqLength, H is hiddenSize, 2H is for bidirectional RNNs. By default a 1-D x represents a single instance for a single time step, a 2-D x represents a single minibatch for a single time step, and a 3-D x represents a sequence of identically sized minibatches for multiple time steps. The output y gives the hidden state (of the final layer for multi-layer RNNs) for each time step. The fields rnn.h and rnn.c represent the hidden states of all layers in a single time step and have size (H,B,L/2L) where L is numLayers and 2L is for bidirectional RNNs.

batchSizes: If batchSizes=nothing (default), all sequences in a minibatch are assumed to be the same length. If batchSizes is an array of (non-increasing) integers, it gives us the batch size for each time step (allowing different sequences in the minibatch to have different lengths). In this case x will typically be 2-D with the second dimension representing variable size batches for time steps. If batchSizes is used, sum(batchSizes) should equal length(x) ÷ size(x,1). When the batch size is different in every time step, hidden states will have size (H,B,L/2L) where B is always the size of the first (largest) minibatch.

Hidden states: The hidden and cell states are kept in rnn.h and rnn.c fields (the cell state is only used by LSTM). They can be initialized during construction using the h and c keyword arguments, or modified later by direct assignment. Valid values are nothing (default), 0, or an array of the right type and size possibly wrapped in a Param. If the value is nothing the initial state is assumed to be zero and the final state is discarded keeping the value nothing. If the value is 0 the initial state is assumed to be zero and 0 is replaced by the final state on return. If the value is a valid state, it is used as the initial state and is replaced by the final state on return.

In a differentiation context the returned final hidden states will be wrapped in Result types. This is necessary if the same RNN object is to be called multiple times in a single iteration. Between iterations (i.e. after diff/update) the hidden states need to be unboxed with e.g. rnn.h = value(rnn.h) to prevent spurious dependencies. This happens automatically during the backward pass for GPU RNNs but needs to be done manually for CPU RNNs. See the CharLM Tutorial for an example.

Keyword arguments for RNN:

• h=nothing: Initial hidden state.
• c=nothing: Initial cell state.
• rnnType=:lstm Type of RNN: One of :relu, :tanh, :lstm, :gru.
• numLayers=1: Number of RNN layers.
• bidirectional=false: Create a bidirectional RNN if true.
• dropout=0: Dropout probability. Applied to input and between layers.
• skipInput=false: Do not multiply the input with a matrix if true.
• algo=0: Algorithm to use, see CUDNN docs for details.
• seed=0: Random number seed for dropout. Uses time() if 0.
• winit=xavier: Weight initialization method for matrices.
• binit=zeros: Weight initialization method for bias vectors.
• finit=ones: Weight initialization method for the bias of forget gates.
• atype=Knet.atype(): array type for model weights.

Formulas: RNNs compute the output h[t] for a given iteration from the recurrent input h[t-1] and the previous layer input x[t] given matrices W, R and biases bW, bR from the following equations:

:relu and :tanh: Single gate RNN with activation function f:

h[t] = f(W * x[t] .+ R * h[t-1] .+ bW .+ bR)

:gru: Gated recurrent unit:

i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
r[t] = sigm(Wr * x[t] .+ Rr * h[t-1] .+ bWr .+ bRr) # reset gate
n[t] = tanh(Wn * x[t] .+ r[t] .* (Rn * h[t-1] .+ bRn) .+ bWn) # new gate
h[t] = (1 - i[t]) .* n[t] .+ i[t] .* h[t-1]

:lstm: Long short term memory unit with no peephole connections:

i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
f[t] = sigm(Wf * x[t] .+ Rf * h[t-1] .+ bWf .+ bRf) # forget gate
o[t] = sigm(Wo * x[t] .+ Ro * h[t-1] .+ bWo .+ bRo) # output gate
n[t] = tanh(Wn * x[t] .+ Rn * h[t-1] .+ bWn .+ bRn) # new gate
c[t] = f[t] .* c[t-1] .+ i[t] .* n[t]               # cell output
h[t] = o[t] .* tanh(c[t])
source
Knet.Ops20.rnnparamFunction
rnnparam(r::RNN, layer, id, param)

Return a single weight matrix or bias vector as a slice of RNN weights.

Valid layer values:

• For unidirectional RNNs 1:numLayers
• For bidirectional RNNs 1:2*numLayers, forw and back layers alternate.

Valid id values:

• For RELU and TANH RNNs, input = 1, hidden = 2.
• For GRU reset = 1,4; update = 2,5; newmem = 3,6; 1:3 for input, 4:6 for hidden
• For LSTM inputgate = 1,5; forget = 2,6; newmem = 3,7; output = 4,8; 1:4 for input, 5:8 for hidden

Valid param values:

• Return the weight matrix (transposed!) if param==1.
• Return the bias vector if param==2.

The effect of skipInput: Let I=1 for RELU/TANH, 1:3 for GRU, 1:4 for LSTM

• For skipInput=false (default), rnnparam(r,1,I,1) is a (inputSize,hiddenSize) matrix.
• For skipInput=true, rnnparam(r,1,I,1) is nothing.
• For bidirectional, the same applies to rnnparam(r,2,I,1): the first back layer.
• The input biases (par=2) are returned even if skipInput=true.
source
Knet.Ops20.rnnparamsFunction
rnnparams(r::RNN)

Return the RNN parameters as an Array{Any}.

The order of params returned (subject to change):

• All weight matrices come before all bias vectors.
• Matrices and biases are sorted lexically based on (layer,id).
• See @doc rnnparam for valid layer and id values.
• Input multiplying matrices are nothing if r.inputMode = 1.
source

## Batch Normalization

Knet.Ops20.batchnormFunction
batchnorm(x[, moments, params]; kwargs...)

perform batch normalization on x with optional mean and variance in moments and scaling factor and bias in params. See https://arxiv.org/abs/1502.03167 for reference.

2d, 4d and 5d inputs are supported. Mean and variance are computed over dimensions (2,), (1,2,4) and (1,2,3,5) for 2d, 4d and 5d arrays, respectively.

moments stores running mean and variance to be used at inference time. It is optional in training mode, but mandatory in test mode. Training and test modes can be controlled by the training keyword argument which defaults to Knet.training().

params stores the optional affine parameters gamma and beta. bnparams function can be used to initialize params.

Example

# Inilization, C is an integer
moments = bnmoments()
params = bnparams(C)
...
# size(x) -> (H, W, C, N)
y = batchnorm(x, moments, params)
# size(y) -> (H, W, C, N)

Keywords

eps=1e-5: The epsilon parameter added to the variance to avoid division by 0.

training=Knet.training(): When training is true, the mean and variance of x are used and moments argument is modified if it is provided. When training is false, mean and variance stored in the moments argument are used.

source
Knet.Ops20.bnmomentsFunction
bnmoments(;momentum=0.1, mean=nothing, var=nothing, meaninit=zeros, varinit=ones)

Return a BNMoments object, a data structure used to store running mean and running variance of batch normalization with the following fields:

• momentum=0.1: A real number between 0 and 1 to be used as the scale of last

mean and variance. The existing running mean or variance is multiplied by (1-momentum).

• mean=nothing: The running mean.

• var=nothing: The running variance.

• meaninit=zeros: The function used for initialize the running mean. Should either be

nothing or of the form ([eltype], dims...)->data. zeros is a good option.

• varinit=ones: The function used for initialize the running variance. Should either be

nothing or ([eltype], dims...)->data. ones is a good option.

This constructor can be used directly load moments from data. meaninit and varinit are called if mean and var are nothing. Type and size of the mean and var are determined automatically from the inputs in the batchnorm calls.

source
Knet.Ops20.bnparamsFunction
bnparams(etype, channels::Integer)

Return a single 1d array that contains both scale and bias of batchnorm, where the first half is scale and the second half is bias.

bnparams(channels) calls bnparams(Float64, channels), following Julia convention.

source

## Model optimization

Knet.Train20.minimizeFunction
minimize(func, data, optimizer=Adam(); params)
sgd     (func, data; lr=0.1,  gclip, params)
momentum(func, data; lr=0.05, gamma=0.95, gclip, params)
nesterov(func, data; lr=0.05, gamma=0.95, gclip, params)
rmsprop (func, data; lr=0.01, rho=0.9, eps=1e-6, gclip, params)
adam    (func, data; lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, gclip, params)

Return an iterator which applies func to arguments in data, i.e. (func(args...) for args in data), and updates the parameters every iteration to minimize func. func should return a scalar value.

The common keyword argument params can be used to list the Params to be optimized. If not specified, any Param that takes part in the computation of func(args...) will be updated.

The common keyword argument gclip can be used to implement per-parameter gradient clipping. For a parameter gradient g, if norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If not specified no gradient clipping is performed.

These functions do not perform optimization, but return an iterator that can. Any function that produces values from an iterator can be used with such an object, e.g. progress!(sgd(f,d)) iterates the sgd optimizer and displays a progress bar. For convenience, appending ! to the name of the function iterates and returns nothing, i.e. sgd!(...) is equivalent to (for x in sgd(...) end).

We define optimizers as lazy iterators to have explicit control over them:

• To report progress use progress(sgd(f,d)).
• To run until convergence use converge(sgd(f,cycle(d))).
• To run multiple epochs use sgd(f,repeat(d,n)).
• To run a given number of iterations use sgd(f,take(cycle(d),n)).
• To do a task every n iterations use (task() for (i,j) in enumerate(sgd(f,d)) if i%n == 1).

These functions apply the same algorithm with the same configuration to every parameter by default. minimize takes an explicit optimizer argument, all others call minimize with an appropriate optimizer argument (see @doc update! for a list of possible optimizers). Before calling update! on a Param, minimize sets its opt field to a copy of this default optimizer if it is not already set. The opt field is used by the update! function to determine the type of update performed on that parameter. If you need finer grained control, you can set the optimizer of an individual Param by setting its opt field before calling one of these functions. They will not override the opt field if it is already set, e.g. sgd(model,data) will perform an Adam update for a parameter whose opt field is an Adam object. This also means you can stop and start the training without losing optimization state, the first call will set the opt fields and the subsequent calls will not override them.

Given a parameter w and its gradient g here are the updates applied by each optimizer:

# sgd (http://en.wikipedia.org/wiki/Stochastic_gradient_descent)
w .= w - lr * g

# momentum (http://jlmelville.github.io/mize/nesterov.html)
v .= gamma * v - lr * g
w .= w + v

# nesterov (http://jlmelville.github.io/mize/nesterov.html)
w .= w - gamma * v
v .= gamma * v - lr * g
w .= w + (1 + gamma) * v

G .= G + g .^ 2
w .= w - lr * g ./ sqrt(G + eps)

# rmsprop (http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
G .= rho * G + (1-rho) * g .^ 2
w .= w - lr * g ./ sqrt(G + eps)

G .= rho * G + (1-rho) * g .^ 2
update = sqrt(delta + eps) .* g ./ sqrt(G + eps)
w = w - lr * update
delta = rho * delta + (1-rho) * update .^ 2

v = beta1 * v + (1 - beta1) * g
G = beta2 * G + (1 - beta2) * g .^ 2
vhat = v ./ (1 - beta1 ^ t)
Ghat = G ./ (1 - beta2 ^ t)
w = w - (lr / (sqrt(Ghat) + eps)) * vhat
source
Knet.Train20.convergeFunction
converge(itr; alpha=0.1)

Return an iterator which acts exactly like itr, but quits when values from itr stop decreasing. itr should produce numeric values.

It can be used to train a model with the data cycled:

progress!(converge(minimize(model,cycle(data))))

alpha controls the exponential average of values to detect convergence. Here is how convergence is decided:

p = x - avgx
avgx = c.alpha * x + (1-c.alpha) * avgx
avgp = c.alpha * p + (1-c.alpha) * avgp
avgp > 0.0 && return nothing

converge!(...) is equivalent to (for x in converge(...) end), i.e. iterates over the object created by converge(...) and returns nothing.

source
Knet.Train20.minibatchFunction
minibatch(x, [y], batchsize; shuffle, partial, xtype, ytype, xsize, ysize)

Return an iterator of minibatches [(xi,yi)...] given data tensors x, y and batchsize.

The last dimension of x and y give the number of instances and should be equal. y is optional, if omitted a sequence of xi will be generated rather than (xi,yi) tuples. Use repeat(d,n) for multiple epochs, Iterators.take(d,n) for a partial epoch, and Iterators.cycle(d) to cycle through the data forever (this can be used with converge). If you need the iterator to continue from its last position when stopped early (e.g. by a break in a for loop), use Iterators.Stateful(d) (by default the iterator would restart from the beginning).

Keyword arguments:

• shuffle=false: Shuffle the instances every epoch.
• partial=false: If true include the last partial minibatch < batchsize.
• xtype=typeof(x): Convert xi in minibatches to this type.
• ytype=typeof(y): Convert yi in minibatches to this type.
• xsize=size(x): Convert xi in minibatches to this shape (with last dimension adjusted for batchsize).
• ysize=size(y): Convert yi in minibatches to this shape (with last dimension adjusted for batchsize).
source
Knet.Train20.progressFunction
progress(msg, itr; steps, seconds, io)
progress(itr; o...) do p; [body of the msg function]; end
progress(itr; o...)
progress!(...)

Return a Progress iterator which acts exactly like itr, but prints a progressbar:

┣█████████████████▎  ┫ [86.83%, 903/1040, 01:36/01:50, 9.42i/s] 3.87835

Here 86.83% is the percentage completed, 903 is the number of iterations completed, 1040 is the total number of iterations. 01:36 is elapsed time, 01:50 is the estimated total time, 9.42i/s is the average number of iterations completed per second. If the speed is less than 1, the average number of seconds per iteration (s/i) is reported instead. The bar, percent, total iterations, and estimated total time are omitted for iterators whose size is unknown.

The 3.87835 at the end is the output of the msg function applied to the Progress iterator. The message can be customized by the first two forms above, if not specified (the third form) nothing gets printed at the end of the line. The message function can use the following fields of its p::Progress argument: p.currval is the current iterator value and p.curriter is the current iteration count.

The progress bar is updated and msg is called with the Progress iterator every steps iterations or every seconds seconds in addition to the first and the last iteration. If neither steps nor seconds is specified the default is to update every second. The keyword argument io determines where the progress bar is printed, the default is stderr.

The last form, progress!(...), is equivalent to (for x in progress(...) end), i.e. iterates over the object created by progress(...) and returns nothing.

source
Missing docstring.

Missing docstring for Knet.training. Check Documenter's build log for details.

## Hyperparameter optimization

Knet.Train20.goldensectionFunction
goldensection(f,n;kwargs) => (fmin,xmin)

Find the minimum of f using concurrent golden section search in n dimensions. See Knet.goldensection_demo() for an example.

f is a function from a Vector{Float64} of length n to a Number. It can return NaN for out of range inputs. Goldensection will always start with a zero vector as the initial input to f, and the initial step size will be 1 in each dimension. The user should define f to scale and shift this input range into a vector meaningful for their application. For positive inputs like learning rate or hidden size, you can use a transformation such as x0*exp(x) where x is a value goldensection passes to f and x0 is your initial guess for this value. This will effectively start the search at x0, then move with multiplicative steps.

I designed this algorithm combining ideas from Golden Section Search and Hill Climbing Search. It essentially runs golden section search concurrently in each dimension, picking the next step based on estimated gain.

Keyword arguments

• dxmin=0.1: smallest step size.
• accel=φ: acceleration rate. Golden ratio φ=1.618... is best.
• verbose=false: use true to print individual steps.
• history=[]: cache of [(x,f(x)),...] function evaluations.
source
Knet.Train20.hyperbandFunction
hyperband(getconfig, getloss, maxresource=27, reduction=3)

Hyperparameter optimization using the hyperband algorithm from (Lisha et al. 2016). You can try a simple MNIST example using Knet.hyperband_demo().

Arguments

• getconfig() returns random configurations with a user defined type and distribution.
• getloss(c,n) returns loss for configuration c and number of resources (e.g. epochs) n.
• maxresource is the maximum number of resources any one configuration should be given.
• reduction is an algorithm parameter (see paper), 3 is a good value.
source

## Utilities

Knet.Ops20.bmmFunction
bmm(A, B ; transA=false, transB=false)

Perform a batch matrix-matrix product of matrices stored in A and B. size(A,2) == size(B,1) and size(A)[3:end] and size(B)[3:end] must match. If A is a (m,n,b...) tensor, B is a (n,k,b...) tensor, and the output is a (m,k,b...) tensor.

source
AutoGrad.cat1dFunction
cat1d(args...)

Return vcat(vec.(args)...) but possibly more efficiently. Can be used to concatenate the contents of arrays with different shapes and sizes.

Missing docstring.

Missing docstring for Knet.cpucopy. Check Documenter's build log for details.

Knet.Ops20.dropoutFunction
dropout(x, p; drop, seed)

Given an array x and probability 0<=p<=1 return an array y in which each element is 0 with probability p or x[i]/(1-p) with probability 1-p. Just return x if p==0, or drop=false. By default drop=true in a @diff context, drop=false otherwise. Specify a non-zero seed::Number to set the random number seed for reproducible results. See (Srivastava et al. 2014) for a reference.

source
Knet.KnetArrays.gcFunction
Knet.gc(dev=CUDA.device().handle)

cudaFree all pointers allocated on device dev that were previously allocated and garbage collected. Normally Knet holds on to all garbage collected pointers for reuse. Try this if you run out of GPU memory.

source
Missing docstring.

Missing docstring for Knet.gpu. Check Documenter's build log for details.

Missing docstring.

Missing docstring for Knet.gpucopy. Check Documenter's build log for details.

Missing docstring.

Missing docstring for Knet.invx. Check Documenter's build log for details.

Knet.Ops20.matFunction
mat(x; dims = ndims(x) - 1)

Reshape x into a two-dimensional matrix by joining the first dims dimensions, i.e. reshape(x, prod(size(x,i) for i in 1:dims), :)

dims=ndims(x)-1 (default) is typically used when turning the output of a 4-D convolution result into a 2-D input for a fully connected layer.

dims=1 is typically used when turning the 3-D output of an RNN layer into a 2-D input for a fully connected layer.

dims=0 will turn the input into a row vector, dims=ndims(x) will turn it into a column vector.

source
Missing docstring.

Missing docstring for Knet.seed!. Check Documenter's build log for details.

AutoGrad.@gcheckMacro
gcheck(f, x...; kw, o...)
@gcheck f(x...; kw...) (opt1=val1,opt2=val2,...)

Numerically check the gradient of f(x...; kw...) and return a boolean result.

Example call: gcheck(nll,model,x,y) or @gcheck nll(model,x,y). The parameters should be marked as Param arrays in f, x, and/or kw. Only 10 random entries in each large numeric array are checked by default. If the output of f is not a number, we check the gradient of sum(f(x...; kw...)). Keyword arguments:

• kw=(): keyword arguments to be passed to f, i.e. f(x...; kw...)
• nsample=10: number of random entries from each param to check
• atol=0.01,rtol=0.05: tolerance parameters. See isapprox for their meaning.
• delta=0.0001: step size for numerical gradient calculation.
• verbose=1: 0 prints nothing, 1 shows failing tests, 2 shows all tests.
AutoGrad.@primitiveMacro
@primitive  fx g1 g2...

Define a new primitive operation for AutoGrad and (optionally) specify its gradients. Non-differentiable functions such as sign, and non-numeric functions such as size should be defined using the @zerograd macro instead.

Examples

@primitive sin(x::Number)
@primitive hypot(x1,x2),dy,y

@primitive sin(x::Number),dy  (dy.*cos(x))
@primitive hypot(x1,x2),dy,y  (dy.*x1./y)  (dy.*x2./y)

The first example shows that fx is a typed method declaration. Julia supports multiple dispatch, i.e. a single function can have multiple methods with different arg types. AutoGrad takes advantage of this and supports multiple dispatch for primitives and gradients.

The second example specifies variable names for the output gradient dy and the output y after the method declaration which can be used in gradient expressions. Untyped, ellipsis and keyword arguments are ok as in f(a::Int,b,c...;d=1). Parametric methods such as f(x::T) where {T<:Number} cannot be used.

The method declaration can optionally be followed by gradient expressions. The third and fourth examples show how gradients can be specified. Note that the parameters, the return variable and the output gradient of the original function can be used in the gradient expressions.

Under the hood

The @primitive macro turns the first example into:

sin(x::Value{T}) where {T<:Number} = forw(sin, x)

This will cause calls to sin with a boxed argument (Value{T<:Number}) to be recorded. The recorded operations are used by AutoGrad to construct a dynamic computational graph. With multiple arguments things are a bit more complicated. Here is what happens with the second example:

hypot(x1::Value{S}, x2::Value{T}) where {S,T} = forw(hypot, x1, x2)
hypot(x1::S, x2::Value{T})        where {S,T} = forw(hypot, x1, x2)
hypot(x1::Value{S}, x2::T)        where {S,T} = forw(hypot, x1, x2)

We want the forw method to be called if any one of the arguments is a boxed Value. There is no easy way to specify this in Julia, so the macro generates all 2^N-1 boxed/unboxed argument combinations.

back(f,Arg{i},dy,y,x...) => dx[i]

For the third example here is the generated gradient method:

back(::typeof(sin), ::Type{Arg{1}}, dy, y, x::Value{T}) where {T<:Number} = dy .* cos(x)

For the last example a different gradient method is generated for each argument:

back(::typeof(hypot), ::Type{Arg{1}}, dy, y, x1::Value{S}, x2::Value{T}) where {S,T} = (dy .* x1) ./ y
back(::typeof(hypot), ::Type{Arg{2}}, dy, y, x1::Value{S}, x2::Value{T}) where {S,T} = (dy .* x2) ./ y

In fact @primitive generates four more definitions for the other boxed/unboxed argument combinations.

Broadcasting is handled by extra forw and back methods. @primitive defines the following so that broadcasting of a primitive function with a boxed value triggers forw and back.

broadcasted(::typeof(sin), x::Value{T}) where {T<:Number} = forw(broadcasted,sin,x)
back(::typeof(broadcasted), ::Type{Arg{2}}, dy, y, ::typeof(sin), x::Value{T}) where {T<:Number} = dy .* cos(x)

If you do not want the broadcasting methods, you can use the @primitive1 macro. If you only want the broadcasting methods use @primitive2. As a motivating example, here is how * is defined for non-scalars:

@primitive1 *(x1,x2),dy  (dy*x2')  (x1'*dy)
@primitive2 *(x1,x2),dy  unbroadcast(x1,dy.*x2)  unbroadcast(x2,x1.*dy)

Regular * is matrix multiplication, broadcasted * is elementwise multiplication and the two have different gradients as defined above. unbroadcast(a,b) reduces b to the same shape as a by performing the necessary summations.

AutoGrad.@zerogradMacro
@zerograd f(args...; kwargs...)

Define f as an AutoGrad primitive operation with zero gradient.

Example:

@zerograd  floor(x::Float32)

@zerograd allows f to handle boxed Value inputs by unboxing them like a @primitive, but unlike @primitive it does not record its actions or return a boxed Value result. Some functions, like sign(), have zero gradient. Others, like length() have discrete or constant outputs. These need to handle Value inputs, but do not need to record anything and can return regular values. Their output can be treated like a constant in the program. Use the @zerograd macro for those. Use the @zerograd1 variant if you don't want to define the broadcasting version and @zerograd2 if you only want to define the broadcasting version. Note that kwargs are NOT unboxed.

The model optimization methods apply the same algorithm with the same configuration to every parameter. If you need finer grained control, you can set the optimization algorithm and configuration of an individual Param by setting its opt field to one of the optimization objects like Adam listed below. The opt field is used as an argument to update! and controls the type of update performed on that parameter. Model optimization methods like sgd will not override the opt field if it is already set, e.g. sgd(model,data) will perform an Adam update for a parameter whose opt field is an Adam object. This also means you can stop and start the training without losing optimization state, the first call will set the opt fields and the subsequent calls will not override them.

Knet.Train20.update!Function
update!(weights::Param, gradients)
update!(weights, gradients, optimizers)

Update the weights using their gradients and the optimization algorithms specified using (1) the opt field of a Param, (2) keyword arguments, (3) the third argument.

weights can be an individual Param, numeric array, or a collection of arrays/Params represented by an iterator or dictionary. gradients should be a matching individual array or collection. In the first form, the optimizer should be specified in weights.opt. In the second form the optimizer defaults to SGD with learning rate lr and gradient clip gclip. In the third form optimizers should be a matching individual optimizer or collection of optimizers. The weights and possibly gradients and optimizers are modified in-place.

Individual optimization parameters can be one of the following types. The keyword arguments for each constructor and their default values are listed as well.

Example:

w = Param(rand(d), Adam())  # a Param with a specified optimizer
g = lossgradient0(w)        # gradient g has the same shape as w
update!(w, g)               # update w in-place with Adam()

w = rand(d)                 # an individual weight array
g = lossgradient1(w)        # gradient g has the same shape as w
update!(w, g)               # update w in-place with SGD()
update!(w, g; lr=0.1)       # update w in-place with SGD(lr=0.1)
update!(w, g, SGD(lr=0.1))  # update w in-place with SGD(lr=0.1)

w = (rand(d1), rand(d2))    # a tuple of weight arrays
g = lossgradient2(w)        # g will also be a tuple
p = (Adam(), SGD())         # p has optimizers for each w[i]
update!(w, g, p)            # update each w[i] in-place with g[i],p[i]

w = Any[rand(d1), rand(d2)] # any iterator can be used
g = lossgradient3(w)        # g will be similar to w
p = Any[Adam(), SGD()]      # p should be an iterator of same length
update!(w, g, p)            # update each w[i] in-place with g[i],p[i]

w = Dict(:a => rand(d1), :b => rand(d2)) # dictionaries can be used
p = Dict(:a => Adam(), :b => SGD())
update!(w, g, p)
source
Knet.Train20.SGDType
SGD(;lr=0.1,gclip=0)
update!(w,g,p::SGD)
update!(w,g;lr=0.1)

Container for parameters of the Stochastic gradient descent (SGD) optimization algorithm used by update!.

SGD is an optimization technique to minimize an objective function by updating its weights in the opposite direction of their gradient. The learning rate (lr) determines the size of the step. SGD updates the weights with the following formula:

w = w - lr * g

where w is a weight array, g is the gradient of the loss function w.r.t w and lr is the learning rate.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

SGD is used by default if no algorithm is specified in the two argument version of update![@ref].

source
Knet.Train20.MomentumType
Momentum(;lr=0.05, gclip=0, gamma=0.95)
update!(w,g,p::Momentum)

Container for parameters of the Momentum optimization algorithm used by update!.

The Momentum method tries to accelerate SGD by adding a velocity term to the update. This also decreases the oscillation between successive steps. It updates the weights with the following formulas:

velocity = gamma * velocity + lr * g
w = w - velocity

where w is a weight array, g is the gradient of the objective function w.r.t w, lr is the learning rate, gamma is the momentum parameter, velocity is an array with the same size and type of w and holds the accelerated gradients.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

Reference: Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151.

source
Knet.Train20.NesterovType
Nesterov(; lr=0.05, gclip=0, gamma=0.95)
update!(w,g,p::Momentum)

Container for parameters of Nesterov's momentum optimization algorithm used by update!.

It is similar to standard Momentum but with a slightly different update rule:

velocity = gamma * velocity_old - lr * g
w = w_old - velocity_old + (1+gamma) * velocity

where w is a weight array, g is the gradient of the objective function w.r.t w, lr is the learning rate, gamma is the momentum parameter, velocity is an array with the same size and type of w and holds the accelerated gradients.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip == 0 no scaling takes place.

Reference Implementation : Yoshua Bengio, Nicolas Boulanger-Lewandowski and Razvan P ascanu

source
Knet.Train20.AdagradType
Adagrad(;lr=0.05, gclip=0, eps=1e-6)
update!(w,g,p::Adagrad)

Container for parameters of the Adagrad optimization algorithm used by update!.

Adagrad is one of the methods that adapts the learning rate to each of the weights. It stores the sum of the squares of the gradients to scale the learning rate. The learning rate is adapted for each weight by the value of current gradient divided by the accumulated gradients. Hence, the learning rate is greater for the parameters where the accumulated gradients are small and the learning rate is small if the accumulated gradients are large. It updates the weights with the following formulas:

G = G + g .^ 2
w = w - g .* lr ./ sqrt(G + eps)

where w is the weight, g is the gradient of the objective function w.r.t w, lr is the learning rate, G is an array with the same size and type of w and holds the sum of the squares of the gradients. eps is a small constant to prevent a zero value in the denominator.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

Reference: Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159.

source
Knet.Train20.RmspropType
Rmsprop(;lr=0.01, gclip=0, rho=0.9, eps=1e-6)
update!(w,g,p::Rmsprop)

Container for parameters of the Rmsprop optimization algorithm used by update!.

Rmsprop scales the learning rates by dividing the root mean squared of the gradients. It updates the weights with the following formula:

G = (1-rho) * g .^ 2 + rho * G
w = w - lr * g ./ sqrt(G + eps)

where w is the weight, g is the gradient of the objective function w.r.t w, lr is the learning rate, G is an array with the same size and type of w and holds the sum of the squares of the gradients. eps is a small constant to prevent a zero value in the denominator. rho is the momentum parameter and delta is an array with the same size and type of w and holds the sum of the squared updates.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

Reference: Tijmen Tieleman and Geoffrey Hinton (2012). "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural Networks for Machine Learning 4.2.

source
Knet.Train20.AdadeltaType
Adadelta(;lr=1.0, gclip=0, rho=0.9, eps=1e-6)
update!(w,g,p::Adadelta)

Container for parameters of the Adadelta optimization algorithm used by update!.

Adadelta is an extension of Adagrad that tries to prevent the decrease of the learning rates to zero as training progresses. It scales the learning rate based on the accumulated gradients like Adagrad and holds the acceleration term like Momentum. It updates the weights with the following formulas:

G = (1-rho) * g .^ 2 + rho * G
update = g .* sqrt(delta + eps) ./ sqrt(G + eps)
w = w - lr * update
delta = rho * delta + (1-rho) * update .^ 2

where w is the weight, g is the gradient of the objective function w.r.t w, lr is the learning rate, G is an array with the same size and type of w and holds the sum of the squares of the gradients. eps is a small constant to prevent a zero value in the denominator. rho is the momentum parameter and delta is an array with the same size and type of w and holds the sum of the squared updates.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

source
Knet.Train20.AdamType
Adam(;lr=0.001, gclip=0, beta1=0.9, beta2=0.999, eps=1e-8)
update!(w,g,p::Adam)

Container for parameters of the Adam optimization algorithm used by update!.

Adam is one of the methods that compute the adaptive learning rate. It stores accumulated gradients (first moment) and the sum of the squared of gradients (second). It scales the first and second moment as a function of time. Here is the update formulas:

m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g .* g
mhat = m ./ (1 - beta1 ^ t)
vhat = v ./ (1 - beta2 ^ t)
w = w - (lr / (sqrt(vhat) + eps)) * mhat

where w is the weight, g is the gradient of the objective function w.r.t w, lr is the learning rate, m is an array with the same size and type of w and holds the accumulated gradients. v is an array with the same size and type of w and holds the sum of the squares of the gradients. eps is a small constant to prevent a zero denominator. beta1 and beta2 are the parameters to calculate bias corrected first and second moments. t is the update count.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

Reference: Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13.

source