Reference

# Reference

Contents

Usage:

``````x = Param([1,2,3])          # user declares parameters with `Param`
x => P([1,2,3])             # `Param` is just a struct wrapping a value
value(x) => [1,2,3]         # `value` returns the thing wrapped
sum(x .* x) => 14           # Params act like regular values
y = @diff sum(x .* x)       # Except when we differentiate using `@diff`
y => T(14)                  # you get another struct
value(y) => 14              # which carries the same result
params(y) => [x]            # and the Params that it depends on

`Param(x)` returns a struct that acts like `x` but marks it as a parameter you want to compute gradients with respect to.

`@diff expr` evaluates an expression and returns a struct that contains the result (which should be a scalar) and gradient information.

`grad(y, x)` returns the gradient of `y` (output by @diff) with respect to any parameter `x::Param`, or `nothing` if the gradient is 0.

`value(x)` returns the value associated with `x` if `x` is a `Param` or the output of `@diff`, otherwise returns `x`.

`params(x)` returns an iterator of Params found by a recursive search of object `x`.

Alternative usage:

``````x = [1 2 3]
f(x) = sum(x .* x)
f(x) => 14
gradloss(f)(x) => ([2 4 6], 14)``````

Given a scalar valued function `f`, `grad(f,argnum=1)` returns another function `g` which takes the same inputs as `f` and returns the gradient of the output with respect to the argnum'th argument. `gradloss` is similar except the resulting function also returns f's output.

## KnetArray

``````KnetArray{T}(undef,dims)
KnetArray(a::AbstractArray)
Array(k::KnetArray)``````

Container for GPU arrays that supports most of the AbstractArray interface. The constructor allocates a KnetArray in the currently active device, as specified by `gpu()`. KnetArrays and Arrays can be converted to each other as shown above, which involves copying to and from the GPU memory. Only Float32/64 KnetArrays are fully supported.

KnetArrays use the CuArrays package for allocation and some operations. Currently some of the custom CUDA kernels that implement elementwise, broadcasting, and reduction operations for KnetArrays work faster. Once these are improved in CuArrays, KnetArrays will be retired.

Supported functions:

• Indexing: getindex, setindex! with the following index types:

• 1-D: Real, Colon, OrdinalRange, AbstractArray{Real}, AbstractArray{Bool}, CartesianIndex, AbstractArray{CartesianIndex}, EmptyArray, KnetArray{Int32} (low level), KnetArray{0/1} (using float for BitArray) (1-D includes linear indexing of multidimensional arrays)
• 2-D: (Colon,Union{Real,Colon,OrdinalRange,AbstractVector{Real},AbstractVector{Bool},KnetVector{Int32}}), (Union{Real,AbstractUnitRange,Colon}...) (in any order)
• N-D: (Real...)
• Array operations: ==, !=, adjoint, argmax, argmin, cat, convert, copy, copyto!, deepcopy, display, eachindex, eltype, endof, fill!, findmax, findmin, first, hcat, isapprox, isempty, length, ndims, one, ones, permutedims, pointer, rand!, randn!, reshape, similar, size, stride, strides, summary, transpose, vcat, vec, zero. (Boolean operators generate outputs with same type as inputs; no support for KnetArray{Bool}.)

• Unary functions with broadcasting: -, abs, abs2, acos, acosh, asin, asinh, atan, atanh, cbrt, ceil, cos, cosh, cospi, digamma, erf, erfc, erfcinv, erfcx, erfinv, exp, exp10, exp2, expm1, floor, gamma, lgamma, log, log10, log1p, log2, loggamma, one, round, sign, sin, sinh, sinpi, sqrt, tan, tanh, trigamma, trunc, zero

• Binary functions with broadcasting: !=, *, +, -, /, <, <=, ==, >, >=, ^, max, min

• Reduction operators: maximum, minimum, prod, sum

• Statistics: mean, std, stdm, var, varm

• Linear algebra: (*), axpy!, lmul!, norm, rmul!

• Knet extras: batchnorm, bce, bmm, cat1d, conv4, cpucopy, deconv4, dropout, elu, gpucopy, invx, logistic, logp, logsoftmax, logsumexp, mat, nll, pool, relu, RNN, selu, sigm, softmax, unpool (Only 4D/5D, Float32/64 KnetArrays support conv4, pool, deconv4, unpool)

source

## File I/O

``Knet.save(filename, args...; kwargs...)``

Call `FileIO.save` after serializing Knet specific args.

File format is determined by the filename extension. JLD and JLD2 are supported. Other formats may work if supported by FileIO, please refer to the documentation of FileIO and the specific format. Example:

``Knet.save("foo.jld2", "name1", value1, "name2", value2)``
source
``Knet.load(filename, args...; kwargs...)``

Call `FileIO.load` then deserialize Knet specific values.

File format is determined by FileIO. JLD and JLD2 are supported. Other formats may work if supported by FileIO, please refer to the documentation of FileIO and the specific format. Example:

``````Knet.load("foo.jld2")           # returns a ("name"=>value) dictionary
Knet.load("foo.jld2", "name1")  # returns the value of "name1" in "foo.jld2"
Knet.load("foo.jld2", "name1", "name2")   # returns tuple (value1, value2)``````
source
``Knet.@save "filename" variable1 variable2...``

Save the values of the specified variables to filename in JLD2 format.

When called with no variable arguments, write all variables in the global scope of the current module to filename. See JLD2.

source
``Knet.@load "filename" variable1 variable2...``

Load the values of the specified variables from filename in JLD2 format.

When called with no variable arguments, load all variables in filename. See JLD2.

source

## Parameter initialization

``````param(array; atype)
param(dims...; init, atype)
param0(dims...; atype)``````

The first form returns `Param(atype(array))` where `atype=identity` is the default.

The second form Returns a randomly initialized `Param(atype(init(dims...)))`. By default, `init` is `xavier_uniform` and `atype` is `KnetArray{Float32}` if `gpu() >= 0`, `Array{Float32}` otherwise.

The third form `param0` is an alias for `param(dims...; init=zeros)`.

source
``````xavier_uniform(a...; gain=1)
xavier(a...; gain=1)``````

Return uniform random weights in the range `± gain * sqrt(6 / (fanin + fanout))`. The `a` arguments are passed to `rand` to specify type and dimensions. See (Glorot and Bengio 2010) or the PyTorch docs for a description. The function implements equation (16) of the referenced paper. Also known as Glorot initialization. The function `xavier` is an alias for `xavier_uniform`. See also `xavier_normal`.

source
``````xavier_uniform(a...; gain=1)
xavier(a...; gain=1)``````

Return uniform random weights in the range `± gain * sqrt(6 / (fanin + fanout))`. The `a` arguments are passed to `rand` to specify type and dimensions. See (Glorot and Bengio 2010) or the PyTorch docs for a description. The function implements equation (16) of the referenced paper. Also known as Glorot initialization. The function `xavier` is an alias for `xavier_uniform`. See also `xavier_normal`.

source
``xavier_normal(a...; gain=1)``

Return normal distributed random weights with mean 0 and std `gain * sqrt(2 / (fanin + fanout))`. The `a` arguments are passed to `rand`. See (Glorot and Bengio 2010) and PyTorch docs for a description. Also known as Glorot initialization. See also `xavier_uniform`.

source
``gaussian(a...; mean=0.0, std=0.01)``

Return a Gaussian array with a given mean and standard deviation. The `a` arguments are passed to `randn`.

source

Bilinear interpolation filter weights; used for initializing deconvolution layers.

Arguments:

`T` : Data Type

`fw`: Width upscale factor

`fh`: Height upscale factor

`IN`: Number of input filters

`ON`: Number of output filters

Example usage:

w = bilinear(Float32,2,2,128,128)

source

## Activation functions

``elu(x)``

Return `(x > 0 ? x : exp(x)-1)`.

Reference: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) (https://arxiv.org/abs/1511.07289).

source
``relu(x)``

Return `max(0,x)`.

References:

source
``selu(x)``

Return `λ01 * (x > 0 ? x : α01 * (exp(x)-1))` where `λ01=1.0507009873554805` and `α01=1.6732632423543778`.

Reference: Self-Normalizing Neural Networks (https://arxiv.org/abs/1706.02515).

source

`sigm(x) = 1/(1+exp(-x))`

source

## Loss functions

``accuracy(scores, answers; dims=1, average=true)``

Given an unnormalized `scores` matrix and an `Integer` array of correct `answers`, return the ratio of instances where the correct answer has the maximum score. `dims=1` means instances are in columns, `dims=2` means instances are in rows. Use `average=false` to return the pair (ncorrect,count) instead of the ratio (ncorrect/count). If `answers[i] == 0`, instance i is skipped.

source
``accuracy(model, data; dims=1, average=true, o...)``

Compute `accuracy(model(x; o...), y; dims)` for `(x,y)` in `data` and return (correct/total) if average=true or (correct,total) if average=false.

source
``bce(scores,answers;average=true)``

Computes binary cross entropy given scores(predicted values) and answer labels. answer values should be {0,1}, then it returns negative of `mean|sum(answers * log(p) + (1-answers)*log(1-p))` where `p` is equal to `1/(1 + exp.(scores))`. See also `logistic`.

source
``logistic(scores, answers; average=true)``

Computes logistic loss given scores(predicted values) and answer labels. answer values should be {-1,1}, then it returns `mean|sum(log(1 + exp(-answers*scores)))`. See also `bce`.

source
``logp(x; dims=:)``

Treat entries in `x` as as unnormalized log probabilities and return normalized log probabilities.

`dims` is an optional argument, if not specified the normalization is over the whole `x`, otherwise the normalization is performed over the given dimensions. In particular, if `x` is a matrix, `dims=1` normalizes columns of `x` and `dims=2` normalizes rows of `x`.

source
`` logsoftmax(x; dims=:)``

Equivalent to `logp(x; dims=:)`. See also `sotfmax`.

source
``logsumexp(x;dims=:)``

Compute `log(sum(exp(x);dims))` in a numerically stable manner.

`dims` is an optional argument, if not specified the summation is over the whole `x`, otherwise the summation is performed over the given dimensions. In particular if `x` is a matrix, `dims=1` sums columns of `x` and `dims=2` sums rows of `x`.

source
``nll(scores, answers; dims=1, average=true)``

Given an unnormalized `scores` matrix and an `Integer` array of correct `answers`, return the negative log likelihood. The `scores` matrix should have size (classes,instances) if `dims=1` or (instances,classes) if `dims=2`. `answers[i]` should be in `1:classes` to indicate the correct class for instance i, or 0 to skip instance i. The return value is `(total/count)` if `average=true` and `(total,count)` if `average=false` where `count` is the number of instances not skipped and `total` is their total negative log likelihood.

Example

Let's assume that there are three classes (cat, dog, ostrich) and just 2 instances with the unnormalized score `scores[:,1]` and `scores[:,2]` respectively. The first instance is actually a cat and the second instance a dog:

``````scores = [12.2    0.3;
2.0   21.5;
0.0  -21.0]
# returns 2.1657e-5``````

The probabilites are derived from the scores and the log-probabilities corresponding to the answers are averaged:

``````probabilites = exp.(scores) ./ sum(exp.(scores),dims=1)
# returns 2.1657e-5``````
source
``nll(model, data; dims=1, average=true, o...)``

Compute negative log likelihood `nll(model(x; o...), y; dims)` for `(x,y)` in `data` and return `(total/count)` if `average=true` or `(total,count)` if `average=false`.

source
``softmax(x; dims=1, algo=1)``

The softmax function typically used in classification. Gives the same results as to `exp.(logp(x, dims))`.

If `algo=1` computation is more accurate, if `algo=0` it is faster.

See also `logsoftmax`.

source

zeroone loss is equal to 1 - accuracy

source

## Convolution and Pooling

``conv4(w, x; kwargs...)``

Execute convolutions or cross-correlations using filters specified with `w` over tensor `x`.

Currently KnetArray{Float32/64,4/5} and Array{Float32/64,4} are supported as `w` and `x`. If `w` has dimensions `(W1,W2,...,I,O)` and `x` has dimensions `(X1,X2,...,I,N)`, the result `y` will have dimensions `(Y1,Y2,...,O,N)` where

``Yi=1+floor((Xi+2*padding[i]-Wi)/stride[i])``

Here `I` is the number of input channels, `O` is the number of output channels, `N` is the number of instances, and `Wi,Xi,Yi` are spatial dimensions. `padding` and `stride` are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.

Keywords

• `padding=0`: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.
• `stride=1`: the number of elements to slide to reach the next filtering window.
• `dilation=1`: dilation factor for each dimension.
• `mode=0`: 0 for convolution and 1 for cross-correlation.
• `alpha=1`: can be used to scale the result.
• `handle`: handle to a previously created cuDNN context. Defaults to a Knet allocated handle.
source
``y = deconv4(w, x; kwargs...)``

Simulate 4-D deconvolution by using transposed convolution operation. Its forward pass is equivalent to backward pass of a convolution (gradients with respect to input tensor). Likewise, its backward pass (gradients with respect to input tensor) is equivalent to forward pass of a convolution. Since it swaps forward and backward passes of convolution operation, padding and stride options belong to output tensor. See this report for further explanation.

Currently KnetArray{Float32/64,4} and Array{Float32/64,4} are supported as `w` and `x`. If `w` has dimensions `(W1,W2,...,O,I)` and `x` has dimensions `(X1,X2,...,I,N)`, the result `y` will have dimensions `(Y1,Y2,...,O,N)` where

Here I is the number of input channels, O is the number of output channels, N is the number of instances, and Wi,Xi,Yi are spatial dimensions. padding and stride are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.

Keywords

• `padding=0`: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.
• `stride=1`: the number of elements to slide to reach the next filtering window.
• `mode=0`: 0 for convolution and 1 for cross-correlation.
• `alpha=1`: can be used to scale the result.
• `handle`: handle to a previously created cuDNN context. Defaults to a Knet allocated handle.
source
``pool(x; kwargs...)``

Compute pooling of input values (i.e., the maximum or average of several adjacent values) to produce an output with smaller height and/or width.

Currently 4 or 5 dimensional KnetArrays with `Float32` or `Float64` entries are supported. If `x` has dimensions `(X1,X2,...,I,N)`, the result `y` will have dimensions `(Y1,Y2,...,I,N)` where

``Yi=1+floor((Xi+2*padding[i]-window[i])/stride[i])``

Here `I` is the number of input channels, `N` is the number of instances, and `Xi,Yi` are spatial dimensions. `window`, `padding` and `stride` are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.

Keywords:

• `window=2`: the pooling window size for each dimension.
• `padding=0`: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.
• `stride=window`: the number of elements to slide to reach the next pooling window.
• `mode=0`: 0 for max, 1 for average including padded values, 2 for average excluding padded values.
• `maxpoolingNanOpt=0`: Nan numbers are not propagated if 0, they are propagated if 1.
• `alpha=1`: can be used to scale the result.
• `handle`: Handle to a previously created cuDNN context. Defaults to a Knet allocated handle.
source

Unpooling; `reverse` of pooling.

``x == pool(unpool(x;o...); o...)``
source

## Recurrent neural networks

``````rnn = RNN(inputSize, hiddenSize; opts...)
rnn(x; batchSizes) => y
rnn.h, rnn.c  # hidden and cell states``````

`RNN` returns a callable RNN object `rnn`. Given a minibatch of sequences `x`, `rnn(x)` returns `y`, the hidden states of the final layer for each time step. `rnn.h` and `rnn.c` fields can be used to set the initial hidden states and read the final hidden states of all layers. Note that the final time step of `y` always contains the final hidden state of the last layer, equivalent to `rnn.h` for a single layer network.

Dimensions: The input `x` can be 1, 2, or 3 dimensional and `y` will have the same number of dimensions as `x`. size(x)=(X,[B,T]) and size(y)=(H/2H,[B,T]) where X is inputSize, B is batchSize, T is seqLength, H is hiddenSize, 2H is for bidirectional RNNs. By default a 1-D `x` represents a single instance for a single time step, a 2-D `x` represents a single minibatch for a single time step, and a 3-D `x` represents a sequence of identically sized minibatches for multiple time steps. The output `y` gives the hidden state (of the final layer for multi-layer RNNs) for each time step. The fields `rnn.h` and `rnn.c` represent the hidden states of all layers in a single time step and have size (H,B,L/2L) where L is numLayers and 2L is for bidirectional RNNs.

batchSizes: If `batchSizes=nothing` (default), all sequences in a minibatch are assumed to be the same length. If `batchSizes` is an array of (non-increasing) integers, it gives us the batch size for each time step (allowing different sequences in the minibatch to have different lengths). In this case `x` will typically be 2-D with the second dimension representing variable size batches for time steps. If `batchSizes` is used, `sum(batchSizes)` should equal `length(x) ÷ size(x,1)`. When the batch size is different in every time step, hidden states will have size (H,B,L/2L) where B is always the size of the first (largest) minibatch.

Hidden states: The hidden and cell states are kept in `rnn.h` and `rnn.c` fields (the cell state is only used by LSTM). They can be initialized during construction using the `h` and `c` keyword arguments, or modified later by direct assignment. Valid values are `nothing` (default), `0`, or an array of the right type and size possibly wrapped in a `Param`. If the value is `nothing` the initial state is assumed to be zero and the final state is discarded keeping the value `nothing`. If the value is `0` the initial state is assumed to be zero and `0` is replaced by the final state on return. If the value is a valid state, it is used as the initial state and is replaced by the final state on return.

In a differentiation context the returned final hidden states will be wrapped in `Result` types. This is necessary if the same RNN object is to be called multiple times in a single iteration. Between iterations (i.e. after diff/update) the hidden states need to be unboxed with e.g. `rnn.h = value(rnn.h)` to prevent spurious dependencies. This happens automatically during the backward pass for GPU RNNs but needs to be done manually for CPU RNNs. See the CharLM Tutorial for an example.

Keyword arguments for RNN:

• `h=nothing`: Initial hidden state.
• `c=nothing`: Initial cell state.
• `rnnType=:lstm` Type of RNN: One of :relu, :tanh, :lstm, :gru.
• `numLayers=1`: Number of RNN layers.
• `bidirectional=false`: Create a bidirectional RNN if `true`.
• `dropout=0`: Dropout probability. Applied to input and between layers.
• `skipInput=false`: Do not multiply the input with a matrix if `true`.
• `dataType=Float32`: Data type to use for weights.
• `algo=0`: Algorithm to use, see CUDNN docs for details.
• `seed=0`: Random number seed for dropout. Uses `time()` if 0.
• `winit=xavier_uniform`: Weight initialization method for matrices.
• `binit=zeros`: Weight initialization method for bias vectors.
• `finit=ones`: Weight initialization method for the bias of forget gates.
• `usegpu=(gpu()>=0)`: GPU used by default if one exists.

Formulas: RNNs compute the output h[t] for a given iteration from the recurrent input h[t-1] and the previous layer input x[t] given matrices W, R and biases bW, bR from the following equations:

`:relu` and `:tanh`: Single gate RNN with activation function f:

``h[t] = f(W * x[t] .+ R * h[t-1] .+ bW .+ bR)``

`:gru`: Gated recurrent unit:

``````i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
r[t] = sigm(Wr * x[t] .+ Rr * h[t-1] .+ bWr .+ bRr) # reset gate
n[t] = tanh(Wn * x[t] .+ r[t] .* (Rn * h[t-1] .+ bRn) .+ bWn) # new gate
h[t] = (1 - i[t]) .* n[t] .+ i[t] .* h[t-1]``````

`:lstm`: Long short term memory unit with no peephole connections:

``````i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
f[t] = sigm(Wf * x[t] .+ Rf * h[t-1] .+ bWf .+ bRf) # forget gate
o[t] = sigm(Wo * x[t] .+ Ro * h[t-1] .+ bWo .+ bRo) # output gate
n[t] = tanh(Wn * x[t] .+ Rn * h[t-1] .+ bWn .+ bRn) # new gate
c[t] = f[t] .* c[t-1] .+ i[t] .* n[t]               # cell output
h[t] = o[t] .* tanh(c[t])``````
source
``rnnparam(r::RNN, layer, id, param)``

Return a single weight matrix or bias vector as a slice of RNN weights.

Valid `layer` values:

• For unidirectional RNNs 1:numLayers
• For bidirectional RNNs 1:2*numLayers, forw and back layers alternate.

Valid `id` values:

• For RELU and TANH RNNs, input = 1, hidden = 2.
• For GRU reset = 1,4; update = 2,5; newmem = 3,6; 1:3 for input, 4:6 for hidden
• For LSTM inputgate = 1,5; forget = 2,6; newmem = 3,7; output = 4,8; 1:4 for input, 5:8 for hidden

Valid `param` values:

• Return the weight matrix (transposed!) if `param==1`.
• Return the bias vector if `param==2`.

The effect of skipInput: Let I=1 for RELU/TANH, 1:3 for GRU, 1:4 for LSTM

• For skipInput=false (default), rnnparam(r,1,I,1) is a (inputSize,hiddenSize) matrix.
• For skipInput=true, rnnparam(r,1,I,1) is `nothing`.
• For bidirectional, the same applies to rnnparam(r,2,I,1): the first back layer.
• The input biases (par=2) are returned even if skipInput=true.
source
``rnnparams(r::RNN)``

Return the RNN parameters as an Array{Any}.

The order of params returned (subject to change):

• All weight matrices come before all bias vectors.
• Matrices and biases are sorted lexically based on (layer,id).
• See @doc rnnparam for valid layer and id values.
• Input multiplying matrices are `nothing` if r.inputMode = 1.
source

## Batch Normalization

`batchnorm(x[, moments, params]; kwargs...)` performs batch normalization to `x` with optional scaling factor and bias stored in `params`.

2d, 4d and 5d inputs are supported. Mean and variance are computed over dimensions (2,), (1,2,4) and (1,2,3,5) for 2d, 4d and 5d arrays, respectively.

`moments` stores running mean and variance to be used in testing. It is optional in the training mode, but mandatory in the test mode. Training and test modes are controlled by the `training` keyword argument.

`params` stores the optional affine parameters gamma and beta. `bnparams` function can be used to initialize `params`.

Example

``````# Inilization, C is an integer
moments = bnmoments()
params = bnparams(C)
...
# size(x) -> (H, W, C, N)
y = batchnorm(x, moments, params)
# size(y) -> (H, W, C, N)``````

Keywords

`eps=1e-5`: The epsilon parameter added to the variance to avoid division by 0.

`training`: When `training` is true, the mean and variance of `x` are used and `moments` argument is modified if it is provided. When `training` is false, mean and variance stored in the `moments` argument are used. Default value is `true` when at least one of `x` and `params` is `AutoGrad.Value`, `false` otherwise.

source

`bnmoments(;momentum=0.1, mean=nothing, var=nothing, meaninit=zeros, varinit=ones)` can be used directly load moments from data. `meaninit` and `varinit` are called if `mean` and `var` are nothing. Type and size of the `mean` and `var` are determined automatically from the inputs in the `batchnorm` calls. A `BNMoments` object is returned.

BNMoments

A high-level data structure used to store running mean and running variance of batch normalization with the following fields:

`momentum::AbstractFloat`: A real number between 0 and 1 to be used as the scale of last mean and variance. The existing running mean or variance is multiplied by (1-momentum).

`mean`: The running mean.

`var`: The running variance.

`meaninit`: The function used for initialize the running mean. Should either be `nothing` or of the form `(eltype, dims...)->data`. `zeros` is a good option.

`varinit`: The function used for initialize the running variance. Should either be `nothing` or `(eltype, dims...)->data`. `ones` is a good option.

source

`bnparams(etype, channels)` creates a single 1d array that contains both scale and bias of batchnorm, where the first half is scale and the second half is bias.

`bnparams(channels)` calls `bnparams` with `etype=Float64`, following Julia convention

source

## Model optimization

``````minimize(func, data, optimizer=Adam(); params)
sgd     (func, data; lr=0.1,  gclip, params)
momentum(func, data; lr=0.05, gamma=0.95, gclip, params)
nesterov(func, data; lr=0.05, gamma=0.95, gclip, params)
rmsprop (func, data; lr=0.01, rho=0.9, eps=1e-6, gclip, params)
adam    (func, data; lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, gclip, params)``````

Return an iterator which applies `func` to arguments in `data`, i.e. `(func(args...) for args in data)`, and updates the parameters every iteration to minimize `func`. `func` should return a scalar value.

The common keyword argument `params` can be used to list the `Param`s to be optimized. If not specified, any `Param` that takes part in the computation of `func(args...)` will be updated.

The common keyword argument `gclip` can be used to implement per-parameter gradient clipping. For a parameter gradient `g`, if `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`. If not specified no gradient clipping is performed.

These functions do not perform optimization, but return an iterator that can. Any function that produces values from an iterator can be used with such an object, e.g. `progress!(sgd(f,d))` iterates the sgd optimizer and displays a progress bar. For convenience, appending `!` to the name of the function iterates and returns `nothing`, i.e. `sgd!(...)` is equivalent to `(for x in sgd(...) end)`.

We define optimizers as lazy iterators to have explicit control over them:

• To report progress use `progress(sgd(f,d))`.
• To run until convergence use `converge(sgd(f,cycle(d)))`.
• To run multiple epochs use `sgd(f,repeat(d,n))`.
• To run a given number of iterations use `sgd(f,take(cycle(d),n))`.
• To do a task every n iterations use `(task() for (i,j) in enumerate(sgd(f,d)) if i%n == 1)`.

These functions apply the same algorithm with the same configuration to every parameter by default. `minimize` takes an explicit optimizer argument, all others call `minimize` with an appropriate optimizer argument (see `@doc update!` for a list of possible optimizers). Before calling `update!` on a `Param`, `minimize` sets its `opt` field to a copy of this default optimizer if it is not already set. The `opt` field is used by the `update!` function to determine the type of update performed on that parameter. If you need finer grained control, you can set the optimizer of an individual `Param` by setting its `opt` field before calling one of these functions. They will not override the `opt` field if it is already set, e.g. `sgd(model,data)` will perform an `Adam` update for a parameter whose `opt` field is an `Adam` object. This also means you can stop and start the training without losing optimization state, the first call will set the `opt` fields and the subsequent calls will not override them.

Given a parameter `w` and its gradient `g` here are the updates applied by each optimizer:

``````# sgd (http://en.wikipedia.org/wiki/Stochastic_gradient_descent)
w .= w - lr * g

# momentum (http://jlmelville.github.io/mize/nesterov.html)
v .= gamma * v - lr * g
w .= w + v

# nesterov (http://jlmelville.github.io/mize/nesterov.html)
w .= w - gamma * v
v .= gamma * v - lr * g
w .= w + (1 + gamma) * v

G .= G + g .^ 2
w .= w - lr * g ./ sqrt(G + eps)

# rmsprop (http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
G .= rho * G + (1-rho) * g .^ 2
w .= w - lr * g ./ sqrt(G + eps)

G .= rho * G + (1-rho) * g .^ 2
update = sqrt(delta + eps) .* g ./ sqrt(G + eps)
w = w - lr * update
delta = rho * delta + (1-rho) * update .^ 2

v = beta1 * v + (1 - beta1) * g
G = beta2 * G + (1 - beta2) * g .^ 2
vhat = v ./ (1 - beta1 ^ t)
Ghat = G ./ (1 - beta2 ^ t)
w = w - (lr / (sqrt(Ghat) + eps)) * vhat``````
source
``converge(itr; alpha=0.1)``

Return an iterator which acts exactly like `itr`, but quits when values from `itr` stop decreasing. `itr` should produce numeric values.

It can be used to train a model with the data cycled:

``progress!(converge(minimize(model,cycle(data))))``

`alpha` controls the exponential average of values to detect convergence. Here is how convergence is decided:

``````p = x - avgx
avgx = c.alpha * x + (1-c.alpha) * avgx
avgp = c.alpha * p + (1-c.alpha) * avgp
avgp > 0.0 && return nothing``````

`converge!(...)` is equivalent to `(for x in converge(...) end)`, i.e. iterates over the object created by `converge(...)` and returns `nothing`.

source
``minibatch(x, [y], batchsize; shuffle, partial, xtype, ytype, xsize, ysize)``

Return an iterator of minibatches [(xi,yi)...] given data tensors x, y and batchsize.

The last dimension of x and y give the number of instances and should be equal. `y` is optional, if omitted a sequence of `xi` will be generated rather than `(xi,yi)` tuples. Use `repeat(d,n)` for multiple epochs, `Iterators.take(d,n)` for a partial epoch, and `Iterators.cycle(d)` to cycle through the data forever (this can be used with `converge`). If you need the iterator to continue from its last position when stopped early (e.g. by a break in a for loop), use `Iterators.Stateful(d)` (by default the iterator would restart from the beginning).

Keyword arguments:

• `shuffle=false`: Shuffle the instances every epoch.
• `partial=false`: If true include the last partial minibatch < batchsize.
• `xtype=typeof(x)`: Convert xi in minibatches to this type.
• `ytype=typeof(y)`: Convert yi in minibatches to this type.
• `xsize=size(x)`: Convert xi in minibatches to this shape.
• `ysize=size(y)`: Convert yi in minibatches to this shape.
source
``````progress(msg, itr; steps, seconds, io)
progress(itr; o...) do p; [body of the msg function]; end
progress(itr; o...)
progress!(...)``````

Return a `Progress` iterator which acts exactly like `itr`, but prints a progressbar:

``┣█████████████████▎  ┫ [86.83%, 903/1040, 01:36/01:50, 9.42i/s] 3.87835``

Here `86.83%` is the percentage completed, `903` is the number of iterations completed, `1040` is the total number of iterations. `01:36` is elapsed time, `01:50` is the estimated total time, `9.42i/s` is the average number of iterations completed per second. If the speed is less than 1, the average number of seconds per iteration (s/i) is reported instead. The bar, percent, total iterations, and estimated total time are omitted for iterators whose size is unknown.

The `3.87835` at the end is the output of the `msg` function applied to the Progress iterator. The message can be customized by the first two forms above, if not specified (the third form) nothing gets printed at the end of the line. The message function can use the following fields of its `p::Progress` argument: `p.currval` is the current iterator value and `p.curriter` is the current iteration count.

The progress bar is updated and `msg` is called with the Progress iterator every `steps` iterations or every `seconds` seconds in addition to the first and the last iteration. If neither `steps` nor `seconds` is specified the default is to update every second. The keyword argument `io` determines where the progress bar is printed, the default is `stderr`.

The last form, `progress!(...)`, is equivalent to `(for x in progress(...) end)`, i.e. iterates over the object created by `progress(...)` and returns `nothing`.

source

`training()` returns `true` only inside a `@diff` context, e.g. during a training iteration of a model.

source

## Hyperparameter optimization

``goldensection(f,n;kwargs) => (fmin,xmin)``

Find the minimum of `f` using concurrent golden section search in `n` dimensions. See `Knet.goldensection_demo()` for an example.

`f` is a function from a `Vector{Float64}` of length `n` to a `Number`. It can return `NaN` for out of range inputs. Goldensection will always start with a zero vector as the initial input to `f`, and the initial step size will be 1 in each dimension. The user should define `f` to scale and shift this input range into a vector meaningful for their application. For positive inputs like learning rate or hidden size, you can use a transformation such as `x0*exp(x)` where `x` is a value `goldensection` passes to `f` and `x0` is your initial guess for this value. This will effectively start the search at `x0`, then move with multiplicative steps.

I designed this algorithm combining ideas from Golden Section Search and Hill Climbing Search. It essentially runs golden section search concurrently in each dimension, picking the next step based on estimated gain.

Keyword arguments

• `dxmin=0.1`: smallest step size.
• `accel=φ`: acceleration rate. Golden ratio `φ=1.618...` is best.
• `verbose=false`: use `true` to print individual steps.
• `history=[]`: cache of `[(x,f(x)),...]` function evaluations.
source
``hyperband(getconfig, getloss, maxresource=27, reduction=3)``

Hyperparameter optimization using the hyperband algorithm from (Lisha et al. 2016). You can try a simple MNIST example using `Knet.hyperband_demo()`.

Arguments

• `getconfig()` returns random configurations with a user defined type and distribution.
• `getloss(c,n)` returns loss for configuration `c` and number of resources (e.g. epochs) `n`.
• `maxresource` is the maximum number of resources any one configuration should be given.
• `reduction` is an algorithm parameter (see paper), 3 is a good value.
source

## Utilities

``````Knet.bmm
Knet.cpucopy
Knet.dir
Knet.dropout
Knet.gc
Knet.gpu
Knet.gpucopy
Knet.invx
Knet.mat
Knet.seed!``````

``````gcheck(f, x...; kw, o...)
@gcheck f(x...; kw...) (opt1=val1,opt2=val2,...)``````

Numerically check the gradient of `f(x...; kw...)` and return a boolean result.

Example call: `gcheck(nll,model,x,y)` or `@gcheck nll(model,x,y)`. The parameters should be marked as `Param` arrays in `f`, `x`, and/or `kw`. Only 10 random entries in each large numeric array are checked by default. If the output of `f` is not a number, we check the gradient of `sum(f(x...; kw...))`. Keyword arguments:

• `kw=()`: keyword arguments to be passed to `f`, i.e. `f(x...; kw...)`
• `nsample=10`: number of random entries from each param to check
• `atol=0.01,rtol=0.05`: tolerance parameters. See `isapprox` for their meaning.
• `delta=0.0001`: step size for numerical gradient calculation.
• `verbose=1`: 0 prints nothing, 1 shows failing tests, 2 shows all tests.
``@primitive  fx g1 g2...``

Define a new primitive operation for AutoGrad and (optionally) specify its gradients. Non-differentiable functions such as `sign`, and non-numeric functions such as `size` should be defined using the @zerograd macro instead.

Examples

``````@primitive sin(x::Number)
@primitive hypot(x1,x2),dy,y

@primitive sin(x::Number),dy  (dy.*cos(x))
@primitive hypot(x1,x2),dy,y  (dy.*x1./y)  (dy.*x2./y)``````

The first example shows that `fx` is a typed method declaration. Julia supports multiple dispatch, i.e. a single function can have multiple methods with different arg types. AutoGrad takes advantage of this and supports multiple dispatch for primitives and gradients.

The second example specifies variable names for the output gradient `dy` and the output `y` after the method declaration which can be used in gradient expressions. Untyped, ellipsis and keyword arguments are ok as in `f(a::Int,b,c...;d=1)`. Parametric methods such as `f(x::T) where {T<:Number}` cannot be used.

The method declaration can optionally be followed by gradient expressions. The third and fourth examples show how gradients can be specified. Note that the parameters, the return variable and the output gradient of the original function can be used in the gradient expressions.

Under the hood

The @primitive macro turns the first example into:

``sin(x::Value{T}) where {T<:Number} = forw(sin, x)``

This will cause calls to `sin` with a boxed argument (`Value{T<:Number}`) to be recorded. The recorded operations are used by AutoGrad to construct a dynamic computational graph. With multiple arguments things are a bit more complicated. Here is what happens with the second example:

``````hypot(x1::Value{S}, x2::Value{T}) where {S,T} = forw(hypot, x1, x2)
hypot(x1::S, x2::Value{T})        where {S,T} = forw(hypot, x1, x2)
hypot(x1::Value{S}, x2::T)        where {S,T} = forw(hypot, x1, x2)``````

We want the forw method to be called if any one of the arguments is a boxed `Value`. There is no easy way to specify this in Julia, so the macro generates all 2^N-1 boxed/unboxed argument combinations.

``back(f,Arg{i},dy,y,x...) => dx[i]``

For the third example here is the generated gradient method:

``back(::typeof(sin), ::Type{Arg{1}}, dy, y, x::Value{T}) where {T<:Number} = dy .* cos(x)``

For the last example a different gradient method is generated for each argument:

``````back(::typeof(hypot), ::Type{Arg{1}}, dy, y, x1::Value{S}, x2::Value{T}) where {S,T} = (dy .* x1) ./ y
back(::typeof(hypot), ::Type{Arg{2}}, dy, y, x1::Value{S}, x2::Value{T}) where {S,T} = (dy .* x2) ./ y``````

In fact @primitive generates four more definitions for the other boxed/unboxed argument combinations.

Broadcasting is handled by extra `forw` and `back` methods. `@primitive` defines the following so that broadcasting of a primitive function with a boxed value triggers `forw` and `back`.

``````broadcasted(::typeof(sin), x::Value{T}) where {T<:Number} = forw(broadcasted,sin,x)
back(::typeof(broadcasted), ::Type{Arg{2}}, dy, y, ::typeof(sin), x::Value{T}) where {T<:Number} = dy .* cos(x)``````

If you do not want the broadcasting methods, you can use the `@primitive1` macro. If you only want the broadcasting methods use `@primitive2`. As a motivating example, here is how `*` is defined for non-scalars:

``````@primitive1 *(x1,x2),dy  (dy*x2')  (x1'*dy)

Regular `*` is matrix multiplication, broadcasted `*` is elementwise multiplication and the two have different gradients as defined above. `unbroadcast(a,b)` reduces `b` to the same shape as `a` by performing the necessary summations.

``@zerograd f(args...; kwargs...)``

Define `f` as an AutoGrad primitive operation with zero gradient.

Example:

``@zerograd  floor(x::Float32)``

`@zerograd` allows `f` to handle boxed `Value` inputs by unboxing them like a `@primitive`, but unlike `@primitive` it does not record its actions or return a boxed `Value` result. Some functions, like `sign()`, have zero gradient. Others, like `length()` have discrete or constant outputs. These need to handle `Value` inputs, but do not need to record anything and can return regular values. Their output can be treated like a constant in the program. Use the `@zerograd` macro for those. Use the `@zerograd1` variant if you don't want to define the broadcasting version and `@zerograd2` if you only want to define the broadcasting version. Note that `kwargs` are NOT unboxed.

The model optimization methods apply the same algorithm with the same configuration to every parameter. If you need finer grained control, you can set the optimization algorithm and configuration of an individual `Param` by setting its `opt` field to one of the optimization objects like `Adam` listed below. The `opt` field is used as an argument to `update!` and controls the type of update performed on that parameter. Model optimization methods like `sgd` will not override the `opt` field if it is already set, e.g. `sgd(model,data)` will perform an `Adam` update for a parameter whose `opt` field is an `Adam` object. This also means you can stop and start the training without losing optimization state, the first call will set the `opt` fields and the subsequent calls will not override them.

``````update!(weights::Param, gradients)

Update the `weights` using their `gradients` and the optimization algorithms specified using (1) the `opt` field of a `Param`, (2) keyword arguments, (3) the third argument.

`weights` can be an individual `Param`, numeric array, or a collection of arrays/Params represented by an iterator or dictionary. `gradients` should be a matching individual array or collection. In the first form, the optimizer should be specified in `weights.opt`. In the second form the optimizer defaults to `SGD` with learning rate `lr` and gradient clip `gclip`. In the third form `optimizers` should be a matching individual optimizer or collection of optimizers. The `weights` and possibly `gradients` and `optimizers` are modified in-place.

Individual optimization parameters can be one of the following types. The keyword arguments for each constructor and their default values are listed as well.

Example:

``````w = Param(rand(d), Adam())  # a Param with a specified optimizer
g = lossgradient0(w)        # gradient g has the same shape as w
update!(w, g)               # update w in-place with Adam()

w = rand(d)                 # an individual weight array
g = lossgradient1(w)        # gradient g has the same shape as w
update!(w, g)               # update w in-place with SGD()
update!(w, g; lr=0.1)       # update w in-place with SGD(lr=0.1)
update!(w, g, SGD(lr=0.1))  # update w in-place with SGD(lr=0.1)

w = (rand(d1), rand(d2))    # a tuple of weight arrays
g = lossgradient2(w)        # g will also be a tuple
p = (Adam(), SGD())         # p has optimizers for each w[i]
update!(w, g, p)            # update each w[i] in-place with g[i],p[i]

w = Any[rand(d1), rand(d2)] # any iterator can be used
g = lossgradient3(w)        # g will be similar to w
p = Any[Adam(), SGD()]      # p should be an iterator of same length
update!(w, g, p)            # update each w[i] in-place with g[i],p[i]

w = Dict(:a => rand(d1), :b => rand(d2)) # dictionaries can be used
p = Dict(:a => Adam(), :b => SGD())
update!(w, g, p)``````
source
``````SGD(;lr=0.1,gclip=0)
update!(w,g,p::SGD)
update!(w,g;lr=0.1)``````

Container for parameters of the Stochastic gradient descent (SGD) optimization algorithm used by `update!`.

SGD is an optimization technique to minimize an objective function by updating its weights in the opposite direction of their gradient. The learning rate (lr) determines the size of the step. SGD updates the weights with the following formula:

``w = w - lr * g``

where `w` is a weight array, `g` is the gradient of the loss function w.r.t `w` and `lr` is the learning rate.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`. If `gclip==0` no scaling takes place.

SGD is used by default if no algorithm is specified in the two argument version of `update!`[@ref].

source
``````Momentum(;lr=0.05, gclip=0, gamma=0.95)
update!(w,g,p::Momentum)``````

Container for parameters of the Momentum optimization algorithm used by `update!`.

The Momentum method tries to accelerate SGD by adding a velocity term to the update. This also decreases the oscillation between successive steps. It updates the weights with the following formulas:

``````velocity = gamma * velocity + lr * g
w = w - velocity``````

where `w` is a weight array, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `gamma` is the momentum parameter, `velocity` is an array with the same size and type of `w` and holds the accelerated gradients.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`. If `gclip==0` no scaling takes place.

Reference: Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151.

source
``````Nesterov(; lr=0.05, gclip=0, gamma=0.95)
update!(w,g,p::Momentum)``````

Container for parameters of Nesterov's momentum optimization algorithm used by `update!`.

It is similar to standard `Momentum` but with a slightly different update rule:

``````velocity = gamma * velocity_old - lr * g
w = w_old - velocity_old + (1+gamma) * velocity``````

where `w` is a weight array, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `gamma` is the momentum parameter, `velocity` is an array with the same size and type of `w` and holds the accelerated gradients.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`. If `gclip == 0` no scaling takes place.

Reference Implementation : Yoshua Bengio, Nicolas Boulanger-Lewandowski and Razvan P ascanu

source
``````Adagrad(;lr=0.05, gclip=0, eps=1e-6)

Container for parameters of the Adagrad optimization algorithm used by `update!`.

Adagrad is one of the methods that adapts the learning rate to each of the weights. It stores the sum of the squares of the gradients to scale the learning rate. The learning rate is adapted for each weight by the value of current gradient divided by the accumulated gradients. Hence, the learning rate is greater for the parameters where the accumulated gradients are small and the learning rate is small if the accumulated gradients are large. It updates the weights with the following formulas:

``````G = G + g .^ 2
w = w - g .* lr ./ sqrt(G + eps)``````

where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `G` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero value in the denominator.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`. If `gclip==0` no scaling takes place.

Reference: Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159.

source
``````Rmsprop(;lr=0.01, gclip=0, rho=0.9, eps=1e-6)
update!(w,g,p::Rmsprop)``````

Container for parameters of the Rmsprop optimization algorithm used by `update!`.

Rmsprop scales the learning rates by dividing the root mean squared of the gradients. It updates the weights with the following formula:

``````G = (1-rho) * g .^ 2 + rho * G
w = w - lr * g ./ sqrt(G + eps)``````

where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `G` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero value in the denominator. `rho` is the momentum parameter and `delta` is an array with the same size and type of `w` and holds the sum of the squared updates.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`. If `gclip==0` no scaling takes place.

Reference: Tijmen Tieleman and Geoffrey Hinton (2012). "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural Networks for Machine Learning 4.2.

source
``````Adadelta(;lr=1.0, gclip=0, rho=0.9, eps=1e-6)

Container for parameters of the Adadelta optimization algorithm used by `update!`.

Adadelta is an extension of Adagrad that tries to prevent the decrease of the learning rates to zero as training progresses. It scales the learning rate based on the accumulated gradients like Adagrad and holds the acceleration term like Momentum. It updates the weights with the following formulas:

``````G = (1-rho) * g .^ 2 + rho * G
update = g .* sqrt(delta + eps) ./ sqrt(G + eps)
w = w - lr * update
delta = rho * delta + (1-rho) * update .^ 2``````

where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `G` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero value in the denominator. `rho` is the momentum parameter and `delta` is an array with the same size and type of `w` and holds the sum of the squared updates.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`. If `gclip==0` no scaling takes place.

source
``````Adam(;lr=0.001, gclip=0, beta1=0.9, beta2=0.999, eps=1e-8)

Container for parameters of the Adam optimization algorithm used by `update!`.

Adam is one of the methods that compute the adaptive learning rate. It stores accumulated gradients (first moment) and the sum of the squared of gradients (second). It scales the first and second moment as a function of time. Here is the update formulas:

``````m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g .* g
mhat = m ./ (1 - beta1 ^ t)
vhat = v ./ (1 - beta2 ^ t)
w = w - (lr / (sqrt(vhat) + eps)) * mhat``````

where `w` is the weight, `g` is the gradient of the objective function w.r.t `w`, `lr` is the learning rate, `m` is an array with the same size and type of `w` and holds the accumulated gradients. `v` is an array with the same size and type of `w` and holds the sum of the squares of the gradients. `eps` is a small constant to prevent a zero denominator. `beta1` and `beta2` are the parameters to calculate bias corrected first and second moments. `t` is the update count.

If `norm(g) > gclip > 0`, `g` is scaled so that its norm is equal to `gclip`. If `gclip==0` no scaling takes place.

Reference: Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13.

source