Reference
Contents
AutoGrad
Knet.AutoGrad
— ModuleUsage:
x = Param([1,2,3]) # The user declares parameters with `Param`
y = @diff sum(x .* x) # computes gradients using `@diff`
grad(y,x) => [2,4,6] # looks up the gradient of a parameter with `grad`
Param(x)
returns a struct that acts like x
but marks it as a parameter you want to compute gradients with respect to.
@diff expr
evaluates an expression and returns a struct that contains its value (which should be a scalar) and gradients with respect to the Param
s used in the computation.
grad(y, x)
returns the gradient of a @diff
result y
with respect to any parameter x::Param
. (nothing
may be returned if the gradient is 0).
value(x)
returns the value associated with x
if x
is a Param
or the output of @diff
, otherwise returns x
.
params(x)
returns an iterator of Param
s found by a recursive search of object x
, which is typically a model or a @diff
result.
Alternative usage:
x = [1 2 3]
f(x) = sum(x .* x)
f(x) => 14
grad(f)(x) => [2 4 6]
gradloss(f)(x) => ([2 4 6], 14)
Given a scalar valued function f
, grad(f,argnum=1)
returns another function g
which takes the same inputs as f
and returns the gradient of the output with respect to the argnum'th argument. gradloss
is similar except the resulting function also returns f's output.
KnetArray
Knet.KnetArrays.KnetArray
— TypeKnetArray{T}(undef,dims)
KnetArray(a::AbstractArray)
Array(k::KnetArray)
Container for GPU arrays that supports most of the AbstractArray interface. The constructor allocates a KnetArray in the currently active device, as specified by CUDA.device()
. KnetArrays and Arrays can be converted to each other as shown above, which involves copying to and from the GPU memory. Only Float32/64 KnetArrays are fully supported.
KnetArrays use the CUDA.jl package for allocation and some operations. Currently some of the custom CUDA kernels that implement elementwise, broadcasting, and reduction operations for KnetArrays work faster. Once these are improved in CUDA.jl, KnetArrays will be retired.
Supported functions:
Indexing: getindex, setindex! with the following index types:
- 1-D: Real, Colon, OrdinalRange, AbstractArray{Real}, AbstractArray{Bool}, CartesianIndex, AbstractArray{CartesianIndex}, EmptyArray, KnetArray{Int32} (low level), KnetArray{0/1} (using float for BitArray) (1-D includes linear indexing of multidimensional arrays)
- 2-D: (Colon,Union{Real,Colon,OrdinalRange,AbstractVector{Real},AbstractVector{Bool},KnetVector{Int32}}), (Union{Real,AbstractUnitRange,Colon}...) (in any order)
- N-D: (Real...)
Array operations: ==, !=, adjoint, argmax, argmin, cat, convert, copy, copyto!, deepcopy, display, eachindex, eltype, endof, fill!, findmax, findmin, first, hcat, isapprox, isempty, length, ndims, one, ones, permutedims, pointer, rand!, randn!, reshape, similar, size, stride, strides, summary, transpose, vcat, vec, zero. (Boolean operators generate outputs with same type as inputs; no support for KnetArray{Bool}.)
Unary functions with broadcasting: -, abs, abs2, acos, acosh, asin, asinh, atan, atanh, cbrt, ceil, cos, cosh, cospi, digamma, erf, erfc, erfcinv, erfcx, erfinv, exp, exp10, exp2, expm1, floor, gamma, lgamma, log, log10, log1p, log2, loggamma, one, round, sign, sin, sinh, sinpi, sqrt, tan, tanh, trigamma, trunc, zero
Binary functions with broadcasting: !=, *, +, -, /, <, <=, ==, >, >=, ^, max, min
Reduction operators: maximum, minimum, prod, sum
Statistics: mean, std, stdm, var, varm
Linear algebra: (*), axpy!, lmul!, norm, rmul!
Knet extras: batchnorm, bce, bmm, cat1d, conv4, cpucopy, deconv4, dropout, elu, gpucopy, logistic, logp, logsoftmax, logsumexp, mat, nll, pool, relu, RNN, selu, sigm, softmax, unpool (Only 4D/5D, Float32/64 KnetArrays support conv4, pool, deconv4, unpool)
File I/O
Missing docstring for Knet.save
. Check Documenter's build log for details.
Missing docstring for Knet.load
. Check Documenter's build log for details.
Missing docstring for Knet.@save
. Check Documenter's build log for details.
Missing docstring for Knet.@load
. Check Documenter's build log for details.
Parameter initialization
Knet.Train20.param
— Functionparam(array; atype)
param(dims...; init, atype)
param0(dims...; atype)
The first form returns Param(atype(array))
.
The second form Returns a randomly initialized Param(atype(init(dims...)))
.
The third form param0
is an alias for param(dims...; init=zeros)
.
By default, init
is xavier_uniform
and atype
is Knet.atype()
.
Knet.Train20.xavier
— Functionxavier_uniform(a...; gain=1)
xavier(a...; gain=1)
Return uniform random weights in the range ± gain * sqrt(6 / (fanin + fanout))
. The a
arguments are passed to rand
to specify type and dimensions. See (Glorot and Bengio 2010) or the PyTorch docs for a description. The function implements equation (16) of the referenced paper. Also known as Glorot initialization. The function xavier
is an alias for xavier_uniform
. See also xavier_normal
.
Knet.Train20.xavier_uniform
— Functionxavier_uniform(a...; gain=1)
xavier(a...; gain=1)
Return uniform random weights in the range ± gain * sqrt(6 / (fanin + fanout))
. The a
arguments are passed to rand
to specify type and dimensions. See (Glorot and Bengio 2010) or the PyTorch docs for a description. The function implements equation (16) of the referenced paper. Also known as Glorot initialization. The function xavier
is an alias for xavier_uniform
. See also xavier_normal
.
Knet.Train20.xavier_normal
— Functionxavier_normal(a...; gain=1)
Return normal distributed random weights with mean 0 and std gain * sqrt(2 / (fanin + fanout))
. The a
arguments are passed to rand
. See (Glorot and Bengio 2010) and PyTorch docs for a description. Also known as Glorot initialization. See also xavier_uniform
.
Knet.Train20.gaussian
— Functiongaussian(a...; mean=0.0, std=0.01)
Return a Gaussian array with a given mean and standard deviation. The a
arguments are passed to randn
.
Knet.Train20.bilinear
— FunctionBilinear interpolation filter weights; used for initializing deconvolution layers.
Adapted from https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/surgery.py#L33
Arguments:
T
: Data Type
fw
: Width upscale factor
fh
: Height upscale factor
IN
: Number of input filters
ON
: Number of output filters
Example usage:
w = bilinear(Float32,2,2,128,128)
Activation functions
Knet.Ops20.elu
— Functionelu(x)
Return (x > 0 ? x : exp(x)-1)
.
Reference: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) (https://arxiv.org/abs/1511.07289).
Knet.Ops20.relu
— Functionrelu(x)
Return max(0,x)
.
References:
- Nair and Hinton, 2010. Rectified Linear Units Improve Restricted Boltzmann Machines. ICML.
- Glorot, Bordes and Bengio, 2011. Deep Sparse Rectifier Neural Networks. AISTATS.
Knet.Ops20.selu
— Functionselu(x)
Return λ01 * (x > 0 ? x : α01 * (exp(x)-1))
where λ01=1.0507009873554805
and α01=1.6732632423543778
.
Reference: Self-Normalizing Neural Networks (https://arxiv.org/abs/1706.02515).
Knet.Ops20.sigm
— Functionsigm(x)
Return 1/(1+exp(-x))
.
Reference: Numerically stable sigm implementation from http://timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick.
Loss functions
Knet.Ops20.accuracy
— Functionaccuracy(scores, labels; dims=1, average=true)
Given an unnormalized scores
matrix and an Integer
array of correct labels
, return the ratio of instances where the correct label has the maximum score. dims=1
means instances are in columns, dims=2
means instances are in rows. Use average=false
to return the pair (ncorrect,count) instead of the ratio (ncorrect/count). The valid labels should be integers in the range 1:numclasses
, if labels[i] == 0
, instance i is skipped.
accuracy(model; data, dims=1, average=true, o...)
Compute the number of correct predictions of a model over a dataset:
accuracy(model(inputs; kwargs...), labels; dims) for (inputs,labels) in data
and return (ncorrect/count)
if average=true
or (ncorrect,count)
if average=false
where count
is the number instances not skipped (instances with label==0
are skipped) and ncorrect
is the number of them correctly labeled by the model.
The model
should be a function returning scores given inputs, and data should be an iterable of (inputs,labels)
pairs. The valid labels should be integers in the range 1:numclasses
, if labels[i] == 0
, instance i is skipped.
Knet.Ops20.bce
— Functionbce(scores, labels; average=true)
Computes binary cross entropy loss given predicted unnormalized scores and answer labels for a binary prediction task. Label values should be in {0,1}. Scores are unrestricted and will be converted to probabilities using
probs = 1 ./ (1 .+ exp.(-scores))
The loss calculated is
-(labels .* log.(probs) .+ (1 .- labels) .* log.(1 .- probs))
The return value is (total/count)
if average=true
and (total,count)
if average=false
where count
is the number of instances and total
is their total loss.
See also logistic
which computes the same loss with {-1,1} labels.
Reference: https://towardsdatascience.com/nothing-but-numpy-understanding-creating-binary-classification-neural-networks-with-e746423c8d5c
Knet.Ops20.logistic
— Functionlogistic(scores, labels; average=true)
Computes logistic loss given predicted unnormalized scores and answer labels for a binary prediction task.
log.(1 .+ exp.(-labels .* scores))
Label values should be {-1,1}. Scores are unrestricted. The return value is (total/count)
if average=true
and (total,count)
if average=false
where count
is the number of instances and total
is their total loss.
See also bce
which computes the same loss with {0,1} labels.
Reference: https://towardsdatascience.com/nothing-but-numpy-understanding-creating-binary-classification-neural-networks-with-e746423c8d5c
Knet.Ops20.logp
— Functionsoftmax(x; dims=:)
logsoftmax(x; dims=:)
Treat entries in x
as as unnormalized log probabilities and return normalized (log) probabilities, i.e.
softmax(x; dims) = exp.(x) ./ sum(exp.(x); dims=dims)
logsoftmax(x; dims) = x .- log.(sum(exp.(x); dims=dims))
For numerical stability x = x .- maximum(x,dims=dims)
is performed before exponentiation.
dims
is an optional argument, if not specified the normalization is over the whole x
, otherwise the normalization is performed over the given dimensions. In particular, if x
is a matrix, dims=1
normalizes columns of x
and dims=2
normalizes rows of x
.
Knet.Ops20.logsoftmax
— Functionsoftmax(x; dims=:)
logsoftmax(x; dims=:)
Treat entries in x
as as unnormalized log probabilities and return normalized (log) probabilities, i.e.
softmax(x; dims) = exp.(x) ./ sum(exp.(x); dims=dims)
logsoftmax(x; dims) = x .- log.(sum(exp.(x); dims=dims))
For numerical stability x = x .- maximum(x,dims=dims)
is performed before exponentiation.
dims
is an optional argument, if not specified the normalization is over the whole x
, otherwise the normalization is performed over the given dimensions. In particular, if x
is a matrix, dims=1
normalizes columns of x
and dims=2
normalizes rows of x
.
Knet.Ops20.logsumexp
— Functionlogsumexp(x;dims=:)
Compute log(sum(exp(x);dims))
in a numerically stable manner.
dims
is an optional argument, if not specified the summation is over the whole x
, otherwise the summation is performed over the given dimensions. In particular if x
is a matrix, dims=1
sums columns of x
and dims=2
sums rows of x
.
Knet.Ops20.nll
— Functionnll(scores, labels; dims=1, average=true)
Return the negative log likelihood for a single batch of data given an unnormalized scores
matrix and an Integer
array of correct labels
. The scores
matrix should have size (classes,instances)
if dims=1
or (instances,classes)
if dims=2
. labels[i]
should be in 1:classes
to indicate the correct class for instance i, or 0 to skip instance i.
The return value is (total/count)
if average=true
and (total,count)
if average=false
where count
is the number of instances not skipped (i.e. label != 0
) and total
is their total negative log likelihood.
Example
Let's assume that there are three classes (cat, dog, ostrich) and just 2 instances with the unnormalized score scores[:,1]
and scores[:,2]
respectively. The first instance is actually a cat and the second instance a dog:
scores = [12.2 0.3;
2.0 21.5;
0.0 -21.0]
labels = [1, 2]
nll(scores,labels)
# returns 2.1657e-5
The probabilites are derived from the scores and the negative log-probabilities corresponding to the labels are averaged:
probabilites = exp.(scores) ./ sum(exp.(scores),dims=1)
-(log(probabilites[labels[1],1]) + log(probabilites[labels[2],2]))/2
# returns 2.1657e-5
nll(model; data, dims=1, average=true, o...)
Compute the negative log likelihood for a model over a dataset:
nll(model(inputs; kwargs...), labels; dims) for (inputs,labels) in data
and return (total/count)
if average=true
or (total,count)
if average=false
where count
is the number of instances not skipped (instances with label==0
are skipped) and total
is their total negative log likelihood.
The model
should be a function returning scores given inputs, and data should be an iterable of (inputs,labels)
pairs. The valid labels should be integers in the range 1:numclasses
, if labels[i] == 0
, instance i is skipped.
Knet.Ops20.softmax
— Functionsoftmax(x; dims=:)
logsoftmax(x; dims=:)
Treat entries in x
as as unnormalized log probabilities and return normalized (log) probabilities, i.e.
softmax(x; dims) = exp.(x) ./ sum(exp.(x); dims=dims)
logsoftmax(x; dims) = x .- log.(sum(exp.(x); dims=dims))
For numerical stability x = x .- maximum(x,dims=dims)
is performed before exponentiation.
dims
is an optional argument, if not specified the normalization is over the whole x
, otherwise the normalization is performed over the given dimensions. In particular, if x
is a matrix, dims=1
normalizes columns of x
and dims=2
normalizes rows of x
.
Knet.Ops20.zeroone
— Functionzeroone loss is equal to 1 - accuracy
Convolution and Pooling
Knet.Ops20.conv4
— Functionconv4(w, x; kwargs...)
Execute convolutions or cross-correlations using filters specified with w
over tensor x
.
If w
has dimensions (W1,W2,...,Cx,Cy)
and x
has dimensions (X1,X2,...,Cx,N)
, the result y
will have dimensions (Y1,Y2,...,Cy,N)
where Cx
is the number of input channels, Cy
is the number of output channels, N
is the number of instances, and Wi,Xi,Yi
are spatial dimensions with Yi
determined by:
Yi = 1 + floor((Xi + 2*padding[i] - ((Wi-1)*dilation[i] + 1)) / stride[i])
padding
, stride
and dilation
are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.
Keywords
padding=0
: the number of extra zeros implicitly concatenated at the start and end of each dimension.stride=1
: the number of elements to slide to reach the next filtering window.dilation=1
: dilation factor for each dimension.mode=0
: 0 for convolution and 1 for cross-correlation (which flips the filter).alpha=1
: can be used to scale the result.group=1
: can be used to perform grouped convolutions.
Knet.Ops20.deconv4
— Functiondeconv4(w, x; kwargs...)
Simulate 4-D deconvolution by using transposed convolution operation. Its forward pass is equivalent to backward pass of a convolution (gradients with respect to input tensor). Likewise, its backward pass (gradients with respect to input tensor) is equivalent to forward pass of a convolution. Since it swaps forward and backward passes of convolution operation, padding and stride options belong to output tensor. See this report for further explanation.
If w
has dimensions (W1,W2,...,Cy,Cx)
and x
has dimensions (X1,X2,...,Cx,N)
, the result y=deconv4(w,x)
will have dimensions (Y1,Y2,...,Cy,N)
where
Yi = (Xi - 1)*stride[i] + ((Wi-1)*dilation[i] + 1) - 2*padding[i]
Here Cx is the number of x channels, Cy is the number of y channels, N is the number of instances, and Wi,Xi,Yi are spatial dimensions. Padding and stride are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.
Keywords
padding=0
: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.stride=1
: the number of elements to slide to reach the next filtering window.mode=0
: 0 for convolution and 1 for cross-correlation.alpha=1
: can be used to scale the result.handle
: handle to a previously created cuDNN context. Defaults to a Knet allocated handle.group=1
: can be used to perform grouped convolutions.
Knet.Ops20.pool
— Functionpool(x; kwargs...)
Compute pooling of input values (i.e., the maximum or average of several adjacent values) to produce an output with smaller height and/or width.
If x
has dimensions (X1,X2,...,Cx,N)
, the result y
will have dimensions (Y1,Y2,...,Cx,N)
where
Yi=1+floor((Xi+2*padding[i]-window[i])/stride[i])
Here Cx
is the number of input channels, N
is the number of instances, and Xi,Yi
are spatial dimensions. window
, padding
and stride
are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.
Keywords:
window=2
: the pooling window size for each dimension.padding=0
: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.stride=window
: the number of elements to slide to reach the next pooling window.mode=0
: 0 for max, 1 for average including padded values, 2 for average excluding padded values, 3 for deterministic max.maxpoolingNanOpt=1
: Nan numbers are not propagated if 0, they are propagated if 1.alpha=1
: can be used to scale the result.
Knet.Ops20.unpool
— Functionunpool(x; o...)
Perform the reverse of pooling: x == pool(unpool(x;o...); o...)
Recurrent neural networks
Knet.Ops20.RNN
— Typernn = RNN(inputSize, hiddenSize; opts...)
rnn(x; batchSizes) => y
rnn.h, rnn.c # hidden and cell states
RNN
returns a callable RNN object rnn
. Given a minibatch of sequences x
, rnn(x)
returns y
, the hidden states of the final layer for each time step. rnn.h
and rnn.c
fields can be used to set the initial hidden states and read the final hidden states of all layers. Note that the final time step of y
always contains the final hidden state of the last layer, equivalent to rnn.h
for a single layer network.
Dimensions: The input x
can be 1, 2, or 3 dimensional and y
will have the same number of dimensions as x
. size(x)=(X,[B,T]) and size(y)=(H/2H,[B,T]) where X is inputSize, B is batchSize, T is seqLength, H is hiddenSize, 2H is for bidirectional RNNs. By default a 1-D x
represents a single instance for a single time step, a 2-D x
represents a single minibatch for a single time step, and a 3-D x
represents a sequence of identically sized minibatches for multiple time steps. The output y
gives the hidden state (of the final layer for multi-layer RNNs) for each time step. The fields rnn.h
and rnn.c
represent the hidden states of all layers in a single time step and have size (H,B,L/2L) where L is numLayers and 2L is for bidirectional RNNs.
batchSizes: If batchSizes=nothing
(default), all sequences in a minibatch are assumed to be the same length. If batchSizes
is an array of (non-increasing) integers, it gives us the batch size for each time step (allowing different sequences in the minibatch to have different lengths). In this case x
will typically be 2-D with the second dimension representing variable size batches for time steps. If batchSizes
is used, sum(batchSizes)
should equal length(x) ÷ size(x,1)
. When the batch size is different in every time step, hidden states will have size (H,B,L/2L) where B is always the size of the first (largest) minibatch.
Hidden states: The hidden and cell states are kept in rnn.h
and rnn.c
fields (the cell state is only used by LSTM). They can be initialized during construction using the h
and c
keyword arguments, or modified later by direct assignment. Valid values are nothing
(default), 0
, or an array of the right type and size possibly wrapped in a Param
. If the value is nothing
the initial state is assumed to be zero and the final state is discarded keeping the value nothing
. If the value is 0
the initial state is assumed to be zero and 0
is replaced by the final state on return. If the value is a valid state, it is used as the initial state and is replaced by the final state on return.
In a differentiation context the returned final hidden states will be wrapped in Result
types. This is necessary if the same RNN object is to be called multiple times in a single iteration. Between iterations (i.e. after diff/update) the hidden states need to be unboxed with e.g. rnn.h = value(rnn.h)
to prevent spurious dependencies. This happens automatically during the backward pass for GPU RNNs but needs to be done manually for CPU RNNs. See the CharLM Tutorial for an example.
Keyword arguments for RNN:
h=nothing
: Initial hidden state.c=nothing
: Initial cell state.rnnType=:lstm
Type of RNN: One of :relu, :tanh, :lstm, :gru.numLayers=1
: Number of RNN layers.bidirectional=false
: Create a bidirectional RNN iftrue
.dropout=0
: Dropout probability. Applied to input and between layers.skipInput=false
: Do not multiply the input with a matrix iftrue
.algo=0
: Algorithm to use, see CUDNN docs for details.seed=0
: Random number seed for dropout. Usestime()
if 0.winit=xavier
: Weight initialization method for matrices.binit=zeros
: Weight initialization method for bias vectors.finit=ones
: Weight initialization method for the bias of forget gates.atype=Knet.atype()
: array type for model weights.
Formulas: RNNs compute the output h[t] for a given iteration from the recurrent input h[t-1] and the previous layer input x[t] given matrices W, R and biases bW, bR from the following equations:
:relu
and :tanh
: Single gate RNN with activation function f:
h[t] = f(W * x[t] .+ R * h[t-1] .+ bW .+ bR)
:gru
: Gated recurrent unit:
i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
r[t] = sigm(Wr * x[t] .+ Rr * h[t-1] .+ bWr .+ bRr) # reset gate
n[t] = tanh(Wn * x[t] .+ r[t] .* (Rn * h[t-1] .+ bRn) .+ bWn) # new gate
h[t] = (1 - i[t]) .* n[t] .+ i[t] .* h[t-1]
:lstm
: Long short term memory unit with no peephole connections:
i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
f[t] = sigm(Wf * x[t] .+ Rf * h[t-1] .+ bWf .+ bRf) # forget gate
o[t] = sigm(Wo * x[t] .+ Ro * h[t-1] .+ bWo .+ bRo) # output gate
n[t] = tanh(Wn * x[t] .+ Rn * h[t-1] .+ bWn .+ bRn) # new gate
c[t] = f[t] .* c[t-1] .+ i[t] .* n[t] # cell output
h[t] = o[t] .* tanh(c[t])
Knet.Ops20.rnnparam
— Functionrnnparam(r::RNN, layer, id, param)
Return a single weight matrix or bias vector as a slice of RNN weights.
Valid layer
values:
- For unidirectional RNNs 1:numLayers
- For bidirectional RNNs 1:2*numLayers, forw and back layers alternate.
Valid id
values:
- For RELU and TANH RNNs, input = 1, hidden = 2.
- For GRU reset = 1,4; update = 2,5; newmem = 3,6; 1:3 for input, 4:6 for hidden
- For LSTM inputgate = 1,5; forget = 2,6; newmem = 3,7; output = 4,8; 1:4 for input, 5:8 for hidden
Valid param
values:
- Return the weight matrix (transposed!) if
param==1
. - Return the bias vector if
param==2
.
The effect of skipInput: Let I=1 for RELU/TANH, 1:3 for GRU, 1:4 for LSTM
- For skipInput=false (default), rnnparam(r,1,I,1) is a (inputSize,hiddenSize) matrix.
- For skipInput=true, rnnparam(r,1,I,1) is
nothing
. - For bidirectional, the same applies to rnnparam(r,2,I,1): the first back layer.
- The input biases (par=2) are returned even if skipInput=true.
Knet.Ops20.rnnparams
— Functionrnnparams(r::RNN)
Return the RNN parameters as an Array{Any}.
The order of params returned (subject to change):
- All weight matrices come before all bias vectors.
- Matrices and biases are sorted lexically based on (layer,id).
- See @doc rnnparam for valid layer and id values.
- Input multiplying matrices are
nothing
if r.inputMode = 1.
Batch Normalization
Knet.Ops20.batchnorm
— Functionbatchnorm(x[, moments, params]; kwargs...)
perform batch normalization on x
with optional mean and variance in moments
and scaling factor and bias in params
. See https://arxiv.org/abs/1502.03167 for reference.
2d, 4d and 5d inputs are supported. Mean and variance are computed over dimensions (2,), (1,2,4) and (1,2,3,5) for 2d, 4d and 5d arrays, respectively.
moments
stores running mean and variance to be used at inference time. It is optional in training mode, but mandatory in test mode. Training and test modes can be controlled by the training
keyword argument which defaults to Knet.training()
.
params
stores the optional affine parameters gamma and beta. bnparams
function can be used to initialize params
.
Example
# Inilization, C is an integer
moments = bnmoments()
params = bnparams(C)
...
# size(x) -> (H, W, C, N)
y = batchnorm(x, moments, params)
# size(y) -> (H, W, C, N)
Keywords
eps=1e-5
: The epsilon parameter added to the variance to avoid division by 0.
training=Knet.training()
: When training
is true, the mean and variance of x
are used and moments
argument is modified if it is provided. When training
is false, mean and variance stored in the moments
argument are used.
Knet.Ops20.bnmoments
— Functionbnmoments(;momentum=0.1, mean=nothing, var=nothing, meaninit=zeros, varinit=ones)
Return a BNMoments
object, a data structure used to store running mean and running variance of batch normalization with the following fields:
momentum=0.1
: A real number between 0 and 1 to be used as the scale of last
mean and variance. The existing running mean or variance is multiplied by (1-momentum).
mean=nothing
: The running mean.var=nothing
: The running variance.meaninit=zeros
: The function used for initialize the running mean. Should either be
nothing
or of the form ([eltype], dims...)->data
. zeros
is a good option.
varinit=ones
: The function used for initialize the running variance. Should either be
nothing
or ([eltype], dims...)->data
. ones
is a good option.
This constructor can be used directly load moments from data. meaninit
and varinit
are called if mean
and var
are nothing. Type and size of the mean
and var
are determined automatically from the inputs in the batchnorm
calls.
Knet.Ops20.bnparams
— Functionbnparams(etype, channels::Integer)
Return a single 1d array that contains both scale and bias of batchnorm, where the first half is scale and the second half is bias.
bnparams(channels)
calls bnparams(Float64, channels)
, following Julia convention.
Model optimization
Knet.Train20.minimize
— Functionminimize(func, data, optimizer=Adam(); params)
sgd (func, data; lr=0.1, gclip, params)
momentum(func, data; lr=0.05, gamma=0.95, gclip, params)
nesterov(func, data; lr=0.05, gamma=0.95, gclip, params)
adagrad (func, data; lr=0.05, eps=1e-6, gclip, params)
rmsprop (func, data; lr=0.01, rho=0.9, eps=1e-6, gclip, params)
adadelta(func, data; lr=1.0, rho=0.9, eps=1e-6, gclip, params)
adam (func, data; lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, gclip, params)
Return an iterator which applies func
to arguments in data
, i.e. (func(args...) for args in data)
, and updates the parameters every iteration to minimize func
. func
should return a scalar value.
The common keyword argument params
can be used to list the Param
s to be optimized. If not specified, any Param
that takes part in the computation of func(args...)
will be updated.
The common keyword argument gclip
can be used to implement per-parameter gradient clipping. For a parameter gradient g
, if norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If not specified no gradient clipping is performed.
These functions do not perform optimization, but return an iterator that can. Any function that produces values from an iterator can be used with such an object, e.g. progress!(sgd(f,d))
iterates the sgd optimizer and displays a progress bar. For convenience, appending !
to the name of the function iterates and returns nothing
, i.e. sgd!(...)
is equivalent to (for x in sgd(...) end)
.
We define optimizers as lazy iterators to have explicit control over them:
- To report progress use
progress(sgd(f,d))
. - To run until convergence use
converge(sgd(f,cycle(d)))
. - To run multiple epochs use
sgd(f,repeat(d,n))
. - To run a given number of iterations use
sgd(f,take(cycle(d),n))
. - To do a task every n iterations use
(task() for (i,j) in enumerate(sgd(f,d)) if i%n == 1)
.
These functions apply the same algorithm with the same configuration to every parameter by default. minimize
takes an explicit optimizer argument, all others call minimize
with an appropriate optimizer argument (see @doc update!
for a list of possible optimizers). Before calling update!
on a Param
, minimize
sets its opt
field to a copy of this default optimizer if it is not already set. The opt
field is used by the update!
function to determine the type of update performed on that parameter. If you need finer grained control, you can set the optimizer of an individual Param
by setting its opt
field before calling one of these functions. They will not override the opt
field if it is already set, e.g. sgd(model,data)
will perform an Adam
update for a parameter whose opt
field is an Adam
object. This also means you can stop and start the training without losing optimization state, the first call will set the opt
fields and the subsequent calls will not override them.
Given a parameter w
and its gradient g
here are the updates applied by each optimizer:
# sgd (http://en.wikipedia.org/wiki/Stochastic_gradient_descent)
w .= w - lr * g
# momentum (http://jlmelville.github.io/mize/nesterov.html)
v .= gamma * v - lr * g
w .= w + v
# nesterov (http://jlmelville.github.io/mize/nesterov.html)
w .= w - gamma * v
v .= gamma * v - lr * g
w .= w + (1 + gamma) * v
# adagrad (http://www.jmlr.org/papers/v12/duchi11a.html)
G .= G + g .^ 2
w .= w - lr * g ./ sqrt(G + eps)
# rmsprop (http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
G .= rho * G + (1-rho) * g .^ 2
w .= w - lr * g ./ sqrt(G + eps)
# adadelta (http://arxiv.org/abs/1212.5701)
G .= rho * G + (1-rho) * g .^ 2
update = sqrt(delta + eps) .* g ./ sqrt(G + eps)
w = w - lr * update
delta = rho * delta + (1-rho) * update .^ 2
# adam (http://arxiv.org/abs/1412.6980)
v = beta1 * v + (1 - beta1) * g
G = beta2 * G + (1 - beta2) * g .^ 2
vhat = v ./ (1 - beta1 ^ t)
Ghat = G ./ (1 - beta2 ^ t)
w = w - (lr / (sqrt(Ghat) + eps)) * vhat
Knet.Train20.converge
— Functionconverge(itr; alpha=0.1)
Return an iterator which acts exactly like itr
, but quits when values from itr
stop decreasing. itr
should produce numeric values.
It can be used to train a model with the data cycled:
progress!(converge(minimize(model,cycle(data))))
alpha
controls the exponential average of values to detect convergence. Here is how convergence is decided:
p = x - avgx
avgx = c.alpha * x + (1-c.alpha) * avgx
avgp = c.alpha * p + (1-c.alpha) * avgp
avgp > 0.0 && return nothing
converge!(...)
is equivalent to (for x in converge(...) end)
, i.e. iterates over the object created by converge(...)
and returns nothing
.
Knet.Train20.minibatch
— Functionminibatch(x, [y], batchsize; shuffle, partial, xtype, ytype, xsize, ysize)
Return an iterator of minibatches [(xi,yi)...] given data tensors x, y and batchsize.
The last dimension of x and y give the number of instances and should be equal. y
is optional, if omitted a sequence of xi
will be generated rather than (xi,yi)
tuples. Use repeat(d,n)
for multiple epochs, Iterators.take(d,n)
for a partial epoch, and Iterators.cycle(d)
to cycle through the data forever (this can be used with converge
). If you need the iterator to continue from its last position when stopped early (e.g. by a break in a for loop), use Iterators.Stateful(d)
(by default the iterator would restart from the beginning).
Keyword arguments:
shuffle=false
: Shuffle the instances every epoch.partial=false
: If true include the last partial minibatch < batchsize.xtype=typeof(x)
: Convert xi in minibatches to this type.ytype=typeof(y)
: Convert yi in minibatches to this type.xsize=size(x)
: Convert xi in minibatches to this shape (with last dimension adjusted for batchsize).ysize=size(y)
: Convert yi in minibatches to this shape (with last dimension adjusted for batchsize).
Knet.Train20.progress
— Functionprogress(msg, itr; steps, seconds, io)
progress(itr; o...) do p; [body of the msg function]; end
progress(itr; o...)
progress!(...)
Return a Progress
iterator which acts exactly like itr
, but prints a progressbar:
┣█████████████████▎ ┫ [86.83%, 903/1040, 01:36/01:50, 9.42i/s] 3.87835
Here 86.83%
is the percentage completed, 903
is the number of iterations completed, 1040
is the total number of iterations. 01:36
is elapsed time, 01:50
is the estimated total time, 9.42i/s
is the average number of iterations completed per second. If the speed is less than 1, the average number of seconds per iteration (s/i) is reported instead. The bar, percent, total iterations, and estimated total time are omitted for iterators whose size is unknown.
The 3.87835
at the end is the output of the msg
function applied to the Progress iterator. The message can be customized by the first two forms above, if not specified (the third form) nothing gets printed at the end of the line. The message function can use the following fields of its p::Progress
argument: p.currval
is the current iterator value and p.curriter
is the current iteration count.
The progress bar is updated and msg
is called with the Progress iterator every steps
iterations or every seconds
seconds in addition to the first and the last iteration. If neither steps
nor seconds
is specified the default is to update every second. The keyword argument io
determines where the progress bar is printed, the default is stderr
.
The last form, progress!(...)
, is equivalent to (for x in progress(...) end)
, i.e. iterates over the object created by progress(...)
and returns nothing
.
Missing docstring for Knet.training
. Check Documenter's build log for details.
Hyperparameter optimization
Knet.Train20.goldensection
— Functiongoldensection(f,n;kwargs) => (fmin,xmin)
Find the minimum of f
using concurrent golden section search in n
dimensions. See Knet.goldensection_demo()
for an example.
f
is a function from a Vector{Float64}
of length n
to a Number
. It can return NaN
for out of range inputs. Goldensection will always start with a zero vector as the initial input to f
, and the initial step size will be 1 in each dimension. The user should define f
to scale and shift this input range into a vector meaningful for their application. For positive inputs like learning rate or hidden size, you can use a transformation such as x0*exp(x)
where x
is a value goldensection
passes to f
and x0
is your initial guess for this value. This will effectively start the search at x0
, then move with multiplicative steps.
I designed this algorithm combining ideas from Golden Section Search and Hill Climbing Search. It essentially runs golden section search concurrently in each dimension, picking the next step based on estimated gain.
Keyword arguments
dxmin=0.1
: smallest step size.accel=φ
: acceleration rate. Golden ratioφ=1.618...
is best.verbose=false
: usetrue
to print individual steps.history=[]
: cache of[(x,f(x)),...]
function evaluations.
Knet.Train20.hyperband
— Functionhyperband(getconfig, getloss, maxresource=27, reduction=3)
Hyperparameter optimization using the hyperband algorithm from (Lisha et al. 2016). You can try a simple MNIST example using Knet.hyperband_demo()
.
Arguments
getconfig()
returns random configurations with a user defined type and distribution.getloss(c,n)
returns loss for configurationc
and number of resources (e.g. epochs)n
.maxresource
is the maximum number of resources any one configuration should be given.reduction
is an algorithm parameter (see paper), 3 is a good value.
Utilities
Knet.Ops20.bmm
— Functionbmm(A, B ; transA=false, transB=false)
Perform a batch matrix-matrix product of matrices stored in A
and B
. size(A,2) == size(B,1) and size(A)[3:end] and size(B)[3:end] must match. If A is a (m,n,b...) tensor, B is a (n,k,b...) tensor, and the output is a (m,k,b...) tensor.
AutoGrad.cat1d
— Functioncat1d(args...)
Return vcat(vec.(args)...)
but possibly more efficiently. Can be used to concatenate the contents of arrays with different shapes and sizes.
Missing docstring for Knet.cpucopy
. Check Documenter's build log for details.
Knet.dir
— FunctionConstruct a path relative to Knet root, e.g. Knet.dir("examples") => "~/.julia/dev/Knet/examples"
Knet.Ops20.dropout
— Functiondropout(x, p; drop, seed)
Given an array x
and probability 0<=p<=1
return an array y
in which each element is 0 with probability p
or x[i]/(1-p)
with probability 1-p
. Just return x
if p==0
, or drop=false
. By default drop=true
in a @diff
context, drop=false
otherwise. Specify a non-zero seed::Number
to set the random number seed for reproducible results. See (Srivastava et al. 2014) for a reference.
Knet.KnetArrays.gc
— FunctionKnet.gc(dev=CUDA.device().handle)
cudaFree all pointers allocated on device dev
that were previously allocated and garbage collected. Normally Knet holds on to all garbage collected pointers for reuse. Try this if you run out of GPU memory.
Missing docstring for Knet.gpu
. Check Documenter's build log for details.
Missing docstring for Knet.gpucopy
. Check Documenter's build log for details.
Missing docstring for Knet.invx
. Check Documenter's build log for details.
Knet.Ops20.mat
— Functionmat(x; dims = ndims(x) - 1)
Reshape x
into a two-dimensional matrix by joining the first dims dimensions, i.e. reshape(x, prod(size(x,i) for i in 1:dims), :)
dims=ndims(x)-1
(default) is typically used when turning the output of a 4-D convolution result into a 2-D input for a fully connected layer.
dims=1
is typically used when turning the 3-D output of an RNN layer into a 2-D input for a fully connected layer.
dims=0
will turn the input into a row vector, dims=ndims(x)
will turn it into a column vector.
Knet.KnetArrays.seed!
— FunctionCall both CUDA.seed! (if available) and Random.seed!
AutoGrad (advanced)
AutoGrad.@gcheck
— Macrogcheck(f, x...; kw, o...)
@gcheck f(x...; kw...) (opt1=val1,opt2=val2,...)
Numerically check the gradient of f(x...; kw...)
and return a boolean result.
Example call: gcheck(nll,model,x,y)
or @gcheck nll(model,x,y)
. The parameters should be marked as Param
arrays in f
, x
, and/or kw
. Only 10 random entries in each large numeric array are checked by default. If the output of f
is not a number, we check the gradient of sum(f(x...; kw...))
. Keyword arguments:
kw=()
: keyword arguments to be passed tof
, i.e.f(x...; kw...)
nsample=10
: number of random entries from each param to checkatol=0.01,rtol=0.05
: tolerance parameters. Seeisapprox
for their meaning.delta=0.0001
: step size for numerical gradient calculation.verbose=1
: 0 prints nothing, 1 shows failing tests, 2 shows all tests.
AutoGrad.@primitive
— Macro@primitive fx g1 g2...
Define a new primitive operation for AutoGrad and (optionally) specify its gradients. Non-differentiable functions such as sign
, and non-numeric functions such as size
should be defined using the @zerograd macro instead.
Examples
@primitive sin(x::Number)
@primitive hypot(x1,x2),dy,y
@primitive sin(x::Number),dy (dy.*cos(x))
@primitive hypot(x1,x2),dy,y (dy.*x1./y) (dy.*x2./y)
The first example shows that fx
is a typed method declaration. Julia supports multiple dispatch, i.e. a single function can have multiple methods with different arg types. AutoGrad takes advantage of this and supports multiple dispatch for primitives and gradients.
The second example specifies variable names for the output gradient dy
and the output y
after the method declaration which can be used in gradient expressions. Untyped, ellipsis and keyword arguments are ok as in f(a::Int,b,c...;d=1)
. Parametric methods such as f(x::T) where {T<:Number}
cannot be used.
The method declaration can optionally be followed by gradient expressions. The third and fourth examples show how gradients can be specified. Note that the parameters, the return variable and the output gradient of the original function can be used in the gradient expressions.
Under the hood
The @primitive macro turns the first example into:
sin(x::Value{T}) where {T<:Number} = forw(sin, x)
This will cause calls to sin
with a boxed argument (Value{T<:Number}
) to be recorded. The recorded operations are used by AutoGrad to construct a dynamic computational graph. With multiple arguments things are a bit more complicated. Here is what happens with the second example:
hypot(x1::Value{S}, x2::Value{T}) where {S,T} = forw(hypot, x1, x2)
hypot(x1::S, x2::Value{T}) where {S,T} = forw(hypot, x1, x2)
hypot(x1::Value{S}, x2::T) where {S,T} = forw(hypot, x1, x2)
We want the forw method to be called if any one of the arguments is a boxed Value
. There is no easy way to specify this in Julia, so the macro generates all 2^N-1 boxed/unboxed argument combinations.
In AutoGrad, gradients are defined using gradient methods that have the following pattern:
back(f,Arg{i},dy,y,x...) => dx[i]
For the third example here is the generated gradient method:
back(::typeof(sin), ::Type{Arg{1}}, dy, y, x::Value{T}) where {T<:Number} = dy .* cos(x)
For the last example a different gradient method is generated for each argument:
back(::typeof(hypot), ::Type{Arg{1}}, dy, y, x1::Value{S}, x2::Value{T}) where {S,T} = (dy .* x1) ./ y
back(::typeof(hypot), ::Type{Arg{2}}, dy, y, x1::Value{S}, x2::Value{T}) where {S,T} = (dy .* x2) ./ y
In fact @primitive generates four more definitions for the other boxed/unboxed argument combinations.
Broadcasting
Broadcasting is handled by extra forw
and back
methods. @primitive
defines the following so that broadcasting of a primitive function with a boxed value triggers forw
and back
.
broadcasted(::typeof(sin), x::Value{T}) where {T<:Number} = forw(broadcasted,sin,x)
back(::typeof(broadcasted), ::Type{Arg{2}}, dy, y, ::typeof(sin), x::Value{T}) where {T<:Number} = dy .* cos(x)
If you do not want the broadcasting methods, you can use the @primitive1
macro. If you only want the broadcasting methods use @primitive2
. As a motivating example, here is how *
is defined for non-scalars:
@primitive1 *(x1,x2),dy (dy*x2') (x1'*dy)
@primitive2 *(x1,x2),dy unbroadcast(x1,dy.*x2) unbroadcast(x2,x1.*dy)
Regular *
is matrix multiplication, broadcasted *
is elementwise multiplication and the two have different gradients as defined above. unbroadcast(a,b)
reduces b
to the same shape as a
by performing the necessary summations.
AutoGrad.@zerograd
— Macro@zerograd f(args...; kwargs...)
Define f
as an AutoGrad primitive operation with zero gradient.
Example:
@zerograd floor(x::Float32)
@zerograd
allows f
to handle boxed Value
inputs by unboxing them like a @primitive
, but unlike @primitive
it does not record its actions or return a boxed Value
result. Some functions, like sign()
, have zero gradient. Others, like length()
have discrete or constant outputs. These need to handle Value
inputs, but do not need to record anything and can return regular values. Their output can be treated like a constant in the program. Use the @zerograd
macro for those. Use the @zerograd1
variant if you don't want to define the broadcasting version and @zerograd2
if you only want to define the broadcasting version. Note that kwargs
are NOT unboxed.
Per-parameter optimization (advanced)
The model optimization methods apply the same algorithm with the same configuration to every parameter. If you need finer grained control, you can set the optimization algorithm and configuration of an individual Param
by setting its opt
field to one of the optimization objects like Adam
listed below. The opt
field is used as an argument to update!
and controls the type of update performed on that parameter. Model optimization methods like sgd
will not override the opt
field if it is already set, e.g. sgd(model,data)
will perform an Adam
update for a parameter whose opt
field is an Adam
object. This also means you can stop and start the training without losing optimization state, the first call will set the opt
fields and the subsequent calls will not override them.
Knet.Train20.update!
— Functionupdate!(weights::Param, gradients)
update!(weights, gradients; lr=0.1, gclip=0)
update!(weights, gradients, optimizers)
Update the weights
using their gradients
and the optimization algorithms specified using (1) the opt
field of a Param
, (2) keyword arguments, (3) the third argument.
weights
can be an individual Param
, numeric array, or a collection of arrays/Params represented by an iterator or dictionary. gradients
should be a matching individual array or collection. In the first form, the optimizer should be specified in weights.opt
. In the second form the optimizer defaults to SGD
with learning rate lr
and gradient clip gclip
. In the third form optimizers
should be a matching individual optimizer or collection of optimizers. The weights
and possibly gradients
and optimizers
are modified in-place.
Individual optimization parameters can be one of the following types. The keyword arguments for each constructor and their default values are listed as well.
SGD
(;lr=0.1, gclip=0)
Momentum
(;lr=0.05, gamma=0.95, gclip=0)
Nesterov
(;lr=0.05, gamma=0.95, gclip=0)
Adagrad
(;lr=0.05, eps=1e-6, gclip=0)
Rmsprop
(;lr=0.01, rho=0.9, eps=1e-6, gclip=0)
Adadelta
(;lr=1.0, rho=0.9, eps=1e-6, gclip=0)
Adam
(;lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, gclip=0)
Example:
w = Param(rand(d), Adam()) # a Param with a specified optimizer
g = lossgradient0(w) # gradient g has the same shape as w
update!(w, g) # update w in-place with Adam()
w = rand(d) # an individual weight array
g = lossgradient1(w) # gradient g has the same shape as w
update!(w, g) # update w in-place with SGD()
update!(w, g; lr=0.1) # update w in-place with SGD(lr=0.1)
update!(w, g, SGD(lr=0.1)) # update w in-place with SGD(lr=0.1)
w = (rand(d1), rand(d2)) # a tuple of weight arrays
g = lossgradient2(w) # g will also be a tuple
p = (Adam(), SGD()) # p has optimizers for each w[i]
update!(w, g, p) # update each w[i] in-place with g[i],p[i]
w = Any[rand(d1), rand(d2)] # any iterator can be used
g = lossgradient3(w) # g will be similar to w
p = Any[Adam(), SGD()] # p should be an iterator of same length
update!(w, g, p) # update each w[i] in-place with g[i],p[i]
w = Dict(:a => rand(d1), :b => rand(d2)) # dictionaries can be used
g = lossgradient4(w)
p = Dict(:a => Adam(), :b => SGD())
update!(w, g, p)
Knet.Train20.SGD
— TypeSGD(;lr=0.1,gclip=0)
update!(w,g,p::SGD)
update!(w,g;lr=0.1)
Container for parameters of the Stochastic gradient descent (SGD) optimization algorithm used by update!
.
SGD is an optimization technique to minimize an objective function by updating its weights in the opposite direction of their gradient. The learning rate (lr) determines the size of the step. SGD updates the weights with the following formula:
w = w - lr * g
where w
is a weight array, g
is the gradient of the loss function w.r.t w
and lr
is the learning rate.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
SGD is used by default if no algorithm is specified in the two argument version of update!
[@ref].
Knet.Train20.Momentum
— TypeMomentum(;lr=0.05, gclip=0, gamma=0.95)
update!(w,g,p::Momentum)
Container for parameters of the Momentum optimization algorithm used by update!
.
The Momentum method tries to accelerate SGD by adding a velocity term to the update. This also decreases the oscillation between successive steps. It updates the weights with the following formulas:
velocity = gamma * velocity + lr * g
w = w - velocity
where w
is a weight array, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, gamma
is the momentum parameter, velocity
is an array with the same size and type of w
and holds the accelerated gradients.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
Reference: Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151.
Knet.Train20.Nesterov
— TypeNesterov(; lr=0.05, gclip=0, gamma=0.95)
update!(w,g,p::Momentum)
Container for parameters of Nesterov's momentum optimization algorithm used by update!
.
It is similar to standard Momentum
but with a slightly different update rule:
velocity = gamma * velocity_old - lr * g
w = w_old - velocity_old + (1+gamma) * velocity
where w
is a weight array, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, gamma
is the momentum parameter, velocity
is an array with the same size and type of w
and holds the accelerated gradients.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip == 0
no scaling takes place.
Reference Implementation : Yoshua Bengio, Nicolas Boulanger-Lewandowski and Razvan P ascanu
Knet.Train20.Adagrad
— TypeAdagrad(;lr=0.05, gclip=0, eps=1e-6)
update!(w,g,p::Adagrad)
Container for parameters of the Adagrad optimization algorithm used by update!
.
Adagrad is one of the methods that adapts the learning rate to each of the weights. It stores the sum of the squares of the gradients to scale the learning rate. The learning rate is adapted for each weight by the value of current gradient divided by the accumulated gradients. Hence, the learning rate is greater for the parameters where the accumulated gradients are small and the learning rate is small if the accumulated gradients are large. It updates the weights with the following formulas:
G = G + g .^ 2
w = w - g .* lr ./ sqrt(G + eps)
where w
is the weight, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, G
is an array with the same size and type of w
and holds the sum of the squares of the gradients. eps
is a small constant to prevent a zero value in the denominator.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
Reference: Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159.
Knet.Train20.Rmsprop
— TypeRmsprop(;lr=0.01, gclip=0, rho=0.9, eps=1e-6)
update!(w,g,p::Rmsprop)
Container for parameters of the Rmsprop optimization algorithm used by update!
.
Rmsprop scales the learning rates by dividing the root mean squared of the gradients. It updates the weights with the following formula:
G = (1-rho) * g .^ 2 + rho * G
w = w - lr * g ./ sqrt(G + eps)
where w
is the weight, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, G
is an array with the same size and type of w
and holds the sum of the squares of the gradients. eps
is a small constant to prevent a zero value in the denominator. rho
is the momentum parameter and delta
is an array with the same size and type of w
and holds the sum of the squared updates.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
Reference: Tijmen Tieleman and Geoffrey Hinton (2012). "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural Networks for Machine Learning 4.2.
Knet.Train20.Adadelta
— TypeAdadelta(;lr=1.0, gclip=0, rho=0.9, eps=1e-6)
update!(w,g,p::Adadelta)
Container for parameters of the Adadelta optimization algorithm used by update!
.
Adadelta is an extension of Adagrad that tries to prevent the decrease of the learning rates to zero as training progresses. It scales the learning rate based on the accumulated gradients like Adagrad and holds the acceleration term like Momentum. It updates the weights with the following formulas:
G = (1-rho) * g .^ 2 + rho * G
update = g .* sqrt(delta + eps) ./ sqrt(G + eps)
w = w - lr * update
delta = rho * delta + (1-rho) * update .^ 2
where w
is the weight, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, G
is an array with the same size and type of w
and holds the sum of the squares of the gradients. eps
is a small constant to prevent a zero value in the denominator. rho
is the momentum parameter and delta
is an array with the same size and type of w
and holds the sum of the squared updates.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
Reference: Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method.
Knet.Train20.Adam
— TypeAdam(;lr=0.001, gclip=0, beta1=0.9, beta2=0.999, eps=1e-8)
update!(w,g,p::Adam)
Container for parameters of the Adam optimization algorithm used by update!
.
Adam is one of the methods that compute the adaptive learning rate. It stores accumulated gradients (first moment) and the sum of the squared of gradients (second). It scales the first and second moment as a function of time. Here is the update formulas:
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g .* g
mhat = m ./ (1 - beta1 ^ t)
vhat = v ./ (1 - beta2 ^ t)
w = w - (lr / (sqrt(vhat) + eps)) * mhat
where w
is the weight, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, m
is an array with the same size and type of w
and holds the accumulated gradients. v
is an array with the same size and type of w
and holds the sum of the squares of the gradients. eps
is a small constant to prevent a zero denominator. beta1
and beta2
are the parameters to calculate bias corrected first and second moments. t
is the update count.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
Reference: Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13.
Function Index
Knet.AutoGrad
Knet.KnetArrays.KnetArray
Knet.Ops20.RNN
Knet.Train20.Adadelta
Knet.Train20.Adagrad
Knet.Train20.Adam
Knet.Train20.Momentum
Knet.Train20.Nesterov
Knet.Train20.Rmsprop
Knet.Train20.SGD
AutoGrad.cat1d
Knet.KnetArrays.gc
Knet.KnetArrays.seed!
Knet.Ops20.accuracy
Knet.Ops20.batchnorm
Knet.Ops20.bce
Knet.Ops20.bmm
Knet.Ops20.bnmoments
Knet.Ops20.bnparams
Knet.Ops20.conv4
Knet.Ops20.deconv4
Knet.Ops20.dropout
Knet.Ops20.elu
Knet.Ops20.logistic
Knet.Ops20.logp
Knet.Ops20.logsoftmax
Knet.Ops20.logsumexp
Knet.Ops20.mat
Knet.Ops20.nll
Knet.Ops20.pool
Knet.Ops20.relu
Knet.Ops20.rnnparam
Knet.Ops20.rnnparams
Knet.Ops20.selu
Knet.Ops20.sigm
Knet.Ops20.softmax
Knet.Ops20.unpool
Knet.Ops20.zeroone
Knet.Train20.bilinear
Knet.Train20.converge
Knet.Train20.gaussian
Knet.Train20.goldensection
Knet.Train20.hyperband
Knet.Train20.minibatch
Knet.Train20.minimize
Knet.Train20.param
Knet.Train20.progress
Knet.Train20.update!
Knet.Train20.xavier
Knet.Train20.xavier_normal
Knet.Train20.xavier_uniform
Knet.dir
AutoGrad.@gcheck
AutoGrad.@primitive
AutoGrad.@zerograd