Reference
Contents
AutoGrad
AutoGrad.AutoGrad
— ModuleUsage:
x = Param([1,2,3]) # user declares parameters with `Param`
x => P([1,2,3]) # `Param` is just a struct wrapping a value
value(x) => [1,2,3] # `value` returns the thing wrapped
sum(x .* x) => 14 # Params act like regular values
y = @diff sum(x .* x) # Except when we differentiate using `@diff`
y => T(14) # you get another struct
value(y) => 14 # which carries the same result
params(y) => [x] # and the Params that it depends on
grad(y,x) => [2,4,6] # and the gradients for all Params
Param(x)
returns a struct that acts like x
but marks it as a parameter you want to compute gradients with respect to.
@diff expr
evaluates an expression and returns a struct that contains the result (which should be a scalar) and gradient information.
grad(y, x)
returns the gradient of y
(output by @diff) with respect to any parameter x::Param
, or nothing
if the gradient is 0.
value(x)
returns the value associated with x
if x
is a Param
or the output of @diff
, otherwise returns x
.
params(x)
returns an iterator of Params found by a recursive search of object x
.
Alternative usage:
x = [1 2 3]
f(x) = sum(x .* x)
f(x) => 14
grad(f)(x) => [2 4 6]
gradloss(f)(x) => ([2 4 6], 14)
Given a scalar valued function f
, grad(f,argnum=1)
returns another function g
which takes the same inputs as f
and returns the gradient of the output with respect to the argnum'th argument. gradloss
is similar except the resulting function also returns f's output.
KnetArray
Knet.KnetArray
— TypeKnetArray{T}(undef,dims)
KnetArray(a::AbstractArray)
Array(k::KnetArray)
Container for GPU arrays that supports most of the AbstractArray interface. The constructor allocates a KnetArray in the currently active device, as specified by gpu()
. KnetArrays and Arrays can be converted to each other as shown above, which involves copying to and from the GPU memory. Only Float32/64 KnetArrays are fully supported.
KnetArrays use the CUDA.jl package for allocation and some operations. Currently some of the custom CUDA kernels that implement elementwise, broadcasting, and reduction operations for KnetArrays work faster. Once these are improved in CUDA.jl, KnetArrays will be retired.
Supported functions:
Indexing: getindex, setindex! with the following index types:
- 1-D: Real, Colon, OrdinalRange, AbstractArray{Real}, AbstractArray{Bool}, CartesianIndex, AbstractArray{CartesianIndex}, EmptyArray, KnetArray{Int32} (low level), KnetArray{0/1} (using float for BitArray) (1-D includes linear indexing of multidimensional arrays)
- 2-D: (Colon,Union{Real,Colon,OrdinalRange,AbstractVector{Real},AbstractVector{Bool},KnetVector{Int32}}), (Union{Real,AbstractUnitRange,Colon}...) (in any order)
- N-D: (Real...)
Array operations: ==, !=, adjoint, argmax, argmin, cat, convert, copy, copyto!, deepcopy, display, eachindex, eltype, endof, fill!, findmax, findmin, first, hcat, isapprox, isempty, length, ndims, one, ones, permutedims, pointer, rand!, randn!, reshape, similar, size, stride, strides, summary, transpose, vcat, vec, zero. (Boolean operators generate outputs with same type as inputs; no support for KnetArray{Bool}.)
Unary functions with broadcasting: -, abs, abs2, acos, acosh, asin, asinh, atan, atanh, cbrt, ceil, cos, cosh, cospi, digamma, erf, erfc, erfcinv, erfcx, erfinv, exp, exp10, exp2, expm1, floor, gamma, lgamma, log, log10, log1p, log2, loggamma, one, round, sign, sin, sinh, sinpi, sqrt, tan, tanh, trigamma, trunc, zero
Binary functions with broadcasting: !=, *, +, -, /, <, <=, ==, >, >=, ^, max, min
Reduction operators: maximum, minimum, prod, sum
Statistics: mean, std, stdm, var, varm
Linear algebra: (*), axpy!, lmul!, norm, rmul!
Knet extras: batchnorm, bce, bmm, cat1d, conv4, cpucopy, deconv4, dropout, elu, gpucopy, invx, logistic, logp, logsoftmax, logsumexp, mat, nll, pool, relu, RNN, selu, sigm, softmax, unpool (Only 4D/5D, Float32/64 KnetArrays support conv4, pool, deconv4, unpool)
File I/O
Knet.save
— FunctionKnet.save(filename, args...; kwargs...)
Call FileIO.save
after serializing Knet specific args.
File format is determined by the filename extension. JLD and JLD2 are supported. Other formats may work if supported by FileIO, please refer to the documentation of FileIO and the specific format. Example:
Knet.save("foo.jld2", "name1", value1, "name2", value2)
Knet.load
— FunctionKnet.load(filename, args...; kwargs...)
Call FileIO.load
then deserialize Knet specific values.
File format is determined by FileIO. JLD and JLD2 are supported. Other formats may work if supported by FileIO, please refer to the documentation of FileIO and the specific format. Example:
Knet.load("foo.jld2") # returns a ("name"=>value) dictionary
Knet.load("foo.jld2", "name1") # returns the value of "name1" in "foo.jld2"
Knet.load("foo.jld2", "name1", "name2") # returns tuple (value1, value2)
Knet.@save
— MacroKnet.@save "filename" variable1 variable2...
Save the values of the specified variables to filename in JLD2 format.
When called with no variable arguments, write all variables in the global scope of the current module to filename. See JLD2.
Knet.@load
— MacroKnet.@load "filename" variable1 variable2...
Load the values of the specified variables from filename in JLD2 format.
When called with no variable arguments, load all variables in filename. See JLD2.
Parameter initialization
Knet.param
— Functionparam(array; atype)
param(dims...; init, atype)
param0(dims...; atype)
The first form returns Param(atype(array))
where atype=identity
is the default.
The second form Returns a randomly initialized Param(atype(init(dims...)))
. By default, init
is xavier_uniform
and atype
is KnetArray{Float32}
if gpu() >= 0
, Array{Float32}
otherwise.
The third form param0
is an alias for param(dims...; init=zeros)
.
Knet.xavier
— Functionxavier_uniform(a...; gain=1)
xavier(a...; gain=1)
Return uniform random weights in the range ± gain * sqrt(6 / (fanin + fanout))
. The a
arguments are passed to rand
to specify type and dimensions. See (Glorot and Bengio 2010) or the PyTorch docs for a description. The function implements equation (16) of the referenced paper. Also known as Glorot initialization. The function xavier
is an alias for xavier_uniform
. See also xavier_normal
.
Knet.xavier_uniform
— Functionxavier_uniform(a...; gain=1)
xavier(a...; gain=1)
Return uniform random weights in the range ± gain * sqrt(6 / (fanin + fanout))
. The a
arguments are passed to rand
to specify type and dimensions. See (Glorot and Bengio 2010) or the PyTorch docs for a description. The function implements equation (16) of the referenced paper. Also known as Glorot initialization. The function xavier
is an alias for xavier_uniform
. See also xavier_normal
.
Knet.xavier_normal
— Functionxavier_normal(a...; gain=1)
Return normal distributed random weights with mean 0 and std gain * sqrt(2 / (fanin + fanout))
. The a
arguments are passed to rand
. See (Glorot and Bengio 2010) and PyTorch docs for a description. Also known as Glorot initialization. See also xavier_uniform
.
Knet.gaussian
— Functiongaussian(a...; mean=0.0, std=0.01)
Return a Gaussian array with a given mean and standard deviation. The a
arguments are passed to randn
.
Knet.bilinear
— FunctionBilinear interpolation filter weights; used for initializing deconvolution layers.
Adapted from https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/surgery.py#L33
Arguments:
T
: Data Type
fw
: Width upscale factor
fh
: Height upscale factor
IN
: Number of input filters
ON
: Number of output filters
Example usage:
w = bilinear(Float32,2,2,128,128)
Activation functions
NNlib.elu
— Functionelu(x)
Return (x > 0 ? x : exp(x)-1)
.
Reference: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) (https://arxiv.org/abs/1511.07289).
NNlib.relu
— Functionrelu(x)
Return max(0,x)
.
References:
- Nair and Hinton, 2010. Rectified Linear Units Improve Restricted Boltzmann Machines. ICML.
- Glorot, Bordes and Bengio, 2011. Deep Sparse Rectifier Neural Networks. AISTATS.
NNlib.selu
— Functionselu(x)
Return λ01 * (x > 0 ? x : α01 * (exp(x)-1))
where λ01=1.0507009873554805
and α01=1.6732632423543778
.
Reference: Self-Normalizing Neural Networks (https://arxiv.org/abs/1706.02515).
Knet.sigm
— Functionsigm(x) = 1/(1+exp(-x))
Loss functions
Knet.accuracy
— Functionaccuracy(scores, answers; dims=1, average=true)
Given an unnormalized scores
matrix and an Integer
array of correct answers
, return the ratio of instances where the correct answer has the maximum score. dims=1
means instances are in columns, dims=2
means instances are in rows. Use average=false
to return the pair (ncorrect,count) instead of the ratio (ncorrect/count). If answers[i] == 0
, instance i is skipped.
accuracy(model, data; dims=1, average=true, o...)
Compute accuracy(model(x; o...), y; dims)
for (x,y)
in data
and return (correct/total) if average=true or (correct,total) if average=false.
Knet.bce
— Functionbce(scores,answers;average=true)
Computes binary cross entropy given scores(predicted values) and answer labels. answer values should be {0,1}, then it returns negative of mean|sum(answers * log(p) + (1-answers)*log(1-p))
where p
is equal to 1/(1 + exp.(scores))
. See also logistic
.
Knet.logistic
— Functionlogistic(scores, answers; average=true)
Computes logistic loss given scores(predicted values) and answer labels. answer values should be {-1,1}, then it returns mean|sum(log(1 + exp(-answers*scores)))
. See also bce
.
Knet.logp
— Functionlogp(x; dims=:)
Treat entries in x
as as unnormalized log probabilities and return normalized log probabilities.
dims
is an optional argument, if not specified the normalization is over the whole x
, otherwise the normalization is performed over the given dimensions. In particular, if x
is a matrix, dims=1
normalizes columns of x
and dims=2
normalizes rows of x
.
Knet.logsoftmax
— Function logsoftmax(x; dims=:)
Equivalent to logp(x; dims=:)
. See also sotfmax
.
Knet.logsumexp
— Functionlogsumexp(x;dims=:)
Compute log(sum(exp(x);dims))
in a numerically stable manner.
dims
is an optional argument, if not specified the summation is over the whole x
, otherwise the summation is performed over the given dimensions. In particular if x
is a matrix, dims=1
sums columns of x
and dims=2
sums rows of x
.
Knet.nll
— Functionnll(scores, answers::Array{<:Integer}; dims=1, average=true)
nll(model, data; dims=1, average=true, o...)
The first form calculates the negative log likelihood for a single batch given an unnormalized scores
matrix and an Integer
array of correct answers
. The scores
matrix should have size (classes,instances) if dims=1
or (instances,classes) if dims=2
. answers[i]
should be in 1:classes
to indicate the correct class for instance i, or 0 to skip instance i.
The second form calculates negative log likelihood for a model and dataset iterating over nll(model(inputs; o...), answers; dims)
for (inputs,answers)
in data
. The model
should be a function returning scores given inputs, and data should be an iterable of (inputs,answers)
pairs.
In both forms, the return value is (total/count)
if average=true
and (total,count)
if average=false
where count
is the number of instances not skipped and total
is their total negative log likelihood.
Example
Let's assume that there are three classes (cat, dog, ostrich) and just 2 instances with the unnormalized score scores[:,1]
and scores[:,2]
respectively. The first instance is actually a cat and the second instance a dog:
scores = [12.2 0.3;
2.0 21.5;
0.0 -21.0]
answers = [1, 2]
Knet.nll(scores,answers)
# returns 2.1657e-5
The probabilites are derived from the scores and the log-probabilities corresponding to the answers are averaged:
probabilites = exp.(scores) ./ sum(exp.(scores),dims=1)
-(log(probabilites[answers[1],1]) + log(probabilites[answers[2],2]))/2
# returns 2.1657e-5
Knet.softmax
— Functionsoftmax(x; dims=1, algo=1)
The softmax function typically used in classification. Gives the same results as to exp.(logp(x, dims))
.
If algo=1
computation is more accurate, if algo=0
it is faster.
See also logsoftmax
.
Knet.zeroone
— Functionzeroone loss is equal to 1 - accuracy
Convolution and Pooling
Knet.conv4
— Functionconv4(w, x; kwargs...)
Execute convolutions or cross-correlations using filters specified with w
over tensor x
.
Currently KnetArray{Float32/64,4/5} and Array{Float32/64,4} are supported as w
and x
. If w
has dimensions (W1,W2,...,I,O)
and x
has dimensions (X1,X2,...,I,N)
, the result y
will have dimensions (Y1,Y2,...,O,N)
where
Yi=1+floor((Xi+2*padding[i]-Wi)/stride[i])
Here I
is the number of input channels, O
is the number of output channels, N
is the number of instances, and Wi,Xi,Yi
are spatial dimensions. padding
and stride
are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.
Keywords
padding=0
: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.stride=1
: the number of elements to slide to reach the next filtering window.dilation=1
: dilation factor for each dimension.mode=0
: 0 for convolution and 1 for cross-correlation.alpha=1
: can be used to scale the result.handle
: handle to a previously created cuDNN context. Defaults to a Knet allocated handle.
Knet.deconv4
— Functiony = deconv4(w, x; kwargs...)
Simulate 4-D deconvolution by using transposed convolution operation. Its forward pass is equivalent to backward pass of a convolution (gradients with respect to input tensor). Likewise, its backward pass (gradients with respect to input tensor) is equivalent to forward pass of a convolution. Since it swaps forward and backward passes of convolution operation, padding and stride options belong to output tensor. See this report for further explanation.
Currently KnetArray{Float32/64,4} and Array{Float32/64,4} are supported as w
and x
. If w
has dimensions (W1,W2,...,O,I)
and x
has dimensions (X1,X2,...,I,N)
, the result y
will have dimensions (Y1,Y2,...,O,N)
where
Yi = Wi+stride[i](Xi-1)-2padding[i]
Here I is the number of input channels, O is the number of output channels, N is the number of instances, and Wi,Xi,Yi are spatial dimensions. padding and stride are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.
Keywords
padding=0
: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.stride=1
: the number of elements to slide to reach the next filtering window.mode=0
: 0 for convolution and 1 for cross-correlation.alpha=1
: can be used to scale the result.handle
: handle to a previously created cuDNN context. Defaults to a Knet allocated handle.
Knet.pool
— Functionpool(x; kwargs...)
Compute pooling of input values (i.e., the maximum or average of several adjacent values) to produce an output with smaller height and/or width.
Currently 4 or 5 dimensional KnetArrays with Float32
or Float64
entries are supported. If x
has dimensions (X1,X2,...,I,N)
, the result y
will have dimensions (Y1,Y2,...,I,N)
where
Yi=1+floor((Xi+2*padding[i]-window[i])/stride[i])
Here I
is the number of input channels, N
is the number of instances, and Xi,Yi
are spatial dimensions. window
, padding
and stride
are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.
Keywords:
window=2
: the pooling window size for each dimension.padding=0
: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.stride=window
: the number of elements to slide to reach the next pooling window.mode=0
: 0 for max, 1 for average including padded values, 2 for average excluding padded values.maxpoolingNanOpt=0
: Nan numbers are not propagated if 0, they are propagated if 1.alpha=1
: can be used to scale the result.handle
: Handle to a previously created cuDNN context. Defaults to a Knet allocated handle.
Knet.unpool
— FunctionUnpooling; reverse
of pooling.
TODO: Does not work correctly for every window, padding, mode combination. Test before use.
x == pool(unpool(x;o...); o...)
Recurrent neural networks
Knet.RNN
— Typernn = RNN(inputSize, hiddenSize; opts...)
rnn(x; batchSizes) => y
rnn.h, rnn.c # hidden and cell states
RNN
returns a callable RNN object rnn
. Given a minibatch of sequences x
, rnn(x)
returns y
, the hidden states of the final layer for each time step. rnn.h
and rnn.c
fields can be used to set the initial hidden states and read the final hidden states of all layers. Note that the final time step of y
always contains the final hidden state of the last layer, equivalent to rnn.h
for a single layer network.
Dimensions: The input x
can be 1, 2, or 3 dimensional and y
will have the same number of dimensions as x
. size(x)=(X,[B,T]) and size(y)=(H/2H,[B,T]) where X is inputSize, B is batchSize, T is seqLength, H is hiddenSize, 2H is for bidirectional RNNs. By default a 1-D x
represents a single instance for a single time step, a 2-D x
represents a single minibatch for a single time step, and a 3-D x
represents a sequence of identically sized minibatches for multiple time steps. The output y
gives the hidden state (of the final layer for multi-layer RNNs) for each time step. The fields rnn.h
and rnn.c
represent the hidden states of all layers in a single time step and have size (H,B,L/2L) where L is numLayers and 2L is for bidirectional RNNs.
batchSizes: If batchSizes=nothing
(default), all sequences in a minibatch are assumed to be the same length. If batchSizes
is an array of (non-increasing) integers, it gives us the batch size for each time step (allowing different sequences in the minibatch to have different lengths). In this case x
will typically be 2-D with the second dimension representing variable size batches for time steps. If batchSizes
is used, sum(batchSizes)
should equal length(x) ÷ size(x,1)
. When the batch size is different in every time step, hidden states will have size (H,B,L/2L) where B is always the size of the first (largest) minibatch.
Hidden states: The hidden and cell states are kept in rnn.h
and rnn.c
fields (the cell state is only used by LSTM). They can be initialized during construction using the h
and c
keyword arguments, or modified later by direct assignment. Valid values are nothing
(default), 0
, or an array of the right type and size possibly wrapped in a Param
. If the value is nothing
the initial state is assumed to be zero and the final state is discarded keeping the value nothing
. If the value is 0
the initial state is assumed to be zero and 0
is replaced by the final state on return. If the value is a valid state, it is used as the initial state and is replaced by the final state on return.
In a differentiation context the returned final hidden states will be wrapped in Result
types. This is necessary if the same RNN object is to be called multiple times in a single iteration. Between iterations (i.e. after diff/update) the hidden states need to be unboxed with e.g. rnn.h = value(rnn.h)
to prevent spurious dependencies. This happens automatically during the backward pass for GPU RNNs but needs to be done manually for CPU RNNs. See the CharLM Tutorial for an example.
Keyword arguments for RNN:
h=nothing
: Initial hidden state.c=nothing
: Initial cell state.rnnType=:lstm
Type of RNN: One of :relu, :tanh, :lstm, :gru.numLayers=1
: Number of RNN layers.bidirectional=false
: Create a bidirectional RNN iftrue
.dropout=0
: Dropout probability. Applied to input and between layers.skipInput=false
: Do not multiply the input with a matrix iftrue
.dataType=Float32
: Data type to use for weights.algo=0
: Algorithm to use, see CUDNN docs for details.seed=0
: Random number seed for dropout. Usestime()
if 0.winit=xavier_uniform
: Weight initialization method for matrices.binit=zeros
: Weight initialization method for bias vectors.finit=ones
: Weight initialization method for the bias of forget gates.usegpu=(gpu()>=0)
: GPU used by default if one exists.
Formulas: RNNs compute the output h[t] for a given iteration from the recurrent input h[t-1] and the previous layer input x[t] given matrices W, R and biases bW, bR from the following equations:
:relu
and :tanh
: Single gate RNN with activation function f:
h[t] = f(W * x[t] .+ R * h[t-1] .+ bW .+ bR)
:gru
: Gated recurrent unit:
i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
r[t] = sigm(Wr * x[t] .+ Rr * h[t-1] .+ bWr .+ bRr) # reset gate
n[t] = tanh(Wn * x[t] .+ r[t] .* (Rn * h[t-1] .+ bRn) .+ bWn) # new gate
h[t] = (1 - i[t]) .* n[t] .+ i[t] .* h[t-1]
:lstm
: Long short term memory unit with no peephole connections:
i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
f[t] = sigm(Wf * x[t] .+ Rf * h[t-1] .+ bWf .+ bRf) # forget gate
o[t] = sigm(Wo * x[t] .+ Ro * h[t-1] .+ bWo .+ bRo) # output gate
n[t] = tanh(Wn * x[t] .+ Rn * h[t-1] .+ bWn .+ bRn) # new gate
c[t] = f[t] .* c[t-1] .+ i[t] .* n[t] # cell output
h[t] = o[t] .* tanh(c[t])
Knet.rnnparam
— Functionrnnparam(r::RNN, layer, id, param)
Return a single weight matrix or bias vector as a slice of RNN weights.
Valid layer
values:
- For unidirectional RNNs 1:numLayers
- For bidirectional RNNs 1:2*numLayers, forw and back layers alternate.
Valid id
values:
- For RELU and TANH RNNs, input = 1, hidden = 2.
- For GRU reset = 1,4; update = 2,5; newmem = 3,6; 1:3 for input, 4:6 for hidden
- For LSTM inputgate = 1,5; forget = 2,6; newmem = 3,7; output = 4,8; 1:4 for input, 5:8 for hidden
Valid param
values:
- Return the weight matrix (transposed!) if
param==1
. - Return the bias vector if
param==2
.
The effect of skipInput: Let I=1 for RELU/TANH, 1:3 for GRU, 1:4 for LSTM
- For skipInput=false (default), rnnparam(r,1,I,1) is a (inputSize,hiddenSize) matrix.
- For skipInput=true, rnnparam(r,1,I,1) is
nothing
. - For bidirectional, the same applies to rnnparam(r,2,I,1): the first back layer.
- The input biases (par=2) are returned even if skipInput=true.
Knet.rnnparams
— Functionrnnparams(r::RNN)
Return the RNN parameters as an Array{Any}.
The order of params returned (subject to change):
- All weight matrices come before all bias vectors.
- Matrices and biases are sorted lexically based on (layer,id).
- See @doc rnnparam for valid layer and id values.
- Input multiplying matrices are
nothing
if r.inputMode = 1.
Batch Normalization
Knet.batchnorm
— Functionbatchnorm(x[, moments, params]; kwargs...)
performs batch normalization to x
with optional scaling factor and bias stored in params
.
2d, 4d and 5d inputs are supported. Mean and variance are computed over dimensions (2,), (1,2,4) and (1,2,3,5) for 2d, 4d and 5d arrays, respectively.
moments
stores running mean and variance to be used in testing. It is optional in the training mode, but mandatory in the test mode. Training and test modes are controlled by the training
keyword argument.
params
stores the optional affine parameters gamma and beta. bnparams
function can be used to initialize params
.
Example
# Inilization, C is an integer
moments = bnmoments()
params = bnparams(C)
...
# size(x) -> (H, W, C, N)
y = batchnorm(x, moments, params)
# size(y) -> (H, W, C, N)
Keywords
eps=1e-5
: The epsilon parameter added to the variance to avoid division by 0.
training
: When training
is true, the mean and variance of x
are used and moments
argument is modified if it is provided. When training
is false, mean and variance stored in the moments
argument are used. Default value is true
when at least one of x
and params
is AutoGrad.Value
, false
otherwise.
Knet.bnmoments
— Functionbnmoments(;momentum=0.1, mean=nothing, var=nothing, meaninit=zeros, varinit=ones)
can be used directly load moments from data. meaninit
and varinit
are called if mean
and var
are nothing. Type and size of the mean
and var
are determined automatically from the inputs in the batchnorm
calls. A BNMoments
object is returned.
BNMoments
A high-level data structure used to store running mean and running variance of batch normalization with the following fields:
momentum::AbstractFloat
: A real number between 0 and 1 to be used as the scale of last mean and variance. The existing running mean or variance is multiplied by (1-momentum).
mean
: The running mean.
var
: The running variance.
meaninit
: The function used for initialize the running mean. Should either be nothing
or of the form (eltype, dims...)->data
. zeros
is a good option.
varinit
: The function used for initialize the running variance. Should either be nothing
or (eltype, dims...)->data
. ones
is a good option.
Knet.bnparams
— Functionbnparams(etype, channels)
creates a single 1d array that contains both scale and bias of batchnorm, where the first half is scale and the second half is bias.
bnparams(channels)
calls bnparams
with etype=Float64
, following Julia convention
Model optimization
Knet.minimize
— Functionminimize(func, data, optimizer=Adam(); params)
sgd (func, data; lr=0.1, gclip, params)
momentum(func, data; lr=0.05, gamma=0.95, gclip, params)
nesterov(func, data; lr=0.05, gamma=0.95, gclip, params)
adagrad (func, data; lr=0.05, eps=1e-6, gclip, params)
rmsprop (func, data; lr=0.01, rho=0.9, eps=1e-6, gclip, params)
adadelta(func, data; lr=1.0, rho=0.9, eps=1e-6, gclip, params)
adam (func, data; lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, gclip, params)
Return an iterator which applies func
to arguments in data
, i.e. (func(args...) for args in data)
, and updates the parameters every iteration to minimize func
. func
should return a scalar value.
The common keyword argument params
can be used to list the Param
s to be optimized. If not specified, any Param
that takes part in the computation of func(args...)
will be updated.
The common keyword argument gclip
can be used to implement per-parameter gradient clipping. For a parameter gradient g
, if norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If not specified no gradient clipping is performed.
These functions do not perform optimization, but return an iterator that can. Any function that produces values from an iterator can be used with such an object, e.g. progress!(sgd(f,d))
iterates the sgd optimizer and displays a progress bar. For convenience, appending !
to the name of the function iterates and returns nothing
, i.e. sgd!(...)
is equivalent to (for x in sgd(...) end)
.
We define optimizers as lazy iterators to have explicit control over them:
- To report progress use
progress(sgd(f,d))
. - To run until convergence use
converge(sgd(f,cycle(d)))
. - To run multiple epochs use
sgd(f,repeat(d,n))
. - To run a given number of iterations use
sgd(f,take(cycle(d),n))
. - To do a task every n iterations use
(task() for (i,j) in enumerate(sgd(f,d)) if i%n == 1)
.
These functions apply the same algorithm with the same configuration to every parameter by default. minimize
takes an explicit optimizer argument, all others call minimize
with an appropriate optimizer argument (see @doc update!
for a list of possible optimizers). Before calling update!
on a Param
, minimize
sets its opt
field to a copy of this default optimizer if it is not already set. The opt
field is used by the update!
function to determine the type of update performed on that parameter. If you need finer grained control, you can set the optimizer of an individual Param
by setting its opt
field before calling one of these functions. They will not override the opt
field if it is already set, e.g. sgd(model,data)
will perform an Adam
update for a parameter whose opt
field is an Adam
object. This also means you can stop and start the training without losing optimization state, the first call will set the opt
fields and the subsequent calls will not override them.
Given a parameter w
and its gradient g
here are the updates applied by each optimizer:
# sgd (http://en.wikipedia.org/wiki/Stochastic_gradient_descent)
w .= w - lr * g
# momentum (http://jlmelville.github.io/mize/nesterov.html)
v .= gamma * v - lr * g
w .= w + v
# nesterov (http://jlmelville.github.io/mize/nesterov.html)
w .= w - gamma * v
v .= gamma * v - lr * g
w .= w + (1 + gamma) * v
# adagrad (http://www.jmlr.org/papers/v12/duchi11a.html)
G .= G + g .^ 2
w .= w - lr * g ./ sqrt(G + eps)
# rmsprop (http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
G .= rho * G + (1-rho) * g .^ 2
w .= w - lr * g ./ sqrt(G + eps)
# adadelta (http://arxiv.org/abs/1212.5701)
G .= rho * G + (1-rho) * g .^ 2
update = sqrt(delta + eps) .* g ./ sqrt(G + eps)
w = w - lr * update
delta = rho * delta + (1-rho) * update .^ 2
# adam (http://arxiv.org/abs/1412.6980)
v = beta1 * v + (1 - beta1) * g
G = beta2 * G + (1 - beta2) * g .^ 2
vhat = v ./ (1 - beta1 ^ t)
Ghat = G ./ (1 - beta2 ^ t)
w = w - (lr / (sqrt(Ghat) + eps)) * vhat
Knet.converge
— Functionconverge(itr; alpha=0.1)
Return an iterator which acts exactly like itr
, but quits when values from itr
stop decreasing. itr
should produce numeric values.
It can be used to train a model with the data cycled:
progress!(converge(minimize(model,cycle(data))))
alpha
controls the exponential average of values to detect convergence. Here is how convergence is decided:
p = x - avgx
avgx = c.alpha * x + (1-c.alpha) * avgx
avgp = c.alpha * p + (1-c.alpha) * avgp
avgp > 0.0 && return nothing
converge!(...)
is equivalent to (for x in converge(...) end)
, i.e. iterates over the object created by converge(...)
and returns nothing
.
Knet.minibatch
— Functionminibatch(x, [y], batchsize; shuffle, partial, xtype, ytype, xsize, ysize)
Return an iterator of minibatches [(xi,yi)...] given data tensors x, y and batchsize.
The last dimension of x and y give the number of instances and should be equal. y
is optional, if omitted a sequence of xi
will be generated rather than (xi,yi)
tuples. Use repeat(d,n)
for multiple epochs, Iterators.take(d,n)
for a partial epoch, and Iterators.cycle(d)
to cycle through the data forever (this can be used with converge
). If you need the iterator to continue from its last position when stopped early (e.g. by a break in a for loop), use Iterators.Stateful(d)
(by default the iterator would restart from the beginning).
Keyword arguments:
shuffle=false
: Shuffle the instances every epoch.partial=false
: If true include the last partial minibatch < batchsize.xtype=typeof(x)
: Convert xi in minibatches to this type.ytype=typeof(y)
: Convert yi in minibatches to this type.xsize=size(x)
: Convert xi in minibatches to this shape.ysize=size(y)
: Convert yi in minibatches to this shape.
Knet.progress
— Functionprogress(msg, itr; steps, seconds, io)
progress(itr; o...) do p; [body of the msg function]; end
progress(itr; o...)
progress!(...)
Return a Progress
iterator which acts exactly like itr
, but prints a progressbar:
┣█████████████████▎ ┫ [86.83%, 903/1040, 01:36/01:50, 9.42i/s] 3.87835
Here 86.83%
is the percentage completed, 903
is the number of iterations completed, 1040
is the total number of iterations. 01:36
is elapsed time, 01:50
is the estimated total time, 9.42i/s
is the average number of iterations completed per second. If the speed is less than 1, the average number of seconds per iteration (s/i) is reported instead. The bar, percent, total iterations, and estimated total time are omitted for iterators whose size is unknown.
The 3.87835
at the end is the output of the msg
function applied to the Progress iterator. The message can be customized by the first two forms above, if not specified (the third form) nothing gets printed at the end of the line. The message function can use the following fields of its p::Progress
argument: p.currval
is the current iterator value and p.curriter
is the current iteration count.
The progress bar is updated and msg
is called with the Progress iterator every steps
iterations or every seconds
seconds in addition to the first and the last iteration. If neither steps
nor seconds
is specified the default is to update every second. The keyword argument io
determines where the progress bar is printed, the default is stderr
.
The last form, progress!(...)
, is equivalent to (for x in progress(...) end)
, i.e. iterates over the object created by progress(...)
and returns nothing
.
Knet.training
— Functiontraining()
returns true
only inside a @diff
context, e.g. during a training iteration of a model.
Hyperparameter optimization
Knet.goldensection
— Functiongoldensection(f,n;kwargs) => (fmin,xmin)
Find the minimum of f
using concurrent golden section search in n
dimensions. See Knet.goldensection_demo()
for an example.
f
is a function from a Vector{Float64}
of length n
to a Number
. It can return NaN
for out of range inputs. Goldensection will always start with a zero vector as the initial input to f
, and the initial step size will be 1 in each dimension. The user should define f
to scale and shift this input range into a vector meaningful for their application. For positive inputs like learning rate or hidden size, you can use a transformation such as x0*exp(x)
where x
is a value goldensection
passes to f
and x0
is your initial guess for this value. This will effectively start the search at x0
, then move with multiplicative steps.
I designed this algorithm combining ideas from Golden Section Search and Hill Climbing Search. It essentially runs golden section search concurrently in each dimension, picking the next step based on estimated gain.
Keyword arguments
dxmin=0.1
: smallest step size.accel=φ
: acceleration rate. Golden ratioφ=1.618...
is best.verbose=false
: usetrue
to print individual steps.history=[]
: cache of[(x,f(x)),...]
function evaluations.
Knet.hyperband
— Functionhyperband(getconfig, getloss, maxresource=27, reduction=3)
Hyperparameter optimization using the hyperband algorithm from (Lisha et al. 2016). You can try a simple MNIST example using Knet.hyperband_demo()
.
Arguments
getconfig()
returns random configurations with a user defined type and distribution.getloss(c,n)
returns loss for configurationc
and number of resources (e.g. epochs)n
.maxresource
is the maximum number of resources any one configuration should be given.reduction
is an algorithm parameter (see paper), 3 is a good value.
Utilities
Knet.bmm
— Functionbmm(A, B ; transA=false, transB=false)
Perform a batch matrix-matrix product of matrices stored in A
and B
. size(A,2) == size(B,1) and size(A)[3:end] and size(B)[3:end] must match. If A is a (m,n,b...) tensor, B is a (n,k,b...) tensor, and the output is a (m,k,b...) tensor.
AutoGrad.cat1d
— Functioncat1d(args...)
Return vcat(vec.(args)...)
but possibly more efficiently. Can be used to concatenate the contents of arrays with different shapes and sizes.
Missing docstring for Knet.cpucopy
. Check Documenter's build log for details.
Knet.dir
— FunctionKnet.dir(path...)
Construct a path relative to Knet root.
Example
julia> Knet.dir("examples","mnist.jl")
"/home/dyuret/.julia/dev/Knet/examples/mnist.jl"
Knet.dropout
— Functiondropout(x, p; drop, seed)
Given an array x
and probability 0<=p<=1
return an array y
in which each element is 0 with probability p
or x[i]/(1-p)
with probability 1-p
. Just return x
if p==0
, or drop=false
. By default drop=true
in a @diff
context, drop=false
otherwise. Specify a non-zero seed::Number
to set the random number seed for reproducible results. See (Srivastava et al. 2014) for a reference.
Knet.gc
— FunctionKnet.gc(dev=gpu())
cudaFree all pointers allocated on device dev
that were previously allocated and garbage collected. Normally Knet holds on to all garbage collected pointers for reuse. Try this if you run out of GPU memory.
Knet.gpu
— Functiongpu()
returns the id of the active GPU device or -1 if none are active.
gpu(true)
resets all GPU devices and activates the one with the most available memory.
gpu(false)
resets and deactivates all GPU devices.
gpu(d::Int)
activates the GPU device d
if 0 <= d < gpuCount()
, otherwise deactivates devices.
gpu(true/false)
resets all devices. If there are any allocated KnetArrays their pointers will be left dangling. Thus gpu(true/false)
should only be used during startup. If you want to suspend GPU use temporarily, use gpu(-1)
.
gpu(d::Int)
does not reset the devices. You can select a previous device and find allocated memory preserved. However trying to operate on arrays of an inactive device will result in error.
Missing docstring for Knet.gpucopy
. Check Documenter's build log for details.
Knet.invx
— Functioninvx(x) = 1/x
Knet.mat
— Functionmat(x; dims = ndims(x) - 1)
Reshape x
into a two-dimensional matrix by joining the first dims dimensions, i.e. reshape(x, prod(size(x,i) for i in 1:dims), :)
dims=ndims(x)-1
(default) is typically used when turning the output of a 4-D convolution result into a 2-D input for a fully connected layer.
dims=1
is typically used when turning the 3-D output of an RNN layer into a 2-D input for a fully connected layer.
dims=0
will turn the input into a row vector, dims=ndims(x)
will turn it into a column vector.
Missing docstring for Knet.seed!
. Check Documenter's build log for details.
AutoGrad (advanced)
AutoGrad.@gcheck
— Macrogcheck(f, x...; kw, o...)
@gcheck f(x...; kw...) (opt1=val1,opt2=val2,...)
Numerically check the gradient of f(x...; kw...)
and return a boolean result.
Example call: gcheck(nll,model,x,y)
or @gcheck nll(model,x,y)
. The parameters should be marked as Param
arrays in f
, x
, and/or kw
. Only 10 random entries in each large numeric array are checked by default. If the output of f
is not a number, we check the gradient of sum(f(x...; kw...))
. Keyword arguments:
kw=()
: keyword arguments to be passed tof
, i.e.f(x...; kw...)
nsample=10
: number of random entries from each param to checkatol=0.01,rtol=0.05
: tolerance parameters. Seeisapprox
for their meaning.delta=0.0001
: step size for numerical gradient calculation.verbose=1
: 0 prints nothing, 1 shows failing tests, 2 shows all tests.
AutoGrad.@primitive
— Macro@primitive fx g1 g2...
Define a new primitive operation for AutoGrad and (optionally) specify its gradients. Non-differentiable functions such as sign
, and non-numeric functions such as size
should be defined using the @zerograd macro instead.
Examples
@primitive sin(x::Number)
@primitive hypot(x1,x2),dy,y
@primitive sin(x::Number),dy (dy.*cos(x))
@primitive hypot(x1,x2),dy,y (dy.*x1./y) (dy.*x2./y)
The first example shows that fx
is a typed method declaration. Julia supports multiple dispatch, i.e. a single function can have multiple methods with different arg types. AutoGrad takes advantage of this and supports multiple dispatch for primitives and gradients.
The second example specifies variable names for the output gradient dy
and the output y
after the method declaration which can be used in gradient expressions. Untyped, ellipsis and keyword arguments are ok as in f(a::Int,b,c...;d=1)
. Parametric methods such as f(x::T) where {T<:Number}
cannot be used.
The method declaration can optionally be followed by gradient expressions. The third and fourth examples show how gradients can be specified. Note that the parameters, the return variable and the output gradient of the original function can be used in the gradient expressions.
Under the hood
The @primitive macro turns the first example into:
sin(x::Value{T}) where {T<:Number} = forw(sin, x)
This will cause calls to sin
with a boxed argument (Value{T<:Number}
) to be recorded. The recorded operations are used by AutoGrad to construct a dynamic computational graph. With multiple arguments things are a bit more complicated. Here is what happens with the second example:
hypot(x1::Value{S}, x2::Value{T}) where {S,T} = forw(hypot, x1, x2)
hypot(x1::S, x2::Value{T}) where {S,T} = forw(hypot, x1, x2)
hypot(x1::Value{S}, x2::T) where {S,T} = forw(hypot, x1, x2)
We want the forw method to be called if any one of the arguments is a boxed Value
. There is no easy way to specify this in Julia, so the macro generates all 2^N-1 boxed/unboxed argument combinations.
In AutoGrad, gradients are defined using gradient methods that have the following pattern:
back(f,Arg{i},dy,y,x...) => dx[i]
For the third example here is the generated gradient method:
back(::typeof(sin), ::Type{Arg{1}}, dy, y, x::Value{T}) where {T<:Number} = dy .* cos(x)
For the last example a different gradient method is generated for each argument:
back(::typeof(hypot), ::Type{Arg{1}}, dy, y, x1::Value{S}, x2::Value{T}) where {S,T} = (dy .* x1) ./ y
back(::typeof(hypot), ::Type{Arg{2}}, dy, y, x1::Value{S}, x2::Value{T}) where {S,T} = (dy .* x2) ./ y
In fact @primitive generates four more definitions for the other boxed/unboxed argument combinations.
Broadcasting
Broadcasting is handled by extra forw
and back
methods. @primitive
defines the following so that broadcasting of a primitive function with a boxed value triggers forw
and back
.
broadcasted(::typeof(sin), x::Value{T}) where {T<:Number} = forw(broadcasted,sin,x)
back(::typeof(broadcasted), ::Type{Arg{2}}, dy, y, ::typeof(sin), x::Value{T}) where {T<:Number} = dy .* cos(x)
If you do not want the broadcasting methods, you can use the @primitive1
macro. If you only want the broadcasting methods use @primitive2
. As a motivating example, here is how *
is defined for non-scalars:
@primitive1 *(x1,x2),dy (dy*x2') (x1'*dy)
@primitive2 *(x1,x2),dy unbroadcast(x1,dy.*x2) unbroadcast(x2,x1.*dy)
Regular *
is matrix multiplication, broadcasted *
is elementwise multiplication and the two have different gradients as defined above. unbroadcast(a,b)
reduces b
to the same shape as a
by performing the necessary summations.
AutoGrad.@zerograd
— Macro@zerograd f(args...; kwargs...)
Define f
as an AutoGrad primitive operation with zero gradient.
Example:
@zerograd floor(x::Float32)
@zerograd
allows f
to handle boxed Value
inputs by unboxing them like a @primitive
, but unlike @primitive
it does not record its actions or return a boxed Value
result. Some functions, like sign()
, have zero gradient. Others, like length()
have discrete or constant outputs. These need to handle Value
inputs, but do not need to record anything and can return regular values. Their output can be treated like a constant in the program. Use the @zerograd
macro for those. Use the @zerograd1
variant if you don't want to define the broadcasting version and @zerograd2
if you only want to define the broadcasting version. Note that kwargs
are NOT unboxed.
Per-parameter optimization (advanced)
The model optimization methods apply the same algorithm with the same configuration to every parameter. If you need finer grained control, you can set the optimization algorithm and configuration of an individual Param
by setting its opt
field to one of the optimization objects like Adam
listed below. The opt
field is used as an argument to update!
and controls the type of update performed on that parameter. Model optimization methods like sgd
will not override the opt
field if it is already set, e.g. sgd(model,data)
will perform an Adam
update for a parameter whose opt
field is an Adam
object. This also means you can stop and start the training without losing optimization state, the first call will set the opt
fields and the subsequent calls will not override them.
Knet.update!
— Functionupdate!(weights::Param, gradients)
update!(weights, gradients; lr=0.1, gclip=0)
update!(weights, gradients, optimizers)
Update the weights
using their gradients
and the optimization algorithms specified using (1) the opt
field of a Param
, (2) keyword arguments, (3) the third argument.
weights
can be an individual Param
, numeric array, or a collection of arrays/Params represented by an iterator or dictionary. gradients
should be a matching individual array or collection. In the first form, the optimizer should be specified in weights.opt
. In the second form the optimizer defaults to SGD
with learning rate lr
and gradient clip gclip
. In the third form optimizers
should be a matching individual optimizer or collection of optimizers. The weights
and possibly gradients
and optimizers
are modified in-place.
Individual optimization parameters can be one of the following types. The keyword arguments for each constructor and their default values are listed as well.
SGD
(;lr=0.1, gclip=0)
Momentum
(;lr=0.05, gamma=0.95, gclip=0)
Nesterov
(;lr=0.05, gamma=0.95, gclip=0)
Adagrad
(;lr=0.05, eps=1e-6, gclip=0)
Rmsprop
(;lr=0.01, rho=0.9, eps=1e-6, gclip=0)
Adadelta
(;lr=1.0, rho=0.9, eps=1e-6, gclip=0)
Adam
(;lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, gclip=0)
Example:
w = Param(rand(d), Adam()) # a Param with a specified optimizer
g = lossgradient0(w) # gradient g has the same shape as w
update!(w, g) # update w in-place with Adam()
w = rand(d) # an individual weight array
g = lossgradient1(w) # gradient g has the same shape as w
update!(w, g) # update w in-place with SGD()
update!(w, g; lr=0.1) # update w in-place with SGD(lr=0.1)
update!(w, g, SGD(lr=0.1)) # update w in-place with SGD(lr=0.1)
w = (rand(d1), rand(d2)) # a tuple of weight arrays
g = lossgradient2(w) # g will also be a tuple
p = (Adam(), SGD()) # p has optimizers for each w[i]
update!(w, g, p) # update each w[i] in-place with g[i],p[i]
w = Any[rand(d1), rand(d2)] # any iterator can be used
g = lossgradient3(w) # g will be similar to w
p = Any[Adam(), SGD()] # p should be an iterator of same length
update!(w, g, p) # update each w[i] in-place with g[i],p[i]
w = Dict(:a => rand(d1), :b => rand(d2)) # dictionaries can be used
g = lossgradient4(w)
p = Dict(:a => Adam(), :b => SGD())
update!(w, g, p)
Knet.SGD
— TypeSGD(;lr=0.1,gclip=0)
update!(w,g,p::SGD)
update!(w,g;lr=0.1)
Container for parameters of the Stochastic gradient descent (SGD) optimization algorithm used by update!
.
SGD is an optimization technique to minimize an objective function by updating its weights in the opposite direction of their gradient. The learning rate (lr) determines the size of the step. SGD updates the weights with the following formula:
w = w - lr * g
where w
is a weight array, g
is the gradient of the loss function w.r.t w
and lr
is the learning rate.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
SGD is used by default if no algorithm is specified in the two argument version of update!
[@ref].
Knet.Momentum
— TypeMomentum(;lr=0.05, gclip=0, gamma=0.95)
update!(w,g,p::Momentum)
Container for parameters of the Momentum optimization algorithm used by update!
.
The Momentum method tries to accelerate SGD by adding a velocity term to the update. This also decreases the oscillation between successive steps. It updates the weights with the following formulas:
velocity = gamma * velocity + lr * g
w = w - velocity
where w
is a weight array, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, gamma
is the momentum parameter, velocity
is an array with the same size and type of w
and holds the accelerated gradients.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
Reference: Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151.
Knet.Nesterov
— TypeNesterov(; lr=0.05, gclip=0, gamma=0.95)
update!(w,g,p::Momentum)
Container for parameters of Nesterov's momentum optimization algorithm used by update!
.
It is similar to standard Momentum
but with a slightly different update rule:
velocity = gamma * velocity_old - lr * g
w = w_old - velocity_old + (1+gamma) * velocity
where w
is a weight array, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, gamma
is the momentum parameter, velocity
is an array with the same size and type of w
and holds the accelerated gradients.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip == 0
no scaling takes place.
Reference Implementation : Yoshua Bengio, Nicolas Boulanger-Lewandowski and Razvan P ascanu
Knet.Adagrad
— TypeAdagrad(;lr=0.05, gclip=0, eps=1e-6)
update!(w,g,p::Adagrad)
Container for parameters of the Adagrad optimization algorithm used by update!
.
Adagrad is one of the methods that adapts the learning rate to each of the weights. It stores the sum of the squares of the gradients to scale the learning rate. The learning rate is adapted for each weight by the value of current gradient divided by the accumulated gradients. Hence, the learning rate is greater for the parameters where the accumulated gradients are small and the learning rate is small if the accumulated gradients are large. It updates the weights with the following formulas:
G = G + g .^ 2
w = w - g .* lr ./ sqrt(G + eps)
where w
is the weight, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, G
is an array with the same size and type of w
and holds the sum of the squares of the gradients. eps
is a small constant to prevent a zero value in the denominator.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
Reference: Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159.
Knet.Rmsprop
— TypeRmsprop(;lr=0.01, gclip=0, rho=0.9, eps=1e-6)
update!(w,g,p::Rmsprop)
Container for parameters of the Rmsprop optimization algorithm used by update!
.
Rmsprop scales the learning rates by dividing the root mean squared of the gradients. It updates the weights with the following formula:
G = (1-rho) * g .^ 2 + rho * G
w = w - lr * g ./ sqrt(G + eps)
where w
is the weight, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, G
is an array with the same size and type of w
and holds the sum of the squares of the gradients. eps
is a small constant to prevent a zero value in the denominator. rho
is the momentum parameter and delta
is an array with the same size and type of w
and holds the sum of the squared updates.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
Reference: Tijmen Tieleman and Geoffrey Hinton (2012). "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural Networks for Machine Learning 4.2.
Knet.Adadelta
— TypeAdadelta(;lr=1.0, gclip=0, rho=0.9, eps=1e-6)
update!(w,g,p::Adadelta)
Container for parameters of the Adadelta optimization algorithm used by update!
.
Adadelta is an extension of Adagrad that tries to prevent the decrease of the learning rates to zero as training progresses. It scales the learning rate based on the accumulated gradients like Adagrad and holds the acceleration term like Momentum. It updates the weights with the following formulas:
G = (1-rho) * g .^ 2 + rho * G
update = g .* sqrt(delta + eps) ./ sqrt(G + eps)
w = w - lr * update
delta = rho * delta + (1-rho) * update .^ 2
where w
is the weight, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, G
is an array with the same size and type of w
and holds the sum of the squares of the gradients. eps
is a small constant to prevent a zero value in the denominator. rho
is the momentum parameter and delta
is an array with the same size and type of w
and holds the sum of the squared updates.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
Reference: Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method.
Knet.Adam
— TypeAdam(;lr=0.001, gclip=0, beta1=0.9, beta2=0.999, eps=1e-8)
update!(w,g,p::Adam)
Container for parameters of the Adam optimization algorithm used by update!
.
Adam is one of the methods that compute the adaptive learning rate. It stores accumulated gradients (first moment) and the sum of the squared of gradients (second). It scales the first and second moment as a function of time. Here is the update formulas:
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g .* g
mhat = m ./ (1 - beta1 ^ t)
vhat = v ./ (1 - beta2 ^ t)
w = w - (lr / (sqrt(vhat) + eps)) * mhat
where w
is the weight, g
is the gradient of the objective function w.r.t w
, lr
is the learning rate, m
is an array with the same size and type of w
and holds the accumulated gradients. v
is an array with the same size and type of w
and holds the sum of the squares of the gradients. eps
is a small constant to prevent a zero denominator. beta1
and beta2
are the parameters to calculate bias corrected first and second moments. t
is the update count.
If norm(g) > gclip > 0
, g
is scaled so that its norm is equal to gclip
. If gclip==0
no scaling takes place.
Reference: Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13.
Function Index
AutoGrad.AutoGrad
Knet.Adadelta
Knet.Adagrad
Knet.Adam
Knet.KnetArray
Knet.Momentum
Knet.Nesterov
Knet.RNN
Knet.Rmsprop
Knet.SGD
AutoGrad.cat1d
Knet.accuracy
Knet.batchnorm
Knet.bce
Knet.bilinear
Knet.bmm
Knet.bnmoments
Knet.bnparams
Knet.conv4
Knet.converge
Knet.deconv4
Knet.dir
Knet.dropout
Knet.gaussian
Knet.gc
Knet.goldensection
Knet.gpu
Knet.hyperband
Knet.invx
Knet.load
Knet.logistic
Knet.logp
Knet.logsoftmax
Knet.logsumexp
Knet.mat
Knet.minibatch
Knet.minimize
Knet.nll
Knet.param
Knet.pool
Knet.progress
Knet.rnnparam
Knet.rnnparams
Knet.save
Knet.sigm
Knet.softmax
Knet.training
Knet.unpool
Knet.update!
Knet.xavier
Knet.xavier_normal
Knet.xavier_uniform
Knet.zeroone
NNlib.elu
NNlib.relu
NNlib.selu
AutoGrad.@gcheck
AutoGrad.@primitive
AutoGrad.@zerograd
Knet.@load
Knet.@save