Reference

Reference

Contents

AutoGrad

AutoGrad.AutoGradModule.

Usage:

x = Param([1,2,3])          # user declares parameters with `Param`
x => P([1,2,3])             # `Param` is just a struct wrapping a value
value(x) => [1,2,3]         # `value` returns the thing wrapped
sum(x .* x) => 14           # Params act like regular values
y = @diff sum(x .* x)       # Except when we differentiate using `@diff`
y => T(14)                  # you get another struct
value(y) => 14              # which carries the same result
params(y) => [x]            # and the Params that it depends on 
grad(y,x) => [2,4,6]        # and the gradients for all Params

Param(x) returns a struct that acts like x but marks it as a parameter you want to compute gradients with respect to.

@diff expr evaluates an expression and returns a struct that contains the result (which should be a scalar) and gradient information.

grad(y, x) returns the gradient of y (output by @diff) with respect to any parameter x::Param, or nothing if the gradient is 0.

value(x) returns the value associated with x if x is a Param or the output of @diff, otherwise returns x.

params(x) returns an iterator of Params found by a recursive search of object x.

Alternative usage:

x = [1 2 3]
f(x) = sum(x .* x)
f(x) => 14
grad(f)(x) => [2 4 6]
gradloss(f)(x) => ([2 4 6], 14)

Given a scalar valued function f, grad(f,argnum=1) returns another function g which takes the same inputs as f and returns the gradient of the output with respect to the argnum'th argument. gradloss is similar except the resulting function also returns f's output.

KnetArray

Knet.KnetArrayType.
KnetArray{T}(undef,dims)
KnetArray(a::AbstractArray)
Array(k::KnetArray)

Container for GPU arrays that supports most of the AbstractArray interface. The constructor allocates a KnetArray in the currently active device, as specified by gpu(). KnetArrays and Arrays can be converted to each other as shown above, which involves copying to and from the GPU memory. Only Float32/64 KnetArrays are fully supported.

Important differences from the alternative CudaArray are: (1) a custom memory manager that minimizes the number of calls to the slow cudaMalloc by reusing already allocated but garbage collected GPU pointers. (2) a custom getindex that handles ranges such as a[5:10] as views with shared memory instead of copies. (3) custom CUDA kernels that implement elementwise, broadcasting, and reduction operations.

Supported functions:

  • Indexing: getindex, setindex! with the following index types:

    • 1-D: Real, Colon, OrdinalRange, AbstractArray{Real}, AbstractArray{Bool}, CartesianIndex, AbstractArray{CartesianIndex}, EmptyArray, KnetArray{Int32} (low level), KnetArray{0/1} (using float for BitArray) (1-D includes linear indexing of multidimensional arrays)
    • 2-D: (Colon,Union{Real,Colon,OrdinalRange,AbstractVector{Real},AbstractVector{Bool},KnetVector{Int32}}), (Union{Real,AbstractUnitRange,Colon}...) (in any order)
    • N-D: (Real...)
  • Array operations: ==, !=, cat, convert, copy, copyto!, deepcopy, display, eachindex, eltype, endof, fill!, first, hcat, isapprox, isempty, length, ndims, one, ones, pointer, rand!, randn!, reshape, similar, size, stride, strides, summary, vcat, vec, zero. (cat(x,y,dims=i) supported for i=1,2.)

  • Math operators: (-), abs, abs2, acos, acosh, asin, asinh, atan, atanh, cbrt, ceil, cos, cosh, cospi, erf, erfc, erfcinv, erfcx, erfinv, exp, exp10, exp2, expm1, floor, log, log10, log1p, log2, round, sign, sin, sinh, sinpi, sqrt, tan, tanh, trunc

  • Broadcasting operators: (.*), (.+), (.-), (./), (.<), (.<=), (.!=), (.==), (.>), (.>=), (.^), max, min. (Boolean operators generate outputs with same type as inputs; no support for KnetArray{Bool}.)

  • Reduction operators: countnz, maximum, mean, minimum, prod, sum, sumabs, sumabs2, norm.

  • Linear algebra: (*), axpy!, permutedims (up to 5D), transpose

  • Knet extras: relu, sigm, invx, logp, logsumexp, conv4, pool, deconv4, unpool, mat, update! (Only 4D/5D, Float32/64 KnetArrays support conv4, pool, deconv4, unpool)

Memory management

Knet models do not overwrite arrays which need to be preserved for gradient calculation. This leads to a lot of allocation and regular GPU memory allocation is prohibitively slow. Fortunately most models use identically sized arrays over and over again, so we can minimize the number of actual allocations by reusing preallocated but garbage collected pointers.

When Julia gc reclaims a KnetArray, a special finalizer keeps its pointer in a table instead of releasing the memory. If an array with the same size in bytes is later requested, the same pointer is reused. The exact algorithm for allocation is:

  1. Try to find a previously allocated and garbage collected pointer in the current device. (0.5 μs)

  2. If not available, try to allocate a new array using cudaMalloc. (10 μs)

  3. If not successful, try running gc() and see if we get a pointer of the right size. (75 ms, but this should be amortized over all reusable pointers that become available due to the gc)

  4. Finally if all else fails, clean up all saved pointers in the current device using cudaFree and try allocation one last time. (25-70 ms, however this causes the elimination of all reusable pointers)

source

File I/O

Knet.saveFunction.
Knet.save(filename, args...; kwargs...)

Call FileIO.save after serializing Knet specific args.

File format is determined by the filename extension. JLD and JLD2 are supported. Other formats may work if supported by FileIO, please refer to the documentation of FileIO and the specific format. Example:

Knet.save("foo.jld2", "name1", value1, "name2", value2)
source
Knet.loadFunction.
Knet.load(filename, args...; kwargs...)

Call FileIO.load then deserialize Knet specific values.

File format is determined by FileIO. JLD and JLD2 are supported. Other formats may work if supported by FileIO, please refer to the documentation of FileIO and the specific format. Example:

Knet.load("foo.jld2")           # returns a ("name"=>value) dictionary
Knet.load("foo.jld2", "name1")  # returns the value of "name1" in "foo.jld2"
Knet.load("foo.jld2", "name1", "name2")   # returns tuple (value1, value2)
source
Knet.@saveMacro.
Knet.@save "filename" variable1 variable2...

Save the values of the specified variables to filename in JLD2 format.

When called with no variable arguments, write all variables in the global scope of the current module to filename. See JLD2.

source
Knet.@loadMacro.
Knet.@load "filename" variable1 variable2...

Load the values of the specified variables from filename in JLD2 format.

When called with no variable arguments, load all variables in filename. See JLD2.

source

Parameter initialization

Knet.paramFunction.
param(array; atype)
param(dims...; init, atype)
param0(dims...; atype)

The first form returns Param(atype(array)) where atype=identity is the default.

The second form Returns a randomly initialized Param(atype(init(dims...))). By default, init is xavier and atype is KnetArray{Float32} if gpu() >= 0, Array{Float32} otherwise.

The third form param0 is an alias for param(dims...; init=zeros).

source
Knet.xavierFunction.
xavier(a...)

Xavier initialization returns uniform random weights in the range ±sqrt(2 / (fanin + fanout)). The a arguments are passed to rand. See (Glorot and Bengio 2010) for a description. Caffe implements this slightly differently. Lasagne calls it GlorotUniform.

source
Knet.gaussianFunction.
gaussian(a...; mean=0.0, std=0.01)

Return a Gaussian array with a given mean and standard deviation. The a arguments are passed to randn.

source
Knet.bilinearFunction.

Bilinear interpolation filter weights; used for initializing deconvolution layers.

Adapted from https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/surgery.py#L33

Arguments:

T : Data Type

fw: Width upscale factor

fh: Height upscale factor

IN: Number of input filters

ON: Number of output filters

Example usage:

w = bilinear(Float32,2,2,128,128)

source

Activation functions

Knet.eluFunction.
elu(x)

Return (x > 0 ? x : exp(x)-1).

Reference: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) (https://arxiv.org/abs/1511.07289).

source
Knet.reluFunction.
relu(x)

Return max(0,x).

References:

source
Knet.seluFunction.
selu(x)

Return λ01 * (x > 0 ? x : α01 * (exp(x)-1)) where λ01=1.0507009873554805 and α01=1.6732632423543778.

Reference: Self-Normalizing Neural Networks (https://arxiv.org/abs/1706.02515).

source
Knet.sigmFunction.

sigm(x) = 1/(1+exp(-x))

source

Loss functions

Knet.accuracyFunction.
accuracy(scores, answers; dims=1, average=true)

Given an unnormalized scores matrix and an Integer array of correct answers, return the ratio of instances where the correct answer has the maximum score. dims=1 means instances are in columns, dims=2 means instances are in rows. Use average=false to return the number of correct answers instead of the ratio.

source
accuracy(model, data; dims=1, average=true, o...)

Compute accuracy(model(x; o...), y; dims) for (x,y) in data and return the ratio (if average=true) or the count (if average=false) of correct answers.

source
Knet.bceFunction.
bce(scores,answers;average=true)

Computes binary cross entropy given scores(predicted values) and answer labels. answer values should be {0,1}, then it returns negative of mean|sum(answers * log(p) + (1-answers)*log(1-p)) where p is equal to 1/(1 + exp.(scores)). See also logistic.

source
Knet.logisticFunction.
logistic(scores, answers; average=true)

Computes logistic loss given scores(predicted values) and answer labels. answer values should be {-1,1}, then it returns mean|sum(log(1 + exp(-answers*scores))). See also bce.

source
Knet.logpFunction.
logp(x; dims=:)

Treat entries in x as as unnormalized log probabilities and return normalized log probabilities.

dims is an optional argument, if not specified the normalization is over the whole x, otherwise the normalization is performed over the given dimensions. In particular, if x is a matrix, dims=1 normalizes columns of x and dims=2 normalizes rows of x.

source
Knet.logsoftmaxFunction.
 logsoftmax(x; dims=:)

Equivalent to logp(x; dims=:). See also sotfmax.

source
Knet.logsumexpFunction.
logsumexp(x;dims=:)

Compute log(sum(exp(x);dims)) in a numerically stable manner.

dims is an optional argument, if not specified the summation is over the whole x, otherwise the summation is performed over the given dimensions. In particular if x is a matrix, dims=1 sums columns of x and dims=2 sums rows of x.

source
Knet.nllFunction.
nll(scores, answers; dims=1, average=true)

Given an unnormalized scores matrix and an Integer array of correct answers, return the per-instance negative log likelihood. dims=1 means instances are in columns, dims=2 means instances are in rows. Use average=false to return the sum instead of per-instance average.

source
nll(model, data; dims=1, average=true, o...)

Compute nll(model(x; o...), y; dims) for (x,y) in data and return the per-instance average (if average=true) or total (if average=false) negative log likelihood.

source
Knet.softmaxFunction.
softmax(x; dims=1, algo=1)

The softmax function typically used in classification. Gives the same results as to exp.(logp(x, dims)).

If algo=1 computation is more accurate, if algo=0 it is faster.

See also logsoftmax.

source
Knet.zerooneFunction.

zeroone loss is equal to 1 - accuracy

source

Convolution and Pooling

Knet.conv4Function.
conv4(w, x; kwargs...)

Execute convolutions or cross-correlations using filters specified with w over tensor x.

Currently KnetArray{Float32/64,4/5} and Array{Float32/64,4} are supported as w and x. If w has dimensions (W1,W2,...,I,O) and x has dimensions (X1,X2,...,I,N), the result y will have dimensions (Y1,Y2,...,O,N) where

Yi=1+floor((Xi+2*padding[i]-Wi)/stride[i])

Here I is the number of input channels, O is the number of output channels, N is the number of instances, and Wi,Xi,Yi are spatial dimensions. padding and stride are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.

Keywords

  • padding=0: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.
  • stride=1: the number of elements to slide to reach the next filtering window.
  • upscale=1: upscale factor for each dimension.
  • mode=0: 0 for convolution and 1 for cross-correlation.
  • alpha=1: can be used to scale the result.
  • handle: handle to a previously created cuDNN context. Defaults to a Knet allocated handle.
source
Knet.deconv4Function.
y = deconv4(w, x; kwargs...)

Simulate 4-D deconvolution by using transposed convolution operation. Its forward pass is equivalent to backward pass of a convolution (gradients with respect to input tensor). Likewise, its backward pass (gradients with respect to input tensor) is equivalent to forward pass of a convolution. Since it swaps forward and backward passes of convolution operation, padding and stride options belong to output tensor. See this report for further explanation.

Currently KnetArray{Float32/64,4} and Array{Float32/64,4} are supported as w and x. If w has dimensions (W1,W2,...,O,I) and x has dimensions (X1,X2,...,I,N), the result y will have dimensions (Y1,Y2,...,O,N) where

Yi = Wi+stride[i](Xi-1)-2padding[i]

Here I is the number of input channels, O is the number of output channels, N is the number of instances, and Wi,Xi,Yi are spatial dimensions. padding and stride are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.

Keywords

  • padding=0: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.
  • stride=1: the number of elements to slide to reach the next filtering window.
  • mode=0: 0 for convolution and 1 for cross-correlation.
  • alpha=1: can be used to scale the result.
  • handle: handle to a previously created cuDNN context. Defaults to a Knet allocated handle.
source
Knet.poolFunction.
pool(x; kwargs...)

Compute pooling of input values (i.e., the maximum or average of several adjacent values) to produce an output with smaller height and/or width.

Currently 4 or 5 dimensional KnetArrays with Float32 or Float64 entries are supported. If x has dimensions (X1,X2,...,I,N), the result y will have dimensions (Y1,Y2,...,I,N) where

Yi=1+floor((Xi+2*padding[i]-window[i])/stride[i])

Here I is the number of input channels, N is the number of instances, and Xi,Yi are spatial dimensions. window, padding and stride are keyword arguments that can be specified as a single number (in which case they apply to all dimensions), or an array/tuple with entries for each spatial dimension.

Keywords:

  • window=2: the pooling window size for each dimension.
  • padding=0: the number of extra zeros implicitly concatenated at the start and at the end of each dimension.
  • stride=window: the number of elements to slide to reach the next pooling window.
  • mode=0: 0 for max, 1 for average including padded values, 2 for average excluding padded values.
  • maxpoolingNanOpt=0: Nan numbers are not propagated if 0, they are propagated if 1.
  • alpha=1: can be used to scale the result.
  • handle: Handle to a previously created cuDNN context. Defaults to a Knet allocated handle.
source
Knet.unpoolFunction.

Unpooling; reverse of pooling.

x == pool(unpool(x;o...); o...)
source

Recurrent neural networks

Knet.RNNType.
rnn = RNN(inputSize, hiddenSize; opts...)
rnn(x; batchSizes) => y
rnn.h, rnn.c  # hidden and cell states

RNN returns a callable RNN object rnn. Given a minibatch of sequences x, rnn(x) returns y, the hidden states of the final layer for each time step. rnn.h and rnn.c fields can be used to set the initial hidden states and read the final hidden states of all layers. Note that the final time step of y always contains the final hidden state of the last layer, equivalent to rnn.h for a single layer network.

Dimensions: The input x can be 1, 2, or 3 dimensional and y will have the same number of dimensions as x. size(x)=(X,[B,T]) and size(y)=(H/2H,[B,T]) where X is inputSize, B is batchSize, T is seqLength, H is hiddenSize, 2H is for bidirectional RNNs. By default a 1-D x represents a single instance for a single time step, a 2-D x represents a single minibatch for a single time step, and a 3-D x represents a sequence of identically sized minibatches for multiple time steps. The output y gives the hidden state (of the final layer for multi-layer RNNs) for each time step. The fields rnn.h and rnn.c represent the hidden states of all layers in a single time step and have size (H,B,L/2L) where L is numLayers and 2L is for bidirectional RNNs.

batchSizes: If batchSizes=nothing (default), all sequences in a minibatch are assumed to be the same length. If batchSizes is an array of (non-increasing) integers, it gives us the batch size for each time step (allowing different sequences in the minibatch to have different lengths). In this case x will typically be 2-D with the second dimension representing variable size batches for time steps. If batchSizes is used, sum(batchSizes) should equal length(x) ÷ size(x,1). When the batch size is different in every time step, hidden states will have size (H,B,L/2L) where B is always the size of the first (largest) minibatch.

Hidden states: The hidden and cell states are kept in rnn.h and rnn.c fields (the cell state is only used by LSTM). They can be initialized during construction using the h and c keyword arguments, or modified later by direct assignment. Valid values are nothing (default), 0, or an array of the right type and size possibly wrapped in a Param. If the value is nothing the initial state is assumed to be zero and the final state is discarded keeping the value nothing. If the value is 0 the initial state is assumed to be zero and 0 is replaced by the final state on return. If the value is a valid state, it is used as the initial state and is replaced by the final state on return.

In a differentiation context the returned final hidden states will be wrapped in Result types. This is necessary if the same RNN object is to be called multiple times in a single iteration. Between iterations (i.e. after diff/update) the hidden states need to be unboxed with e.g. rnn.h = value(rnn.h) to prevent spurious dependencies. This happens automatically during the backward pass for GPU RNNs but needs to be done manually for CPU RNNs. See the CharLM Tutorial for an example.

Keyword arguments for RNN:

  • h=nothing: Initial hidden state.
  • c=nothing: Initial cell state.
  • rnnType=:lstm Type of RNN: One of :relu, :tanh, :lstm, :gru.
  • numLayers=1: Number of RNN layers.
  • bidirectional=false: Create a bidirectional RNN if true.
  • dropout=0: Dropout probability. Applied to input and between layers.
  • skipInput=false: Do not multiply the input with a matrix if true.
  • dataType=Float32: Data type to use for weights.
  • algo=0: Algorithm to use, see CUDNN docs for details.
  • seed=0: Random number seed for dropout. Uses time() if 0.
  • winit=xavier: Weight initialization method for matrices.
  • binit=zeros: Weight initialization method for bias vectors.
  • usegpu=(gpu()>=0): GPU used by default if one exists.

Formulas: RNNs compute the output h[t] for a given iteration from the recurrent input h[t-1] and the previous layer input x[t] given matrices W, R and biases bW, bR from the following equations:

:relu and :tanh: Single gate RNN with activation function f:

h[t] = f(W * x[t] .+ R * h[t-1] .+ bW .+ bR)

:gru: Gated recurrent unit:

i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
r[t] = sigm(Wr * x[t] .+ Rr * h[t-1] .+ bWr .+ bRr) # reset gate
n[t] = tanh(Wn * x[t] .+ r[t] .* (Rn * h[t-1] .+ bRn) .+ bWn) # new gate
h[t] = (1 - i[t]) .* n[t] .+ i[t] .* h[t-1]

:lstm: Long short term memory unit with no peephole connections:

i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
f[t] = sigm(Wf * x[t] .+ Rf * h[t-1] .+ bWf .+ bRf) # forget gate
o[t] = sigm(Wo * x[t] .+ Ro * h[t-1] .+ bWo .+ bRo) # output gate
n[t] = tanh(Wn * x[t] .+ Rn * h[t-1] .+ bWn .+ bRn) # new gate
c[t] = f[t] .* c[t-1] .+ i[t] .* n[t]               # cell output
h[t] = o[t] .* tanh(c[t])
source
Knet.rnnparamFunction.
rnnparam(r::RNN, layer, id, param)

Return a single weight matrix or bias vector as a slice of RNN weights.

Valid layer values:

  • For unidirectional RNNs 1:numLayers
  • For bidirectional RNNs 1:2*numLayers, forw and back layers alternate.

Valid id values:

  • For RELU and TANH RNNs, input = 1, hidden = 2.
  • For GRU reset = 1,4; update = 2,5; newmem = 3,6; 1:3 for input, 4:6 for hidden
  • For LSTM inputgate = 1,5; forget = 2,6; newmem = 3,7; output = 4,8; 1:4 for input, 5:8 for hidden

Valid param values:

  • Return the weight matrix (transposed!) if param==1.
  • Return the bias vector if param==2.

The effect of skipInput: Let I=1 for RELU/TANH, 1:3 for GRU, 1:4 for LSTM

  • For skipInput=false (default), rnnparam(r,1,I,1) is a (inputSize,hiddenSize) matrix.
  • For skipInput=true, rnnparam(r,1,I,1) is nothing.
  • For bidirectional, the same applies to rnnparam(r,2,I,1): the first back layer.
source
Knet.rnnparamsFunction.
rnnparams(r::RNN)

Return the RNN parameters as an Array{Any}.

The order of params returned (subject to change):

  • All weight matrices come before all bias vectors.
  • Matrices and biases are sorted lexically based on (layer,id).
  • See @doc rnnparam for valid layer and id values.
  • Input multiplying matrices are nothing if r.inputMode = 1.
source

Batch Normalization

Knet.batchnormFunction.

batchnorm(x[, moments, params]; kwargs...) performs batch normalization to x with optional scaling factor and bias stored in params.

2d, 4d and 5d inputs are supported. Mean and variance are computed over dimensions (2,), (1,2,4) and (1,2,3,5) for 2d, 4d and 5d arrays, respectively.

moments stores running mean and variance to be used in testing. It is optional in the training mode, but mandatory in the test mode. Training and test modes are controlled by the training keyword argument.

params stores the optional affine parameters gamma and beta. bnparams function can be used to initialize params.

Example

# Inilization, C is an integer
moments = bnmoments()
params = bnparams(C)
...
# size(x) -> (H, W, C, N)
y = batchnorm(x, moments, params)
# size(y) -> (H, W, C, N)

Keywords

eps=1e-5: The epsilon parameter added to the variance to avoid division by 0.

training: When training is true, the mean and variance of x are used and moments argument is modified if it is provided. When training is false, mean and variance stored in the moments argument are used. Default value is true when at least one of x and params is AutoGrad.Value, false otherwise.

source
Knet.bnmomentsFunction.

bnmoments(;momentum=0.1, mean=nothing, var=nothing, meaninit=zeros, varinit=ones) can be used directly load moments from data. meaninit and varinit are called if mean and var are nothing. Type and size of the mean and var are determined automatically from the inputs in the batchnorm calls. A BNMoments object is returned.

BNMoments

A high-level data structure used to store running mean and running variance of batch normalization with the following fields:

momentum::AbstractFloat: A real number between 0 and 1 to be used as the scale of last mean and variance. The existing running mean or variance is multiplied by (1-momentum).

mean: The running mean.

var: The running variance.

meaninit: The function used for initialize the running mean. Should either be nothing or of the form (eltype, dims...)->data. zeros is a good option.

varinit: The function used for initialize the running variance. Should either be nothing or (eltype, dims...)->data. ones is a good option.

source
Knet.bnparamsFunction.

bnparams(etype, channels) creates a single 1d array that contains both scale and bias of batchnorm, where the first half is scale and the second half is bias.

bnparams(channels) calls bnparams with etype=Float64, following Julia convention

source

Model optimization

Knet.minimizeFunction.
minimize(func, data, optimizer=Adam(); params)
sgd     (func, data; lr=0.1,  gclip, params)
momentum(func, data; lr=0.05, gamma=0.95, gclip, params)
nesterov(func, data; lr=0.05, gamma=0.95, gclip, params)
adagrad (func, data; lr=0.05, eps=1e-6, gclip, params)
rmsprop (func, data; lr=0.01, rho=0.9, eps=1e-6, gclip, params)
adadelta(func, data; lr=1.0,  rho=0.9, eps=1e-6, gclip, params)
adam    (func, data; lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, gclip, params)

Return an iterator which applies func to arguments in data, i.e. (func(args...) for args in data), and updates the parameters every iteration to minimize func. func should return a scalar value.

The common keyword argument params can be used to list the Params to be optimized. If not specified, any Param that takes part in the computation of func(args...) will be updated.

The common keyword argument gclip can be used to implement per-parameter gradient clipping. For a parameter gradient g, if norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If not specified no gradient clipping is performed.

These functions do not perform optimization, but return an iterator that can. Any function that produces values from an iterator can be used with such an object, e.g. progress!(sgd(f,d)) iterates the sgd optimizer and displays a progress bar. For convenience, appending ! to the name of the function iterates and returns nothing, i.e. sgd!(...) is equivalent to (for x in sgd(...) end).

We define optimizers as lazy iterators to have explicit control over them:

  • To report progress use progress(sgd(f,d)).
  • To run until convergence use converge(sgd(f,cycle(d))).
  • To run multiple epochs use sgd(f,repeat(d,n)).
  • To run a given number of iterations use sgd(f,take(cycle(d),n)).
  • To do a task every n iterations use (task() for (i,j) in enumerate(sgd(f,d)) if i%n == 1).

These functions apply the same algorithm with the same configuration to every parameter by default. minimize takes an explicit optimizer argument, all others call minimize with an appropriate optimizer argument (see @doc update! for a list of possible optimizers). Before calling update! on a Param, minimize sets its opt field to a copy of this default optimizer if it is not already set. The opt field is used by the update! function to determine the type of update performed on that parameter. If you need finer grained control, you can set the optimizer of an individual Param by setting its opt field before calling one of these functions. They will not override the opt field if it is already set, e.g. sgd(model,data) will perform an Adam update for a parameter whose opt field is an Adam object. This also means you can stop and start the training without losing optimization state, the first call will set the opt fields and the subsequent calls will not override them.

Given a parameter w and its gradient g here are the updates applied by each optimizer:

# sgd (http://en.wikipedia.org/wiki/Stochastic_gradient_descent)
w .= w - lr * g

# momentum (http://jlmelville.github.io/mize/nesterov.html)
v .= gamma * v - lr * g
w .= w + v

# nesterov (http://jlmelville.github.io/mize/nesterov.html)
w .= w - gamma * v
v .= gamma * v - lr * g
w .= w + (1 + gamma) * v

# adagrad (http://www.jmlr.org/papers/v12/duchi11a.html)
G .= G + g .^ 2
w .= w - lr * g ./ sqrt(G + eps)

# rmsprop (http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
G .= rho * G + (1-rho) * g .^ 2 
w .= w - lr * g ./ sqrt(G + eps)

# adadelta (http://arxiv.org/abs/1212.5701)
G .= rho * G + (1-rho) * g .^ 2
update = sqrt(delta + eps) .* g ./ sqrt(G + eps)
w = w - lr * update
delta = rho * delta + (1-rho) * update .^ 2

# adam (http://arxiv.org/abs/1412.6980)
v = beta1 * v + (1 - beta1) * g
G = beta2 * G + (1 - beta2) * g .^ 2
vhat = v ./ (1 - beta1 ^ t)
Ghat = G ./ (1 - beta2 ^ t)
w = w - (lr / (sqrt(Ghat) + eps)) * vhat
source
Knet.convergeFunction.
converge(itr; alpha=0.1)

Return an iterator which acts exactly like itr, but quits when values from itr stop decreasing. itr should produce numeric values.

It can be used to train a model with the data cycled:

progress!(converge(minimize(model,cycle(data))))

alpha controls the exponential average of values to detect convergence. Here is how convergence is decided:

p = x - avgx
avgx = c.alpha * x + (1-c.alpha) * avgx
avgp = c.alpha * p + (1-c.alpha) * avgp
avgp > 0.0 && return nothing

converge!(...) is equivalent to (for x in converge(...) end), i.e. iterates over the object created by converge(...) and returns nothing.

source
Knet.minibatchFunction.
minibatch(x, [y], batchsize; shuffle, partial, xtype, ytype, xsize, ysize)

Return an iterator of minibatches [(xi,yi)...] given data tensors x, y and batchsize.

The last dimension of x and y give the number of instances and should be equal. y is optional, if omitted a sequence of xi will be generated rather than (xi,yi) tuples. Use repeat(d,n) for multiple epochs, Iterators.take(d,n) for a partial epoch, and Iterators.cycle(d) to cycle through the data forever (this can be used with converge). If you need the iterator to continue from its last position when stopped early (e.g. by a break in a for loop), use Iterators.Stateful(d) (by default the iterator would restart from the beginning).

Keyword arguments:

  • shuffle=false: Shuffle the instances every epoch.
  • partial=false: If true include the last partial minibatch < batchsize.
  • xtype=typeof(x): Convert xi in minibatches to this type.
  • ytype=typeof(y): Convert yi in minibatches to this type.
  • xsize=size(x): Convert xi in minibatches to this shape.
  • ysize=size(y): Convert yi in minibatches to this shape.
source
Knet.progressFunction.
progress(itr; width, alpha, interval)

Return an iterator which acts exactly like itr, but prints a progressbar as new values are requested:

2.70e-01  21.83%┣███▉              ┫ 13101/60000 [00:12/00:53, 1137.05i/s]

Here 2.70e-01 is the exponential average of values generated by itr (only displayed for iterators with numeric values). 21.83% is the percentage, 13101 is the number of iterations completed, 60000 is the total number of iterations. 00:12 is elapsed seconds, 00:53 is the estimated total seconds, 1137.05i/s is the average number of iterations completed per second. If the speed is less than 1, the average number of seconds per iteration (s/i) is reported instead. The percent, total iterations, and completion time are omitted for iterators whose size is unknown.

progress!(...) is equivalent to (for x in progress(...) end), i.e. iterates over the object created by progress(...) and returns nothing.

An integer itr is treated as 1:itr, i.e. progress(n::Integer) is equivalent to progress(1:n)

Keyword arguments:

  • width=max(64,displaysize()[2]): controls display width. The default width can be controlled using ENV["COLUMNS"].
  • interval=1.0: minimum time interval in seconds between progressbar updates.
  • alpha=1.0: controls the exponential average displayed for numeric iterators: avg = alpha * val + (1-alpha) * avg
source
Knet.trainingFunction.

training() returns true only inside a @diff context, e.g. during a training iteration of a model.

source

Hyperparameter optimization

Knet.goldensectionFunction.
goldensection(f,n;kwargs) => (fmin,xmin)

Find the minimum of f using concurrent golden section search in n dimensions. See Knet.goldensection_demo() for an example.

f is a function from a Vector{Float64} of length n to a Number. It can return NaN for out of range inputs. Goldensection will always start with a zero vector as the initial input to f, and the initial step size will be 1 in each dimension. The user should define f to scale and shift this input range into a vector meaningful for their application. For positive inputs like learning rate or hidden size, you can use a transformation such as x0*exp(x) where x is a value goldensection passes to f and x0 is your initial guess for this value. This will effectively start the search at x0, then move with multiplicative steps.

I designed this algorithm combining ideas from Golden Section Search and Hill Climbing Search. It essentially runs golden section search concurrently in each dimension, picking the next step based on estimated gain.

Keyword arguments

  • dxmin=0.1: smallest step size.
  • accel=φ: acceleration rate. Golden ratio φ=1.618... is best.
  • verbose=false: use true to print individual steps.
  • history=[]: cache of [(x,f(x)),...] function evaluations.
source
Knet.hyperbandFunction.
hyperband(getconfig, getloss, maxresource=27, reduction=3)

Hyperparameter optimization using the hyperband algorithm from (Lisha et al. 2016). You can try a simple MNIST example using Knet.hyperband_demo().

Arguments

  • getconfig() returns random configurations with a user defined type and distribution.
  • getloss(c,n) returns loss for configuration c and number of resources (e.g. epochs) n.
  • maxresource is the maximum number of resources any one configuration should be given.
  • reduction is an algorithm parameter (see paper), 3 is a good value.
source

Utilities

Knet.bmm
AutoGrad.cat1d
Knet.cpucopy
Knet.dir
Knet.dropout
Knet.gc
Knet.gpu
Knet.gpucopy
Knet.invx
Knet.mat
Knet.seed!

AutoGrad (advanced)

AutoGrad.@gcheckMacro.
gcheck(f, x...; kw, o...)
@gcheck f(x...; kw...) (opt1=val1,opt2=val2,...)

Numerically check the gradient of f(x...; kw...) and return a boolean result.

Example call: gcheck(nll,model,x,y) or @gcheck nll(model,x,y). The parameters should be marked as Param arrays in f, x, and/or kw. Only 10 random entries in each large numeric array are checked by default. If the output of f is not a number, we check the gradient of sum(f(x...; kw...)). Keyword arguments:

  • kw=(): keyword arguments to be passed to f, i.e. f(x...; kw...)
  • nsample=10: number of random entries from each param to check
  • atol=0.01,rtol=0.05: tolerance parameters. See isapprox for their meaning.
  • delta=0.0001: step size for numerical gradient calculation.
  • verbose=1: 0 prints nothing, 1 shows failing tests, 2 shows all tests.
@primitive  fx g1 g2...

Define a new primitive operation for AutoGrad and (optionally) specify its gradients. Non-differentiable functions such as sign, and non-numeric functions such as size should be defined using the @zerograd macro instead.

Examples

@primitive sin(x::Number)
@primitive hypot(x1,x2),dy,y

@primitive sin(x::Number),dy  (dy.*cos(x))
@primitive hypot(x1,x2),dy,y  (dy.*x1./y)  (dy.*x2./y)

The first example shows that fx is a typed method declaration. Julia supports multiple dispatch, i.e. a single function can have multiple methods with different arg types. AutoGrad takes advantage of this and supports multiple dispatch for primitives and gradients.

The second example specifies variable names for the output gradient dy and the output y after the method declaration which can be used in gradient expressions. Untyped, ellipsis and keyword arguments are ok as in f(a::Int,b,c...;d=1). Parametric methods such as f(x::T) where {T<:Number} cannot be used.

The method declaration can optionally be followed by gradient expressions. The third and fourth examples show how gradients can be specified. Note that the parameters, the return variable and the output gradient of the original function can be used in the gradient expressions.

Under the hood

The @primitive macro turns the first example into:

sin(x::Value{T}) where {T<:Number} = forw(sin, x)

This will cause calls to sin with a boxed argument (Value{T<:Number}) to be recorded. The recorded operations are used by AutoGrad to construct a dynamic computational graph. With multiple arguments things are a bit more complicated. Here is what happens with the second example:

hypot(x1::Value{S}, x2::Value{T}) where {S,T} = forw(hypot, x1, x2)
hypot(x1::S, x2::Value{T})        where {S,T} = forw(hypot, x1, x2)
hypot(x1::Value{S}, x2::T)        where {S,T} = forw(hypot, x1, x2)

We want the forw method to be called if any one of the arguments is a boxed Value. There is no easy way to specify this in Julia, so the macro generates all 2^N-1 boxed/unboxed argument combinations.

In AutoGrad, gradients are defined using gradient methods that have the following pattern:

back(f,Arg{i},dy,y,x...) => dx[i]

For the third example here is the generated gradient method:

back(::typeof(sin), ::Type{Arg{1}}, dy, y, x::Value{T}) where {T<:Number} = dy .* cos(x)

For the last example a different gradient method is generated for each argument:

back(::typeof(hypot), ::Type{Arg{1}}, dy, y, x1::Value{S}, x2::Value{T}) where {S,T} = (dy .* x1) ./ y
back(::typeof(hypot), ::Type{Arg{2}}, dy, y, x1::Value{S}, x2::Value{T}) where {S,T} = (dy .* x2) ./ y

In fact @primitive generates four more definitions for the other boxed/unboxed argument combinations.

Broadcasting

Broadcasting is handled by extra forw and back methods. @primitive defines the following so that broadcasting of a primitive function with a boxed value triggers forw and back.

broadcasted(::typeof(sin), x::Value{T}) where {T<:Number} = forw(broadcasted,sin,x)
back(::typeof(broadcasted), ::Type{Arg{2}}, dy, y, ::typeof(sin), x::Value{T}) where {T<:Number} = dy .* cos(x)

If you do not want the broadcasting methods, you can use the @primitive1 macro. If you only want the broadcasting methods use @primitive2. As a motivating example, here is how * is defined for non-scalars:

@primitive1 *(x1,x2),dy  (dy*x2')  (x1'*dy)
@primitive2 *(x1,x2),dy  unbroadcast(x1,dy.*x2)  unbroadcast(x2,x1.*dy)

Regular * is matrix multiplication, broadcasted * is elementwise multiplication and the two have different gradients as defined above. unbroadcast(a,b) reduces b to the same shape as a by performing the necessary summations.

@zerograd f(args...; kwargs...)

Define f as an AutoGrad primitive operation with zero gradient.

Example:

@zerograd  floor(x::Float32)

@zerograd allows f to handle boxed Value inputs by unboxing them like a @primitive, but unlike @primitive it does not record its actions or return a boxed Value result. Some functions, like sign(), have zero gradient. Others, like length() have discrete or constant outputs. These need to handle Value inputs, but do not need to record anything and can return regular values. Their output can be treated like a constant in the program. Use the @zerograd macro for those. Use the @zerograd1 variant if you don't want to define the broadcasting version and @zerograd2 if you only want to define the broadcasting version. Note that kwargs are NOT unboxed.

Per-parameter optimization (advanced)

The model optimization methods apply the same algorithm with the same configuration to every parameter. If you need finer grained control, you can set the optimization algorithm and configuration of an individual Param by setting its opt field to one of the optimization objects like Adam listed below. The opt field is used as an argument to update! and controls the type of update performed on that parameter. Model optimization methods like sgd will not override the opt field if it is already set, e.g. sgd(model,data) will perform an Adam update for a parameter whose opt field is an Adam object. This also means you can stop and start the training without losing optimization state, the first call will set the opt fields and the subsequent calls will not override them.

Knet.update!Function.
update!(weights::Param, gradients)
update!(weights, gradients; lr=0.1, gclip=0)
update!(weights, gradients, optimizers)

Update the weights using their gradients and the optimization algorithms specified using (1) the opt field of a Param, (2) keyword arguments, (3) the third argument.

weights can be an individual Param, numeric array, or a collection of arrays/Params represented by an iterator or dictionary. gradients should be a matching individual array or collection. In the first form, the optimizer should be specified in weights.opt. In the second form the optimizer defaults to SGD with learning rate lr and gradient clip gclip. In the third form optimizers should be a matching individual optimizer or collection of optimizers. The weights and possibly gradients and optimizers are modified in-place.

Individual optimization parameters can be one of the following types. The keyword arguments for each constructor and their default values are listed as well.

  • SGD(;lr=0.1, gclip=0)
  • Momentum(;lr=0.05, gamma=0.95, gclip=0)
  • Nesterov(;lr=0.05, gamma=0.95, gclip=0)
  • Adagrad(;lr=0.05, eps=1e-6, gclip=0)
  • Rmsprop(;lr=0.01, rho=0.9, eps=1e-6, gclip=0)
  • Adadelta(;lr=1.0, rho=0.9, eps=1e-6, gclip=0)
  • Adam(;lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, gclip=0)

Example:

w = Param(rand(d), Adam())  # a Param with a specified optimizer
g = lossgradient0(w)        # gradient g has the same shape as w
update!(w, g)               # update w in-place with Adam()

w = rand(d)                 # an individual weight array
g = lossgradient1(w)        # gradient g has the same shape as w
update!(w, g)               # update w in-place with SGD()
update!(w, g; lr=0.1)       # update w in-place with SGD(lr=0.1)
update!(w, g, SGD(lr=0.1))  # update w in-place with SGD(lr=0.1)

w = (rand(d1), rand(d2))    # a tuple of weight arrays
g = lossgradient2(w)        # g will also be a tuple
p = (Adam(), SGD())         # p has optimizers for each w[i]
update!(w, g, p)            # update each w[i] in-place with g[i],p[i]

w = Any[rand(d1), rand(d2)] # any iterator can be used
g = lossgradient3(w)        # g will be similar to w
p = Any[Adam(), SGD()]      # p should be an iterator of same length
update!(w, g, p)            # update each w[i] in-place with g[i],p[i]

w = Dict(:a => rand(d1), :b => rand(d2)) # dictionaries can be used
g = lossgradient4(w)
p = Dict(:a => Adam(), :b => SGD())
update!(w, g, p)
source
Knet.SGDType.
SGD(;lr=0.1,gclip=0)
update!(w,g,p::SGD)
update!(w,g;lr=0.1)

Container for parameters of the Stochastic gradient descent (SGD) optimization algorithm used by update!.

SGD is an optimization technique to minimize an objective function by updating its weights in the opposite direction of their gradient. The learning rate (lr) determines the size of the step. SGD updates the weights with the following formula:

w = w - lr * g

where w is a weight array, g is the gradient of the loss function w.r.t w and lr is the learning rate.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

SGD is used by default if no algorithm is specified in the two argument version of update![@ref].

source
Knet.MomentumType.
Momentum(;lr=0.05, gclip=0, gamma=0.95)
update!(w,g,p::Momentum)

Container for parameters of the Momentum optimization algorithm used by update!.

The Momentum method tries to accelerate SGD by adding a velocity term to the update. This also decreases the oscillation between successive steps. It updates the weights with the following formulas:

velocity = gamma * velocity + lr * g
w = w - velocity

where w is a weight array, g is the gradient of the objective function w.r.t w, lr is the learning rate, gamma is the momentum parameter, velocity is an array with the same size and type of w and holds the accelerated gradients.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

Reference: Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151.

source
Knet.NesterovType.
Nesterov(; lr=0.05, gclip=0, gamma=0.95)
update!(w,g,p::Momentum)

Container for parameters of Nesterov's momentum optimization algorithm used by update!.

It is similar to standard Momentum but with a slightly different update rule:

velocity = gamma * velocity_old - lr * g
w = w_old - velocity_old + (1+gamma) * velocity

where w is a weight array, g is the gradient of the objective function w.r.t w, lr is the learning rate, gamma is the momentum parameter, velocity is an array with the same size and type of w and holds the accelerated gradients.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip == 0 no scaling takes place.

Reference Implementation : Yoshua Bengio, Nicolas Boulanger-Lewandowski and Razvan P ascanu

source
Knet.AdagradType.
Adagrad(;lr=0.05, gclip=0, eps=1e-6)
update!(w,g,p::Adagrad)

Container for parameters of the Adagrad optimization algorithm used by update!.

Adagrad is one of the methods that adapts the learning rate to each of the weights. It stores the sum of the squares of the gradients to scale the learning rate. The learning rate is adapted for each weight by the value of current gradient divided by the accumulated gradients. Hence, the learning rate is greater for the parameters where the accumulated gradients are small and the learning rate is small if the accumulated gradients are large. It updates the weights with the following formulas:

G = G + g .^ 2
w = w - g .* lr ./ sqrt(G + eps)

where w is the weight, g is the gradient of the objective function w.r.t w, lr is the learning rate, G is an array with the same size and type of w and holds the sum of the squares of the gradients. eps is a small constant to prevent a zero value in the denominator.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

Reference: Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159.

source
Knet.RmspropType.
Rmsprop(;lr=0.01, gclip=0, rho=0.9, eps=1e-6)
update!(w,g,p::Rmsprop)

Container for parameters of the Rmsprop optimization algorithm used by update!.

Rmsprop scales the learning rates by dividing the root mean squared of the gradients. It updates the weights with the following formula:

G = (1-rho) * g .^ 2 + rho * G
w = w - lr * g ./ sqrt(G + eps)

where w is the weight, g is the gradient of the objective function w.r.t w, lr is the learning rate, G is an array with the same size and type of w and holds the sum of the squares of the gradients. eps is a small constant to prevent a zero value in the denominator. rho is the momentum parameter and delta is an array with the same size and type of w and holds the sum of the squared updates.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

Reference: Tijmen Tieleman and Geoffrey Hinton (2012). "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural Networks for Machine Learning 4.2.

source
Knet.AdadeltaType.
Adadelta(;lr=1.0, gclip=0, rho=0.9, eps=1e-6)
update!(w,g,p::Adadelta)

Container for parameters of the Adadelta optimization algorithm used by update!.

Adadelta is an extension of Adagrad that tries to prevent the decrease of the learning rates to zero as training progresses. It scales the learning rate based on the accumulated gradients like Adagrad and holds the acceleration term like Momentum. It updates the weights with the following formulas:

G = (1-rho) * g .^ 2 + rho * G
update = g .* sqrt(delta + eps) ./ sqrt(G + eps)
w = w - lr * update
delta = rho * delta + (1-rho) * update .^ 2

where w is the weight, g is the gradient of the objective function w.r.t w, lr is the learning rate, G is an array with the same size and type of w and holds the sum of the squares of the gradients. eps is a small constant to prevent a zero value in the denominator. rho is the momentum parameter and delta is an array with the same size and type of w and holds the sum of the squared updates.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

Reference: Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method.

source
Knet.AdamType.
Adam(;lr=0.001, gclip=0, beta1=0.9, beta2=0.999, eps=1e-8)
update!(w,g,p::Adam)

Container for parameters of the Adam optimization algorithm used by update!.

Adam is one of the methods that compute the adaptive learning rate. It stores accumulated gradients (first moment) and the sum of the squared of gradients (second). It scales the first and second moment as a function of time. Here is the update formulas:

m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g .* g
mhat = m ./ (1 - beta1 ^ t)
vhat = v ./ (1 - beta2 ^ t)
w = w - (lr / (sqrt(vhat) + eps)) * mhat

where w is the weight, g is the gradient of the objective function w.r.t w, lr is the learning rate, m is an array with the same size and type of w and holds the accumulated gradients. v is an array with the same size and type of w and holds the sum of the squares of the gradients. eps is a small constant to prevent a zero denominator. beta1 and beta2 are the parameters to calculate bias corrected first and second moments. t is the update count.

If norm(g) > gclip > 0, g is scaled so that its norm is equal to gclip. If gclip==0 no scaling takes place.

Reference: Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13.

source

Function Index