1 Introduction
The goal of dictionary learning is to decompose a data matrix , where , into a dictionary matrix , where each column also referred to as atom is normalised, and a sparse coefficient matrix ,
(1) 
The compact data representation provided by a dictionary can be used for data restoration, such as denoising or reconstruction from incomplete information, [12, 26, 28] and data analysis, such as blind source separation, [15, 25, 21, 22]
. Due to these applications dictionary learning is of interest to both the signal processing community, where it is also known as sparse coding, and the independent component analysis (ICA) and the blind source separation (BSS) community, where it is also known as sparse component analysis. It also means that there are not only many algorithms to choose from,
[15, 3, 14, 22, 25, 27, 38, 28, 33, 30], but also that theoretical results have started to accumulate, [17, 39, 4, 1, 34, 35, 19, 6, 5, 37, 40, 41, 9]. As our reference list grows more incomplete every day, we point to the surveys [32, 36] as starting points for digging into algorithms and theory, respectively.One way to concretise the abstract formulation of the dictionary learning problem in (1) is to formulate it as optimisation programme. For instance, choosing a sparsity level and a dictionary size , we define to be the set of all columnwise sparse coefficient matrices, to be the set of all dictionaries with atoms and for some try to find
(2) 
Unfortunately this problem is highly nonconvex and as such difficult to solve even in the simplest and most commonly used case . However, randomly initialised alternating projection algorithms, which alternate between (trying to) find the best dictionary , based on coefficients , and the best coefficients , based on a dictionary
, such as KSVD (K Singular Value Decompositions) for
, [3], and ITKrM (Iterative Thresholding and K residual Means) related to , [37], tend to be very successful on synthetic data  usually recovering 90 to 100% of all atoms  and to provide useful dictionaries on image data.Apart from needing both the sparsity level and the dictionary size as input, the main drawback of these algorithms is that  assuming that the data is synthesized from a generating dictionary and randomly drawn  sparse coefficients  they have almost no (KSVD) or very weak (ITKrM) theoretical dictionary recovery guarantees. This is in sharp contrast to more involved algorithms, which  given the correct  have gobal recovery guarantees but due to their computational complexity can at best be used in small toy examples, [4, 2, 6].
One interesting exception is an algorithm developed by Sun, Qu and Wright, [40, 41]
. The algorithm is gradient descent based with a Newton trust region method to escape saddle points and proven to recover the generating dictionary if it is a basis. This result together with several results in machine learning which prove that nonconvex problems can be well behaved, meaning all local minima are global minima, gives rise to hope that a similar result can be proven for learning overcomplete dictionaries.
Contribution: In this paper we will first destroy all hope of proving global convergence of alternating projection algorithms by sketching the existence of stable fixed points, that are not equivalent to the generating dictionary. Conversely we will prove that ITKrM is a contraction towards the generating dictionary under much relaxed conditions compared to those from [37].
However, based on a characterisation of the fixed points, showing them to be very coherent, and an analysis of the residuals at these fixed points, we consider a replacement procedure for coherent atoms and develop a strategy for finding good replacement candidates. With the help of these replacement candidates we will then tackle one of the most challenging problems in dictionary learning  the automatic choice of the sparsity level and the dictionary size
. This will lead to a version of ITKrM that adapts both the sparsity level and the dictionary size in each iteration. To the best of our knowledge, this is the first dictionary learning algorithm that can recover a generating dictionary without prescribing its size in the presence of noise and outliers.
Organisation: In the next section we will summarise our notational conventions and introduce the sparse signal model on which all our theoretical results are based. In Section 3 we will familiarise the reader with the ITKrM algorithm and analyse situations where the algorithm does resp. does not converge to a generating dictionary. In particular, we first conduct a small experiment to identify dictionaries that are fixed points of ITKrM but not equivalent to the generating dictionary through reordering and sign flips. We also provide a proof sketch showing that these dictionaries are stable fixed points. Inspired by the properties of the spurious fixed points we then prove that ITKrM is a contraction towards the generating dictionary on an area much bigger than indicated by the convergence radius proved in [37].
The insights from Section 3 will be used in Section 4 to develop a nonrandom replacement strategy. Using the information at a spurious fixed point dictionary, we will construct replacement candidates that allow ITKrM to escape towards the generating dictionary, and test the resulting algorithm on synthetic data.
In Section 5 we will then address the big problem how to learn dictionaries without being given the generating sparsity level and dictionary size. This will be done by slowly increasing the sparsity level and by decoupling the replacement strategy into separate pruning of the dictionary and adding of promising replacement candidates. Numerical experiments will show that the resulting algorithm can indeed recover the generating dictionary from initialisations with various sizes on synthetic data and learn meaningful dictionaries on image data.
In the last section we will sketch how the concepts leading to adaptive ITKrM can be extended to other algorithms such as (approximative) KSVD. Finally, based on a discussion of our results, we will map out future directions of research.
2 Notations and Sparse Signal Model
Before we hit the strings, we will fine tune our notation and introduce some definitions; usually subscripted letters will denote vectors with the exception of
, where they are numbers, for instance vs. , however, it should always be clear from the context what we are dealing with.For a matrix we denote its (conjugate) transpose by and its MoorePenrose pseudo inverse by . We denote its operator norm by and its Frobenius norm by , remember that we have .
We consider a dictionary a collection of unit norm vectors , . By abuse of notation we will also refer to the matrix collecting the atoms as its columns as the dictionary, that is, . The maximal absolute inner product between two different atoms is called the coherence of a dictionary, .
By we denote the restriction of the dictionary to the atoms indexed by , that is, , , and by the orthogonal projection onto the span of the atoms indexed by , that is, . Note that in case the atoms indexed by are linearly independent we have . We also define to be the orthogonal projection onto the orthogonal complement of the span of , that is, , where is the identity operator (matrix) in .
(Ab)using the language of compressed sensing we define
as the smallest number such that all eigenvalues of
are included in and the isometry constant of the dictionary as . When clear from the context we will usually omit the reference to the dictionary. For more details on isometry constants see for instance [8].For a (sparse) signal we will refer to the indices of the coefficients with largest absolute magnitude as the support of . Again, we will omit the reference to the sparsity level if clear from the context.
To keep the sub(sub)scripts under control we denote the indicator function of a set by , that is is one if and zero else. The set of the first integers we abbreviate by .
We define the distance of a dictionary to a dictionary as
(3) 
Note that this distance is not a metric since it is not symmetric. For example, if is the canonical basis and is defined by for , , and then we have while . The advantage is that this distance is well defined also for dictionaries of different sizes. A symmetric distance between two dictionaries of the same size could be defined as the maximal distance between two corresponding atoms, that is,
(4) 
where is the set of permutations of . The distances are equivalent whenever there exists a permutation such that after rearrangement, the crossGram matrix is diagonally dominant, that is, . Since the main assumption for our results will be such a diagonal dominance we will state them in terms of the easier to calculate asymmetric distance and assume that is already signed and rearranged in a way that . We then use the abbreviations and .
The maximal absolute inner product between two noncorresponding atoms will be called the crosscoherence of the two dictionaries, .
We will also use the following decomposition of a dictionary into a given dictionary and a perturbation dictionary . If we set , where by definition . We can then find unit vectors with such that
(5) 
Note that if the crossGram matrix is diagonally dominant we have , and .
2.1 Sparse signal model
As basis for our results we use the following signal model, already used in [34, 35, 37]. Given a dictionary , we assume that the signals are generated as,
(6) 
where is a sparse coefficient sequence and is some noise. We assume that is a centered subgaussian vector with parameter , that is, and for all vectors the marginals are subgaussian with parameter , meaning they satisfy for all .
To model the coefficient sequences we first assume that there is a measure on a subset of the positive, non increasing sequences with unit norm, meaning for we have and . A coefficient sequence is created by drawing a sequence according to , and both a permutation and a sign sequence uniformly at random and setting where . The signal model then takes the form
(7) 
Using this model it is quite simple to incorporate sparsity via the measure . To model approximately sparse signals we require that the largest absolute coefficients, meaning those inside the support , are well balanced and much larger than the remaining ones outside the support. Further, we need that the expected energy of the coefficients outside the support is relatively small and that the sparse coefficients are well separated from the noise. Concretely we require that almost surely we have
(8) 
We will refer to the worst case ratio between coefficients inside the support, , as dynamic (sparse) range and to the worst case ratio between coefficients outside the support to those inside the support, , as the (sparse) gap. Since for a noise free signal the expected squared sparse approximation error is
we will call the relative (sparse) approximation error. Finally, is called the noise to (sparse) coefficient ratio.
Apart from these worst case bounds we will also use three other signal statistics,
(9) 
The constant helps to characterise the average size of a sparse coefficient, , while characterises the average sparse approximation quality, . The noise constant can be bounded by
(10) 
and for large approaches signal to noise ratio, , see [35] for details.
To get a better feeling for all the involved constants, we will calculate them for the case of perfectly sparse signals where for and else. We then have , and as well as and . In the case of noiseless signals we have and . In the case of Gaussian noise the noise to coefficient ratio is related to the signal to noise ratio via .
3 Global behaviour patterns of ITKrM
The iterative thresholding and K residual means algorithm (ITKrM) was introduced in [37] as modification of its much simpler predecessor ITKsM, which uses signal means instead of residual means. As can be seen from the summary in Algorithm 1 the signals can be processed sequentially, thus making the algorithm suitable for an online version and parallelisation.
The determining factors for the computational complexity are the matrix vector products between the current estimate of the dictionary and the signals, , and the projections . If computed with maximal numerical stability these would have an overall cost
, corresponding to the QR decompositions of
. However, since usually the achievable precision in the learning is limited by the number of available training signals rather than the numerical precision, it is computationally more efficient to precompute the Gram matrix and calculate the projections less stably via the eigenvalue decompositions of , corresponding to an overall cost . Another good property of the ITKrM algorithm is that it is proven to converge locally to a generating dictionary. Concretely this means that if the data is homogeneously sparse in a dictionary , where , and we initialise with a dictionary within radius , , then ITKrM using samples in each iteration will converge to the generating dictionary, [37]. In simulations on synthetic data ITKrM shows even better convergence behaviour. Concretely, if the atoms of the generating dictionary are perturbed with vectors chosen uniformly at random from the sphere, , ITKrM converges also for ratios . For completely random initialisations, , it finds between 90% and 100% of the atoms  depending on the noise and sparsity level.Last but not least, ITKrM is not just a pretty toy with theoretical guarantees but on image data produces dictionaries of the same quality as KSVD in a fraction of the time, [29].
Considering the good practical performance of ITKrM, it is especially frustrating that we only get a convergence radius of size , while for its simpler cousin ITKsM, which when initialised randomly performs much worse both on synthetic and image data, we can prove a convergence radius of size . To get a better understanding why ITKrM has a better practical performance, we will first have a closer look at what happens when things go wrong and things go right. Based on the intuition gained there we will then prove that ITKrM behaves well under much more relaxed conditions.
3.1 When things go wrong
We start with a closer inspection of what happens when ITKrM does not find all atoms using a random initialisation. From [37] we know that this is most likely to happen when the signals are very sparse ( small) and the noiselevel is small. For better visualisation we only run a small experiment in , where we try to recover a very incoherent dictionary from 2sparse vectors^{1}^{1}1All experiments and resulting figures can be reproduced using the matlab toolbox available at https://www.uibk.ac.at/mathematik/personal/schnass/code/adl.zip. The dictionary, containing 48 atoms, consists of the Dirac basis and the first half of the vectors from the Hadamard basis, and as such has coherence . The signals follow the model in (24), where the coefficient sequences are constructed by chosing uniformly at random and setting and for
. The noise is chosen to be Gaussian with variance
, corresponding to . Running ITKrM with 20000 new signals per iteration for 25 iterations and 10 different random initialisations we recover 4 times 46 atoms and 6 times 44 atoms.An immediate observation is that we always miss an even number of atoms. Taking a look at the recovered dictionaries  examples for recovery of 44 and 46 atoms are shown in Figure 1  we see that this is due to their special structure; in case of missing atoms, we always observe atoms that are recovered twice and atoms that are 1:1 linear combinations of 2 missing atoms, respectively. To get a better understanding why these configurations are stable, we will sketch that for noiseless signals that are perfectly sparse in a dictionary , meaning^{2}^{2}2To simplify the proof sketch, we assume that the nonzero coefficients are equal to 1 in magnitude rather than , so that the signals have average energy instead of 1. , one iteration of ITKrM will stay near where , for and with if and else.
Proof sketch: We start by analysing which support thresholding will recover depending on whether the generating support contains the (indices of the) double or the missing atoms and how the residual, , will look like. Without loss of generality we assume that and therefore . To better deal with the recovered support sets we use the following notation for an index set where the index has been replaced by an index , that is, . Note that we have , for and else, as well as for and .

[noitemsep]

In the most common case we have since if and if . In consequence we have .

We have for all and for all . Therefore, the recovered support depends on the order of the inner products
. Let’s assume that each inner product is equally likely to be the smallest. Then with probability
we take both copies of but miss the generating atom , meaning for , leading to . Further with probability we take one copy of and all the other generating atoms, or , leading to . 
As in the first case we have for all . Further, we have for all and , leading to . For the residual we have . The case , is analogue, meaning we get and .

We have for all , as well as for all . For we have , meaning we will recover and the residual will have the shape . In the analogue case , we will recover with residual .

We have for all as well as for and for . For we have to distinguish whether the signal coefficients have the same sign or not. If they have the same sign, we have , thus and . To have correct size the recovered support will also contain an index , or . However, since the residual is zero, this will hardly affect the update of the respective atom . If conversely have different signs, then the contribution of to the signal is orthogonal to and we have . It is thus highly improbably that the thresholded support will contain . Instead it will contain two atoms , likely those most correlated with the residual .

This quite rare case works analogue to the one before, taking into account that is sure to be selected in the thresholding.
Based on the knowledge of the outcome of thresholding and the shape of the residuals we can now estimate the updated atoms , where
(11) 
We have to distinguish between three types of atoms, the correct singles for , where , the double atoms where , and the combined atom , where .

We start with the correct singles, where . To calculate the sum we first have a look at the cases where was correctly recovered, meaning and .
From the analysis above we know that this occurs whenever , so with probability . We then have and since the recovery is independent of the sign patterns the sum over these signals should be close to its expectation, .
Given that the next case which guarantees recovery of is that either but not , which has probability . Then the residual has the form for . Since the product of the signs for is equally likely to be positive or negative, the contribution of will average out and we again get that the sum over these signals is close to its expectation .
The only situation where we might not recover is when , that is the second case described above, occuring with probability . Under the assumption that all atoms in the support are equally likely to have smallest inner product, this happens with probability . Most often we recover but fail to recover some other , so the residual has the form . In the worst case the signs of and the missing atom are very correlated so that they never cancel. Then we get that the sum over these signals is of the form(12) Note that if we are less pessimistic about the correlation of the signs, for instance because we have decaying coefficients, the contributions of the missing atoms will often cancel, meaning , and the sum will be closer to .
Finally, we have to check how often will be falsely recovered, meaning but . From the analysis of thresholding we see that this is possible whenever both and (probability ). With probability we have , so that the residual is zero and adding will not affect the sum negatively. However, also with probability , the residual has the shape and at least for the atom , which is most correlated with and therefore picked with the highest probability , we have to add to the sum a term .
Putting it all together we see that(13) This means that after normalisation we have and that for the updated atoms will stay (critically) close to the generating atoms.

Next we have a look how the double atom evolves; the case of is exactly the same. We again estimate the sum starting with the cases where and . For we most commonly have that , probability . Because of our assumption that all inner products are equally likely to be the smallest, we recover uniquely with probability and have . With probability we recover both copies of and the residual has the shape , where is the atom with smallest inner product. As before the question is how correlated the signs of the missing atoms are with the signs of . If they are very correlated and never cancel, the sum over these signals has the form
(14) for . We will additionally assume that they are quite uncorrelated, meaning that . For balanced coefficients this is likely to happen whenever is orthogonal (or comparatively incoherent) to all other atoms in . In this case the sign does not influence the order of the inner products for (as much as the signs of the more coherent atoms) and we get the same residual for both choices (most of the time).
The second case where that we have to take into account is when or , happening with probability . In this case we always recover both copies of and the residuals have the shape for resp. . Since is equally likely to be positive or negative the contribution of the residuals will average out and the sum over these signals will be .
Lastly, we have to estimate the effects of wrongly recovering , that is but . As for with this can only happen when both , and so the worst contribution to the sum is the term .
Combining the three estimates and normalising, we get that , so both updated copies of stay near . 
Finally we will verify that , the atom doing the jobs of two generating atoms, is stable. The analysis of thresholding tells us that is never falsely recovered, meaning but , and that whenever it is correctly recovered the residual is approximately a linear combination of and . In particular if but we have so that , and vice versa for but . On the other hand if both then either is not recovered or and . Adding the terms scaled with their respective probabilities we get
(15) showing that the updated atom is again a one to one combination of the missing atoms.
After sketching that there exist bad dictionaries from which ITKrM, or for that matter any alternating optimisation algorithm, is not likely to escape in reasonably many iterations, the next interesting question is how we end up near these bad dictionaries in the first place. The intuition from the sketch is the following; if we have two estimated atoms pointing to the same generating atom but nowhere else, meaning they are relatively close to an but very incoherent to all the other atoms for , they will quickly get contracted to this generating atom. At the same time we have two possible scenarios; a) one estimated atom is pointing to two generating atoms and drawn equally to both of them or b) once all estimated atoms have been drawn to a generating atom, the missing generating atom will start to attract the estimator of the generating atom to which it is most coherent, that is where , and they will start to efficiently share the same estimator .
Observe also that the two estimated atoms pointing to the same generating atom can be very incoherent even if they are both already quite close to . For instance, if where is a balanced sum of the other atoms , we have , meaning approximate orthogonality
at . Using these ideas we can construct wellconditioned and incoherent initial dictionaries , with abritrary distances to the generating dictionary, so that things will go maximally wrong, meaning we end up with a lot of double and 1:1 atoms.
Figure shows an example of a bad initialisation with coherence , leading to 16 missing atoms. The accompanying matlab toolbox provides more examples of these bad initialisations to observe convergence, play around with and inspire more evil constructions.
Here we will follow a less destructive route and ask ourselves, whether excluding the bad situation of two estimated atoms pointing to the same generating atom is sufficient to guarantee good behaviour of ITKrM.
3.2 When things go right
Going back to theory we can prove a refined theorem. To keep the flow of the paper we will state it in an informal version and refer the reader to the appendix for the exact statement and its proof.
Assume that the sparsity level of the training signals scales as and that the number of training signals scales as . Further, assume that the coherence and operator norm of the current dictionary estimate satisfy,
(16) 
If the distance of to the generating dictionary satisfies either

or

but the crossGram matrix is diagonally dominant in the sense that
(17)
then one iteration of ITKrM will reduce the distance by at least a factor , meaning
The first part of the theorem simply says that, excluding dictionaries that are coherent or have large operator norm, ITKrM is a contraction on a ball of radius around the generating dictionary . To better understand the second part of the theorem, we have a closer look at the conditions on the crossGram matrix in (17). The fact that the diagonal entries have to be larger than and puts a constraint on the admissible distance via the relation . For an incoherent dictionary with and and a moderate sparsity level this means that
(18) 
Considering that the maximal distance between two dictionaries is , this is a large improvement over the admissible distance in a). However, the additional price to pay is that also the intrinsic condition on the crossGram matrix needs to be satisfied,
(19) 
This condition captures our intuition that no two estimated atoms should point to the same generating atom and provides a bound for sufficient separation.
One thing that has to be noted about the result above is that it does not guarantee convergence of ITKrM since it is only valid for one iteration. To prove convergence of ITKrM, we need to additionally prove that inherits from the properties that are required for being a contraction, which is part of our future goals. Still, the result goes a long way towards explaining the good convergence behaviour of ITKrM.
For example, it allows us to briefly sketch why the algorithm always converges in experiments where the initial dictionary is a large but random perturbation of a wellbehaved generating dictionary with coherence and operatornorm . If , where the perturbation vectors are drawn uniformly at random from the unit sphere orthogonal to , then with high probability for all we have
(20) 
and consequently for all possible
(21) 
Also with high probability the operator norm of the matrix is bounded by , [42], so that for we get , again independent of . Comparing these estimates with the requirements of the theorem we see that for moderate sparsity levels, , we get a contraction whenever
(22) 
Summarising the two last subsections we see that ITKrM has convergence problems if the current dictionary estimate is too coherent, has large operator norm or if two atoms are close to one generating atom. Both coherence and operator norm of the estimate could be calculated after each iteration to check whether ITKrM is going in a good direction. Unfortunately, the diagonal dominance of the crossGram matrix, which prevents two estimated atoms to be close to the same generating atom, cannot be verified. However, the most likely outcome of this situation is that both these estimated atoms converge to the same generating atom, meaning that eventually the estimated dictionary will be coherent. This suggests that in order to improve the global convergence behaviour of ITKrM, we should control the coherence of the estimated dictionaries. One strategy to incorporate incoherence into ITKrM could be adding a penalty for coherent dictionaries. The main disadvantages of this strategy, apart from the fact that ITKrM is not straightforwardly associated to an optimisation programme, are the computational cost and the fact that penalties tend to complicate the highdimensional landscape of basins of attractions which slows up convergence. Therefore, we will use a different strategy which allows us to keep the high percentage of correctly recovered atoms and even use the information they provide for identifying the missing ones: replacement.
4 Replacement
Replacement of coherent atoms with new, randomly drawn atoms is a simple cleanup step that most dictionary learning algorithms based on alternating minimisation, e.g. KSVD [3], employ additionally in each iteration. While randomly drawing a replacement is costefficient and democratic, the drawback is that the new atom converges only very slowly to the missing generating atom. Thinking back to our example of a dictionary with one double atom, and one 1:1 atom,
Comments
There are no comments yet.