Linear Coding, Applications and Supremus Typicality SHENG HUANG Doctoral Thesis in Electrical Engineering Stockholm, Sweden 2015 TRITA-EE 2015:008 ISSN 1653-5146 ISBN 978-91-7595-462-2 KTH, School of Electrical Engineering Communication Theory Department SE-100 44 Stockholm SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i Elektro- och Systemteknik fredag den 20 mars 2015 klockan 13.15 i hörsal Q2, Osquldas väg 10, Stockholm. © 2014 Sheng Huang, unless otherwise noted. Tryck: Universitetsservice US AB Sammanfattning Detta arbete börjar med att presentera en kodningssats gällande linjär kodning över ändliga ringar för kodning av korrelerade diskreta minneslösa källor. Denna sats inkluderar som specialfall motsvarande uppnåbarhetssatser från Elias och Csiszár gällande linjär kodning över ändliga kroppar. Dessutom visas det att för varje uppsättning av ändliga korrelerade diskreta minneslösa källor, så finns alltid en sekvens av linjära kodare över vissa ändliga icke-kropp-ringar som uppnår datakompressionsgränsen bestämd av Slepian-Wolf-regionen. Därmed sluter vi problemet med linjär kodning över ändlig icke-kropps-ringar för i.i.d. datakomprimering med positiv bekräftelse gällande existens. Vi studerar också kodning av funktioner, där avkodaren är intresserad av att återskapa en diskret mappning av data som genererats av flera korrelerade i.i.d. källor och som kodats individuellt. Vi föreslår linjär kodning över ändliga ringar som en alternativ lösning på detta problem. Vi visar att linjär kodning över ändliga ringar presterar bättre än sin ändliga-kropp-motsvarighet, liksom dessutom SlepianWolf-kodning, i termer av att uppnå bättre kodningshastigheter för kodning av flera diskreta funktioner. För att generalisera ovannämnda genomförbarhetssatser, både gällande datakompression och funktionskodningsproblemet, till Markov-källor (homogena irreducerbara Markov-källor), så introducerar vi ett nytt koncept gällande klassificering av typiska sekvenser, benämnd Supremus-typiska sekvenser. Den asymptotiska likafördelningsprincipen samt en generaliserad version av typiskhets-hjälpsatsen för Supremus-typiska sekvenser bevisas. Jämfört med traditionell (stark och svag) typiskhet, så tillåter Supremus-typiskhet oss att härleda bättre tillgängliga verktyg och resultat, som låter oss bevisa att linjär kodning över ringar är överlägsen andra metoder. I motsats härtill misslyckas argument baserade på den traditionella versionen antingen med att nå liknande resultat eller så är de härledda resultaten svåra att analysera på grund av en utmanande utvärdering av entropitakt. För att ytterligare undersöka den grundläggande skillnaden mellan traditionell typiskhet och Supremus-typiskhet och dessutom göra våra resultat än mer allmänt gällande, så betraktar vi även asymptotiskt medelvärdesstationära ergodiska källor. Våra resultat visar att en inducerad transformation med avseende på en ändligt mätbar mängd över ett rekurrent asymptotiskt medelvärdesstationärt dynamiskt system med ett sigma-ändlig sannolikhetsmått är asymptotiskt medelvärdesstationär. Följaktligen så gäller Shannon-McMillan-Breiman-teoremet, liksom Shannon-McMillan-teoremet, för alla reducerade processer härledda ur rekurrenta asymptotiskt medelvärdesstationära stokastisk processer. Alltså ser vi att det traditionella typiskhetkonceptet endast realiserar Shannon-McMillan-Breimanteoremet i ett globalt hänseende, medan Supremus-typiskhet leder till att resultatet håller samtidigt även för alla härledda reducerade sekvenser. Abstract This work first presents a coding theorem on linear coding over finite rings for encoding correlated discrete memoryless sources. This theorem covers corresponding achievability theorems from Elias and Csiszár on linear coding over finite fields as special cases. In addition, it is shown that, for any set of finite correlated discrete memoryless sources, there always exists a sequence of linear encoders over some finite non-field rings which achieves the data compression limit, the Slepian–Wolf region. Hence, the optimality problem regarding linear coding over finite non-field rings for i.i.d. data compression is closed with positive confirmation with respect to existence. We also address the function encoding problem, where the decoder is interested in recovering a discrete function of the data generated and independently encoded by several correlated i.i.d. sources. We propose linear coding over finite rings as an alternative solution to this problem. It is demonstrated that linear coding over finite rings strictly outperforms its field counterpart, as well as the Slepian–Wolf scheme, in terms of achieving better coding rates for encoding many discrete functions. In order to generalise the above achievability theorems, on both the data compression and the function encoding problems, to the Markovian settings (homogeneous irreducible Markov sources), a new concept of typicality for sequences, termed Supremus typical sequences, is introduced. The Asymptotically Equipartition Property and a generalised typicality lemma of Supremus typical sequences are proved. Compared to traditional (strong and weak) typicality, Supremus typicality allows us to derive more accessible tools and results, based on which it is once again proved that linear technique over rings is superior to others. In contrast, corresponding arguments based on the traditional versions either fail to draw similar conclusions or the derived results are often hard to analyse because it is complicated to evaluate entropy rates. To further investigate the fundamental difference between traditional typicality and Supremus typicality and to bring our results to a more universal setting, asymptotically mean stationary ergodic sources, we look into the ergodic properties featured in these two concepts. Our studies prove that an induced transformation with respect to a finite measure set of a recurrent asymptotically mean stationary dynamical system with a sigma-finite measure is asymptotically mean stationary. Consequently, the Shannon–McMillan–Breiman Theorem, as well as the Shannon–McMillan Theorem, holds simultaneously for all reduced processes of any finite-state recurrent asymptotically mean stationary random process. From this, we see that the traditional typicality concept only realises the Shannon–McMillan– Breiman Theorem in the global sequence, while Supremus typicality engraves the simultaneous effects claimed in the previous statement into all reduced sequences as well. Acknowledgments I want to express my deepest gratitude to my supervisor, Professor Mikael Skoglund, for accepting me to work in his research group. This not only let to this thesis but also made a great impact on my future career. Mikael is extremely kind to allow me to pursue my research interests. His helpful comments and suggestions have influenced my work in many ways. I will always remember the time working with him. I wish to thank all my friends and colleagues from the Communication Theory Department for creating such a wonderful working environment. They are always very supportive. The last few years of academic life is certainly less enjoyable without them. I am also very grateful to Farshad Naghibi and Hady Ghaouch for proofreading this thesis. Sheng Huang Stockholm, February 2015 Contents Sammanfattning iii Abstract v Acknowledgments vii Contents 0 Introduction 0.1 Motivations . . . . . . . . 0.2 Outline and Contributions 0.3 Copyright Notice . . . . . 0.4 Notation . . . . . . . . . . ix . . . . 1 1 5 7 7 1 Preliminaries: Finite Rings and Polynomial Functions 1.1 Finite Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Polynomial Functions . . . . . . . . . . . . . . . . . . . . . . . . 9 9 13 2 Linear Coding 2.1 Linear Coding over Finite Rings . . 2.2 Proof of the Achievability Theorems 2.3 Optimality . . . . . . . . . . . . . . 2.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 18 26 28 35 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 37 38 44 48 4 Stochastic Complements and Supremus Typicality 4.1 Markov Chains and Stochastic Complements . . . . . . . . . . . 4.2 Supremus Typical Sequences . . . . . . . . . . . . . . . . . . . . 51 52 55 3 Encoding Functions of Correlated 3.1 A Polynomial Approach . . . . . 3.2 Source Coding for Computing . . 3.3 Non-field Rings versus Fields I . 3.A Appendix . . . . . . . . . . . . . ix . . . . . . . . . . . . x Contents 4.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Irreducible Markov Sources 5.1 Linear Coding over Finite Rings for Irreducible Markov Sources 5.2 Source Coding for Computing Markovian Functions . . . . . . 5.3 Non-field Rings versus Fields II . . . . . . . . . . . . . . . . . . 5.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Extended Shannon–McMillan–Breiman Theorem 6.1 Asymptotically Mean Stationary Dynamical Systems and dom Processes . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Induced Transformations of A.M.S. Systems . . . . . . . . 6.3 Extended Shannon–McMillan–Breiman Theorem . . . . . 6.A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 67 68 73 80 83 85 Ran. . . . . . . . . . . . . . . . 86 91 99 101 7 Asymptotically Mean Stationary Ergodic Sources 7.1 Supremus Typicality in the Weak Sense . . . . . . . . . . . . . . 7.2 Hyper Supremus Typicality in the Weak Sense . . . . . . . . . . 7.3 Linear Coding over Finite Rings for A.M.S. Sources . . . . . . . 103 104 108 114 8 Conclusion 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . 119 119 120 Bibliography 123 Chapter 0 Introduction 0.1 Motivations This thesis resulted from attempting to prove the conjecture: Linear encoders over finite rings are optimal for Slepian–Wolf data compression. This problem is interesting for several reasons. Reason one: it is “intrinsically interesting.” In 1955, Elias [Eli55] (cf. [Gal68]) introduced a binary linear coding scheme which compresses binary sources up to their Shannon limits, also known as the Slepian–Wolf limits [SW73]. Csiszár [Csi82] then showed that linear encoders over finite fields are optimal for all Slepian–Wolf data compression scenarios. This settles the previous problem for the special case when all rings considered are fields. Unfortunately, the general case of linear coding over finite rings is left open. In fact, Elias’ or Csiszár’s argument does not present an optimal conclusion when applied to the non-field ring scenario. In this work, we will show that linear encoders over some classes of finite rings can be equally optimal for Slepian–Wolf data compression. In addition, it is proved that: For any Slepian–Wolf data compression scenario, there always exist linear encoders over some finite non-field rings that achieve the optimal data compression limit. The conjecture is then closed on the regard of existence. As a matter of fact, our general conclusion also includes the corresponding field scenarios from Elias and Csiszár as special cases. Reason two: linear coding over non-field rings appears superior to others in some source network problems. As a generalisation of the Slepian–Wolf problem, the function encoding problem considers to recover a function of the source messages, instead of the original messages, from the encoder outputs. It comes with the following applications. 1. The function encoding problem is actually a special case of the “manyhelp-one” source network problem. For example, if, in Figure 0.1, Zi = Xi ⊕2 Yi for all feasible i’s and f3 is a constant function (namely, the 1 2 Introduction f1 (X n ) X n = X1 , X2 , · · · , Xn Encoder f1 f2 (Y n ) Y n = Y1 , Y2 , · · · , Yn Ẑ n Decoder Encoder f2 f3 (Z n ) Z n = Z1 , Z2 , · · · , Zn Encoder f3 Figure 0.1: Two-Help-One Source Network A b a B C a b D a a+b b E a+b F a+b G Figure 0.2: Network Coding encoding rate of f3 is 0), then this two-help-one problem renders to the encoding the modulo-two sum problem. 2. In network coding, an intermediate node is only interested in a function of the output messages from the sources (or its preceding nodes), instead of the original messages. As showed in Figure 0.2, node D is only required to bridge the message a + b, which is a function of the outputs from B and C, respectively. If the function encoding scheme is implemented, then the required capacities of link BD and link CD can be reduced, while it is still guaranteed that a + b is decoded correctly at node D. 0.1. Motivations 3 A = [A1 , A2 , · · · , An ] R1 f (A1 , B1 , C1 ) B = [B1 , B2 , · · · , Bn ] R2 f (A2 , B2 , C2 ) .. . R3 f (An , Bn , Cn ) C = [C1 , C2 , · · · , Cn ] data backup Figure 0.3: Partial Data Backup 3. “Partial data backup”. Consider that a vast amount of correlated data A, B and C are stored in three data centres (see Figure 0.3). The “partial backup” reliability requirement demands that, if one of the data centres fails, then we must be able to recover the original data stored in this centre from the other two data centres and the backup. Since facilities are highly reliable nowadays, failures occur with very low risk. It is even more unlikely that more than two data centres malfunction at the same time. Therefore, the maintenance cost is mainly from the network traffic required to perform frequent backups. The worst solution is to transfer all the data A, B and C and store their duplicated copies in the backup. Yet a better one is to store a function, say f , of the data in the backup. This will reduce not only the size of the backup data but also the required network traffic. This is because the sum rate for encoding the function f is usually significantly smaller than the one for encoding the original sources A, B and C. One specific method is to view data A, B and C as sequences of elements from set {0, 1} and let f (0, 0, 0) = 0; f (0, 0, 1) = 3; f (0, 1, 0) = 2; f (1, 0, 0) = 1; f (0, 1, 1) = 1; f (1, 0, 1) = 0; f (1, 1, 0) = 3; f (1, 1, 1) = 2. It is easy to see that, with the backup storing this function of the data, when one data centre fails we can recover the data with the aid of the backup and the two available data centres. However, the achievable coding rate region for encoding an arbitrary function of sources is unknown. Making use of the binary linear coding scheme, [KM79] showed 4 Introduction that the Slepian–Wolf limit is often sub-optimal for encoding the modulo-two sum of two memoryless binary sources. For the special case of symmetric binary sources, [KM79] in addition proved that the optimal coding rate limit is a symmetric region, known as the Körner–Marton region. Yet the case of asymmetric binary sources is left open, although [AH83] proved that the Körner–Marton region is unfortunately sub-optimal. From [KM79,AH83], it is seen that the linear coding technique (over the binary field) is the key element that allows for achieving better coding rates outside the Slepian–Wolf region. It is easy to generalise their results to encoding functions over other finite fields with the help of the linear coding technique (over finite fields) from Csiszár1 . In this thesis, we propose to use linear encoders over finite rings. We will show that: There exist (infinitely) many function encoding scenarios, in which the non-field ring linear coding scheme strictly outperforms its field counterpart, as well as the Slepian–Wolf scheme, in terms of achieving better coding rates. Notice that the function encoding problem is a sub-problem in the above applications. Hence, it is plausible that linear coding technique over finite rings can also provide an alternative, possibly better, solution to these applications as it does to the function encoding problem. Reason three: the classical typicality idea does not work beyond the independent and identically distributed (i.i.d.) source scenarios. One can call upon the Shannon–MacMillan–Breiman (SMB) Theorem [Sha48, McM53, Bre57] to generalise the Slepian–Wolf data compression theorem from i.i.d. source scenarios to stationary ergodic source scenarios [Cov75] (and to asymptotically mean stationary (a.m.s.) ergodic source scenarios if the SMB Theorem for a.m.s. ergodic processes from [GK80] is applied). Similarly, the generalisation can also be done for the result on linear encoders over fields from Csiszár. Unfortunately, that is not the case when trying to generalise our results on linear coding over rings beyond the i.i.d. case (e.g. irreducible Markov sources). One of the technical obstacles is that Shannon’s argument on (strong or weak) typical sequences no longer works as expected. As a result, a new concept of typical sequence, called Supremus typical sequence, and its corresponding asymptotical equipartition property (AEP) and conditional typicality lemmeta are introduced instead. Built on these new tools, corresponding results on linear coding over finite rings are established for both irreducible Markov and a.m.s. ergodic source scenarios. The major differences of the mechanisms between classical typicality and Supremus typicality are seen by investigating the dynamical systems describing the random processes (sources). It is proved that all induced systems of an a.m.s. system are a.m.s.. As a consequence: 1 This observation is part of the motivation of Csiszár’s studies on linear codes over finite fields. In [Csi82], it reads “in some source network problems linear codes appear superior to others (cf. Körner and Marton [KM79]).” 0.2. Outline and Contributions 5 The SMB Theorem simultaneously holds for all reduced processes of an a.m.s. ergodic random process. From this we see that the classical typical sequences and the SMB Theorem do not represent and characterise, respectively, the corresponding a.m.s. ergodic random process good enough. To be more precise, the property that the SMB Theorem holds simultaneously for all reduced processes is not featured in the classical typicality concept. On the contrary, Supremus typicality takes the effect of all reduced processes into account. Its AEP further states that all non-typical sequences in the classical sense together with all classical typical sequences that are not Supremus typical are all “negligible in probability.” Reason four: algorithms designed for rings are easier to implement compared to the ones for fields. This is because a finite field is normally given by its polynomial representation. Corresponding field operations are carried out based on the polynomial operations (addition and multiplication) followed by the polynomial long division algorithm. In contrast, implementing arithmetic of many finite rings is rather straightforward. For instance, the arithmetic of modulo integers ring Zq , for any positive integer q, is simply the integer modulo q arithmetic, and the arithmetic of matrix rings are matrix additions and multiplications. Up to the point this work is written, we can only conclude the conjecture to the extent of existence. Nevertheless, there are already several interesting discoveries along the process. Hopefully, more is to be unveiled when the studies are carried on in the future. 0.2 Outline and Contributions The remaining of the thesis is divided into several chapters. We summarise the contents in each of them along with the contributions below. Chapter 1 introduces some fundamental algebraic concepts and some related properties that will be used in succeeding chapters. Chapter 2 establishes an achievability theorem on linear coding over finite rings. This theorem includes corresponding results from [Eli55] and [Csi82] as special cases. In addition, we will also prove the optimality part (the converse) of this theorem in various cases. In particular, it is showed that for some finite non-field rings optimality is always claimed. This implies that for any Slepian–Wolf data compression scenarios, there always exist linear encoders over some non-field rings that achieve the optimal coding rate limits as their field counterparts. Chapter 3 addresses the function encoding problem. In this problem, the first issue raised is how to handle an arbitrary function whose algebraic structure is unclear. We suggest a polynomial approach based on the fact that any discrete function defined on a finite domain is equivalent to a restriction of some polynomial function over some finite ring. Namely, we can assume that the function considered is presented as a polynomial function over some finite ring. This allows us to use the linear coding technique over corresponding ring to construct encoders and achieve 6 Introduction better coding rates by exploring the polynomial structure. As a demonstration, we prove that linear coding over non-finite rings strictly outperforms all of its field counterparts in terms of achieving better coding rates for encoding many functions. Chapter 4 provides some theoretical background used to generalise results from Chapter 2 and Chapter 3 to the Markovian settings. This chapter investigates a new type of typicality for sequences, termed Supremus typical sequences, for irreducible Markov sources. It is seen that Supremus typicality is a condition stronger than classical typicality from Shannon. Even though Supremus typical sequences form a (often strictly smaller) subset of classical typical sequences, the AEP is still valid. Furthermore, Supremus typicality possesses properties that are more accessible and easier to analyse than its classical counterpart. Chapter 5 generalises results from Chapter 2 and Chapter 3 to the Markovian settings. Seemingly, this can be easily done based on the SMB Theorem and the argument built on Shannon’s typical sequences. Unfortunately, the end results so obtained are often difficult to analyse. This is because it involves evaluating entropy rates of functions of a Markov process. Since a function of a Markov process is usually not Markov, the results cannot provide much insight of the achievable coding rates (optimal or not). To overcome this, we replace the argument based on classical typicality with the one built on Supremus typicality introduced in Chapter 4. By exploring the properties from Supremus typicality, we obtain results that do not involve any analysis of entropy rates. In fact, calculations of the end results are simple and straightforward. Moreover, they are optimal as showed in many examples. Chapter 6 is dedicated to proving that an induced transformation with respect to a finite measure set of a recurrent a.m.s. dynamical system with a σ-finite measure is a.m.s.. Since the SMB Theorem and the Shannon–McMillan Theorem hold for any finite-state a.m.s. ergodic process [GK80], it is concluded that the SMB Theorem, as well as the Shannon–McMillan Theorem, holds simultaneously for all reduced processes of any finite-state recurrent a.m.s. ergodic random process. We term this recursive property the Extended SMB Theorem. This theorem is important because it provides the theoretical background to further generalise some of our results from the Markovian settings to a more general case, a.m.s. ergodic sources. It is seen from Chapter 4 and Chapter 5 that the idea of Supremus typicality is important for our analysis to work. Generalising this concept from Markov sources to a.m.s. sources can also be straightforward. However, one needs to justify a corresponding AEP of Supremus typicality defined for a.m.s. sources in order to prove corresponding coding theorems. The Extended SMB Theorem is a key element for proving such an AEP as seen in Chapter 7. Chapter 7 establishes the AEP of Supremus typicality defined for recurrent a.m.s. ergodic sources based on the Extended SMB Theorem given in Chapter 6. The achievability theorem of linear coding over finite rings is then generalised to the recurrent a.m.s. ergodic sources settings. Chapter 8 summarises the thesis and provides some suggestions on future research directions. 0.3. Copyright Notice 0.3 7 Copyright Notice Parts of the material presented in this thesis are based on the author’s joint works which are previously published or submitted to conferences [HS12d, HS12a, HS12b, HS13b,HS13c,HS14b] and journals [HS,HS12c,HS13a,HS14a] held by or sponsored by the Institute of Electrical and Electronics Engineer (IEEE) or World Scientific or Royal Institute of Technology (KTH). IEEE or World Scientific or KTH holds the copyright of the published papers and will hold the copyright of the submitted papers if they are accepted. Materials (e.g., figure, graph, table, or textual material) are reused in this thesis with permission. 0.4 Notation We denote random variables, their corresponding realisations or deterministic values, and their alphabets by the upper case, lower case, and script letters, respectively. For a positive integer n, X n is designated for the array n o X (1) , X (2) , · · · , X (n) . Suppose that X (i) = Qm j=1 (i) Xj (i) X1 (i) X2 (i) := .. , then XT (T ⊆ {1, 2, · · · , m}) is defined . (i) Xm to be (i) j∈T Xj Q and XTn stands for n o (1) (2) (n) XT , XT , · · · , XT . (i) (i) Similarly, xn , xT , xnT , X n , XT and XTn resemble corresponding definitions. In addition, the cardinality of a set X is denoted by |X |, and all logarithms in the thesis are of base 2, unless stated otherwise. Other notation used in this thesis are listed in the following: N The set of non-negative integers N+ The set of positive integers R The set of real numbers cov(A) The convex hull of set A ∈ Rn supp(p) The support of the (probability mass) function p PX , p(x) Probability distribution of discrete random variable X X ∼ PX Random variable X is distributed according to PX H(X) Entropy of random variable X H(X, Y ) Joint entropy of random variables X and Y H(X|Y ) Conditional entropy of random variable X given Y 8 Introduction I(X; Y ) I(X; Y |Z) Pr {E} Pr { E1 | E2 } T (n, X) T (n, P) TH (n, P) S (n, P) S n, X (n) H n, X (n) Mutual information between random variables X and Y Conditional mutual information between X and Y given Z Probability of the event E Probability of the event E1 conditional on E2 The set of all -typical sequences of length n with respect to X ∼ PX The set of all Markov -typical sequences of length n with respect to an irreducible Markov process with transition matrix P The set of all modified weak -typical sequences of length n with respect to an irreducible Markov process with transition matrix P The set of all Supremus -typical sequences of length n with respect to an irreducible Markov process with transition matrix P The set of all Supremus -typical sequences of length n with respect to the random process X (n) The set of all Hyper Supremus -typical sequences of length n with respect to the random process X (n) Chapter 1 Preliminaries: Finite Rings and Polynomial Functions T 1.1 his chapter provides some fundamental algebraic concepts and related properties. Readers who are already familiar with this material may still choose to go through quickly to identify our notation. Finite Rings Definition 1.1.1. The touple [R, +, ·] is called a ring if the following criteria are met: 1. [R, +] is an Abelian group; 2. There exists a multiplicative identity 1 1 ∈ R, namely, 1·a = a·1 = a, ∀ a ∈ R; 3. ∀ a, b, c ∈ R, a · b ∈ R and (a · b) · c = a · (b · c); 4. ∀ a, b, c ∈ R, a · (b + c) = (a · b) + (a · c) and (b + c) · a = (b · a) + (c · a). We often write R for [R, +, ·] when the operations considered are known from the context. The operation “·” is usually written by juxtaposition, ab for a · b, for all a, b ∈ R. A ring [R, +, ·] is said to be commutative if ∀ a, b ∈ R, a · b = b · a. In Definition 1.1.1, the identity of the group [R, +], denoted by 0, is called the zero. A ring [R, +, ·] is said to be finite if the cardinality |R| is finite, and |R| is called the order of R. The set Zq of integers modulo q is a commutative finite ring with respect to the modular arithmetic. For any ring R, the set of all polynomials of s indeterminants over R is an infinite ring. 1 Sometimes a ring without a multiplicative identity is considered. Such a structure has been called a rng. We consider rings with multiplicative identities in this thesis. However, similar results remain valid when considering rngs instead. Although we will occasionally comment on such results, they are not fully considered in the present work. 9 10 Preliminaries: Finite Rings and Polynomial Functions Proposition 1.1.1. Given s rings R1 , R2 , · · · , Rs , forQany non-empty set T ⊆ {1, 2, · · · , s}, the Cartesian product (see [Rot10]) RT = i∈T Ri forms a new ring [RT , +, ·] with respect to the component-wise operations defined as follows: a0 + a00 = a01 + a001 , a02 + a002 , · · · , a0|T | + a00|T | , a0 · a00 = a01 a001 , a02 a002 , · · · , a0|T | a00|T | , ∀ a0 = a01 , a02 , · · · , a0|T | , a00 = a001 , a002 , · · · , a00|T | ∈ RT . Remark 1.1. In Proposition 1.1.1, [RT , +, ·] is called the direct product of {Ri |i ∈ T }. It can be easily seen that (0, 0, · · · , 0) and (1, 1, · · · , 1) are the zero and the multiplicative identity of [RT , +, ·], respectively. Definition 1.1.2. A non-zero element a of a ring R is said to be invertible, if and only if there exists b ∈ R, such that ab = ba = 1. b is called the inverse of a, denoted by a−1 . An invertible element of a ring is called a unit. Remark 1.2. It can be proved that the inverse of a unit is unique. By definition, the multiplicative identity is the inverse of itself. Let R∗ = R \ {0}. The ring [R, +, ·] is a field if and only if [R∗ , ·] is an Abelian group. In other words, all non-zero elements of R are invertible. All fields are commutative rings. Zq is a field if and only if q is a prime. All finite fields of the same order are isomorphic to each other [DF03, pp. 549]. This “unique” field of order q is denoted by Fq . It is necessary that q is a power of a prime. More details regarding finite fields can be found in [DF03, Ch. 14.3]. Theorem 1.1.1 (Wedderburn’s little theorem cf. [Rot10, Theorem 7.13]). Let R be a finite ring. R is a field if and only if all non-zero elements of R are invertible. Remark 1.3. Wedderburn’s little theorem guarantees commutativity for a finite ring if all of its non-zero elements are invertible. Hence, a finite ring is either a field or at least one of its elements has no inverse. However, a finite commutative ring is not necessary a field, e.g. Zq is not a field if q is not a prime. Definition 1.1.3 (cf. [DF03]). The characteristic Pm of a finite ring R is defined to be the smallest positive integer m, such that j=1 1 = 0, where 0 and 1 are the zero and the multiplicative identity of R, respectively. The characteristic of R is often denoted by Char(R). Remark 1.4. Clearly, Char(Zq ) = q. For a finite field Fq , Char(Fq ) is always the prime q0 such that q = q0n for some integer n [Rot10, Proposition 2.137]. Proposition 1.1.2. Let Fq be a finite field. For any 0 P 6= a ∈ Fq , m = Char(Fq ) if m and only if m is the smallest positive integer such that j=1 a = 0. 1.1. Finite Rings 11 Proof. Since a 6= 0, m X a = 0 ⇒ a−1 j=1 m X a = a−1 · 0 ⇒ j=1 m X j=1 1=0⇒ m X a=0 j=1 The statement is proved. Definition 1.1.4. A subset I of a ring [R, +, ·] is said to be a left ideal of R, denoted by I ≤l R, if and only if 1. [I, +] is a subgroup of [R, +]; 2. ∀ x ∈ I and ∀ a ∈ R, a · x ∈ I. If condition 2 is replaced by 3. ∀ x ∈ I and ∀ a ∈ R, x · a ∈ I, then I is called a right ideal of R, denoted by I ≤r R. {0} is a trivial left (right) ideal, usually denoted by 0. The cardinality |I| is called the order of a finite left (right) ideal I. Remark 1.5. Let {a1 , a2 , · · · , an } be a non-empty Pn set of elements of some ring R. It is easy to verify that ha1 , a2 , · · · , an ir := i=1 ai bi bi ∈ R, ∀ 1 ≤ i ≤ n is a Pn right ideal and ha1 , a2 , · · · , an il := i=1 bi ai bi ∈ R, ∀ 1 ≤ i ≤ n is a left ideal. Furthermore, ha1 , a2 , · · · , an ir = ha1 , a2 , · · · , an il = R if some ai is a unit. It is well-known that if I ≤l R, then R is divided into disjoint cosets which are of equal size (cardinality). For any coset J, J = x + I = {x + y|y ∈ I}, ∀ x ∈ J. The set of all cosets forms a left module over R, denoted by R/I. Similarly, R/I becomes a right module over R if I ≤r R [AF92]. Of course, R/I can also be considered as a quotient group [Rot10, Ch. 1.6 and Ch. 2.9]. However, its structure is far richer than simply being a quotient group. Qs Proposition 1.1.3. Let Ri (1 ≤ i ≤ s) be a ring and Qs R = i=1 Ri . For any A ⊆ R, A ≤l R (or A ≤r R) if and only if A = i=1 Ai and Ai ≤l Ri (or Ai ≤r Ri ), ∀ 1 ≤ i ≤ s. Proof. We prove for the ≤l case only, and the ≤r case follows from a similar argument. Let πi (1 ≤ i ≤ s) be the Qscoordinate function assigning every element in R its ith component. Then A ⊆ i=1 Ai , where Ai = πi (A). Moreover, for any Qs x = (π1 (x1 ), π2 (x2 ), · · · , πs (xs )) ∈ i=1 Ai , where xi ∈ A for all feasible i, we have that Ps x = i=1 ei xi , 12 Preliminaries: Finite Rings and Polynomial Functions where ei ∈ R has the ith coordinate Qs being 1 and others being 0. IfQAs ≤l R, then x ∈ A by definition. Therefore, i=1 Ai ⊆ A. Consequently, A = i=1 Ai . Since πi is a homomorphism, we also have that Ai ≤l Ri for all feasible i. The other direction is easily verified by definition. Remark 1.6. It is worthwhile to Q point out that Proposition 1.1.3 does not hold for infinite index set, namely, R = i∈I Ri , where I is not finite. For any ∅ = 6 T ⊆ S, Proposition 1.1.3 states that any left (right) ideal of RT is a Cartesian product of some left (right) ideals of Ri , i ∈ TQ. Let Ii be a left (right) ideal of ring Ri . We define IT to be the left (right) ideal i∈T Ii of RT . Let xtr be the transpose of a vector (or matrix) x. Definition 1.1.5. A mapping f : Rn → Rm given as: !tr n n X X f (x1 , x2 , · · · , xn ) = a1,j xj , · · · , am,j xj , ∀ (x1 , · · · , xn ) ∈ Rn , (1.1.1) j=1 j=1 where ai,j ∈ R for all feasible i and j, is called a left linear mapping over ring R. Similarly, !tr n n X X xj am,j , ∀ (x1 , · · · , xn ) ∈ Rn , f (x1 , x2 , · · · , xn ) = xj a1,j , · · · , j=1 j=1 defines a right linear mapping over ring R. If m = 1, then f is called a left (right) linear function over R. From now on, left linear mapping (function) or right linear mapping (function) are simply called linear mapping (function). This will not lead to any confusion since the intended use can usually be clearly distinguished from the context. Remark 1.7. The mapping f in Definition 1.1.5 is called linear in accordance with the definition of linear mapping (function) over field. In fact, the two structures have several similar properties. Moreover, (1.1.1) is equivalent to tr f (x1 , x2 , · · · , xn ) = A (x1 , x2 , · · · , xn ) , ∀ (x1 , x2 , · · · , xn ) ∈ Rn , (1.1.2) where A is an m × n matrix over R and [A]i,j = ai,j for all feasible i and j. A is named the coefficient matrix. It is easy to prove that a linear mapping is uniquely determined by its coefficient matrix, and vice versa. The linear mapping f is said to be trivial, denoted by 0, if A is the zero matrix, i.e. [A]i,j = 0 for all feasible i and j. Let A be an m × n matrix over ring R and f (x) = Ax, ∀ x ∈ Rn . For the system of linear equations f (x) = Ax = 0, where 0 = (0, 0, · · · , 0)tr ∈ Rm , 1.2. Polynomial Functions 13 let S(f ) be the set of all solutions, namely S(f ) = {x ∈ Rn |f (x) = 0}. It is obvious that S(f ) = Rn if f is trivial, i.e. A is a zero matrix. If R is a field, then S(f ) is a subspace of Rn . We conclude this section with a lemma regarding the cardinalities of Rn and S(f ) in the following. Lemma 1.1.1. For a finite ring R and a linear function f : x 7→ (a1 , a2 , · · · , an )x or f : x 7→ xtr (a1 , a2 , · · · , an )tr , ∀ x ∈ Rn , we have 1 |S(f )| , n = |I| |R| where I = ha1 , a2 , · · · , an ir (or I = ha1 , a2 , · · · , an il ). In particular, if ai is invertn−1 ible for some 1 ≤ i ≤ n, then |S(f )| = |R| . Proof. It is obvious that the image f (Rn ) = I by definition. Moreover, ∀ x 6= y ∈ I, the pre-images f −1 (x) and f −1 (y) satisfy f −1 (x) ∩ f −1 (y) = ∅ and f −1 (x) = −1 f (y) = |S(f )|. Therefore, |I| |S(f )| = |R|n , i.e. |S(fn)| = 1 . Moreover, if ai |I| |R| n n−1 is a unit, then I = R, thus, |S(f )| = |R| / |R| = |R| . 1.2 Polynomial Functions Definition 1.2.1. A polynomial function 2 of k variables over a finite ring R is a function g : Rk → R of the form g(x1 , x2 , · · · , xk ) = m X m1j aj x1 m2j x2 m · · · xk kj , (1.2.1) j=0 where aj ∈ R and m and mij ’s are non-negative integers. The set of all the polynomial functions of k variables over ring R is designated by R[k]. Remark 1.8. Polynomial and polynomial function are sometimes only defined over a commutative ring [Rot10, MS84]. It is a very delicate matter to define them over a non-commutative ring [Hun80, Lam01], due to the fact that x1 x2 and x2 x1 can become different objects. We choose to define “polynomial functions” with formula (1.2.1) because those functions are within the scope of this work’s interest. Lemma 1.2.1 (cf. [LN97, Lemma 7.40]). For any polynomial function g ∈ Fq [k], where q is a power of a prime and k ∈ N+ , there exists a unique polynomial function h ∈ Fq [k] of degree less than q in each variable with h = g. 2 Polynomial and polynomial function are distinct concepts. 14 Preliminaries: Finite Rings and Polynomial Functions Lemma 1.2.2. Let q be a power of a prime. The number of polynomial functions k in Fq [k] (k ∈ N+ ) is q q . Moreover, any function g : Fkq → Fq is a polynomial function in Fq [k]. Proof. By Lemma 1.2.1, we have that k |Fq [k]| ≤ |polynomail functions of degree less than q in each variable| = q q . On the other hand, it is obvious that two distinct polynomial functions in Fq [k] of degree less than q is never equal. Thus, k |Fq [k]| ≥ |polynomail functions of degree less than q in each variable| = q q . k Consequently, |Fq [k]| = q q . In addition, let A be the set of all functions with domain Fkq and codomain Fq . k Obviously, Fq [k] ⊆ A. In the meanwhile, |A| = q q . Therefore, Fq [k] = A. Remark 1.9. The special case of Lemma 1.2.2 with q being a prime and k = 1 can be easily verified with Fermat’s Little Theorem. Theorem 1.2.1 (Fermat’s Little Theorem). p divides ap−1 −1 whenever p is prime and a is coprime to p, i.e. ap = a mod p. Qs Qs Definition 1.2.2. Let g1 : i=1 Xi → Ω1 and g2 : i=1 Yi → Ω2 be two functions. If there exist bijections µi : Xi → Yi , ∀ 1 ≤ i ≤ s, and ν : Ω1 → Ω2 , such that g1 (x1 , x2 , · · · , xs ) = ν −1 (g2 (µ1 (x1 ), µ2 (x2 ), · · · , µs (xs ))), then g1 and g2 are said to be equivalent (via µ1 , µ2 , · · · , µs and ν). Definition 1.2.3. Given function g : D → Ω, and let ∅ = 6 S ⊆ D. The restriction of g on S is defined to be the function g|S : S → Ω such that g|S : x 7→ g(x), ∀ x ∈ S . Qk Lemma 1.2.3. For any discrete function g : i=1 Xi → Ω with Xi ’s and Ω being finite, there always exist a finite ring (field) R and a polynomial function ĝ ∈ R[k] such that ν (g (x1 , x2 , · · · , xk )) = ĝ (µ1 (x1 ), µ2 (x2 ), · · · , µk (xk )) for some injections µi : Xi → R (1 ≤ i ≤ k) and ν : Ω → R. Proof. For any injections µi : Xi → R (1 ≤ i ≤ k) and ν : Ω → R, the function ĝ = ν ◦ g (µ01 , µ02 , · · · , µ0k ) : Rk → R, where µ0i is the inverse mapping of µi : Xi → µi (Xi ), must be a polynomial function by Lemma 1.2.2, the statement is established. 1.2. Polynomial Functions 15 Remark 1.10. Up to equivalence, a function can be presented in many different formats. For example, the function min{x, y} defined on {0, 1} × {0, 1} (with ordering 0 ≤ 1) can either be seen as F1 (x, y) = xy on Z22 or be treated as the restriction of F2 (x, y) = x + y − (x + y)2 defined on Z23 to the domain {0, 1} × {0, 1} ( Z23 . Lemma 1.2.3 states that any discrete function defined on a finite domain is equivalent to a restriction of some polynomial function over some finite ring (field). As a consequence, we can restrict a problem considering an arbitrary function with a finite domain to the problem considering only polynomial functions and their restrictions that are equivalent to this arbitrary function. This polynomial approach offers valuable insight into the general problem, because the algebraic structure of a polynomial function is clearer than that of an arbitrary function. We often call ĝ in Lemma 1.2.3 a polynomial presentation of g. In addition, if ĝ admits that ĝ = h ◦ k, where k(x1 , x2 , · · · , xs ) = s X ki (xi ), i=1 and h, ki ’s are functions mapping R to R, then it is named a nomographic function over R (by terminology borrowed from [Buc82]), and it is said to be a nomographic presentation of g if g is equivalent to a restriction of ĝ. Lemma 1.2.4. Qs Let X1 , X2 , · · · , Xs and Ω be some finite sets. For any discrete function g : i=1 Xi → Ω, there exists a nomographic function ĝ over some finite ring (field) R such that ν (g (x1 , x2 , · · · , xk )) = ĝ (µ1 (x1 ), µ2 (x2 ), · · · , µk (xk )) for some injections µi : Xi → R (1 ≤ i ≤ k) and ν : Ω → R. s Proof. Let F be a finite field such that |F| ≥ |Xi | for all 1 ≤ i ≤ s and |F| ≥ |Ω|, s and let R be the splitting field of F of order |F| (one example of the pair F and R is the Zp , where p is some prime, and its Galois extension of degree s). It is easily seen that R is an s dimensional vector space over F. Hence, there exist s vectors v1 , v2 , · · · , vs ∈ R that are linearly independent. Let µi be an injectionPfrom Xi s to the subspace generated by vector vi . It is easy to verify that k = i=1 µi is 0 injective · · · , vs are linearly independent. Let k be the inverse mapping Qs since v1 , v2 , Q s of k : i=1 Xi → k ( i=1 Xi ) and ν : Ω → R be any injection. By the second half 0 of Lemma 1.2.2, there exists Pas polynomial function h ∈ R[s] such that h = ν ◦ g ◦ k . Let ĝ(x1 , x2 , · · · , xs ) = h ( i=1 xi ). The statement is proved. Remark 1.11. In the above proof, k is chosen to be injective because the proof includes the case that g is an identity function. In general, k is not necessarily injective. Chapter 2 Linear Coding his chapter is dedicated to establishing an achievability theorem regarding linear coding over finite rings (LCoR). It will be seen that this includes corresponding results from [Eli55] and [Csi82] as special cases. In addition, we will also prove the optimality part (the converse) of this theorem in various cases. In particular, it is shown that for some finite non-field rings optimality is always claimed. This implies that for any Slepian–Wolf data compression scenarios, there always exist linear encoders over some non-field rings that achieve the optimal coding rate limits as their field counterparts. From Chapter 0, we learnt that the Slepian–Wolf problem is a special case of the function encoding problem. To define the later in rigorous terms: T Problem 2.1 (Source Coding for Computing). Let i ∈ S = {1, 2, · · · , s} be a discrete memoryless source (DMS) that randomly generates i.i.d. discrete data (1) (2) (n) (n) (n) Xi , Xi , · · · , Xi , · · · , where Xi has a finite sample space Xi and XS ∼ p, ∀ n ∈ N+ . For a discrete function g : XS → Ω, what is the largest region R[g] ⊂ Rs , + such that, ∀ (R1 , R2 , · · · , Rs ) ∈ R[g] and ∀ > 0, there exists an N0 ∈ N , such n nRi that for all n > N , i ∈ S, and one 0 , there exist s encoders φi : Xi → 1, 2 Q decoder ψ : i∈S 1, 2nRi → Ωn , with Pr {~g (XSn ) 6= ψ [φ1 (X1n ) , φ1 (X2n ) , · · · , φs (Xsn )]} < , (1) g XS .. where ~g (XSn ) = ∈ Ωn ? . (n) g XS The region R[g] is called the achievable coding rate region for computing g. A rate tuple R ∈ Rs is said to be achievable for computing g (or simply achievable) if and only if R ∈ R[g]. A region R ⊂ Rs is said to be achievable for computing g (or simply achievable) if and only if R ⊆ R[g]. Obviously, in the problem of source coding for computing, Problem 2.1, the decoder is only interested in recovering a function of the message(s), other than 17 18 Linear Coding the original message(s), that is (are) i.i.d. generated and independently encoded by the source(s). If g is an identity function, the computing problem is exactly the Slepian–Wolf source coding problem. R[g] is then the Slepian–Wolf region [SW73], n R[X1 , X2 , · · · , Xs ] = (R1 , R2 , · · · , Rs ) ∈ Rs o X Rj > H(XT |XT c ), ∀ ∅ 6= T ⊆ S , j∈T where T c is the complement of T in S. However, from [SW73] it is hard to draw conclusions regarding the structure of the optimal encoders, as the corresponding mappings are chosen randomly among all feasible mappings. This limits the scope of their potential applications. As a completion, linear coding over finite fields (LCoF), namely Xi ’s are injectively mapped into some subsets of some finite fields and the φi ’s are chosen as linear mappings over these fields, is considered. It is shown that LCoF achieves the same encoding limit, the Slepian–Wolf region [Eli55,Csi82]. Although it seems straightforward to study linear mappings over rings (non-field rings in particular), it has not been proved (nor denied) that linear encoding over non-field rings can be equally optimal. This chapter will concentrate on addressing this problem. We will prove that linear encoding over non-field rings can be equally optimal. 2.1 Linear Coding over Finite Rings In this section, we will present a coding rate region achieved with LCoR for the Slepian–Wolf source coding problem, i.e. g is an identity function in Problem 2.1. This region is exactly the Slepian–Wolf region if all the rings considered are fields. However, being field is not necessary as seen in Section 2.3, where the issue of optimality is addressed. Before proceeding, a subtlety needs to be cleared out. It is assumed that a source, say i, generates data taking values from a finite sample space Xi , while Xi does not necessarily admit any algebraic structure. We have to either assume that Xi is with a certain algebraic structure, for instance Xi is a ring, or injectively map elements of Xi into some algebraic structure. In our subsequent discussions, we assume that Xi is mapped into a finite ring Ri of order at least |Xi | by some injection Φi . Hence, Xi can simply be treated as a subset Φi (Xi ) ⊆ Ri for a fixed Φi . When required, Φi can also be selected to obtain desired outcomes. To facilitate our discussion, the following notation is used. For XS ∼ p, we denote the marginal of p with respect to XT (∅ 6= T ⊆ S) by pXT , i.e. XT ∼ pXT , and define H(pXT ) to be H(XT ). In addition, M (XS , RS ) := { (Φ1 , Φ2 , · · · , Φs )| Φi : Xi → Ri is injective, ∀ i ∈ S} Q (|Ri | ≥ |Xi | is implicitly assumed), and Φ(xT ) := i∈T Φi (xi ) for any Φ ∈ M (XS , RS ) and xT ∈ XT . 2.1. Linear Coding over Finite Rings 19 For any Φ ∈ M (XS , RS ), let X Ri log |Ii | s RΦ = (R1 , R2 , · · · , Rs ) ∈ R > r (T, IT ) , log |Ri | i∈T ∀∅= 6 T ⊆ S, ∀ 0 6= Ii ≤l Ri , (2.1.1) where r (T, IT ) = H(XT |XT c ) − H(YRT /IT |XT c ) and YRT /IT = Φ(XT ) + IT is a random variable with sample space RT /IT (a left module). Theorem 2.1.1. RΦ is achievable with linear coding over the finite rings R1 , R2 , · · · , Rs . In exact terms, ∀ > 0, there exists N0 ∈ N+ , for all n > N0 , there exist linear encoders (left linear mappings to be more precise) φi : Φ(Xi )n → Rki i (i ∈ S) and a decoder ψ, such that Pr {ψ (φ1 (X1 ) , φ2 (X2 ) , · · · , φs (Xs )) 6= (X1 , X2 , · · · , Xs )} < , tr (1) (2) (n) , as long as where Xi = Φ Xi , Φ Xi , · · · , Φ Xi ks log |Rs | k1 log |R1 | k2 log |R2 | , ,··· , ∈ RΦ . n n n Proof. The proof is given in Section 2.2. The following is a concrete example helping to interpret this theorem. Example 2.1.1. Consider the single source scenario, where X1 ∼ p and X1 = Z6 , specified as follows. X1 p(X1 ) 0 0.05 1 0.1 2 0.15 3 0.2 4 0.2 5 0.3 Obviously, Z6 contains 3 non-trivial ideals I1 = {0, 3}, I2 = {0, 2, 4} and Z6 . Meanwhile, YZ6 /I1 and YZ6 /I2 admit the distributions YZ6 /I1 p(YZ6 /I1 ) I1 0.25 1 + I1 0.3 2 + I1 0.45 and YZ6 /I2 p(YZ6 /I2 ) I2 0.4 1 + I2 , 0.6 respectively. In addition, YZ6 /Z6 = Z6 is a constant. Thus, by Theorem 2.1.1, rate R1 is achievable if R1 log |I1 | R1 log 2 = > H(X1 ) − H(YZ6 /I1 ) = 2.40869 − 1.53949 = 0.86920, log |Z6 | log 6 R1 log |I2 | R1 log 3 = > H(X1 ) − H(YZ6 /I2 ) = 2.40869 − 0.97095 = 1.43774 log |Z6 | log 6 R1 log |Z6 | and =R1 > H(X1 ) − H(YZ6 /Z6 ) = H(X1 ) = 2.40869. log |Z6 | 20 Linear Coding In other words, R = {R1 ∈ R|R1 > max{2.24685, 2.34485, 2.40869}} = {R1 ∈ R|R1 > 2.40869 = H(X1 )} is achievable with linear coding over ring Z6 . Obviously, R is just the Slepian-Wolf region R[X1 ]. Optimality is claimed. Besides, we would like to point out that some of the inequalities defining (2.1.1) are not active for specific scenarios. Two classes of these scenarios are discussed in the following theorems. Qki Theorem 2.1.2. Suppose Ri (1 ≤ i ≤ s) is a (finite) product ring l=1 Rl,i of finite rings Rl,i ’s, and the sample space Xi satisfies |Xi | ≤ |Rl,i | for all feasible i and l. Given injections Φl,i : Xi → Rl,i and let Φ = (Φ1 , Φ2 , · · · , Φs ), where Φi = Qki l=1 Φl,i is defined as Φi : xi 7→ (Φ1,i (xi ), Φ2,i (xi ), · · · , Φki ,i (xi )) ∈ Ri , ∀ xi ∈ Xi . We have that RΦ,prod = X Ri log |Ii | > H(XT |YRT /IT , XT c ), (R1 , R2 , · · · , Rs ) ∈ Rs log |Ri | i∈T ki Y Il,i with 0 6= Il,i ≤l Rl,i , ∀∅= 6 T ⊆ S, ∀ Ii = (2.1.2) l=1 where YRT /IT = Φ(XT ) + IT , is achievable with linear coding over R1 , R2 , · · · , Rs . Moreover, RΦ ⊆ RΦ,prod . Proof. The proof is found in Section 2.2. Let R be a finite ring and a1 0 a a 2 1 ML,R,m = am am−1 .. . 0 0 a1 , a2 , · · · , am ∈ R , a1 where m is a positive integer. It is easy to verify that ML,R,m is a ring with respect to matrix operations. Moreover, I is a left ideal of ML,R,m if and only if 0 0 a1 a a 0 2 aj ∈ Ij ≤l R, ∀ 1 ≤ j ≤ m; 1 I= . .. Ij ⊆ Ij+1 , ∀ 1 ≤ j < m . am am−1 a1 2.1. Linear Coding over Finite Rings 21 Define O(ML,R,m ) to be the set of all left ideals of the form a1 0 0 aj ∈ Ij ≤l R, ∀ 1 ≤ j ≤ m; a1 0 a2 I ⊆ I , ∀ 1 ≤ j < m; . j j+1 .. . I = 0 for some 1 ≤ i ≤ m i am am−1 a1 Theorem 2.1.3. Let Ri (1 ≤ i ≤ s) be a finite ring such that |Xi | ≤ |Ri |. For any injections Φ0i : Xi → Ri , let Φ = (Φ1 , Φ2 , · · · , Φs ), where Φi : Xi → ML,Ri ,mi is defined as 0 Φi (xi ) 0 0 Φi (xi ) Φ0i (xi ) Φi : xi 7→ Φ0i (xi ) Φ0i (xi ) 0 0 .. . Φ0i (xi ) , ∀ xi ∈ Xi . We have that X Ri log |Ii | (R1 , R2 , · · · , Rs ) ∈ Rs > H(XT |YRT /IT , XT c ), log |Ri | i∈T ∀∅= 6 T ⊆ S, ∀ Ii ≤l ML,Ri ,mi and Ii ∈ / O(ML,Ri ,mi ) , (2.1.3) RΦ,m = where YRT /IT = Φ(XT ) + IT , is achievable with linear coding over ML,R1 ,m1 , ML,R2 ,m2 , · · · , ML,Rs ,ms . Moreover, RΦ ⊆ RΦ,m . Proof. The proof is found in Section 2.2. Remark 2.1. The difference between (2.1.1), (2.1.2) and (2.1.3) lies in their restrictions defining Ii ’s, respectively, as highlighted in the proofs given in Section 2.2. Remark 2.2. Without much effort, one can see that RΦ (RΦ,prod and RΦ,m , resp.) in Theorem 2.1.1 (Theorem 2.1.2 and Theorem 2.1.3, resp.) depends on Φ via random variables YRT /IT ’s whose distributions are determined by Φ. For each |Ri |! i ∈ S, there exist distinct injections from Xi to a ring Ri of order (|Ri | − |Xi |)! at least |Xi |. Let cov(A) be the convex hull of a set A ⊆ Rs . By a straightforward time sharing argument, we have that ! [ Rl = cov RΦ (2.1.4) Φ∈M (XS ,RS ) is achievable with linear coding over R1 , R2 , · · · , Rs . 22 Linear Coding Remark 2.3. From Theorem 2.3.1, one will see that (2.1.1) and (2.1.4) are the same when all the rings are fields. Actually, both are identical to the Slepian–Wolf region. However, (2.1.4) can be strictly larger than (2.1.1) (see Section 2.3), when not all the rings are fields. This implies that, in order to achieve the desired rate, a suitable injection is required. Nevertheless, be reminded that taking the convex hull in (2.1.4) is not always needed for optimality as shown in Example 2.1.1. A more sophisticated elaboration on this issue is found in Section 2.3. The rest of this section provides key supporting lemmata and concepts used to prove Theorem 2.1.1, Theorem 2.1.2 and Theorem 2.1.3. The final proofs are presented in Section 2.2. Lemma 2.1.1. Let x, y ∈ Rn be two distinct sequences, where R is a finite ring, tr and assume that y − x = (a1 , a2 , · · · , an ) . If f : Rn → Rk is a random linear mapping chosen uniformly at random, i.e. generate the k × n coefficient matrix A of f by independently choosing each entry of A from R uniformly at random, then Pr {f (x) = f (y)} = |I|−k , where I = ha1 , a2 , · · · , an il . Proof. Let f = (f1 , f2 , · · · , fk )tr , where fi : Rn → R is a random linear function. Then ( k ) k Y \ Pr {f (x) = f (y)} = Pr {fi (x) = fi (y)} = Pr {fi (x − y) = 0} , i=1 i=1 since the fi ’s are independent from each other. The statement follows from Lemma 1.1.1 which assures that Pr {fi (x − y) = 0} = |I|−1 . Remark 2.4. In Lemma 2.1.1, if R is a field and x 6= y, then I = R because every non-zero ai is a unit. Thus, Pr {f (x) = f (y)} = |R|−k . Definition 2.1.1 (cf. [Yeu08]). Let X ∼ pX be a discrete random variable with sample space X . The set T (n, X) of strongly -typical sequences of length n with respect to X is defined to be n N (x; x) − pX (x) ≤ , ∀ x ∈ X , x ∈ X n where N (x; x) is the number of occurrences of x in the sequence x. The notation T (n, X) is sometimes replaced by T when the length n and the random variable X referred to are clear from the context. Now we conclude this section with the following lemma. It is a crucial part for our proofs of the achievability theorems. It generalizes the classic conditional typicality lemma [CT06, Theorem 15.2.2], yet at the same time distinguishes our argument from the one for the field version. 2.1. Linear Coding over Finite Rings 23 Lemma 2.1.2. Let (X1 , X2 ) ∼ p be a jointly random variable whose sample space is a finite ring R = R1 × R2 . For any η > 0, there exists > 0, such that, ∀ (x1 , x2 )tr ∈ T (n, (X1 , X2 )) and ∀ I ≤l R1 , |D (x1 , I|x2 )| < 2n[H(X1 |YR1 /I ,X2 )+η] , (2.1.5) where D (x1 , I|x2 ) = (y, x2 )tr ∈ T y − x1 ∈ In and YR1 /I = X1 + I is a random variable with sample space R1 /I. First Proof. Let R1 /I = {a1 + I, a2 + I, · · · , am + I}, where m = |R1 |/|I|. For arbitrary > 0 and integer n, without loss of generality, we can assume that " # " # " # (1) (2) (n) x1 x1 , x1 , · · · , x1 x1,1 , x1,2 , · · · , x1,m = (1) (2) (n) = x2 x2,1 , x2,2 , · · · , x2,m x 2 , x2 , · · · , x2 admits a structure satisfying Pj−1 Pj " " # (Pj−1 ck +1) #cj ( ck +2) ( ck ) k=0 k=0 k=0 x1,j x , x , · · · , x a + I j 1 1 1 ∈ Pj−1 Pj = Pj−1 , ( ck +1) ( ck +2) ( ck ) x2,j R2 x2 k=0 , x2 k=0 , · · · , x2 k=0 P where c0 = 0 and cj = r∈aj +I×R2 N (r, (x1 , x2 )tr ) , 1 ≤ j ≤ m. For any y = (i) y (1) , y (2) , · · · , y (n) with (y, x2 )tr ∈ D (x1 , I|x2 ), we have y (i) − x1 ∈ I, ∀ 1 ≤ (i) i ≤ n, by definition. Thus, y (i) and x1 belong to the same coset, i.e. Pj−1 Pj−1 Pj y ( k=0 ck +1) , y ( k=0 ck +2) , · · · , y ( k=0 ck ) ∈ aj + I, ∀ 1 ≤ j ≤ m. Furthermore, ∀ r ∈ R, ( |N (r, (x1 , x2 )tr ) /n − p(r)| ≤ |N (r, (y, x2 )tr ) /n − p(r)| ≤ N (r, (y, x2 )tr ) N (r, (x1 , x2 )tr ) ≤ 2. − =⇒ n n Let " # x1,j zj = and x2,j Pj−1 y ( k=0 ck +1) , z0j = (Pj−1 ck +1) x2 k=0 , y( Pj−1 k=0 ck +2) , ··· , Pj−1 ( ck +2) x2 k=0 , ··· , " Pj #cj y ( k=0 ck ) a + I j Pj ∈ . ( ck ) R2 x2 k=0 We have that z0j is a strongly 2-typical sequence of length cj with respect to the random variable Zj ∼ pj = emp(zj ) (the empirical distribution of zj ). The 24 Linear Coding sample space of Zj is (aj + I) × R2 . Therefore, the number of all possible z0j ’s " # w1 (namely, all elements ∈ T2 (cj , Zj ) such that w2 = x2,j ) is upper bounded w2 by 2cj [H(pj )−H(pj,2 )+2] , where pj,2 is the marginal of pj with respect to the second coordinate, by [Yeu08, Theorem 6.10]. Consequently, Pm c [H(pj )−H(pj,2 )+2] . (2.1.6) |D (x1 , I|x2 )| ≤ 2 j=1 j Direct computation yields m m X 1X cj cj H(pj ) = n j=1 n j=1 = X r∈aj +I×R2 N (r, (x1 , x2 )tr ) cj log cj N (r, (x1 , x2 )tr ) m X X N (r, (x1 , x2 )tr ) n cj n log − log tr n N (r, (x1 , x2 ) ) j=1 n cj r∈R and m 1X cj H(pj,2 ) n j=1 " # P tr m X cj X cj r1 ∈aj +I N ((r1 , r2 ), (x1 , x2 ) ) = log P tr n cj r1 ∈aj +I N ((r1 , r2 ), (x1 , x2 ) ) j=1 r2 ∈R2 P tr m X X n r1 ∈aj +I N ((r1 , r2 ), (x1 , x2 ) ) = log P tr n N ((r 1 , r2 ), (x1 , x2 ) ) r1 ∈aj +I j=1 − r2 ∈R2 m X cj j=1 n log n . cj Since the entropy H is a continuous function, there exists some small 0 < < η/4, such that X N (r, (x , x )tr ) n 1 2 log − H(X , X ) 1 2 < η/8, n N (r, (x1 , x2 )tr ) r∈R m X cj n log − H(YR1 /I ) < η/8 and n c j j=1 m P tr X X r1 ∈aj +I N ((r1 , r2 ), (x1 , x2 ) ) n j=1 r2 ∈R2 n − H(X , Y ) × log P < η/8. 2 R /I 1 tr r1 ∈aj +I N ((r1 , r2 ), (x1 , x2 ) ) 2.1. Linear Coding over Finite Rings 25 Therefore, m 1X cj H(pj ) <H(X1 , X2 ) − H(YR1 /I ) + η/4 n j=1 (2.1.7) m 1X cj H(pj,2 ) >H(X2 , YR1 /I ) − H(YR1 /I ) − η/4 n j=1 (2.1.8) where (2.1.7) and (2.1.8) are guaranteed for small 0 < < η/4. Substituting (2.1.7) and (2.1.8) into (2.1.6), (2.1.5) follows. Second Proof. Define the mapping Γ : R1 → R1 /I by Γ : x1 7→ x1 + I, ∀ x1 ∈ R1 . (1) (2) (n) Assume that x1 = x1 , x1 , · · · , x1 , and let (1) (2) (n) y = Γ x1 , Γ x1 , · · · , Γ x1 . By definition, ∀ (y, x2 )tr ∈ D (x1 , I|x2 ), where y = y (1) , y (2) , · · · , y (n) , Γ y (1) , Γ y (2) , · · · , Γ y (n) = y. Obviously, (y, y, x2 )tr is a function of (y, x2 )tr . Thus, (y, y, x2 )tr ∈T (n, (X1 , YR1 /I , X2 )) by [Yeu08, Theorem 6.8]. Therefore, for a fixed (y, x2 )tr ∈ T , the number of strongly -typical sequences y such that (y, y, x2 )tr is strongly -typical is strictly upper bounded by 2n[H(X1 |YR1 /I ,X2 )+η] if n is large enough and is small. Since |D (x1 , I|x2 )| = (y, y, x2 )tr ∈ T y − x1 ∈ In , we conclude that |D (x1 , I|x2 )| < 2n[H(X1 |YR1 /I ,X2 )+η] . Remark 2.5. The second proof was suggested by an anonymous reviewer for our paper [HS13b]. The mechanisms behind the first proof and the second one are in fact very different. However, this is not quite clear for i.i.d. scenarios. For noni.i.d. scenarios, the results proved by these two approaches diverse. Although the technique from the first proof is more complicated, it provides results with its own advantages. More details on the differences are given in Chapter 5. 26 Linear Coding 2.2 2.2.1 Proof of the Achievability Theorems Proof of Theorem 2.1.1 As mentioned, Xi can be seen as a subset of Ri for a fixed Φ = (Φ1 , · · · , Φs ). In this section, we assume that Xi has sample space Ri , which makes sense since Φi is injective. nRi Let R = (R1 , R2 , · · · , Rs ) and ki = , ∀ i ∈ S, where n is the length log |Ri | P Ri log |Ii | > r (T, IT ) , (this implies of the data sequences. If R ∈ RΦ , then i∈T log |Ri | 1P that ki log |Ii | − r (T, IT ) > 2η for some small constant η > 0 and large n i∈T enough n), ∀ ∅ = 6 T ⊆ S, ∀ 0 6= Ii ≤l Ri . We claim that R is achievable by linear coding over R1 , R2 , · · · , Rs . Encoding: For every i ∈ S, randomly generate a ki × n matrix Ai based on a uniform distribution, i.e. independently choose each entry of Ai uniformly at random from Ri . Define a linear encoder φi : Rni → Rki i such that φi : x 7→ Ai x, ∀ x ∈ Rni . Obviously the coding rate of this encoder is nRi 1 log |Ri | 1 ki n log |φi (Ri )| ≤ log |Ri | = ≤ Ri . n n n log |Ri | Decoding: Subject to observing yi ∈ Rki i (i ∈ S) from the ith encoder, the decoder claims Qs tr that x = (x1 , x2 , · · · , xs ) ∈ i=1 Rni is the array of the encoded data sequences, if and only if: 1. x ∈ T ; and tr 2. ∀ x0 = (x10 , x20 , · · · , xs0 ) ∈ T , if x0 6= x, then φj (xj0 ) 6= yj , for some j. Error: Assume that Xi ∈ Rni (i ∈ S) is the original data sequence generated by the ith source. It is readily seen that an error occurs if and only if one of the following events occurs: tr E1 : X = (X1 , X2 , · · · , Xs ) ∈ / T ; tr E2 : There exists X 6= (x10 , x20 , · · · , xs0 ) ∈ T , such that φi (xi0 ) = φi (Xi ), ∀ i ∈ S. Error Probability: By the joint asymptotic equipartition principle (AEP) [Yeu08, Theorem 6.9], Pr {E1 } → 0, n → ∞. 2.2. Proof of the Achievability Theorems 27 Additionally, for ∅ = 6 T ⊆ S, let tr D (X; T ) = (x10 , x20 , · · · , xs0 ) ∈ T xi0 6= Xi , ∀ i ∈ T and xi0 = Xi , ∀ i ∈ T c . We have [ D (X; T ) = [D (XT , I|XT c ) \ {X}] , (2.2.1) 06=I≤l RT Q Q where XT = i∈T Xi and XT c = i∈T c Xi , since I goes over all possible nontrivial left ideals. Consequently, Y X Pr {φi (xi0 ) = φi (Xi )|E1c } Pr {E2 |E1c } = tr = X X (x01 ,··· ,x0s ) ∈T \{X} i∈S Y Pr {φi (xi0 ) = φi (Xi )|E1c } (2.2.2) ∅6=T ⊆S (x0 ,··· ,x0 )tr i∈T 1 s ∈D (X;T ) ≤ X X X ∅6=T ⊆S 06=I≤l RT < X ∅6=T ⊆S 06= i∈T Pr {φi (xi0 ) = φi (Xi )|E1c } (2.2.3) tr i∈T (x01 ,··· ,x0s ) ∈D (XT ,I|XT c )\{X} X Q Y 2n[r(T,IT )+η] − 1 Y |Ii |−ki (2.2.4) i∈T Ii ≤l RT < (2s − 1) 2|RS | − 2 × −n max 06= 2 Q∅6=T ⊆S, i∈T 1 P n i∈T ki log |Ii |−[r(T,IT )+η] , (2.2.5) Ii ≤l RT where (2.2.2) is from the fact that T \ {X} = ` ∅6=T ⊆S D (X; T ) (disjoint union); (2.2.3) follows from (2.2.1) by Boole’s inequality [Boo10, Fré35]; (2.2.4) is from Lemma 2.1.1 and Lemma 2.1.2, as well as the fact that every left ideal of RT is a Cartesian product of some left ideals Ii of Ri , i ∈ T (see Proposition 1.1.3). At the same time, is required to be sufficiently small; (2.2.5) is due to the facts that the number of non-empty subsets of S is 2s − 1 and the number of non-trivial left ideals of the finite ring RT is less than 2|RS | −1, which is the number of non-empty subsets of RS . Thus, Pr {E2 |E1c } → 0, when n → ∞, from (2.2.5), since for sufficiently large n 1P ki log |Ii | − [r (T, I) + η] > η > 0. and small , n i∈T Therefore, Pr {E1 ∪ E2 } = Pr {E1 } + Pr {E1c } Pr { E2 | E1c } → 0 as → 0 and n → ∞. 28 Linear Coding 2.2.2 Proof of Theorem 2.1.2 The proof follows from almost the same argument as in proving Theorem 2.1.1, except that the performance analysis only focuses on sequences (ai,1 , ai,2 , · · · , ai,n ) ∈ Rni (1 ≤ i ≤ s) such that ki Y (j) (j) (j) ai,j = Φ1,i xi , Φ2,i xi , · · · , Φki ,i xi ∈ Rl,i l=1 (j) for some xi ∈ Xi . Let Xi , Yi be any two such sequences satisfying Xi − Yi ∈ Ini for some Ii ≤l Ri . Based Qki on the special structure of Xi and Yi , it is easy to verify that Ii 6= 0 ⇔ Ii = l=1 Il,i and 0 6= Il,i ≤l Rl,i , for all 1 ≤ l ≤ ki (This causes the difference between (2.1.1) and (2.1.2)). In addition, it is obvious that RΦ ⊆ RΦ,prod by their definitions. 2.2.3 Proof of Theorem 2.1.3 The proof is similar to that for Theorem 2.1.1, except that it only focuses on sequences (ai,1 , ai,2 , · · · , ai,n ) ∈ MnL,Ri ,mi (1 ≤ i ≤ s) such that ai,j ∈ ML,Ri ,mi ( a, u ≥ v; satisfies [ai,j ]u,v = for some a ∈ Ri . Let Xi , Yi be any two such 0, otherwise, sequences such that Xi − Yi ∈ Ini for some Ii ≤l ML,Ri ,mi . It is easily seen that Ii 6= 0 if and only if Ii ∈ / O(ML,Ri ,mi ) (This causes the difference between (2.1.1) and (2.1.3)). In addition, it is obvious that RΦ ⊆ RΦ,m by their definitions. 2.3 Optimality Obviously, Theorem 2.1.1 specializes to its field counterpart if all rings considered are fields, as summarized in the following theorem. Theorem 2.3.1. Region (2.1.1) is the Slepian–Wolf region if Ri contains no proper non-trivial left ideal, equivalently1 , Ri is a field, for all i ∈ S. As a consequence, region (2.1.4) is the Slepian–Wolf region. Proof. In Theorem 2.1.1, random variable YRT /IT admits a sample space of cardinality 1 for all ∅ = 6 T ⊆ S, since the only non-trivial left ideal of Ri is itself for all feasible i. Thus, 0 = H(YRT /IT ) ≥ H(YRT /IT |XT c ) ≥ 0. Consequently, X n o RΦ = (R1 , R2 , · · · , Rs ) ∈ Rs Ri > H(XT |XT c ), ∀ ∅ = 6 T ⊆S , i∈T which is the Slepian–Wolf region R[X1 , X2 , · · · , Xs ]. Therefore, region (2.1.4) is also the Slepian–Wolf region. 1 Equivalency does not necessarily hold for rngs. 2.3. Optimality 29 If Ri is a field, then obviously it has no proper non-trivial left (right) ideal. Conversely, ∀ 0 6= a ∈ Ri , hail = Ri implies that ∃ 0 6= b ∈ Ri , such that ba = 1. Similarly, ∃ 0 6= c ∈ Ri , such that cb = 1. Moreover, c = c · 1 = cba = 1 · a = a. Hence, ab = cb = 1. b is the inverse of a. By Wedderburn’s little theorem, Theorem 1.1.1, Ri is a field. One important question to address is whether linear coding over finite nonfield rings can be equally optimal for data compression. Hereby, we claim that, for any Slepian–Wolf scenario, there always exist linear encoders over some finite non-field rings which achieve the data compression limit. Therefore, optimality of linear coding over finite non-field rings for data compression is established in the sense of existence. 2.3.1 Existence Theorem I: Single Source For any single source scenario, the assertion that there always exists a finite ring R1 , such that Rl is in fact the Slepian–Wolf region R[X1 ] = {R1 ∈ R|R1 > H(X1 )}, is equivalent to the existence of a finite ring R1 and an injection Φ1 : X1 → R1 , such that log |R1 | (2.3.1) H(X1 ) − H(YR1 /I1 ) = H(X1 ), max 06=I1 ≤l R1 log |I1 | where YR1 /I1 = Φ1 (X1 ) + I1 . Theorem 2.3.2. Let R1 be a finite ring of order |R1 | ≥ p|X1 |. If R1 contains one and only one proper non-trivial left ideal I0 and |I0 | = |R1 |, then region (2.1.4) coincides with the Slepian–Wolf region, i.e. there exists an injection Φ1 : X1 → R1 , such that (2.3.1) holds. Remark 2.6. Examples of such a non-field ring R1 in the above theorem include (" # ) x 0 ML,p = x, y ∈ Zp y x (ML,p is a ring with respect to matrix addition and multiplication) and Zp2 , where p is any prime. For any single source scenario, one can always choose R1 to be either ML,p or Zp2 . Consequently, optimality is attained. Proof of Theorem 2.3.2. Notice that the random variable YR1 /I0 depends on the injection Φ1 , so does its entropy H(YR1 /I0 ). Obviously H(YR1 /R1 ) = 0, since the sample space of the random variable YR1 /R1 contains only one element. Therefore, log |R1 | H(X1 ) − H(YR1 /R1 ) = H(X1 ). log |R1 | 30 Linear Coding Consequently, (2.3.1) is equivalent to log |R1 | H(X1 ) − H(YR1 /I0 ) ≤ H(X1 ) log |I0 | ⇔H(X1 ) ≤ 2H(YR1 /I0 ), (2.3.2) since |I0 | = |R1 |. By Lemma 2.A.1, there exists injection Φ̃1 : X1 → R1 such that (2.3.2) holds if Φ1 = Φ̃1 . The statement follows. p Up to isomorphism, there are exactly 4 distinct rings of order p2 for a given prime p. They include 3 non-field rings, Zp × Zp , ML,p and Zp2 , in addition to the field Fp2 . It has been proved that, using linear encoders over the last three, optimality can always be achieved in the single source scenario. Actually, the same holds true for all multiple sources scenarios. 2.3.2 Existence Theorem II: Multiple Sources Theorem 2.3.3. Let R1 , R2 , · · · , Rs be s finite rings with |Ri | ≥ |Xi |. If Ri is isomorphic to either 1. a field, i.e. Ri contains no proper non-trivial left (right) ideal; or 2. a pring containing one and only one proper non-trivial left ideal I0i and |I0i | = |Ri |, for all feasible i, then (2.1.4) coincides with the Slepian–Wolf region R[X1 , X2 , · · · , Xs ]. Remark 2.7. It is obvious that Theorem 2.3.3 includes Theorem 2.3.2 as a special case. In fact, its proof resembles the one of Theorem 2.3.2. Examples of Ri ’s include all finite fields, ML,p and Zp2 , where p is a prime. However, Theorem 2.3.3 does not guarantee that all rates, except the vertexes, in the polytope of the Slepian–Wolf region are “directly” achievable for the multiple sources case. A time sharing scheme is required in our current proof. Nevertheless, all rates are “directly” achievable if all Ri ’s are fields or if s = 1. This is partially the reason that the two theorems are stated separately. Remark 2.8. Theorem 2.3.3 also includes Theorem 2.3.1 as a special case. However, Theorem 2.3.1 admits a simpler proof compared to the one for Theorem 2.3.3. Proof of Theorem 2.3.3. It suffices to prove that, for any R = (R1 , R2 , · · · , Rs ) ∈ Rs satisfies Ri > H(Xi |Xi−1 , Xi−2 , · · · , X1 ), ∀ 1 ≤ i ≤ s, R ∈ RΦ for some set of injections Φ = (Φ1 , Φ2 , · · · , Φs ), where Φi : Xi → Ri . Let Φ̃ = (Φ̃1 , Φ̃2 , · · · , Φ̃s ) be the set of injections, where, if 2.3. Optimality 31 (i) Ri is a field, Φ̃i is any injection; (ii) Ri satisfies 2, Φ̃i is the injection such that H(Xi |Xi−1 , Xi−2 , · · · , X1 ) ≤2H(YRi /I0i |Xi−1 , Xi−2 , · · · , X1 ), when Φi = Φ̃i . The existence of Φ̃i is guaranteed by Lemma 2.A.1. If Φ = Φ̃, then log |Ii | H(Xi |Xi−1 , Xi−2 , · · · , X1 ) log |Ri | ≥H(Xi |Xi−1 , Xi−2 , · · · , X1 ) − H(YRi /Ii |Xi−1 , Xi−2 , · · · , X1 ) =H(Xi |YRi /Ii , Xi−1 , Xi−2 , · · · , X1 ), for all 1 ≤ i ≤ s and 0 6= Ii ≤l Ri . As a consequence, X Ri log |Ii | i∈T X log |Ii | H(Xi |Xi−1 , Xi−2 , · · · , X1 ) log |Ri | log |Ri | i∈T X ≥ H(Xi |YRi /Ii , Xi−1 , Xi−2 , · · · , X1 ) > i∈T X ≥ H(Xi |YRT /IT , XT c , Xi−1 , Xi−2 , · · · , X1 ) i∈T ≥H XT YRT /IT , XT c =H (XT |XT c ) − H YRT /IT |XT c , for all ∅ = 6 T ⊆ {1, 2, · · · , s}. Thus, R ∈ RΦ̃ . By Theorem 2.3.1, Theorem 2.3.2 and Theorem 2.3.3, we draw the conclusion that Corollary 2.3.1. For any Slepian–Wolf scenario, there always exists a sequence of linear encoders over some finite rings (fields or non-field rings) which achieves the data compression limit, the Slepian–Wolf region. In fact, LCoR can be optimal even for rings beyond those stated in the above theorems (see Example 2.1.1). We classify some of these scenarios in the remaining parts of this section. 2.3.3 Product Rings Theorem 2.3.4. Let Rl,1 , Rl,2 , · · · , Rl,s (l = 1, 2) be a set of finite rings of equal size, and Ri = R1,i × R2,i for all feasible i. If the coding rate R ∈ Rs is achievable with linear encoders over Rl,1 , Rl,2 , · · · , Rl,s (l = 1, 2), then R is achievable with linear encoders over R1 , R2 , · · · , Rs . 32 Linear Coding Proof. By definition, R is a convex combination of coding rates which are achieved by different linear encoding schemes over Rl,1 , Rl,2 , · · · , Rl,s (l = 1, 2), respectively. To be more precise, Rs and positive numbers Pm there exist R1 , R2 , · · · , Rm P∈ m w1 , w2 , · · · , wm with j=1 wj = 1, such that R = j=1 wj Rj . Moreover, there exist injections Φl = (Φl,1 , Φl,2 , · · · , Φl,s ) (l = 1, 2), where Φl,i : Xi → Rl,i , such that X Ri log |Il,i | s > Rj ∈ RΦl = (R1 , R2 , · · · , Rs ) ∈ R log |Rl,i | i∈T H(XT |XT c ) − H(YRl,T /Il,T |XT c ), ∀ ∅ = 6 T ⊆ S, ∀ 0 6= Il,i ≤l Rl,i , (2.3.3) Q Q where Rl,T = i∈T Rl,i , Il,T = i∈T Il,i and YRl,T /Il,T = Φl (XT ) + Il,T is a random variable with sample space Rl,T /Il,T . To show that R is achievable with linear encoders over R1 , R2 , · · · , Rs , it suffices to prove that Rj is achievable with linear encoders over R1 , R2 , · · · , Rs for all feasible j. Let Rj = (Rj,1 , Rj,2 , · · · , Rj,s ). For all ∅ = 6 T ⊆ S and 0 6= Ii = I1,i × I2,i ≤l Ri with 0 6= Il,i ≤l Rl,i (l = 1, 2), we have X Rj,i log |Ii | i∈T log |Ri | = X Rj,i log |I1,i | log |R1,i | i∈T X Rj,i log |I2,i | c2 c1 + , c1 + c2 log |R2,i | c1 + c2 i∈T where cl = log |Rl,1 |. By (2.3.3), it can be easily seen that X Rj,i log |Ii | i∈T log |Ri | Meanwhile, let RT = Φ1,s × Φ2,s ) (Note: >H(XT |X Tc 2 X 1 )− cl H(YRl,T /Il,T |XT c ). c1 + c2 l=1 Q i∈T Ri , IT = Q i∈T Ii , Φ = (Φ1,1 × Φ2,1 , Φ1,2 × Φ2,2 , · · · , Φ1,i × Φ2,i : xi 7→ (Φ1,i (xi ), Φ2,i (xi )) ∈ Ri for all xi ∈ Xi .) and YRT /IT = Φ(XT ) + IT . It can be verified that YRl,T /Il,T (l = 1, 2) is a function of YRT /IT , hence, H(YRT /IT |XT c ) ≥ H(YRl,T /Il,T |XT c ). Consequently, X Rj,i log |Ii | i∈T log |Ri | > H(XT |XT c ) − H(YRT /IT |XT c ), which implies that Rj ∈ RΦ,prod by Theorem 2.1.2. We therefore conclude that Rj is achievable with linear encoders over R1 , R2 , · · · , Rs for all feasible j, so is R. Obviously, R1 , R2 , · · · , Rs in Theorem 2.3.4 are of the same size. Inductively, one can verify the following without any difficulty. 2.3. Optimality 33 Theorem 2.3.5. Let L be any finite index Q set, Rl,1 , Rl,2 , · · · , Rl,s (l ∈ L ) be a set of finite rings of equal size, and Ri = l∈L Rl,i for all feasible i. If the coding rate R ∈ Rs is achievable with linear encoders over Rl,1 , Rl,2 , · · · , Rl,s (l ∈ L ), then R is achievable with linear encoders over R1 , R2 , · · · , Rs . Remark 2.9. There are delicate issues to the situation Theorem 2.3.5 (Theorem 2.3.4) illustrates. Let Xi (1 ≤ i ≤ s) be the set of all symbols generated by the ith source. The hypothesis of Theorem 2.3.5 (Theorem 2.3.4) implicitly implies the alphabet constraint |Xi | ≤ |Rl,i | for all feasible i and l. Let R1 , R2 , · · · , Rs be s finite rings each of which is isomorphic to either 1. a ring p R containing one and only one proper non-trivial left ideal whose order is |R|, e.g. ML,p and Zp2 (p is a prime); or 2. a ring of a finite product of finite field(s) and/or ring(s) satisfying 1, e.g. Qm Qm0 Qm00 ML,p × j=1 Zpj (p and pj ’s are prime) and i=1 ML,pi × j=1 Fqj (m0 and m00 are non-negative, pi ’s are prime and qj ’s are powers of primes). Theorem 2.3.3 and Theorem 2.3.5 ensure that linear encoders over ring R1 , R2 , · · · , Rs are always optimal in any applicable (subject to the condition specified in the corresponding theorem) Slepian–Wolf coding scenario. As a very special case, Zp × Zp , where p is a prime, is always optimal in any (single source or multiple sources) scenario with alphabet size less than or equal to p. However, using a field or product rings is not necessary. As shown in Theorem 2.3.2, neither ML,p nor Zp2 is (isomorphic to) a product of rings nor a field. It is also not required to have a restriction on the alphabet size (see Theorem 2.3.3), even for product rings (see Example 2.1.1 for a case of Z2 × Z3 ). 2.3.4 Trivial Case: Uniform Distributions The following theorem is trivial, however we include it for completeness. Theorem 2.3.6. Regardless which set of rings R1 , R2 , · · · , Rs is chosen, as long as |Ri | = |Xi | for all feasible i, region (2.1.1) is the Slepian–Wolf region if (X1 , X2 , · · · , Xs ) ∼ p is a uniform distribution. Proof. If p is uniform, then, for any ∅ = 6 T ⊆ S and 0 6= IT ≤l RT , YRT /IT is uniformly distributed on RT /IT . Moreover, XT and XT c are independent, so are YRT /IT and XT c . Therefore, H(XT |XT c ) = H(XT ) = log |RT | and H(YRT /IT |XT c ) |RT | . Consequently, = H(YRT /IT ) = log |IT | r(T, IT ) = H(XT |XT c ) − H(YRT /IT |XT c ) = log |IT |. Region (2.1.1) is the Slepian–Wolf region. 34 Linear Coding Remark 2.10. When p is uniform, it is obvious that the uncoded strategy (all encoders are one-to-one mappings) is optimal in the Slepian–Wolf source coding problem. However, optimality stated in Theorem 2.3.6 does not come from deliberately fixing the linear encoding mappings, but generating them randomly. So far, we have only shown that there exist linear encoders over finite non-field rings that are equally good as their field counterparts. In Chapter 3, Problem 2.1 is considered with an arbitrary g. It will be demonstrated that linear coding over finite non-field rings can strictly outperform its field counterpart for encoding some discrete functions, and there are infinitely many such functions. 2.A. Appendix 2.A 2.A.1 35 Appendix A Supporting Lemma Lemma 2.A.1. Let R be a finite ring, X and Y be two correlated discrete random variables, and X be the sample space of X with |X p | ≤ |R|. If R contains one and only one proper non-trivial left ideal I and |I| = |R|, then there exists injection Φ̃ : X → R such that H(X|Y ) ≤ 2H(Φ̃ (X) + I|Y ). (2.A.1) Proof. Let Φ̃ ∈ arg max H(Φ (X) + I|Y ), Φ∈M where M is the set of all possible Φ’s (maximum can always be reached because |R|! |M | = is finite, but it is not uniquely attained by Φ̃ in general). (|R| − |X |)! Assume that Y is the sample space (not necessarily finite) of Y . Let q = |I|, I = {r1 , r2 , · · · , rq } and R/I = {a1 + I, a2 + I, · · · , aq + I}. We have that H(X|Y ) = − H(Φ̃ (X) + I|Y ) = − q X X y∈Y i,j=1 q XX pi,j,y log pi,y log y∈Y i=1 pi,j,y and py pi,y , py where pi,j,y = Pr Φ̃(X) = ai + rj , Y = y , py = pi,y = q X pi,j,y , i,j=1 q X pi,j,y . j=1 (Note: Pr Φ̃(X) = r = 0 if r ∈ R \ Φ̃(X ). In addition, every element in R can be uniquely expressed as ai + rj .) Therefore, (2.A.1) is equivalent to q XX pi,j,y pi,y ≤ −2 pi,y log p py y y∈Y i=1 y∈Y i,j=1 q X X pi,y pi,1,y pi,2,y pi,q,y ⇔ py H , ,··· , py pi,y pi,y pi,y i=1 y∈Y X p1,y p2,y pq,y ≤ py H , ,··· , , py py py − q X X pi,j,y log y∈Y (2.A.2) 36 Linear Coding where H (v1 , v2 , · · · , vq ) = − pp. 49]. Let A= X Pq j=1 vj log vj , by the grouping rule for entropy [CT06, q q q X X pi,1,y X pi,2,y pi,q,y , ,··· , p p py y y i=1 i=1 i=1 py H y∈Y ! . The concavity of the function H implies that X py y∈Y q X pi,y H py i=1 pi,1,y pi,2,y pi,q,y , ,··· , pi,y pi,y pi,y ≤ A. (2.A.3) At the same time, X y∈Y py H pq,y p1,y p2,y , ,··· , py py py = max H(Φ(X) + I|Y ) Φ∈M by the definition of Φ̃. We now claim that A ≤ max H(Φ(X) + I|Y ). Φ∈M Suppose otherwise, i.e. A > P y∈Y py H (2.A.4) pq,y p1,y p2,y , ,··· , . Let Φ0 : X → R py py py be defined as Φ0 : x 7→ aj + ri ⇔ Φ̃(x) = ai + rj . We have that ! q q q X X X p p p i,2,y i,q,y i,1,y , ,··· , =A H(Φ0 (X) + I|Y ) = py H py i=1 py py i=1 i=1 y∈Y X p1,y p2,y pq,y > py H , ,··· , = max H(Φ(X) + I|Y ). Φ∈M py py py X y∈Y It is absurd that H(Φ0 (X) + I|Y ) > maxΦ∈M H(Φ(X) + I|Y )! Therefore, (2.A.2) is valid by (2.A.3) and (2.A.4), so is (2.A.1). Chapter 3 Encoding Functions of Correlated Sources or an arbitrary discrete function g, Problem 2.1 remains open in general, and R[X1 , X2 , · · · , Xs ] ⊆ R[g] obviously. Making use of Elias’ theorem on binary linear codes [Eli55], Körner–Marton [KM79] shows that R[⊕2 ] (“⊕2 ” is the modulo-two sum) contains the region R⊕2 = (R1 , R2 ) ∈ R2 | R1 , R2 > H(X1 ⊕2 X2 ) . F This region is not contained in the Slepian–Wolf region for certain distributions. In other words, R[⊕2 ] ) R[X1 , X2 ]. Combining the standard random coding technique and Elias’ result, [AH83] shows that R[⊕2 ] can be strictly larger than the convex hull of the union R[X1 , X2 ] ∪ R⊕2 . However, the functions considered in these works are relatively simple. As general as it can be, their work only considers functions defined on some finite field. This is because (part of) the encoding technique used is the linear coding technique over finite fields from [Eli55, Csi82]. Unfortunately, we will see later that this is a suboptimal solution. Instead, we will propose replacing the linear encoders over finite fields with the more generalized version, linear encoders over finite rings. We will see that in many examples, the later strictly outperform the first in various aspects. 3.1 A Polynomial Approach The first question arising here is how do we handle the arbitrary function g in Problem 2.1. This brings us back to Lemma 1.2.3 and Lemma 1.2.4. As commented in Section 1.2, any function defined on a finite domain is equivalent to a restriction of some polynomial function and some nomographic function. Conceptually, this is a nice observation. At least, there is well-defined polynomial structure associated with the function considered. Moreover, if Problem 2.1 were concluded for all the polynomial functions or all the nomographic functions, then it is concluded for all functions defined on finite domains. 37 38 Encoding Functions of Correlated Sources Thus, from now on we will consider Problem 2.1 with this polynomial approach, namely, only polynomial functions are considered. We will prove the claim that LCoR dominates LCoF in terms of achieving better coding rates based on this approach. 3.2 Source Coding for Computing We begin with establishing the following theorem which can be recognized as a generalization of Körner–Marton [KM79]. Theorem 3.2.1. Let R be a finite ring, and ĝ = h ◦ k, where k(x1 , x2 , · · · , xs ) = s X ki (xi ) (3.2.1) i=1 and h, ki ’s are functions mapping R to R. Then n o log |R| H(X) − H(YR/I ) (3.2.2) Rĝ = (R1 , R2 , · · · , Rs ) ∈ Rs Ri > max 06=I≤l R log |I| ⊆R[ĝ], where X = k(X1 , X2 , · · · , Xs ) and YR/I = X + I. Proof. By Theorem 2.1.1, ∀ > 0, there exists a large enough n, an m × n matrix A ∈ Rm×n and a decoder ψ, such that Pr {X n 6= ψ (AX n )} < , if m > n(H(X) − H(YR/I )) max06=I≤l R . Let φi = A ◦ ~ki (1 ≤ i ≤ s) be the encoder log |I| of the ith source. Upon receiving φi (Xin ) from the ith source, the decoder claims Ps n that ~h X̂ , where X̂ n = ψ [ φi (X n )], is the function, namely ĝ, subject to i i=1 computation. The probability of decoding error is n h i o Pr ~h ~k (X1n , X2n , · · · , Xsn ) 6= ~h X̂ n n o ≤ Pr X n 6= X̂ n ( " s #) X n n = Pr X 6= ψ φi (Xi ) i=1 ( " = Pr X n 6= ψ s X #) A~ki (Xin ) i=1 ( " n = Pr X 6= ψ A s X #) ~ki (X n ) i i=1 n h io = Pr X n 6= ψ A~k (X1n , X2n , · · · , Xsn ) = Pr {X n 6= ψ (AX n )} < . 3.2. Source Coding for Computing 39 Therefore, all (R1 , R2 , · · · , Rs ) ∈ Rs , where Ri = m log |R| log |R| > max H(X) − H(YR/I ) , 06=I≤l R log |I| n is achievable, i.e. Rĝ ⊆ R[ĝ]. Corollary 3.2.1. In Theorem 3.2.1, let X = k(X1 , X2 , · · · , Xs ) ∼ pX . We have Rĝ = { (R1 , R2 , · · · , Rs ) ∈ Rs | Ri > H(X)} ⊆ R[ĝ], if either of the following conditions holds: 1. R is isomorphic to a finite field; 2. R is isomorphic to apring containing one and only one proper non-trivial left ideal I0 with |I0 | = |R|, and H(X) ≤ 2H(X + I0 ). Proof. If either 1 or 2 holds, then it is guaranteed that log |R| H(X) − H(YR/I ) = H(X) 06=I≤l R log |I| max in Theorem 3.2.1. The statement follows. Remark 3.1. By Lemma 3.A.1, examples of non-field rings satisfying 2 in Corollary 3.2.1 include (1) Z4 with pX (0) = p1 , pX (1) = p2 , pX (3) = p3 and pX (2) = p4 satisfying ( 0 ≤ max{p2 , p3 } < 6 min{p1 , p4 } ≤ 1 (3.2.3) 0 ≤ max{p1 , p4 } < 6 min{p2 , p3 } ≤ 1; (2) ML,2 with " pX pX #! " #! 0 1 0 = p1 , pX = p2 , 0 0 1 " #! " #! 1 0 0 0 = p3 and pX = p4 1 1 1 0 0 0 satisfying (3.2.3). Interested readers can figure out even more explicit examples deduced from Lemma 2.A.1. Besides, if R is isomorphic to Z2 and ĝ is the modulo-two sum, then Corollary 3.2.1 recovers the theorem of Körner–Marton [KM79]. 40 Encoding Functions of Correlated Sources However, Rĝ given by (3.2.2) is sometimes strictly smaller than R[g]. This was first shown by Ahlswede–Han [AH83] for the case of g being the modulo-two sum. Their approach combines the linear coding technique over binary field with the standard random coding technique. In the following, we generalize the result of Ahlswede–Han [AH83, Theorem 10] to the settings, where g is arbitrary, and, at the same time, LCoF is replaced by its generalized version, LCoR. Consider function ĝ admitting s X ĝ(x1 , x2 , · · · , xs ) = h k0 (x1 , x2 , · · · , xs0 ), kj (xj ) , 0 ≤ s0 < s, (3.2.4) j=s0 +1 where k0 : Rs0 → R and h, kj ’s are functions mapping R to R. By Lemma 1.2.4, a discrete function with a finite domain is always equivalent to a restriction of some function of format (3.2.4). We call ĝ from (3.2.4) a pseudo nomographic function over ring R. Theorem 3.2.2. Let S0 = {1, 2, · · · , s0 } ⊆ S = {1, 2, · · · , s}. If ĝ is of format (3.2.4), and R = (R1 , R2 , · · · , Rs ) ∈ Rs satisfying X log |R| H(X|VS ) − H(YR/I |VS ) 06=I≤l R log |I| Rj > |T \ S0 | max j∈T + I(YT ; VT |VT c ), ∀ ∅ = 6 T ⊆ S, (3.2.5) where ∀ j ∈ S0 , Vj = Yj = Xj ; ∀ j ∈ S \ S0 , Yj = kj (Xj ), Vj ’s are discrete random variables such that p(y1 , y2 , · · · , ys , v1 , v2 , · · · , vs ) = p(y1 , y2 , · · · , ys ) s Y p(vj |yj ), (3.2.6) j=s0 +1 and X = Ps j=s0 +1 Yj , YR/I = X + I, then R ∈ R[ĝ]. Proof. Choose δ > 6 > 0, such that Rj = Rj0 + Rj00 , ∀ j ∈ S, I(YT ; VT |VT c ) + 2 |T | δ, ∀ ∅ = 6 T ⊆ S, and Rj00 > r + 2δ, where r = max 06=I≤l R P j∈T Rj0 > log |R| H(X|VS ) − H(YR/I |VS ) , log |I| ∀ j ∈ S \ S0 . Encoding: Fix the joint distribution p which satisfies (3.2.6). For all j ∈ S0 , let Vj, = T (n, Xj ). For all j ∈ S \ S0 , generate randomly 2n[I(Yj ;Vj )+δ] strongly typical sequences according to distribution pVjn and let Vj, be the set of these generated sequences. Define mapping φ0j : Rn → Vj, as follows: ( x, if x ∈ T ; n 0 1. If j ∈ S0 , then, ∀ x ∈ R , φj (x) = where x0 ∈ Vj, is fixed. x0 , otherwise, 3.2. Source Coding for Computing 41 2. If j ∈ S \ S0 , then for every x ∈ Rn , let Lx = {v ∈ Vj, |(~kj (x), v) ∈ T }. If x ∈ T and Lx 6= ∅, then φ0j (x) is set to be some element in Lx ; otherwise φ0j (x) is some fixed v0 ∈ Vj, . 0 Define mapping ηj : Vj, → [1, 2nRj ] by randomly choosing the value for each v ∈ Vj, according to a uniform distribution. nRj00 n[r + δ] . When n is big enough, we have k > . Let k = minj∈S\S0 log |R| log |R| Randomly generate a k × n matrix M ∈ Rk×n , and let θj : Rn → Rk (j ∈ S \ S0 ) be the function θj : x 7→ M~kj (x), ∀ x ∈ Rn . Define the encoder φj as the follows ( ηj ◦ φ0j , j ∈ S0 ; φj = 0 (ηj ◦ φj , θj ), otherwise. Decoding: Upon observing (a1 , a2 , · · · , as0 , (as0 +1 , bs0 +1 ), · · · , (as , bs )) at the decoder, the decoder claims that h i ~h ~k0 V̂ n , V̂ n , · · · , V̂ n , X̂ n 1 2 s0 is the function of the generated data, if and only if there exists one and only one s Y V̂ = V̂1n , V̂2n , · · · , V̂sn ∈ Vj, , j=1 such that aj = ηj (V̂jn ), ∀ j ∈ S, and X̂ n is the only element in the set s n o X LV̂ = x ∈ Rn (x, V̂) ∈ T , Mx = bj . j=t+1 Error: Assume that Xjn is the data generated by the jth source and let X n = Ps n ~ j=s0 +1 kj Xj . An error happens if and only if one of the following events happens. E1 : (X1n , X2n , · · · , Xsn , Y1n , Y2n , · · · , Ysn , X n ) ∈ / T ; E2 : There exists some j0 ∈ S \ S0 , such that LXjn = ∅; 0 E3 : (Y1n , Y2n , · · · , Ysn , X n , V) φ0j (Xjn ), ∀ j ∈ S; ∈ / T , where V = (V1n , V2n , · · · , Vsn ) and Vjn = E4 : There exists V0 = (v10 , v20 , · · · , vs0 ) ∈ T ∩ ηj (vj0 ) = ηj Vjn , ∀ j ∈ S; Qs j=1 Vj, , V0 6= V, such that 42 Encoding Functions of Correlated Sources E5 : X n ∈ / LV or |LV | > 1, i.e. there exists X0n ∈ Rn , X0n 6= X n , such that MX0n = MX n and (X0n , V) ∈ T . o P nS 5 5 E ≤ l=1 Pr { El | El,c }, where E1,c = ∅ and Error Probability: Let γ = Pr l l=1 Tl−1 c El,c = τ =1 Eτ for 1 < l ≤ 5. In the following, we show that γ → 0, n → ∞. (a). By the jointnAEP [Yeu08, Theorem 6.9], Pr{E1 } → 0, n → ∞. o (b). Let E2,j = LXjn = ∅ , ∀j ∈ S \ S0 . Then X Pr{E2 |E2,c } ≤ Pr {E2,j |E2,c } . (3.2.7) j∈S\S0 For any j ∈ S \ S0 , because the sequence v ∈ Vj, and Yjn = ~kj (Xjn ) are drawn independently, we have Pr{(Yjn , v) ∈ T } ≥(1 − )2−n[I(Yj ;Vj )+3] =(1 − )2−n[I(Yj ;Vj )+δ/2]+n(δ/2−3) >2−n[I(Yj ;Vj )+δ/2] when n is big enough. Thus, n o Pr {E2,j |E2,c } = Pr LXjn = ∅ | E2,c n o Y = Pr ~kj (Xjn ), v ∈ / T v∈Vj, n o2n[I(Yj ;Vj )+δ] < 1 − 2−n[I(Yj ;Vj )+δ/2] (3.2.8) → 0, n → ∞. where (3.2.8) holds true for all big enough n and the limit follow from the fact that a (1 − 1/a) → e−1 , a → ∞. Therefore, Pr{E2 |E2,c } → 0, n → ∞ by (3.2.7). (c). By (3.2.6), it is obvious that VJ1 − YJ1 − YJ2 − VJ2 forms a Markov chain for any two disjoint nonempty sets J1 , J2 ( S. Thus, if (Yjn , Vjn ) ∈ T for all j ∈ S and (Y1n , Y2n , · · · , Ysn ) ∈ T , then (Y1n , Y2n , · · · , Ysn , V) ∈ T . In the meantime, X − (Y1 , Y2 , · · · , Ys ) − (V1 , V2 , · · · , Vs ) is also a Markov chain. Hence, (Y1n , Y2n , · · · , Ysn , X n , V) ∈ T if (Y1n , Y2n , · · · , Ysn , X n ) ∈ T . Therefore, Pr{E3 |E3,c } = 0. (d). For all ∅ = 6 J ⊆ S, let J = {j1 , j2 , · · · , j|j| } and s n o Y ΓJ = V0 = (v10 , v20 , · · · , vs0 ) ∈ Vj, vj0 = Vjn if and only if j ∈ S \ J . j=1 3.2. Source Coding for Computing = n Q By definition, |ΓJ | = Pr{E4 |E4,c } X X 43 j∈J |Vj, | − 1 = 2 P j∈J I(Yj ;Vj )+|J|δ − 1 and Pr ηj (vj0 ) = ηj (Vjn ), ∀ j ∈ J, V0 ∈ T |E4,c ∅6=J⊆S V0 ∈ΓJ = X X Pr ηj (vj0 ) = ηj (Vjn ), ∀ j ∈ J × Pr {V0 ∈ T |E4,c } (3.2.9) ∅6=J⊆S V0 ∈ΓJ < X −n X 2 P j∈J Rj0 −n ×2 P|J| i=1 I(Vji ;VJ c ,Vj1 ,··· ,Vji−1 )−|J|δ (3.2.10) ∅6=J⊆S V0 ∈ΓJ < X n 2 P j∈J I(Yj ;Vj )+|j|δ −n ×2 P j∈J Rj0 −n ×2 P|j| i=1 I(Vji ;VJ c ,Vj1 ,··· ,Vji−1 )−|j|δ ∅6=J⊆S P −n ≤C max 2 j∈J Rj0 −I(YJ ;VJ |VJ c )−2|j|δ (3.2.11) ∅6=J⊆N → 0, n → ∞, where C = 2s − 1. Equality (3.2.9) holds because the processes of choosing ηj ’s and generating V0 are done independently. (3.2.10) follows from Lemma 3.A.3 and the definitions of ηj ’s. (3.2.11) is from Lemma 3.A.4. (e). Let E5,1 = {LV = ∅} and E5,2 = {|LV | > 1}. We have Pr{E5,1 |E5,c } = 0, because E5,c contains the event that (X n , V) ∈ LV and V is unique. Therefore, Pr {E5 |E5,c } = Pr {E5,2 |E5,c } X = Pr {MX0n = MX n } (X0n ,V)∈T \(X n ,V) < X 06=I≤l R D Choose a small η > 0 such that η < X Pr {MX0n = MX n } (X n ,I|V)\(X n ,V) δ . Then 2 log |R| Pr {E5 |E5,c } < 2|R| − 2 max 2n[H(X|VS )−H(YR/I |VS )+η] × 2−k log|I| 06=I≤l R |R| = 2 − 2 max 2−n[k log|I|/n−H(X|VS )+H(YR/I |VS )−η] 06=I≤l R |R| < 2 − 2 max 2−n[δ log|I|/ log|R|−η] 06=I≤l R |R| < 2 − 2 2−nδ/2 log|R| (3.2.12) (3.2.13) → 0, n → ∞, where (3.2.12) is from Lemma 2.1.1 and Lemma 2.1.2 (for all large enough n and small enough ) and (3.2.13) is because |I| ≥ 2 for all I 6= 0. 44 Encoding Functions of Correlated Sources To summarize, by (a)–(e), we have γ → 0, n → ∞. The theorem is established. Remark 3.2. The achievable region given by (3.2.5) always contains the Slepian– Wolf region. Furthermore, it is in general larger than the Rĝ from (3.2.2). If ĝ is the modulo-two sum, namely s0 = 0 and h, kj ’s are identity functions for all s0 < j ≤ s, then (3.2.5) resumes the region of Ahlswede–Han [AH83, Theorem 10]. 3.3 Non-field Rings versus Fields I Given some finite ring R, let ĝ be of format (3.2.1), a nomographic presentation of g. We say that the region Rĝ given by (3.2.2) is achievable for computing g in the sense of Körner–Marton. From Theorem 3.2.2, we know that Rĝ might not be the largest achievable region one can obtain for computing g. However, Rĝ still captures the ability of linear coding over R when used for computing g. In other words, Rĝ is the region purely achieved with linear coding over R for computing g. On the other hand, regions from Theorem 3.2.2 are achieved by combining the linear coding and the standard random coding techniques. Therefore, it is reasonable to compare LCoR with LCoF in the sense of Körner–Marton. We are now to show that linear coding over finite rings, non-field rings in particular, strictly outperforms its field counterpart, LCoF, in the following example. Example 3.3.1. Let g : {α0 , α1 }3 → {β0 , β1 , β2 , β3 } (Figure 3.1) be a function such that g : (α0 , α0 , α0 ) 7→ β0 ; g : (α0 , α0 , α1 ) 7→ β3 ; g : (α0 , α1 , α0 ) 7→ β2 ; g : (α0 , α1 , α1 ) 7→ β1 ; g : (α1 , α0 , α0 ) 7→ β1 ; g : (α1 , α0 , α1 ) 7→ β0 ; g : (α1 , α1 , α0 ) 7→ β3 ; g : (α1 , α1 , α1 ) 7→ β2 . (3.3.1) Define µ : {α0 , α1 } → Z4 and ν : {β0 , β1 , β2 , β3 } → Z4 by µ : αj 7→ j, ∀ j ∈ {0, 1}, and ν : βj 7→ j, ∀ j ∈ {0, 1, 2, 3}, (3.3.2) respectively. Obviously, g is equivalent to x + 2y + 3z ∈ Z4 [3] (Figure 3.2) via µ1 = µ2 = µ3 = µ and ν. However, by Proposition 3.3.1, there exists no ĝ ∈ F4 [3] of format (3.2.1) so that g is equivalent to any restriction of ĝ. Although, Lemma 1.2.4 ensures that there always exists a bigger field Fq such that g admits a presentation ĝ ∈ Fq [3] of format (3.2.1), the size q must be strictly bigger than 4. For instance, let X ĥ(x) = a 1 − (x − a)4 − 1 − (x − 4)4 ∈ Z5 [1]. a∈Z5 Then, g has presentation ĥ(x + 2y + 4z) ∈ Z5 [3] (Figure 3.3) via µ1 = µ2 = µ3 = µ : {α0 , α1 } → Z5 and ν : {β0 , β1 , β2 , β3 } → Z5 defined (symbolic-wise) by (3.3.2). 3.3. Non-field Rings versus Fields I 45 β2 β1 β0 β3 β3 β2 β1 y z x β0 Figure 3.1: g : {α0 , α1 }3 → {β0 , β1 , β2 , β3 } 2 2 1 0 1 0 3 3 3 2 1 y z x 0 Figure 3.2: x + 2y + 3z ∈ Z4 [3] 2 3 = ĥ(4) y z 1 x 0 Figure 3.3: ĥ(x + 2y + 4z) ∈ Z5 [3] Proposition 3.3.1. There exists no polynomial function ĝ ∈ F4 [3] of format (3.2.1), such that a restriction of ĝ is equivalent to the function g defined by (3.3.1). Proof. Suppose ν ◦ g = ĝ ◦ (µ1 , µ2 , µ3 ), where µ1 , µ1 , µ3 : {α0 , α1 } → F4 , ν : {β0 , · · · , β3 } → F4 are injections, and ĝ = h ◦ (k1 + k2 + k3 ) with h, ki ∈ F4 [1]for all feasible i. We claim that ĝ and h are both surjective, since g {α0 , α1 }3 = |{β0 , β1 , β2 , β3 }| = 4 = |F4 | . In particular, h is bijective. Therefore, h−1 ◦ ν ◦ g = k1 ◦ µ1 + k2 ◦ µ2 + k3 ◦ µ3 , i.e. g admits a presentation k1 (x) + k2 (y) + k3 (z) ∈ F4 [3]. A contradiction to Lemma 3.A.2. As a consequence of Proposition 3.3.1, in the sense of Körner–Marton, in order to use LCoF to encode function g, the alphabet sizes of the three encoders need to 46 Encoding Functions of Correlated Sources be at least 5. However, LCoR offers a solution in which the alphabet sizes are 4, strictly smaller than using LCoF. Most importantly, the region achieved with linear coding over any finite field Fq , is always a subset of the one achieved with linear coding over Z4 . This is proved in the following proposition. Proposition 3.3.2. Let g be the function defined by (3.3.1), {α0 , α1 }3 be the sample space of (X1 , X2 , X3 ) ∼ p and pX be the distribution of X = g(X1 , X2 , X3 ). If pX (β0 ) = p1 , pX (β1 ) = p2 , pX (β3 ) = p3 and pX (β2 ) = p4 satisfy (3.2.3), then, in the sense of Körner–Marton, the region R1 achieved with linear coding over Z4 contains the one, that is R2 , obtained with linear coding over any finite field Fq for computing g. Moreover, if supp(p) is the whole domain of g, then R1 ) R2 . Proof. Let ĝ = h ◦ k ∈ Fq [3] be a polynomial presentation of g with format (3.2.1). By Corollary 3.2.1 and Remark 3.1, we have R1 = (R1 , R2 , R3 ) ∈ R3 Ri > H(X1 + 2X2 + 3X3 ) , R2 = (R1 , R2 , R3 ) ∈ R3 Ri > H(k(X1 , X2 , X3 )) . Assume that ν ◦ g = h ◦ k ◦ (µ1 , µ2 , µ3 ), where µ1 , µ1 , µ3 : {α0 , α1 } → Fq and ν : {β0 , · · · , β3 } → Fq are injections. Obviously, g(X1 , X2 , X3 ) is a function of k(X1 , X2 , X3 ). Hence, H(k(X1 , X2 , X3 )) ≥ H(g(X1 , X2 , X3 )). (3.3.3) On the other hand, H(X1 + 2X2 + 3X3 ) = H(g(X1 , X2 , X3 )). Therefore, H(k(X1 , X2 , X3 )) ≥ H(X1 + 2X2 + 3X3 ), and R1 ⊇ R2 . In addition, we claim that h|S , where S = k Q (3.3.4) 3 µ {α , α } , j 0 1 j=1 is not injective. Otherwise, h : S → S 0 , where S 0 = h(S ), is bijective, hence, −1 (h|S 0 ) ◦ν◦g = k◦(µ1 , µ2 , µ3 ) = k1 ◦µ1 +k2 ◦µ2 +k3 ◦µ3 . A contradiction to Lemma 3.A.2. Consequently, |S | > |S 0 | = |ν ({β0 , · · · , β3 })| = 4. If supp(p) = {α0 , α1 }3 , then (3.3.3) as well as (3.3.4) hold strictly, thus, R1 ) R2 . A more intuitive comparison (which is not as conclusive as Proposition 3.3.2) can be identified from the presentations of g given in Figure 3.2 and Figure 3.3. According to Corollary 3.2.1, linear encoders over field Z5 achieve RZ5 = (R1 , R2 , R3 ) ∈ R3 Ri > H(X1 + 2X2 + 4X3 ) . The one achieved by linear encoders over ring Z4 is RZ4 = (R1 , R2 , R3 ) ∈ R3 Ri > H(X1 + 2X2 + 3X3 ) . Clearly, H(X1 + 2X2 + 3X3 ) ≤ H(X1 + 2X2 + 4X3 ), thus, RZ4 contains RZ5 . Furthermore, as long as 0 < Pr (α0 , α0 , α1 ) , Pr (α1 , α1 , α0 ) < 1, 3.3. Non-field Rings versus Fields I (X1 , X2 , X3 ) (α0 , α0 , α0 ) (α1 , α0 , α1 ) (α1 , α0 , α0 ) (α0 , α1 , α1 ) 47 p 1/90 1/90 42/90 42/90 (X1 , X2 , X3 ) (α0 , α1 , α0 ) (α1 , α1 , α1 ) (α0 , α0 , α1 ) (α1 , α1 , α0 ) p 1/90 1/90 1/90 1/90 Table 1 RZ4 is strictly larger than RZ5 , since H(X1 + 2X2 + 3X3 ) < H(X1 + 2X2 + 4X3 ). To be specific, assume that (X1 , X2 , X3 ) ∼ p satisfies Table 1, we have R[X1 , X2 , X3 ] (RZ5 = (R1 , R2 , R3 ) ∈ R3 Ri > 0.4812 (RZ = (R1 , R2 , R3 ) ∈ R3 Ri > 0.4590 . 4 Based on Proposition 3.3.1 and Proposition 3.3.2, we conclude that LCoR dominates LCoF, in terms of achieving better coding rates with smaller alphabet sizes of the encoders for computing g. As a direct conclusion, we have: Theorem 3.3.1. In the sense of Körner–Marton, LCoF is not optimal. Remark 3.3. The key property underlying the proof of Proposition 3.3.2 is that the characteristic of a finite field must be a prime while the characteristic of a finite ring can be any positive integer larger than or equal to 2. This implies that it is possible to construct infinitely many discrete functions for which using LCoF always leads to a suboptimal achievable Ps region compared to linear coding over finite nonfield rings. Examples include i=1 xi ∈ Z2p [s] for s ≥ 2 and prime p > 2 (note: the characteristic of Z2p is 2p which is not a prime). One can always find an explicit distribution of sources for which linear coding over Z2p strictly dominates linear coding over each and every finite field. 48 Encoding Functions of Correlated Sources 3.A Appendix 3.A.1 Suppporting Lemmata Lemma 3.A.1. If ( and P4 − j=1 4 X 0 ≤ max{p2 , p3 } < 6 min{p1 , p4 } ≤ 1 0 ≤ max{p1 , p4 } < 6 min{p2 , p3 } ≤ 1 pj = 1, then pj log pj ≤ −2 (p2 + p3 ) log (p2 + p3 ) + (p1 + p4 ) log (p1 + p4 ) . (3.A.1) j=1 Proof [DA12]. Without loss of generality, we assume that 0 ≤ max{p4 , p3 } ≤ min{p2 , p1 } ≤ 1 which implies that p1 + p2 − 1/2 ≥ |p1 + p4 − 1/2|. Let H2 (c) = −c log c − (1 − c) log(1 − c), 0 ≤ c ≤ 1, be the binary entropy function. By the grouping rule for entropy [CT06, pp. 49], (3.A.1) equals to p1 p1 + p4 p4 p1 + p4 (p1 + p4 ) log + log p1 + p4 p1 p1 + p4 p4 p2 p2 + p3 p3 p2 + p3 +(p2 + p3 ) log + log p2 + p3 p2 p2 + p3 p3 ≤ − (p2 + p3 ) log (p2 + p3 ) − (p1 + p4 ) log (p1 + p4 ) ⇔ A :=(p1 + p4 )H2 p1 p1 + p4 + (p2 + p3 )H2 p2 p2 + p3 ≤H2 (p1 + p4 ). Since H2 is a concave function and P4 j=1 pj = 1, then A ≤ H2 (p1 + p2 ) . Moreover, p1 + p2 − 1/2 ≥ |p1 + p4 − 1/2| guarantees that H2 (p1 + p2 ) ≤ H2 (p1 + p4 ) , because H2 (c) = H2 (1 − c), ∀ 0 ≤ c ≤ 1, and H2 (c0 ) ≤ H2 (c00 ) if 0 ≤ c0 ≤ c00 ≤ 1/2. Therefore, A ≤ H2 (p1 + p4 ) and (3.A.1) holds. Lemma 3.A.2. No matter which finite field Fq is chosen, g given by (3.3.1) admits no presentation k1 (x) + k2 (y) + k3 (z), where ki ∈ Fq [1] for all feasible i. 3.A. Appendix 49 Proof. Suppose otherwise, i.e. k1 ◦ µ1 + k2 ◦ µ2 + k3 ◦ µ3 = ν ◦ g for some injections µ1 , µ1 , µ3 : {α0 , α1 } → Fq and ν : {β0 , · · · , β3 } → Fq . By (3.3.1), we have ν(β1 ) =(k1 ◦ µ1 )(α1 ) + (k2 ◦ µ2 )(α0 ) + (k3 ◦ µ3 )(α0 ) =(k1 ◦ µ1 )(α0 ) + (k2 ◦ µ2 )(α1 ) + (k3 ◦ µ3 )(α1 ) ν(β3 ) =(k1 ◦ µ1 )(α1 ) + (k2 ◦ µ2 )(α1 ) + (k3 ◦ µ3 )(α0 ) =(k1 ◦ µ1 )(α0 ) + (k2 ◦ µ2 )(α0 ) + (k3 ◦ µ3 )(α1 ) =⇒ ν(β1 ) − ν(β3 ) = τ = −τ =⇒ τ + τ = 0, (3.A.2) where τ = k2 (µ2 (α0 )) − k2 (µ2 (α1 )). Since µ2 is injective, (3.A.2) implies that either τ = 0 or Char(Fq ) = 2 by Proposition 1.1.2. Noticeable that k2 (µ2 (α0 )) 6= k2 (µ2 (α1 )), i.e. τ 6= 0, otherwise, ν(β1 ) = ν(β3 ) which contradicts the assumption that ν is injective. Thus, Char(Fq ) = 2. Let ρ = (k3 ◦ µ3 )(α0 ) − (k3 ◦ µ3 )(α1 ). Obviously, ρ 6= 0 because of the same reason that τ 6= 0, and ρ + ρ = 0 since Char(Fq ) = 2. Therefore, ν(β0 ) =(k1 ◦ µ1 )(α0 ) + (k2 ◦ µ2 )(α0 ) + (k3 ◦ µ3 )(α0 ) =(k1 ◦ µ1 )(α0 ) + (k2 ◦ µ2 )(α0 ) + (k3 ◦ µ3 )(α1 ) + ρ =ν(β3 ) + ρ =(k1 ◦ µ1 )(α1 ) + (k2 ◦ µ2 )(α1 ) + (k3 ◦ µ3 )(α0 ) + ρ =(k1 ◦ µ1 )(α1 ) + (k2 ◦ µ2 )(α1 ) + (k3 ◦ µ3 )(α1 ) + ρ + ρ =ν(β2 ) + 0 = ν(β2 ). This contradicts the assumption that ν is injective. Remark 3.4. As a special case, this lemma implies that no matter which finite field Fq is chosen, g defined by (3.3.1) has no polynomial presentation that is linear over Fq . In contrast, g admits presentation x + 2y + 3z ∈ Z4 [3] which is a linear function over Z4 . Lemma 3.A.3. Let (X1 , X2 , · · · , Xl , Y ) ∼ q. For any > 0 and positive integer n, choose a sequence X̃jn (1 ≤ j ≤ l) randomly from T (n, Xj ) based on a uniform distribution. If y ∈ Y n is an -typical sequence with respect to Y , then Pl −n I(Xj ;Y,X1 ,X2 ,··· ,Xj−1 )−3l j=1 Pr (X̃1n , X̃2n , · · · , X̃ln , Y n ) ∈ T |Y n = y ≤ 2 . Proof. Let Fj be the event {(X̃1n , X̃2n , · · · , X̃jn , Y n ) ∈ T }, 1 ≤ j ≤ l, and F0 = ∅. 50 Encoding Functions of Correlated Sources We have l Y Pr (X̃1n , X̃2n , · · · , X̃ln , Y n ) ∈ T |Y n = y = Pr {Fj |Y n = y, Fj−1 } j=1 ≤ l Y 2−n[I(Xj ;Y,X1 ,X2 ,··· ,Xj−1 )−3] j=1 =2 −n Pl j=1 I(Xj ;Y,X1 ,X2 ,··· ,Xj−1 )−3l since X̃1n , X̃2n , · · · , X̃ln , y are generated independent. Lemma 3.A.4. If (Y1 , V1 , Y2 , V2 , · · · , Ys , Vs ) ∼ q, and q(y1 , v1 , y2 , v2 , · · · , ys , vs ) = q(y1 , y2 , · · · , ys ) s Y q(vi |yi ), i=1 then, ∀ J = {j1 , j2 , · · · , j|j| } ⊆ {1, 2, · · · , s}, I(YJ ; VJ |VJ c ) = |j| X i=1 I(Yji ; Vji ) − I(Vji ; VJ c , Vj1 , · · · , Vji−1 ). , Chapter 4 Stochastic Complements and Supremus Typicality s seen in the last two chapters, Lemma 2.1.2 is a very important foundation for most of the conclusions drawn. Tracing along the arguments, we shall see that Lemma 2.1.2 is the base for proving the achievability theorems regarding LCoR, Theorem 2.1.1 and its variations. In turn, LCoR is applied to Problem 2.1, Source Coding for Computing, and it is demonstrated that LCoR outperforms LCoF in various aspects. Therefore, if we want to re-establish some results, say Theorem 2.1.1 and Theorem 3.2.1, drawn previously for non-i.i.d. sources, then naturally a correspondence of Lemma 2.1.2 ought to be reproved first. Usually, generalising achievability results from the i.i.d. case to other stationary (or a.m.s.) ergodic scenarios is easy. That can be done by extending the typicality argument (from Shannon [SW49]) to the generalised scenario. Unfortunately, this process is not as straightforward as it usually is for our particular problem. To be more precise, it is possible to obtain expressions in characterizing the achievable coding rates. However, these expressions are often hard to analyse or evaluate. To overcome such a drawback, we will introduce a new typicality concept, called Supremus typicality. Built on this, some results of LCoR from the previous two chapters are re-established, and they become easier to analyse and evaluate. In addition, we will see that the classical definition of typicality does not characterize the stochastic properties of the (non-i.i.d.) sources well enough. This is essentially the reason causing the insufficiency of the classical typical sequence mentioned before. In order to clearly present the idea of Supremus typicality in a simpler setting, we will only focus on the Markov source scenarios in this and the next chapters. The discussions of the more universal settings (e.g. a.m.s. sources) are deployed after introducing some mathematical tools on ergodic theory in chapter 6. A 51 52 4.1 4.1.1 Stochastic Complements and Supremus Typicality Markov Chains and Stochastic Complements Index Oriented Matrix Operations Let X , Y and Z be three countable sets with or without orders defined, e.g. X = {(0, 0), (0, 1), (1, 1), (1, 0)} and Y = {α, β}×N+ . In many places hereafter, we write [pi,j ]i∈X ,j∈Y ([pi ]i∈X ) for a “matrix” (“vector”) whose “(i, j)th” (“ith”) entry is pi,j (pi ) ∈ R. Matrices p0i,j i∈X ,j∈Y and [qj,k ]j∈Y ,k∈Z are similarly defined. Let P = [pi,j ]i∈X ,j∈Y . For subsets A ⊆ X and B ⊆ Y , PA,B is designated for the “submatrix” [pi,j ]i∈A,j∈B . We will use “index oriented” operations, namely " # X [pi ]i∈X [pi,j ]i∈X ,j∈Y = pi pi,j ; i∈X j∈Y [pi,j ]i∈X ,j∈Y + p0i,j i∈X ,j∈Y = pi,j + p0i,j i∈X ,j∈Y ; X [pi,j ]i∈X ,j∈Y [qj,k ]j∈Y ,k∈Z = pi,j qj,k j∈Y . i∈X ,k∈Z In addition, a matrix PA,A = [pi,j ]i,j∈A is said to be an identity matrix if and only if pi,j = δi,j (Kronecker delta), ∀ i, j ∈ A. We often indicate an identity matrix with 1 whose size is known from the context, while designating 0 as a zero matrix (all of whose entries are 0) of size known from the context. For any matrix PA,A , its inverse (if exists) is some matrix QA,A suchPthat QA,A PA,A = PA,A QA,A = 1. Let [pi ]i∈X be non-negative and unitary, i.e. i∈X pi = 1, and [pi,j ]i∈X ,j∈Y be P non-negative and j∈Y pi,j = 1 (such a matrix is termed a stochastic matrix). For discrete random variables X and Y with sample spaces X and Y , respectively, X ∼ [pi ]i∈X and (X, Y ) ∼ [pi ]i∈X [pi,j ]i∈X ,j∈Y state for Pr {X = i} = pi and Pr {X = i, Y = j} = pi pi,j , for all i ∈ X and j ∈ Y , respectively. 4.1.2 Markov Chains and Strong Markov Typical Sequences Definition 4.1.1. A (discrete) Markov chain is defined to be a discrete stochastic process M = X (n) with state space X such that, ∀ n ∈ N+ , n o n o Pr X (n+1) X (n) , X (n−1) , · · · , X (1) = Pr X (n+1) X (n) . M is said to be finite-state if X is finite. Definition 4.1.2. A Markov chain M = X (n) is said to be homogeneous (time homogeneous) if and only if n o n o Pr X (n+1) X (n) = Pr X (2) X (1) , ∀ n ∈ N+ . 4.1. Markov Chains and Stochastic Complements 53 If not specified, we assume that all Markov chains considered throughout this and the next chapters are finite-state and homogeneous. However, they are not necessarily stationary [CT06, pp. 71], or their initial distributions are unknown. Definition 4.1.3. Given a Markov chain M = X (n) with state space X , the transitionmatrix ofM is defined to be the stochastic matrix P = [pi,j ]i,j∈X , where pi,j = Pr X (2) = j X (1) = i . Moreover, M is said to be irreducible if and only if P is irreducible, namely, there exists no ∅ = 6 A ( X such that PA,Ac = 0. Definition 4.1.4. A state j of a Markov chain M = X (n) is said to be recurrent if Pr T < ∞| X (0) = j = 1, where T = inf{n > 0|X (n) = j}. If in addition the conditional expectation E{T |X (0) = j} < ∞, then j is said to be positive recurrent. M is said to be positive recurrent if all states are positive recurrent. Definition 4.1.5. A Markov chain (not necessarily finite-state) is said to be ergodic if and only if it is irreducible, positive recurrent and aperoidic. Theorem 4.1.1 (Theorem 1.7.7 of [Nor98]). An irreducible Markov chain M with state space X is positive recurrent, if and only if it admits a non-negative unitary vector π = [pj ]j∈X , such that πP = π, where P is the transition matrix of M . Moreover, π is unique and is called the invariant (stationary) distribution. Theorem 4.1.2 (Theorem 2.31 of [BB05]). A finite-state irreducible Markov chain is positive recurrent. Clearly, every irreducible Markov chain considered in this and the next chapters admits a unique invariant distribution1 (which is not necessarily the initial distribution), since it is assumed to be simultaneously finite-state and homogeneous (unless otherwise specified). Definition 4.1.6 (Strong Markov Typicality (cf. [DLS81, Csi98])). Let M = (n) X be an irreducible Markov chain with state space X , and P = [pi,j ]i,j∈X and π = [pj ]j∈X be its transition matrix and invariant distribution, respectively. For any > 0, a sequence x ∈ X n of length n (≥ 2) is said to be strong Markov -typical with respect to P if N (i, j; x) N (i; x) − pi < , ∀ i, j ∈ X , N (i; x) − pi,j < and n where N (i, j; x) is the number of occurrences of sub-sequence [i, j] in x and X N (i; x) = N (i, j; x) j∈X The set of all strong Markov -typical sequences with respect to P in X n is denoted by T (n, P) or T for simplicity. 1 This can also be proved with the Perron–Frobenius Theorem [Per07, Fro12]. 54 Stochastic Complements and Supremus Typicality Let P and π be some stochastic matrix and non-negative unitary vector. We define H(π) and H(P|π) to be H(X) and H(Y |X), respectively, for jointly discrete random variables (X, Y ) such that X ∼ π and (X, Y ) ∼ πP. Proposition 4.1.1 (AEP of Strong Markov Typicality2 ). Let M = X (n) be an irreducible Markov chain with state space X , and P = [pi,j ]i,j∈X and π = [pj ]j∈X be its transition matrix and invariant distribution, respectively. For any η > 0, there 0 > 0 and N0 ∈ N+ , such that, ∀ 0 > > 0, ∀ n > N0 and ∀ x = (1) exist (2) x , x , · · · , x(n) ∈ T (n, P), 1. exp2 [−n (H(P|π) + η)] < Pr X (1) , X (2) , · · · , X (n) = x < exp2 [−n (H(P|π) − η)]; 2. Pr {X ∈ / T (n, P)} < η, where X = X (1) , X (2) , · · · , X (n) ; and 3. |T (n, P)| < exp2 [n (H(P|π) + η)]. 4.1.3 Stochastic Complements Given a Markov chain M = X (n) with state space X A of X , let (n) ∈A ; inf n > 0|X TA,l = inf n > TA,l−1 |X (n) ∈ A ; sup n < TA,l+1 |X (n) ∈ A ; and a non-empty subset l = 1, l > 1, l < 1. It is well-known that MA = X (TA,l ) is Markov by the strong Markov property [Nor98, Theorem 1.4.2]. In particular, if M is irreducible, so is MA . To be more precise, if M is irreducible, and write its invariant distribution and transition matrix as π = [pi ]i∈X and " # PA,A PA,Ac P= , PAc ,A PAc ,Ac respectively, then SA = PA,A + PA,Ac (1 − PAc ,Ac ) −1 PAc ,A , is the transition matrix of MA [Mey89, Theorem 2.1 and Section 3]. # " pi πA = P j∈A pj i∈A 2 Similar statements in the literature (cf. [DLS81, Csi98]) assume that the Markov chain is stationary ergodic. The result is easy to generalise to irreducible Markov chain. To be rigorous, we include a proof of the irreducible case in Section 4.A.1. 4.2. Supremus Typical Sequences 55 is an invariant distribution of SA , i.e. πA SA = πA [Mey89, Theorem 2.2]. Since MA inherits irreducibility from M [Mey89, Theorem 2.3], πA is unique. The matrix SA is termed the stochastic complement of PA,A in P, while MA is named a reduced Markov chain (or reduced process) of M . It has state space A obviously. 4.2 Supremus Typical Sequences We will define Supremus typical sequence in this section. This new concept is stronger in the sense of characterizing the stochastic behaviours of random processes/sources. Although this concept is only defined for and applied to Markov processes/sources in this chapter, the idea can be generalised to other random processes/sources, e.g. a.m.s. processes/sources [GK80]. Nevertheless, some background on ergodic theory is required. Thus, we leave the investigation on the more universal settings to chapter 6. Definition 4.2.1 (Supremus Typicality). Following the notation defined in Section 4.1.3, given > 0 and a sequence x = x(1) , x(2) , · · · , x(n) ∈ X n of length n (≥ 2 |X |), let xA be the subsequence of x formed by all those x(l) ’s that belong to A in the original ordering. x is said to be Supremus -typical with respect to P, if and only if xA is strong Markov -typical with respect to SA for any feasible non-empty subset A of X . In Definition 4.2.1, the set of all Supremus -typical sequences with respect to P in X n is denoted as S (n, P) or S for simplicity. xA is called a reduced subsequence (with respect to A) of x. It follows immediately form the definition that Proposition 4.2.1. Every reduced subsequence of a Supremus -typical sequence is Supremus -typical. However, the above proposition does not hold for strong Markov -typical sequences. Namely, a reduced subsequence of a strong Markov -typical sequence is not necessarily strong Markov -typical. Example 4.2.1. Let {α, β, γ} be the state space of an i.i.d. process with a uniform distribution, i.e. 1/3 1/3 1/3 P = 1/3 1/3 1/3 , 1/3 1/3 1/3 and x = (α, β, γ, α, β, γ, α, β, γ). It is easy to verify that x is a strong Markov 5/12-typical sequence. However, the reduced subsequence x{α,γ} = (α, γ, α, γ, α, γ) 56 Stochastic Complements and Supremus Typicality " is no longer a strong Markov 5/12-typical sequence, because S{α,γ} 0.5 = 0.5 0.5 0.5 # and the number of subsequence (α, α)’s in x{α,γ} 5 − 0.5 = |0 − 0.5| > . 6 12 Proposition 4.2.2 (AEP of Supremus Typicality). Let M = X (n) be an irreducible Markov chain with state space X , and P = [pi,j ]i,j∈X and π = [pj ]j∈X be its transition matrix and invariant distribution, respectively. For any η > 0, + there exist (1) 0(2)> 0 and N0 ∈ N , such that, ∀ 0 > > 0, ∀ n > N0 and (n) ∀ x = x ,x ,··· ,x ∈ S (n, P), 1. exp2 [−n (H(P|π) + η)] < Pr X (1) , X (2) , · · · , X (n) = x < exp2 [−n (H(P|π) − η)]; 2. Pr {X ∈ / S (n, P)} < η, where X = X (1) , X (2) , · · · , X (n) ; and 3. |S (n, P)| < exp2 [n (H(P|π) + η)]. Proof. Note that T (n, P) ⊇ S (n, P). Thus, 1 and 3 are inherited from the AEP of strong Markov typicality. In addition, 2 can be proved without any difficulty since any reduced Markov chain of M is irreducible and the number of reduced Markov chains of M is, 2|X | − 1, finite. Remark 4.1. It is known that Shannon’s (weak/strong) typical sequences [SW49] are defined to be those sequences “representing” the stochastic behaviour of the whole random process. To be more precise, a non (weak/strong) typical sequence is unlikely to be produced by the random procedure (Proposition 4.1.1). However, the study of induced transformations3 in ergodic theory suggests that (weak/strong) typical sequences that are not Supremus typical form also a low probability set (see Theorem 6.3.4). When the random procedure propagates, it is highly likely that all reduced subsequences of the generated sequence also admit empirical distributions “close enough” to the genuine distributions of corresponding reduced processes as proved by Proposition 4.2.2. Therefore, Supremus typical sequences “represent” the random process better. This difference has been seen from Proposition 4.2.1 and Example 4.2.1, and will be seen again in comparing the two typicality lemmata, Lemma 4.2.1 and Lemma 4.2.2, given later. The following two typicality lemmata of typical sequences are the ring versions tailored for our discussions from the two given in Section 4.A.2, respectively. From these two lemmata, we will start to see the impact brought to the analytic results by the differences between classical typicality and Supremus typicality. 3 See Chapter 6 for the correspondence between an induced transformation and a reduced process of a random process (a dynamical system). 4.2. Supremus Typical Sequences 57 Lemma 4.2.1. Let R be a finite ring, M = X (n) be an irreducible Markov chain whose state space, transition matrix and invariant distribution are R, P and π = [pj ]j∈R , respectively. For any η > 0, there exist 0 > 0 and N0 ∈ N+ , such that, ∀ 0 > > 0, ∀ n > N0 , ∀ x ∈ S (n, P) and ∀ I ≤l R, ( " #) X X |S (x, I)| < exp2 n pj H(SA |πA ) + η (4.2.1) A∈R/I j∈A = exp2 n H(SR/I |π) + η , (4.2.2) where S (x, I) = { y ∈ S (n, P)| y − x ∈ In } , " pi SA is the stochastic complement of PA,A in P, πA = P j∈A pj o n distribution of SA and SR/I = diag {SA }A∈R/I . # is the invariant i∈A Remark 4.2. By definition, for any y ∈ S (x, I) in Lemma 4.2.1, we have that y and x follow the same sequential pattern, i.e. the ith coordinates of both sequences are from the same coset of I. If I = R, then S (x, I) is the whole set of Supremus typical sequences. It is well-known that evaluating the cardinality of the set of all the (weak/strong) typical sequences is of great importance to the achievability part of the source coding theorem [SW73]. We will see from the next section that determining the number of (weak/strong/Supremus) typical sequences of certain sequential pattern is also very important to the achievability result for linear coding over finite rings. Proof of Lemma 4.2.1. Assume that x = x(1) , x(2) , · · · , x(n) and let xA be the subsequence of x formed by all those x(l) ’sthat belong to A ∈ R/I in the original ordering. For any y = y (1) , y (2) , · · · , y (n) ∈ S (x, I), obviously y (l) ∈ A if and only if x(l) ∈ A for all A ∈ R/I and 1 ≤ l ≤ n. Let xA = x(n1 ) , x(n2 ) , x(nmA ) m P 1 A P (note: A∈R/I mA = n and − j∈A pj < |A| + ). By Proposition 4.2.1, n n yA = y (n1 ) , y (n2 ) , y (nmA ) ∈ AmA is a Supremus -typical sequence of length mA with respect to SA , since y is Supremus -typical. Additionally, by Proposition 4.2.2, there exist A > 0 and positive integer MA such that the number of Supremus typical sequences of length mA is upper bounded by exp2 {mA [H(SA |πA ) + η/2]} if 0 < < A and mA > MA . Therefore, if 0 < < minA∈R/I A and n > M = max 1 + MA P j∈A pj − |A| A∈R/I 58 Stochastic Complements and Supremus Typicality (this guarantees that mA > MA for all A ∈ R/I), then ( ) X |S (x, I)| ≤ exp2 mA [H(SA |πA ) + η/2] A∈R/I ( " = exp2 n #) X mA H(SA |πA ) + η/2 . n A∈R/I mA < Furthermore, choose 0 < 0 ≤ minA∈R/I A and N0 ≥ M such that n P η P for all 0 < < 0 and n > N0 and A ∈ R/I, j∈A pj + 2 A∈R/I H(SA |πA ) we have ( " #) X X |S (x, I)| < exp2 n pj H(SA |πA ) + η , A∈R/I j∈A (4.2.1) is established. Direct calculation yields (4.2.2). At this point, one might argue to replace S (x, I) in Lemma 4.2.1 with T (x, I) = { y ∈ T (n, P)| y − x ∈ In }, the set of strong Markov -typical sequences having the same sequential pattern as those from S (x, I), to keep the argument inside the classical typicality framework. Actually, this change makes Lemma 4.2.1 a Markovian generalisation of Lemma 2.1.2. Unfortunately, a reduced subsequence of a sequence from T (x, I) is not necessarily strong Markov -typical anymore (Proposition 4.2.1 fails). Thus, the same proof built on classical typicality does not follow. Another alternative is to consider weak typicality (defined below). However, even though a corresponding bound (see Lemma 4.2.2 below) can be obtained, this bound is often very hard to evaluate as seen later. Definition 4.2.2 (Modified Weak Typicality). Given an irreducible Markov chain (n) X with transition matrix P and a finite state space X . For any > 0, a sequence x(1) , x(2) , · · · , x(n) ∈ X n of length n is said to be weak -typical with respect to P, if and only if n o 1 − log Pr Γ X (l) = Γ x(l) , ∀ 1 ≤ l ≤ n − HΓ,X < , n where HΓ,X is the entropy rate of Γ X (n) , for all feasible function Γ’s. In Definition 4.2.2, the set of all weak -typical sequences with respect to P in X n is denoted as TH (n, P) or TH for simplicity. 4 Proposition 4.2.3 (AEP of Modified Weak ). In Definition 4.2.2,for Typicality (1) (2) any η > 0, there exists N0 > 0 such that Pr X , X , · · · , X (n) ∈ / TH (n, P) < η for all n > N0 . 4 The proof of this proposition is presented in Session 6.A.1 when required background are given. 4.2. Supremus Typical Sequences 59 Lemma 4.2.2. In Lemma 4.2.1, let TH (x, I) = y ∈ TH (n, P) y − x ∈ In . It follows that H T (x, I) 1 (m) (m−1) (1) < exp2 n H (P|π) − lim H YR/I , YR/I , · · · , YR/I + 2 , (4.2.3) m→∞ m (m) where YR/I = X (m) + I is a random variable with sample space R/I. Proof. Assume that x = x(1) , x(2) , · · · , x(n) and let y = x(1) + I, x(2) + I, · · · , x(n) + I . For any y (1) , y (2) , · · · , y (n) ∈ TH (x, I), obviously y (l) ∈ A if and only if x(l) ∈ A for all A ∈ R/I and 1 ≤ l ≤ n. Thus, y = y (1) + I, y (2) + I, · · · , y (n) + I . As a consequence, n o Pr X (l) + I = x(l) + I, ∀ 1 ≤ l ≤ n n o X ≥ Pr X (l) = y (l) , ∀ 1 ≤ l ≤ n [y(1) ,y(2) ,··· ,y(n) ]∈TH (x,I) h i X > exp2 {−n [H(P|π) + ]} since y (1) , y (2) , · · · , y (n) ∈ TH [y(1) ,y(2) ,··· ,y(n) ]∈TH (x,I) H (4.2.4) = T (x, I) exp2 {−n [H(P|π) + ]} 1 H X (m) , X (m−1) , · · · , X (1) = H (P |π ) since M is irreducible m Markov). On the other hand, n o Pr X (l) + I = x(l) + I, ∀ 1 ≤ l ≤ n 1 (m) (m−1) (1) < exp2 −n lim (4.2.5) H YR/I , YR/I , · · · , YR/I − m→∞ m (note: limm→∞ by Definition 4.2.2. Therefore, H T (x, I) < exp2 n H (P|π) − lim 1 H Y (m) , Y (m−1) , · · · , Y (1) + 2 R/I R/I R/I m→∞ m by (4.2.4) and (4.2.5). Remark 4.3. If R in Lemma 4.2.1 is a field, then both (4.2.2) and (4.2.3) are equivalent to exp2 [n (H (P|π) + η)] . (4.2.6) Or, if M in Lemma 4.2.1 is i.i.d., then both (4.2.2) and (4.2.3) are equivalent to i h (1) exp2 n H X (1) − H YR/I + η , (4.2.7) which is a special case of the generalised conditional typicality lemma, Lemma 2.1.2. 60 Stochastic Complements and Supremus Typicality Remark 4.4. In Lemma 4.2.2, if P =nc1 U +o(1 − c1 )1 with all rows of U being (n) identical and 0 ≤ c1 ≤ 1, then M 0 = YR/I is Markov by Lemma 4.A.3. As a conclusion, io n h H T (x, I) < exp2 n H (P|π) − lim H Y (m) Y (m−1) + 2 R/I R/I m→∞ 0 = exp2 {n [H (P|π) − H (P |π 0 ) + 2]} , where P0 and π 0 are the transition matrix and the invariant distribution of M 0 that can be easily calculated from P. From Remark 4.3 and Remark 4.4, we have seen that the two bounds (4.2.2) and (4.2.3) coincide, and both can be easily calculated for some special scenarios. Unfortunately, for general settings (when the initial distribution of M is not known or P 6= c1 U + (1 − c1 )1 for any U of identical rows and c1 ), (4.2.3) becomes almost because there is no efficient way to evaluate the entropy rate o n unaccessible (n) of YR/I . On the other hand, (4.2.2) is always as straightforward as calculating the conditional entropy. Example 4.2.2. Let M be an irreducible Markov chain with state space Z4 = {0, 1, 2, 3}. Its transition matrix P = [pi,j ]i,j∈Z4 is given as the follows. 0 1 2 3 0 .2597 .1208 .0184 .0985 1 .2093 .0872 .2627 .1823 2 .2713 .6711 .4101 .2315 3 .2597 .1208 .3088 .4877 (4.2.8) Let I = {0, 2}. Notice that the initial distribution is unknown, neither P = c1 U + (1 − c1 )1 for any U of identical rows and c1 . Thus, the upper bound of TH (x, I) from (4.2.3) is not very meaningful for calculation since the entropy rate is not explicitly known. In contrast, we have that |S (x, I)| < 20.8791+η by (4.2.2). The above is partially the reason we forsake the traditional (weak/strong) typical sequence argument of Shannon [SW49], and introduce an argument based on Supremus typicality. 4.A. Appendix 4.A 61 Appendix 4.A.1 Proof of the AEP of Strong Markov Typicality 1. Let Pr X (1) = x(1) = c. By definition, i o nh X (1) , X (2) , · · · , X (n) = x n o Y N (i,j;x) = Pr X (1) = x(1) pi,j Pr i,j∈X =c exp2 X N (i, j; x) log pi,j i,j∈X =c exp2 −n X i,j∈X =c exp2 −n N (i; x) N (i, j; x) log pi,j − n N (i; x) X i,j∈X pi pi,j N (i; x) N (i, j; x) − n N (i; x) log pi,j − pi pi,j log pi,j . In addition, there exists a small enough 0 > 0 and a N0 ∈ N+ such that N (i; x) N (i, j; x) 2 − pi pi,j < −η 2 |X | min log pi,j n pi,j 6=0 N (i; x) and − log c < η/2 for all 0 > > 0 and n > N0 . Consequently, n nh i o Pr X (1) , X (2) , · · · , X (n) = x X η log p i,j >c exp2 −n − pi pi,j log pi,j 2 2 |X | min log p p = 6 0 i,j i,j i,j∈X X η pi pi,j log pi,j ≥c exp2 −n − 2 i,j∈X log c η = exp2 −n − + + H(P|π) n 2 > exp2 [−n (η + H(P|π))] . Similarly, Pr nh i o X (1) , X (2) , · · · , X (n) = x 62 Stochastic Complements and Supremus Typicality <c exp2 −n −η log pi,j X 2 2 |X | minpi,j 6=0 log pi,j X η pi pi,j log pi,j ≤c exp2 −n − − 2 i,j∈X h η i ≤ exp2 −n − + H(P|π) 2 < exp2 [−n (−η + H(P|π))] . − pi pi,j log pi,j i,j∈X 2. By Boole’s inequality [Boo10, Fré35], Pr {X ∈ / T (n, P)} ! [ N (i, j; X) [ [ N (i; X) = Pr − pi,j ≥ − pi ≥ N (i; X) n i,j∈X i∈X X X N (i, j; X) N (i; X) ≤ Pr Pr − pi,j ≥ E + − pi ≥ , N (i; X) n i,j∈X where E = i∈X T i∈X N (i; X) − pi < for all feasible i. n By the Ergodic Theorem of Markov chains [Nor98, Theorem 1.10.2], N (i; X) Pr − pi ≥ → 0 as n → ∞ n for any > 0. Thus, there is an integer N00 , such that for all n > N00 , N (i; X) η Pr − pi ≥ < . n 2 |X | On the other hand, for mini∈X pi /2 > > 0 (note: pi > 0, ∀ i ∈ X , because P is irreducible), N (i; x) → ∞ as n → ∞, conditional on E. Therefore, by the Strong Law of Large Numbers [Nor98, Theorem 1.10.1], N (i, j; X) Pr − pi,j ≥ E → 0, as n → ∞. N (i; X) Hence, there exists N000 , for all n > N000 , N (i, j; X) η Pr − pi,j ≥ E < 2. N (i; X) 2 |X | Let N0 = max{N00 , N000 } and 0 = mini∈X pi /2 > 0. We have Pr {X ∈ / T (n, P)} < η for all 0 > > 0 and n > N0 . 4.A. Appendix 63 3. Finally, let 0 and N0 be defined as in 1. |T (n, P)| < exp2 [n (H(P|π) + η)] follows since X 1≥ Pr {X = x} > |T (n, P)| exp2 [−n (H(P|π) + η)] , x∈T (n,P) if 0 > > 0 and n > N0 . Let 0 be the smallest one chosen above and N0 be the biggest one chosen. The statement is proved. 4.A.2 Typicality Lemmata of Supremus Typical Sequences ` Given a set X , a partition S k∈K Ak of X is a disjoint union of subsets of X , 0 00 0 00 i.e. A ∩ A = 6 ∅ ⇔ k = k , k k∈K Ak = X and Ak ’s are not empty. Obviously, ` k A is a partition of a ring R given the left (right) ideal I. A∈R/I Lemma 4.A.1. Given an irreducible Markov chain M = X (n) with finite `m state space X , transition matrix P and invariant distribution π = [pj ]j∈X . Let k=1 Ak + be any partition of X . For any η > 0, there exist 0 > (1) 0 and N0 ∈ N , such that, (2) (n) ∀ 0 > > 0, ∀ n > N0 and ∀ x = x , x , · · · , x ∈ S (n, P), ( " |S (x)| < exp2 n m X X #) pj H(Sk |πk ) + η (4.A.1) k=1 j∈Ak = exp2 {n [H(S|π) + η]} , (4.A.2) where S (x) = nh i y (1) , y (2) , · · · , y (n) ∈ S (n, P) o y (l) ∈ Ak ⇔ x(l) ∈ Ak , ∀ 1 ≤ l ≤ n, ∀ 1 ≤ k ≤ m , [pi ] Sk is the stochastic complement of PAk ,Ak in P, πk = P i∈Ak is the invariant j∈Ak pj n o distribution of Sk and S = diag {Sk }1≤k≤m . Proof. Let xAk = x(n1 ) , x(n2 ) , x(nmk ) be the subsequence of x formed by all Pm those x(l) ’s belong to Ak in the original ordering. Obviously, k=1 mk = n and m 1 k P − j∈Ak pj < |Ak | + . For any y = y (1) , y (2) , · · · , y (n) ∈ S (x), n n i h k yAk = y (n1 ) , y (n2 ) , y (nmk ) ∈ Am k 64 Stochastic Complements and Supremus Typicality is a Supremus -typical sequence of length mk with respect to Sk by Proposition 4.2.1, since y is Supremus -typical. Additionally, by Proposition 4.2.2, there exist k > 0 and positive integer Mk such that the number of Supremus -typical sequences of length mk is upper bounded by exp2 {mk [H(Sk |πk ) + η/2]} if 0 < < k and mk > Mk . Therefore, if 0 < < min1≤k≤m k and 1 + Mk P n > M = max 1≤k≤m p − |A | j∈Ak j k (this guarantees that mk > Mk for all 1 ≤ k ≤ m), then (m ) X |S (x)| ≤ exp2 mk [H(Sk |πk ) + η/2] k=1 ( " = exp2 n m X mk k=1 n #) H(Sk |πk ) + η/2 . mk Furthermore, choose 0 < 0 ≤ min1≤k≤m k and N0 ≥ M such that < n P η Pm for all 0 < < 0 and n > N0 and 1 ≤ k ≤ m, j∈Ak pj + 2 k=1 H(Sk |πk ) we have ( " m #) X X |S (x)| < exp2 n pj H(Sk |πk ) + η , k=1 j∈Ak (4.A.1) is established. Direct calculation yields (4.A.2). By definition, S (x) in Lemma 4.A.1 contains Supremus -typical sequences `m whose have the same sequential pattern as x regarding the partition k=1 Ak . Similarly, let TH (x) be the set of weak sequences with the same sequential `-typical m pattern as x regarding the partition k=1 Ak , namely h i TH (x) = y (1) , y (2) , · · · , y (n) ∈ T (n, P) y (l) ∈ Ak ⇔ x(l) ∈ Ak , ∀ 1 ≤ l ≤ n, ∀ 1 ≤ k ≤ m . We have that Lemma 4.A.2. In Lemma 4.A.1, define Γ(x) = l ⇔ x ∈ Al . We have that H T (x) < exp2 n H(P|π) − lim 1 H Y (w) , Y (w−1) , · · · , Y (1) + 2 , w→∞ w where Y (w) = Γ X (w) . 4.A. Appendix 65 Proof. Let y = Γ x(1) , Γ x(2) , · · · , Γ x(n) . By definition, Γ y (1) , Γ y (2) , · · · , Γ y (n) = y, for any y (1) , y (2) , · · · , y (n) ∈ S (x). As a consequence, o n Pr Γ X (l) = Γ x(l) , ∀ 1 ≤ l ≤ n n o X ≥ Pr X (l) = y (l) , ∀ 1 ≤ l ≤ n [y(1) ,y(2) ,··· ,y(n) ]∈TH (x,I) h i X > exp2 {−n [H(P|π) + ]} since y (1) , y (2) , · · · , y (n) ∈ TH [y(1) ,y(2) ,··· ,y(n) ]∈TH (x,I) H (4.A.3) = T (x, I) exp2 {−n [H(P|π) + ]} 1 H X (m) , X (m−1) , · · · , X (1) = H (P |π ) since M is irreducible m Markov). On the other hand, (note: limm→∞ n o Pr Γ X (l) = Γ x(l) , ∀ 1 ≤ l ≤ n 1 (w) (w−1) (1) < exp2 −n lim H Y ,Y ,··· ,Y − w→∞ w (4.A.4) by Definition 4.2.2. Therefore, H T (x, I) < exp2 n H (P|π) − lim 1 H Y (w) , Y (w−1) , · · · , Y (1) + 2 w→∞ w by (4.A.3) and (4.A.4). Remark 4.5. Given a left ideal I of a finite ring R, R/I gives raise to a partition of R. Let X = R, m = |R/I| and Ak (1 ≤ k ≤ m) be an element (which is a set) of R/I. One has Lemma 4.2.1 and Lemma 4.2.2 proved immediately. In fact, Lemma 4.A.1 and Lemma 4.A.2 can be easily tailored to corresponding versions regarding other algebraic structures, e.g. group, rng, vector space, module, algebra and etc, in a similar fashion. 4.A.3 A Supporting Lemma Lemma 4.A.3. Let X (n) be a Markov chain with countable state space X and transition matrix P0 . If P0 = c1 U + (1 − c1 )1, where U is a matrix all of whose rows are identical to some countably infinite unitary vector and 0 ≤ c1 ≤ 1, then Γ X (n) is Markov for all feasible function Γ. 66 Stochastic Complements and Supremus Typicality Proof. Let Y (n) = Γ X (n) , and assume that [ux ]x∈X is the first row of U. For any a, b ∈ Γ (X ), n o Pr Y (n+1) = b Y (n) = a n o X = Pr X (n) = x, Y (n+1) = b Y (n) = a x∈Γ−1 (a) = X n o n o Pr Y (n+1) = b X (n) = x, Y (n) = a Pr X (n) = x Y (n) = a x∈Γ−1 (a) = X n o n o Pr Y (n+1) = b X (n) = x Pr X (n) = x Y (n) = a x∈Γ−1 (a) (n) P 0 = x Y (n) = a ; x0 ∈Γ−1 (b) c1 ux Pr X h i (n) P = P 0 = x Y (n) = a ; x∈Γ−1 (a) 1 − c1 + x0 ∈Γ−1 (b) c1 ux Pr X ( P P c1 x0 ∈Γ−1 (b) ux0 x∈Γ−1 (a) Pr X (n) = x Y (n) = a ; h i (n) P P = 1 − c1 + c1 x0 ∈Γ−1 (b) ux0 = x Y (n) = a ; x∈Γ−1 (a) Pr X ( P c1 x0 ∈Γ−1 (b) ux0 ; a 6= b = P 1 − c1 + c1 x0 ∈Γ−1 (b) ux0 ; a = b n o X = Pr X (n+1) = x0 X (n) = x ∀ x ∈ Γ−1 (a) (P x∈Γ−1 (a) a 6= b a=b a 6= b a=b x0 ∈Γ−1 (b) = X n o n o Pr X (n+1) = x0 X (n) = x Pr Y (n) = a Y (n) = a, Y (n−1) , · · · x0 ∈Γ−1 (b) = ∀ x ∈ Γ−1 (a) X X n o Pr X (n+1) = x0 X (n) = x, Y (n) = a, Y (n−1) , · · · x∈Γ−1 (a) x0 ∈Γ−1 (b) n o Pr X (n) = x Y (n) = a, Y (n−1) , · · · n o X X = Pr X (n+1) = x0 , X (n) = x Y (n) = a, Y (n−1) , · · · x∈Γ−1 (a) x0 ∈Γ−1 (b) n o = Pr Y (n+1) = b Y (n) = a, Y (n−1) , · · · Therefore, Γ X (n) is Markov. Remark 4.6. Lemma 4.A.3 is enlightened by [BR58, Theorem 3]. However, X (n) in this lemma is not necessarily stationary or finite-state. Chapter 5 Irreducible Markov Sources quipped with the foundation laid down by Proposition 4.2.2, Lemma 4.2.1 and Lemma 4.2.2, we resume our discussion of the Markov source network problem. First, recall Problem 2.1 and redefine it for the more universal settings as the follows. E Problem 5.1 (Source Coding for Computing a Function of Sources with or without Memory). Let t ∈ S = {1, 2, · · · , s} be a source that randomly generates discrete (2) (n) (n) (1) data · · · , Xt , Xt , · · · , Xt , · · · , where Xt has a finite sample space Xt for all n ∈ N+ . Given a discrete function g : XS → Y , what is the biggest region R[g] ⊂ Rs satisfying, ∀ (R1 , R2 , · · · , Rs ) ∈ R[g] and ∀ > 0, ∃ N0 ∈ N+ , such that, ∀ n > N0 , there exist s encoders φt : Xtn → 1, 2nRt , t ∈ S, and one decoder Q nRt ψ : t∈S 1, 2 → Y n with Pr {Y n 6= ψ [φ1 (X1n ) , · · · , φs (Xsn )]} < , (j) where Y (j) = g XS for all 1 ≤ j ≤ n. The difference between Problem 2.1 and Problem 5.1 lies in the restriction of the sources. Problem 2.1 a special case where the sources are i.i.d.. A more n is obviously o (n) general case where XS is asymptotically mean station is discussed in Chapter 7. In this chapter, we will investigate the situation where g is irreducible Markovian1 . We will extend results on LCoR from previous chapter to the Markovian settings based on the Supremus typicality argument. Once again, it is shown that LCoR dominates its field counterpart in various aspects. Moreover, it is n seen that o our (n) being approach even provides solutions to some particular situations with XS non-ergodic stationary source. 1 A Markovian function is defined to be a Markov process that is a function of another arbitrary process [BR58]. 67 68 5.1 Irreducible Markov Sources Linear Coding over Finite Rings for Irreducible Markov Sources As a n specialo case, Problem 5.1 with s = 1, g being an identity function and (n) M = X1 = Y (n) being irreducible Markov resumes the Markov source compression problem. It is known from [Cov75] that the achievable coding rate region for compressing M is {R ∈ R|R > H(P|π)} where P and π are the transition matrix and invariant distribution of M , respectively. Unfortunately, the structures of the encoders used in [Cov75] are also unclear (as their Slepian–Wolf correspondences) which limits their application (to Problem 5.1) as we will see in later sections. On the other hand, we have seen from Chapter 3 that the linear coding technique is of great use when applied to this problem. Thus, it is important to re-product the achievability theory, Theorem 2.1.1, in the Markovian settings first. Theorem 5.1.1. Assume that s = 1, X1 = Y is nsome o finite ring R and g is an (n) identity function in Problem 5.1, and additionally X1 = Y (n) is irreducible Markov with transition matrix P and invariant distribution π. We have that n log |R| min H(SR/I |π), R > max 06=I≤l R log |I| o 1 (m) (m−1) (1) (5.1.1) H (P|π) − lim H YR/I , YR/I , · · · , YR/I , m→∞ m where o n SR/I = diag {SA }A∈R/I (i) (i) with SA being the stochastic complement of PA,A in P and YR/I = X1 + I, is achievable with linear coding over R. To be more precise, for any > 0, there is an N0 ∈ N+ such that there exist a linear encoder φ : Rn → Rk and a decoder ψ : Rk → Rn for all n > N0 with Pr {ψ (φ (Y n )) 6= Y n } < , provided that n n k > max min H(SR/I |π), 06=I≤l R log |I| o 1 (m) (m−1) (1) H (P|π) − lim H YR/I , YR/I , · · · , YR/I . m→∞ m log |R| Proof. Part One: Let rR/I = H(SR/I |π), R0 = max rR/I . For any R > 06=I≤l R log |I| nR log |I| R − R0 R0 and n ∈ N+ , let k = . Obviously, for any 0 < η < min , 06=I≤l R log |R| log |R| 2 2 log |I| if n > , then η log |I| k log |I| k log |I| R0 − log |I| < R − 2η − log |I| ≤ − 2η < −3η/2. log |R| n log |R| n n 5.1. Linear Coding over Finite Rings for Irreducible Markov Sources Let N00 = max 06=I≤l R 69 2 log |I| . We have that, for all n > N00 , η rR/I + η − k log |I| k log |I| ≤ R0 + η − log |I| < −η/2. n log |R| n (5.1.2) The following proves that R is achievable with linear coding over R. Encoding: Choose some n ∈ N+ and generate a k × n matrix A over R uniformly at random (independently choose each entry of A from R uniformly at random). Let the encoder be the linear mapping φ : x 7→ Ax, ∀ x ∈ Rn . Notice that the coding rate is k log |R| nR 1 1 n log |φ(R )| ≤ log R = ≤ R. n n n log |R| Decoding: Choose an > 0. Assume that z ∈ Rk is the output of the encoder, the decoder claims that x ∈ Rn is the original data sequence, if and only if 1. x ∈ S (n, P); and 2. ∀ x0 ∈ S (n, P), if x0 6= x, then φ(x0 ) 6= z. In other words, the decoder ψ maps z to x. Error: Assume that X ∈ Rn is the original data sequence generated. An error occurs if and only if E1 : X ∈ / S (n, P); or E2 : There exists x0 ∈ S (n, P) such that φ(x0 ) = φ(X). Error Probability: We claim that there exist N0 ∈ N+ and 0 > 0, if n > N0 and 0 > > 0, then Pr {ψ(φ(X)) 6= X} = Pr {E1 ∪ E2 } < η. First of all, by the AEP of Supremus typicality (Proposition 4.2.2), there exist N000 ∈ N+ and 000 > 0 such that Pr {E1 } < η/2 if n > N000 and 000 > > 0. Secondly, let E1c be the complement of E1 . We have X Pr { E2 | E1c } = Pr { φ(x0 ) = φ(X)| E1c } x0 ∈S \{X} ≤ X X Pr { φ(x0 ) = φ(X)| E1c } (5.1.3) 06=I≤l R x0 ∈S (X,I)\{X} < X −k exp2 n(rR/I + η) |I| (5.1.4) 06=I≤l R k ≤ 2|R| − 2 max exp2 n rR/I + η − log |I| 06=I≤l R n |R| < 2 − 2 exp2 (−nη/2), (5.1.5) (5.1.6) 70 Irreducible Markov Sources where (5.1.3) follows from the fact that S (n, P) = S 06=I≤l R S (X, I); (5.1.4) is from Lemma 4.2.1 and Lemma 2.1.1, and it is required that is smaller 000 + than some 000 0 > 0 and n is larger than some N0 ∈ N ; (5.1.5) is due to the fact that the number of non-trivial left ideals of R is bounded by 2|R| − 2; (5.1.6) is from (5.1.2), and it is required that n > N00 . 2 |R| 2 0 00 000 log 2 −2 and 0 = min{000 , 000 Let N0 = max N0 , N0 , N0 , 0 }. We η η have that Pr { E2 | E1c } < η/2 and Pr {E1 } < η/2 if n > N0 and 0 > > 0. Hence, Pr {E1 ∪ E2 } ≤ Pr { E2 | E1c } + Pr {E1 } < η. This says that R is achievable with linear coding over R. Part Two: If we define rR/I to be 1 (m) (m−1) (1) H YR/I , YR/I , · · · , YR/I m→∞ m H (P|π) − lim and replace S (n, P) with TH (n, P), then the conclusion R > max 06=I≤l R log |R| rR/I log |I| is achievable with linear coding over R follows from a similarly proof based on the AEP of modified weak typicality (Proposition 4.2.3) and Lemma 4.2.2. Finally, the theorem is established by a time sharing argument. Remark 5.1. In Part One of the proof of Theorem 5.1.1, we use the Supremus typicality encoding-decoding technique, in contrast to the classical (weak) typical sequence argument. Technically speaking, if one uses a classical (weak) typical sequence argument, Lemma 4.2.1 will not apply. Consequently, the classical argument will only achieve the inner bound log |R| 1 (m) (m−1) (1) R > max H (P|π) − lim H YR/I , YR/I , · · · , YR/I , (5.1.7) m→∞ m 06=I≤l R log |I| of (5.1.1). Similarly, the inner bound R > max 06=I≤l R log |R| H(SR/I |π), log |I| (5.1.8) 5.1. Linear Coding over Finite Rings for Irreducible Markov Sources 71 is achieved if applying only Lemma 4.2.1 (but not Lemma 4.2.2). Obviously, (5.1.1) is the union of these two inner bounds. However, as we have mentioned before, (5.1.7) is hard to access in general due to engaging with the entropy rate. Thus, based on (5.1.7), it is often hard to draw a optimality conclusion regarding compressing a Markov source as seen below. Example 5.1.1. Let M be an irreducible Markov chain with state space Z4 = {0, 1, 2, 3} and transition matrix P = [pi,j ]i,j∈Z4 defined by (4.2.8). With simple calculation, (5.1.8) says that R > max{1.8629, 1.7582} = H(P|π), (5.1.9) where π is the invariant distribution of M , is achievable with linear coding over Z4 . Optimality is attained, i.e. (5.1.1) and (5.1.8) coincide with the optimal achievable region (cf. [Cov75]). On the contrary, the achievable rate (5.1.7) drawn from the classical typicality argument does not lead to the same optimality conclusion. Because there is no efficient method to evaluate the entropy rate in (5.1.7), since neither the initial distribution is known, nor P = c1 U + (1 − c1 )1 for any U with identical rows and c1 (see Remark 4.4). Generally speaking, X or Y is not necessarily associated with any algebraic structure. In order to apply the linear encoder, we usually assume that Y in Problem 5.1 is mapped into a finite ring R of order at least |Y | by some injection Φ : Y → R and denote the set of all possible injections by I(Y , R). n o (n) Theorem 5.1.2. Assume that s = 1, g is an identity function and X1 = (n) Y is irreducible Markov with transition matrix P and invariant distribution π in Problem 5.1. For a finite ring R of order at least |Y | and ∀ Φ ∈ I(Y , R), let n log |R| min H(SΦ,I |π), 06=I≤l R log |I| rΦ = max o 1 (m) (m−1) (1) H YR/I , YR/I , · · · , YR/I , m→∞ m H (P|π) − lim where SΦ,I = diag n SΦ−1 (A) o A∈R/I (m) with SΦ−1 (A) being the stochastic complement of PΦ−1 (A),Φ−1 (A) in P and YR/I = (m) Φ X1 + I, and define RΦ = {R ∈ R|R > rΦ } . We have that [ Φ∈I(Y ,R) is achievable with linear coding over R. RΦ (5.1.10) 72 Irreducible Markov Sources Proof. The result follows immediately from Theorem 5.1.1 by a timesharing argument. Remark 5.2. In Theorem 5.1.2, assume that Y is some finite ring itself, and let τ be the identity mapping in I(Y , Y ). It could happen that Rτ ( RΦ for some Φ ∈ I(Y , Y ). This implies that region given by (5.1.1) could be strictly smaller than (5.1.10). Therefore, a “reordering” of elements in the ring Y is required when seeking for better linear encoders. Remark 5.3. By Lemma 4.A.3, if, in Theorem 5.1.1, P = c1 U + (1 − c1 )1 with U of identical rows and 0 ≤ c1 ≤ 1, then o n log |R| (m) (m−1) min H(SR/I |π), H (P|π) − lim H YR/I YR/I R > max m→∞ 06=I≤l R log |I| is achievable with linear coding over R. Similarly, if P = c1 U+(1−c1 )1 in Theorem 5.1.2, then, for all Φ ∈ I(Y , R), n log |R| RΦ = R ∈ RR > max min H(SΦ,I |π), 06=I≤l R log |I| o (m) (m−1) . H (P|π) − lim H YR/I YR/I m→∞ Although the achievable regions presented in the above theorems are comprehensive, they depict the optimal one in many situations, i.e. (5.1.10) (or (5.1.1)) is identical to H(P|π). This has been demonstrated in Example 5.1.1 above, and more is shown in the following. Corollary 5.1.1. In Theorem 5.1.1 (Theorem 5.1.2), if R is a finite field, then R >H(P|π) (RΦ = {R ∈ R|R >H(P|π)}, ∀ Φ ∈ I(Y , R), ) is achievable with linear coding over R. Proof. If R is a finite field, then R is the only non-trivial left ideal of itself. The (m) statement follows, since SR/R = P (SΦ,R = P) and H YR/R = 0 for all feasible m. Corollary 5.1.2. In Theorem 5.1.2, if P describes an i.i.d. process, i.e. the row vectors of P are identical to π = [pj ]j∈Y , then log |R| RΦ = R ∈ R R > max [H(π) − H(πΦ,I )] , ∀ Φ ∈ I(Y , R), 06=I≤l R log |I| hP i where πΦ,I = p , is achievable with linear coding over R. In −1 j j∈Φ (A) A∈R/I particular, if 5.2. Source Coding for Computing Markovian Functions 73 1. R is a field with |R| ≥ |Y |; or 2. R, with |R|p ≥ |Y |, contains one and only one proper non-trivial left ideal I0 and |I0 | = |R|; or 3. R is a product ring of several rings satisfying condition 1 or 2, then S Φ∈I(Y ,R) RΦ = {R ∈ R |R > H(π) } . Proof. The first half of the statement follows from Theorem 5.1.2 by direct calculation. The second half is from Theorem 2.3.3 and Theorem 2.3.5. Remark 5.4. Concrete examples of the finite ring from Corollary 5.1.2 includes, but are not limited to: 1. Zp , where p ≥ |Y | is a prime, as a finite field; (" 2. Zp2 and ML,p = x y ) # p 0 x, y ∈ Zp , where p ≥ |Y | is a prime; x 3. ML,p1 × Zp2 , where p1 ≥ |Y | and p2 ≥ |Y | are primes. Since there always exists a prime p with p2 ≥ |Y | in Theorem 5.1.2, Corollary 5.1.2 guarantees that there always exist optimal linear encoders over some non-field ring, say Zp2 or ML,p , if the source is i.i.d.. Corollary 5.1.2 can be generalized to the multiple sources scenario in a memoryless setting (see Theorem 2.3.3 and Theorem 2.3.5). More precisely, the Slepian– Wolf region is always achieved with linear coding over some non-field ring. Unfortunately, it is neither proved nor denied that a corresponding existence conclusion for the (single or multivariate [FCC+ 02]) Markov source(s) scenario holds. Nevertheless, Example 5.1.1, Corollary 5.1.2 and conclusions from Chapter 2 do affirmatively support such an assertion to their own extents. Even if it is unproved that linear coding over non-field ring is optimal for the special case of Problem 5.1 considered in this section, it will be seen in later sections that linear coding over non-field ring strictly outperforms its field counterpart in other settings of this problem. 5.2 Source Coding for Computing Markovian Functions We now move on to a more general setting of Problem 5.1, where both s and g are arbitrary. Generally speaking, R[g] is unknown when g is not an identity function (e.g. the binary sum), and it is larger (strictly in many cases) than the Slepian–Wolf 74 Irreducible Markov Sources region. However, not much is known for the case of sources with memory. Let Rs = X 1 h (n) (n−1) (1) Rt > lim H XS , XS , · · · , XS (R1 , R2 , · · · , Rs ) ∈ R n→∞ n t∈T i (n) (n−1) (1) − H XT c , XT c , · · · , XT c , ∅ = 6 T ⊆ S 2, (5.2.1) s where T c = S \ T . By [Cov75], if the process (1) (2) (n) · · · , XS , XS , · · · , XS , · · · is jointly ergodic3 (stationary ergodic), then Rs = R[g] for an identity function g. Naturally, Rs is an inner bound for R[g] in the case of an arbitrary g. But Rs is not always tight (optimal), i.e. Rs ( R[g], as we will demonstrate later in Example 5.2.1 below. Even for the special scenario of correlated i.i.d. sources, i.e. (1) (2) (n) · · · , XS , XS , · · · , XS , · · · is i.i.d., Rs , which is then the Slepian–Wolf region, is not tight (optimal) in general. Unfortunately, little is mentioned in the existing (1) (2) (n) literature regarding the situation that · · · , XS , XS , · · · , XS , · · · is not memory(1) (2) (n) less, neither for the case that · · · , Y , Y , · · · , Y , · · · is homogeneous Markov (1) (2) (n) (which does not necessarily imply that · · · , XS , XS , · · · , XS , · · · is jointly ergodic or homogeneous Markov (see Example 5.2.2)). In this section, we will address Problem 5.1 by assuming that g admits a Markovian polynomial (nomographic) presentation. We will show that the linear coding approach is strictly better than the one from [Cov75] and it even offers solutions when [Cov75] does not apply. Furthermore, in Section 5.3, we will once again demonstrate that LCoR is in strict upper hand compared to its field counterpart in terms of achieving better coding rates in even non-i.i.d. settings. (i) Example 5.2.1. Consider three sources 1, 2 and 3 generating random data X1 , (i) (i) X2 and X3 (at time i ∈ N+ ) whose sample spaces are X1 = X2 = X3 = {0, 1} ( Z4 , respectively. Let g : X1 × X2 × X3 → Z4 be defined as g : (x1 , x2 , x3 ) 7→ x1 + 2x2 + 3x3 , (5.2.2) (i) (i) (i) and assume that X (n) , where X (i) = X1 , X2 , X3 , forms a Markov chain 2 Assume the limits exist. ergodic defined by Cover [Cov75] is equivalent to stationary ergodic, a condition supporting the Shannon–McMillan–Breiman Theorem. Stationary ergodic is a special case of a.m.s. ergodic [GK80]. The later is a sufficient and necessary condition for the Point-wise Ergodic Theorem to hold [GK80, Theorem 1]. The Shannon–McMillan–Breiman Theorem holds under this universal condition as well [GK80]. 3 Jointly 5.2. Source Coding for Computing Markovian Functions 75 with transition matrix (0, (0, (0, (0, (1, (1, (1, (1, 0, 0, 1, 1, 0, 0, 1, 1, 0) 1) 0) 1) 0) 1) 0) 1) (0, 0, 0) .1397 .0097 .0097 .0097 .0097 .0097 .0097 .0097 (0, 0, 1) .4060 .5360 .4060 .4060 .4060 .4060 .4060 .4060 (0, 1, 0) .0097 .0097 .1397 .0097 .0097 .0097 .0097 .0097 (0, 1, 1) .0097 .0097 .0097 .1397 .0097 .0097 .0097 .0097 (1, 0, 0) .0097 .0097 .0097 .0097 .1397 .0097 .0097 .0097 (1, 0, 1) .0097 .0097 .0097 .0097 .0097 .1397 .0097 .0097 (1, 1, 0) .4060 .4060 .4060 .4060 .4060 .4060 .5360 .4060 (1, 1, 1) .0097 .0097 .0097 .0097 .0097 .0097 .0097 .1397 In order to recover g at the decoder, one solution is to apply Cover’s method [Cov75] to first decode the original data and then compute g. This results in an achievable region X h R3 = (R1 , R2 , R3 ) ∈ R3 Rt > lim H X (m) X (m−1) m→∞ t∈T i (m) (m−1) − H XT c XT c ,∅ = 6 T ⊆ {1, 2, 3} . However, R3 is not optimal, i.e. coding rates beyond this region can be achieved. Observe that Y (n) , where Y (i) = g X (i) , is an irreducible Markov with transition matrix 0 3 2 1 0 .1493 .0193 .0193 .0193 3 .8120 .9420 .8120 .8120 2 .0193 .0193 .1493 .0193 1 .0193 .0193 .0193 .1493 (5.2.3) By Theorem 5.1.1, for any > 0, there is an N0 ∈ N+ , such that for all n > N0 there exist a linear encoder φ : Zn4 → Zk4 and a decoder ψ : Zk4 → Zn4 , such that Pr {ψ (φ (Y n )) 6= Y n } < , as long as n × max {0.3664, 0.3226} = 0.1832n. 2 Further notice that φ(Y n ) = ~g Z1k , Z2k , Z3k , where Ztk = φ (Xtn ) (t = 1, 2, 3) and (1) (1) (1) g Z1 , Z2 , Z3 g Z (2) , Z (2) , Z (2) 1 2 3 , since g is linear. Thus, by the approach of ~g Z1k , Z2k , Z3k = .. . k> (k) (k) (k) g Z1 , Z2 , Z3 76 Irreducible Markov Sources Körner–Marton [KM79], we can use φ as encoder for each source. Upon observing Z1k , Z2k and Z3k , the decoder claims that ψ ~g Z1k , Z2k , Z3k is the desired data ~g (X1n , X2n , X3n ). Obviously Pr {ψ (~g [φ (X1n ) , φ (X2n ) , φ (X3n )]) 6= Y n } = Pr {ψ (φ (Y n )) 6= Y n } < , as long as k > 0.1832n. As a consequence, the region 2k 3 = 0.4422 RZ4 = (R1 , R2 , R3 ) ∈ R Ri > n (5.2.4) is achieved. Since 0.4422 + 0.4422 + 0.4422 < lim H X (m) X (m−1) = 1.4236, m→∞ we have that RZ4 6⊆ R3 . In conclusion, R3 is suboptimal for computing g. Theorem 5.2.1. In Problem 5.1, assume that g admits presentation ĝ = h ◦ k, where k(x1 , x2 , · · · , xs ) = s X ki (xi ), (5.2.5) i=1 h, ki ’s are functions mapping R to R and k is irreducible Let n Markovian. P and o P (n) (n) π be the transition matrix and invariant distribution of Z = t∈S kt Xt , respectively. We have R = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > R0 } ⊆ R[g], where n log |R| min H(SR/I |π), 06=I≤l R log |I| R0 = max o 1 (m) (m−1) (1) H YR/I , YR/I , · · · , YR/I , m→∞ m H (P|π) − lim n o SR/I = diag {SA }A∈R/I with SA being the stochastic complement of PA,A in P (m) and YR/I = Z (m) + I. Moreover, if R is a field, then R = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > H(P|π) } . (5.2.6) Proof. By Theorem 5.1.1, for any > 0, there exists an N0 ∈ N+ and for all n > N0 , there exist an linear encoder φ0 : Rn → Rk and a decoder ψ0 : Rk → Rn such that Pr {ψ0 (φ0 (Z n )) 6= Z n } < , 5.2. Source Coding for Computing Markovian Functions 77 nR0 . Choose φt = φ0 ◦ ~kt (t ∈ S) as the encoder for the tth log |R| P source and ψ = ψ0 ◦ γ, where γ : Rs → R is defined as γ(x1 , x2 , · · · , xs ) = t∈S xt , as the decoder. We have that provided that k > Pr {ψ (φ1 (X1n ) , φ2 (X2n ) , · · · , φs (Xsn )) 6= Z n } o n = Pr ψ0 γ φ0 ~kt (Xtn ) 6= Z n n o = Pr ψ0 φ0 γ ~kt (Xtn ) 6= Z n = Pr {ψ0 (φ0 (Z n )) 6= Z n } < . k log |R| > R0 , is achievable for n computing g. As a conclusion, R ⊆ R[g]. If furthermore R is a field, then R is the only non-trivial left ideal of itself. (5.2.6) follows. Therefore, (R1 , R2 , · · · , Rs ) ∈ Rs , where Ri = Example 5.2.2. Define Pα and Pβ to be (0, (0, (0, (0, (1, (1, (1, (1, 0, 0, 1, 1, 0, 1, 1, 0, 0) 1) 0) 1) 1) 0) 1) 0) (0, 0, 0) .2597 .1208 .0184 .0985 .12985 .0604 .0092 .04925 (0, 0, 1) .2093 .0872 .2627 .1823 .10465 .0436 .13135 .09115 (0, 1, 0) .2713 .6711 .4101 .2315 .13565 .33555 .20505 .11575 (0, 1, 1) .2597 .1208 .3088 .4877 .12985 .0604 .1544 .24385 (1, 0, 1) 0 0 0 0 .12985 .0604 .0092 .04925 (1, 1, 0) 0 0 0 0 .10465 .0436 .13135 .09115 (1, 1, 1) 0 0 0 0 .13565 .33555 .20505 .11575 (1, 0, 0) 0 0 0 0 .12985 .0604 .1544 .24385 0, 0, 1, 1, 0, 1, 1, 0, 0) 1) 0) 1) 1) 0) 1) 0) (0, 0, 0) 0 0 0 0 .2597 .1208 .0184 .0985 (0, 0, 1) 0 0 0 0 .2093 .0872 .2627 .1823 (0, 1, 0) 0 0 0 0 .2713 .6711 .4101 .2315 (0, 1, 1) 0 0 0 0 .2597 .1208 .3088 .4877 (1, 0, 1) .2597 .1208 .0184 .0985 0 0 0 0 (1, 1, 0) .2093 .0872 .2627 .1823 0 0 0 0 (1, 1, 1) .2713 .6711 .4101 .2315 0 0 0 0 (1, 0, 0) .2597 .1208 .3088 .4877 0 0 0 0 and (0, (0, (0, (0, (1, (1, (1, (1, respectively. Let M = X (n) be a non-homogeneous Markov chain whose transition matrix from time n to time n + 1 is ( Pα ; n is even, (n) P = Pβ ; otherwise. , 78 Irreducible Markov Sources Consider Example 5.2.1 by replacing the original homogeneous Markov chain X (n) with M defined above. It is easy to verify that there exists no invariant distribution π 0 such that π 0 P(n) = π 0 for all feasible n. This implies that M is not jointly ergodic (stationary ergodic), nor a.m.s. ergodic. Otherwise, M will always possess an invariant distribution induced from the stationary mean measure of the a.m.s. dynamical system describing M [Gra09, Theorem 7.1 and Theorem 8.1]. As a consequence, [Cov75] does not apply. However, g is Markovian although M is not even homogeneous. In exact terms, g X (n) is homogeneous irreducible Markov with transition matrix P given by (4.2.8). Consequently, Theorem 5.2.1 offers a solution which achieves R = {(R1 , R2 , R3 ) |Ri > H(P|π) = 1.8629} , where π is the unique eigenvector satisfying πP = π. Once again, the optimal coding rate H(P|π) for compressing g X (n) is derived from the Supremus typicality argument, other than the classical typicality argument. For an arbitrary g, Lemma 1.2.4 promises that there always exist some finite ring R and functions kt : Xt → R (t ∈ S) and h : R → Y such that ! X g=h kt . t∈S However, k = t∈S kt is not necessarily Markovian, unless the process M = (n) X is Markov with transition matrix c1 U + (1 − c1 )1, where the stochastic matrix U has identical rows. In that case, k is always Markovian so claimed by Lemma 4.A.3. n o (n) Corollary 5.2.1. In Problem 5.1, assume that XS forms an irreducible Markov P chain with transition matrix P0 = c1 U+(1−c1 )1, where all rows of U are identical to some unitary vector and 0 ≤ c1 ≤ 1. Then there exist some finite ring R and functions kt : Xt → R (t ∈ S) and h : R → Y such that ! s X g(x1 , x2 , · · · , xs ) = h kt (xt ) (5.2.7) t=1 o n Ps (n) and M = Z (n) = t=1 kt Xt is irreducible Markov. Furthermore, let π and P be the invariant distribution and the transition matrix of M , respectively, and n o log |R| (m) (m−1) min H(SR/I |π), H (P|π) − lim H YR/I YR/I R0 = max m→∞ 06=I≤l R log |I| n o where SR/I = diag {SA }A∈R/I with SA being the stochastic complement of PA,A (m) in P and YR/I = Z (m) + I. We have that RR = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > R0 } ⊆ R[g]. 5.2. Source Coding for Computing Markovian Functions 79 Proof. The existences of kt ’s and n h are from o Lemma 1.2.4, and Lemma 4.A.3 ensures (n) that M is Markov. In addition, XS is irreducible, so is M . Finally, 1 (m) (m−1) (1) (m) (m−1) , H YR/I , YR/I , · · · , YR/I = lim H YR/I YR/I m→∞ m m→∞ n o (n) since YR/I is Markov by Lemma 4.A.3. Henceforth, RR ⊆ R[g] by Theorem 5.2.1. lim Remark 5.5. For the function g in Corollary 5.2.1, it is often the case that there exists more than one finite ring R or more than one set of functions kt ’s and h satisfying corresponding requirements. For example, the polynomial function x + 2y + 3z ∈ Z4 [3] admits also the polynomial presentation ĥ (x + 2y + 4z) ∈ Z5 [3], P where ĥ(u) = a∈Z5 a 1 − (u − a)4 − 1 − (u − 4)4 ∈ Z5 [1]. As a conclusion, a possibly better inner bound of R[g] is [ [ [ Rs RR , (5.2.8) R PR (g) where PR (g) denotes all the polynomial presentations of format (5.2.7) of g over ring R. n o (n) Corollary 5.2.2. In Corollary 5.2.1, let π = [pj ]j∈R . If c1 = 1, namely, XS and M are i.i.d., then log |R| s [H(π) − H(πI )] ⊆ R[g], RR = (R1 , R2 , · · · , Rs ) ∈ R Ri > max 06=I≤l R log |I| hP i where πI = p . j j∈A A∈R/I Remark 5.6. In Corollary 5.2.2, under many circumstances it may hold that log |R| [H(π) − H(πI )] = H(π), i.e. max06=I≤l R log |I| RR = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > H(π) } . (5.2.9) For example, when R is a field. However, R being a field is definitely not necessary. For more details, please kindly refer to Section 2.3. Corollary 5.2.3. In Corollary 5.2.1, R can always be chosen as a field. Consequently, RR = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > H(P|π) } ⊆ R[g]. (5.2.10) 80 Irreducible Markov Sources Remark 5.7. Although R in Corollary 5.2.1 can always be chosen to be a field, the region RR is not necessarily larger than when R is chosen as a non-field ring. On the contrary, RR is strictly larger when R is a non-field ring than when it is chosen as a field in many case. This is because the induced P, as well as π, varies. As mentioned, in Theorem 5.2.1, Corollary 5.2.1 and Corollary 5.2.2, there may be more than one choice of such a finite ring R satisfying the corresponding requirements. Among those choices, R can be either a field or a non-field ring. Surprisingly, it is seen in (infinitely) many examples that using non-field ring outperforms using a field. In many cases, it is proved that the achievable region obtained with linear coding over some non-field ring is strictly larger than any that is achieved with its field counterpart, regardless which field is considered. Section 3.3 has demonstrated this in the setting of correlated i.i.d. sources. In the next section, this will be once again demonstrated in the setting of sources with memory. 5.3 Non-field Rings versus Fields II Clearly, our previous discussion regarding linear coding is mainly based on general finite rings which can be either fields or non-field rings, each bringing their own advantages. In the setting where g is the identity function in Problem 5.1, linear coding over finite field is always optimal in the sense of achieving R[g] if the sources are jointly ergodic (stationary ergodic) [Cov75]. An equivalent conclusive result is not yet proved for linear coding over non-field ring. Nevertheless, it is proved that there always exist more than one (up to isomorphism) non-field rings over which linear coding achieves the Slepian–Wolf region if the sources considered are i.i.d. (Section 2.3). Furthermore, many examples, say Example 5.1.1, show that non-field ring can be equally optimal when considering irreducible Markov sources. All in all, there is still no conclusive support that linear coding over field is preferable in terms of achieving the optimal region R[g] when g is an identity function. On the contrary, there are many drawbacks of using finite fields compared to using non-field rings (e.g. modulo integer rings): 1. The finite field arithmetic is complicated to implement since the finite field arithmetic usually involves the polynomial long division algorithm; and 2. The alphabet size(s) of the encoder(s) is (are) usually larger than required (Section 3.3); and 3. In many specific circumstances of Problem 5.1, linear coding over any finite field is proved to be less optimal than its non-field rings counterpart in terms of achieving larger achievable region (see Section 5.3 and Example 5.3.1); and 4. The characteristic of a finite field has to be a prime. This constraint creates shortages in their polynomial presentations of discrete functions (see Lemma 5.A.2). These shortages confine the performance of the polynomial approach 5.3. Non-field Rings versus Fields II 81 (if restrict to field) and lead to results like Proposition 5.3.1. On the other hand, The characteristic can be any positive integer for a finite non-field ring; and 5. Field (finite or not) contains no zero divisor. This also impares the performance of the polynomial approach (if restrict to field). Example 5.3.1. Consider the situation illustrated in Example 5.2.1, one alternative is to treat X1 = X2 = X3 = {0, 1} as a subset of finite field Z5 and the function g can then be presented as g(x1 , x2 , x3 ) = ĥ(x1 + 2x2 + 4x3 ), ( z; z 6= 4, where ĥ : Z5 → Z4 is given by ĥ(z) = (symbol-wise). By Corollary 3; z = 4, 5.2.3, linear coding over Z5 achieves the region RZ5 = (r1 , r2 , r3 ) ∈ R3 |ri > H (PZ5 |πZ5 ) = 0.4623 . Obviously, RZ5 ( RZ4 ⊆ R[g]. In conclusion, using linear coding over field Z5 is less optimal compared with over non-field ring Z4 . In fact, the region RF achieved by linear coding over any finite field F is always strictly smaller than RZ4 . Proposition 5.3.1. In Example 5.2.1, RF , the achievable region achieved with linear coding over any finite field F in the sense of Corollary 5.2.1, is properly contained in RZ4 , i.e. RF ( RZ4 . Proof. Assume that g(x1 , x2 , x3 ) = h (k1 (x1 ) + k2 (x2 ) + k3 (x3 )) with kt : {0, 1} → F (1 ≤ t ≤ 3) and h : F → Z4 . Let n o (n) (n) (n) , M1 = Y (n) with Y (n) = g X1 , X2 , X3 n o (n) (n) (n) , + k3 X3 M2 = Z (n) with Z (n) = k1 X1 + k2 X2 and Pl and πl be the transition matrix and the invariant distribution of Ml , respectively, for l = 1, 2. By Corollary 5.2.1 (also Corollary 5.2.3), linear coding over F achieves the region RF = {(R1 , R2 , · · · , Rs ) ∈ Rs |Ri > H(P2 |π2 ) } , while linear coding over Z4 achieves log |Z4 | s H(SZ4 /I |π1 ) = H(P1 |π1 ) . RZ4 = (R1 , R2 , · · · , Rs ) ∈ R Ri > max 06=I≤l Z4 log |I| 82 Irreducible Markov Sources Moreover, H(P1 |π1 ) < H(P2 |π2 ) by Lemma 5.A.1 due to Lemma 5.A.2 claims that h|S , where S = k1 ({0, 1}) + k2 ({0, 1}) + k3 ({0, 1}), can never be injective. Therefore, RF ( RZ4 . Remark 5.8. There are infinitely many functions like g defined in Example 5.2.1 such that the achievable region obtained with linear coding over any finite field in the sense of Corollary 5.2.1 is strictly suboptimal compared to the Psone achieved with linear coding over some non-field ring. These functions include t=1 xt ∈ Z2p [s] for any s ≥ 2 and any prime p > 2. One can always find a concrete example in which linear coding over Z2p dominates. The reason for this is partially because these functions are defined on rings (e.g. Z2p ) of non-prime characteristic. However, a finite field must be of prime characteristic, resulting in conclusions like Proposition 5.3.1. As a direct consequence of Proposition 5.3.1, we have Theorem 5.3.1. In the sense of (5.2.8), linear coding over finite field is not optimal. 5.A. Appendix 5.A 83 Appendix 5.A.1 Supporting Lemmata Lemma 5.A.1. Let Z be a countable set, π = [p(z)]z∈Z and P = [p(z1 , z2 )]z1 ,z2 ∈Z be a non-negative unitary vector and a stochastic matrix, respectively. For any function h : Z → Y , if for all y1 , y2 ∈ Y p(z1 , y2 ) = cy1 ,y2 , ∀ z1 ∈ h−1 (y1 ), p(z1 ) (5.A.1) where cy1 ,y2 is a constant, then H h Z (2) h Z (1) ≤ H(P|π), (5.A.2) where Z (1) , Z (2) ∼ πP. Moreover, (5.A.2) holds with equality if and only if p(z1 , h(z2 )) = p(z1 , z2 ), ∀ z1 , z2 ∈ Z with p(z1 , z2 ) > 0. (5.A.3) Proof. By definition, H h Z (2) h Z (1) =− X p(y1 , y2 ) log y1 ,y2 ∈Y p(y1 , y2 ) p(y1 ) =− X X p(z1 , y2 ) log (a) =− X X p(z1 , y2 ) log y1 ,y2 ∈Y z1 ∈h−1 (y1 ) =− X X p(z1 , z2 ) log y1 ,y2 ∈Y z2 ∈h−1 (y2 ), z1 ∈h−1 (y1 ) (b) ≤− X X p(z1 , z2 ) log y1 ,y2 ∈Y z2 ∈h−1 (y2 ), z1 ∈h−1 (y1 ) =− X z1 ,z2 ∈Z p(z10 , y2 ) z10 ∈h−1 (y1 ) y1 ,y2 ∈Y z1 ∈h−1 (y1 ) p(z1 , z2 ) log , X p(z1 , y2 ) p(z1 ) P z20 ∈h−1 (y2 ) X p(z100 ) z100 ∈h−1 (y1 ) p(z1 , z20 ) p(z1 ) p(z1 , z2 ) p(z1 ) p(z1 , z2 ) p(z1 ) =H(P|π), where (a) is from (5.A.1). In addition, equality holds, i.e. (b) holds with equality, if and only if (5.A.3) is satisfied. 84 Irreducible Markov Sources Remark 5.9. P in the above lemma can be interpreted as the transition matrix of some Markov process. However, π is not necessarily the corresponding invariant distribution. It is also not necessary that such a Markov process is irreducible. In the meantime, (5.A.2) can be seen as a “data processing inequality”. In addition, (5.A.1) is sufficient but not necessary for (5.A.2), even though it is sufficient and necessary for (a) in the above proof. Lemma 5.A.2. For g given by (5.2.2) and any finite field F, if there exist functions kt : {0, 1} → F and h : F → Z4 , such that ! s X g(x1 , x2 , · · · , xs ) = h kt (xt ) , t=1 then h|S , where S = k1 ({0, 1}) + k2 ({0, 1}) + k3 ({0, 1}), is not injective. Proof. Suppose otherwise, i.e. h|S is injective. Let h0 : h (S ) → S be the inverse mapping of h : S → h (S ). Obviously, h0 is bijective. By (5.2.2), we have h0 [g(1, 0, 0)] = k1 (1) + k2 (0) + k3 (0) =h0 [g(0, 1, 1)] = k1 (0) + k2 (1) + k3 (1) 6=h0 [g(1, 1, 0)] = k1 (1) + k2 (1) + k3 (0) =h0 [g(0, 0, 1)] = k1 (0) + k2 (0) + k3 (1). Let τ = h0 [g(1, 0, 0)] − h0 [g(1, 1, 0)] = h0 [g(0, 1, 1)] − h0 [g(0, 0, 1)] ∈ F. We have that τ = k2 (0) − k2 (1) = k2 (1) − k2 (0) = −τ =⇒ τ + τ = 0. (5.A.4) (5.A.4) implies that either τ = 0 or Char(F) = 2 by Proposition 1.1.2. Noticeable that k2 (0) 6= k2 (1), i.e. τ 6= 0, by the definition of g. Thus, Char(F) = 2. Let ρ = k3 (0) − k3 (1). Obviously, ρ 6= 0 by the definition of g, and ρ + ρ = 0 since Char(F) = 2. Consequently, h0 [g(0, 0, 0)] =k1 (0) + k2 (0) + k3 (0) =k1 (0) + k2 (0) + k3 (1) + ρ =h0 [g(0, 0, 1)] + ρ =h0 [g(1, 1, 0)] + ρ =k1 (1) + k2 (1) + k3 (0) + ρ =k1 (1) + k2 (1) + k3 (1) + ρ + ρ =h0 [g(1, 1, 1)] . Therefore, g(0, 0, 0) = g(1, 1, 1) since h0 is bijective. This is absurd! Chapter 6 Extended Shannon–McMillan–Breiman Theorem y Proposition 4.2.1, we have seen that Supremus typicality is a recursive property, while classical typicality is not by Example 4.2.1. This is because we have integrated a recursive feature, namely a reduced process of an irreducible Markov process is irreducible Markov [Mey89], into the definition of corresponding Supremus typical sequences. Although this makes the set of Supremus typical sequences a smaller (more restricted) set compared to the set of classical typical sequences, the AEP, Proposition 4.2.2, still holds. Consequently, we conclude that non-Supremus typical sequences, which might or might not be classical typical, are negligible in a stochastic sense. This becomes the spirit behind the arguments of the achievability theorems from Chapter 5. Moreover, this enriched structure of Supremus typical sequences provides us with refine properties (see the comparison between Lemma 4.2.1 and Lemma 4.2.2, etc.) to obtain more accessible conclusions. However, we have postponed the consideration of the more universal case, asymptotically mean stationary (a.m.s.) process, due to missing some ergodic theoretical background that is to be introduced in this chapter. This chapter proves that an induced transformation with respect to a finite measure set of a recurrent a.m.s. dynamical system with a σ-finite measure is a.m.s.. This is the correspondence of the recursive feature of irreducible Markov process that we are looking for. Since the Shannon–McMillan–Breiman (SMB) Theorem and the Shannon–McMillan Theorem hold for any finite-state a.m.s. ergodic process [GK80], we conclude that the SMB Theorem, as well as the Shannon–McMillan Theorem, holds simultaneously for all reduced processes of any finite-state recurrent a.m.s. ergodic random process. We term this recursive property the Extended SMB Theorem. B 85 86 6.1 6.1.1 Extended Shannon–McMillan–Breiman Theorem Asymptotically Mean Stationary Dynamical Systems and Random Processes Asymptotically Mean Stationary Dynamical Systems A dynamical system (Ω, F , µ, T ) with a finite measure, e.g. probability measure, is said to be asymptotically mean stationary1 (a.m.s.) [GK80] if the limit n−1 1X µ T −i B n→∞ n i=0 lim exists for all B ∈ F . As proved in [GK80], a system being a.m.s. is a necessary and sufficient condition to ensure that n−1 1X fTi n i=0 converges µ-almost everywhere (µ-a.e.) on Ω for every bounded F -measurable realvalued function, say f . Let n−1 1X µ T −i B , ∀ B ∈ F , n→∞ n i=0 µ(B) = lim n−1 1X f T i (x), ∀ x ∈ Ω. n→∞ n i=0 and f (x) = lim Then, by the Vitali–Hahn–Saks Theorem, it is easily seen that µ is a finite measure on (Ω, F ), and f is F -measurable. Moreover, (Ω, F , µ, T ) is invariant, in other words, T is a measure preserving transformation on (Ω, F , µ), i.e. µ(B) = µ T −1 B , ∀ B ∈ F , and f is T -invariant a.e., i.e. f = f T a.e., with respect to both µ and µ. In fact, f is simply the conditional expectation Eµ (f |I ), where I ⊆ F is the σ-algebra of T -invariant sets (B ∈ F is said to be T -invariant if B = T −1 B). Therefore, if (Ω, F , µ, T ) is ergodic, i.e. T −1 B = B =⇒ µ(B) = 0 or µ(Ω − B) = 0, ∀ B ∈ F , 1 Perhaps it is better to replace “stationary” with “invariant,” because a stationary measure defined in [GK80] is usually called an invariant measure in the language of ergodic theory. However, in order to be consistent, we will follow existing literature and use the terminology “asymptotically mean stationary,” while the reader can read it as “asymptotically mean invariant” if preferred. 6.1. Asymptotically Mean Stationary Dynamical Systems and Random Processes 87 then f = Eµ (f |I ) equals to a constant a.e. with respect to both µ and µ. We emphasize that the definition (cited from [GK80]) of the a.m.s. property given above is only valid for finite measures. In order to address dynamical systems with non-finite measures, in particular those with σ-finite measures, we generalise the definition as follows. Definition 6.1.1. A dynamical system (Ω, F , µ, T ) is said to be asymptotically mean stationary (a.m.s.) if there exists a measure µ on (Ω, F ) satisfying: 1. For any B ∈ F of finite measure, i.e. µ(B) < ∞, n−1 1X µ(T −i B); n→∞ n i=0 µ(B) = lim 2. For any T -invariant set B ∈ F , µ(B) = µ(B). Such a measure µ is named the invariant mean 2 of µ. The following proposition clearly explains why the terminology “asymptotically mean stationary” and “invariant mean” are suggested. Proposition 6.1.1. Let (Ω, F , µ, T ) be a.m.s. and µ be an invariant mean of µ. If µ is σ-finite, then (Ω, F , µ, T ) is invariant. Proof. For any B ∈ F , if µ(B) < ∞ obviously µ(B) = µ (T −n B) for any positive integer n. If µ(B) = ∞, then there exists a countable {Bi : i ∈ N+ }, with −1 partition µ(Bi ) < ∞, of B since µ is σ-finite. Moreover, T Bi is a countable partition of T −1 B, and µ(Bi ) = µ T −1 Bi < ∞ for all feasible i. As a consequence, ∞ ∞ X X µ T −1 B = µ T −1 Bi = µ(Bi ) = µ(B). i=1 i=1 Hence, (Ω, F , µ, T ) is invariant. Remark 6.1. Obviously, an invariant system (Ω, F , m, T ) is a.m.s. with m being the invariant mean of itself. Actually, if µ in Definition 6.1.1 is finite, then the second requirement in the definition is redundant, because the fact µ(B) = µ(B) for any T -invariant set B can be deduced from the first requirement. Therefore, Definition 6.1.1 covers the original definition from [GK80] as a special case. However, for a non-finite measure, the second condition is crucial. Example 6.1.1. Let R+ = (0, +∞), B be the Borel σ-algebra on R+ , µ be the Lebesgue measure on (R+ , B), and T (x) = x2 , ∀ x ∈ R+ . For set function λ : B → R given by: 2 In µ. [GK80], the term “stationary mean” is used instead of “invariant mean” for a finite measure 88 Extended Shannon–McMillan–Breiman Theorem 1. For all B ∈ B with µ(B) < ∞, λ(B) = limn→∞ 2. For all B ∈ B with µ(B) = ∞, λ(B) = the Bi ’s form a countable partition of B. P∞ i=1 1 Pn−1 −i B ; i=0 µ T n λ(Bi ), where µ(Bi ) < ∞ and It is easy to verify that λ is well-defined. In exact terms, for any measurable set B with µ(B) = ∞ and any two countable partitions {Bi0 } and {Bi00 }, where µ(Bi0 ) < ∞ and µ(Bi00 ) < ∞, of B, ∞ X λ(Bi0 ) = i=1 ∞ X λ(Bi00 ). i=1 In addition, one can also prove that λ is a finite, hence σ-finite, measure over (R+ , B), since λ(R+ ) = 1. However, λ is not an invariant mean of µ, because [1, +∞) is a T -invariant set while µ([1, +∞)) = ∞ = 6 0 = λ([1, +∞)). From this one sees that (R+ , B, µ, T ) is not a.m.s.. To prove this by contradiction, suppose µ is an invariant mean of µ, then µ([1, +∞)) = ∞ X µ([j, j + 1)) j=1 = n−1 ∞ X 1X µ T −i [j, j + 1) = 0=0 n→∞ n i=0 j=1 j=1 ∞ X lim 6=∞ = µ([1, +∞)). Definition 6.1.2. Given two dynamical systems (Ω1 , F1 , µ1 , T1 ) and (Ω2 , F2 , µ2 , T2 ), a mapping φ : Ω1 → Ω2 is said to be a homomorphism if 1. φ is measurable; 2. µ1 φ−1 (B2 ) = µ2 (B2 ) , ∀ B2 ∈ F2 ; 3. φT1 = T2 φ µ1 -a.e.. (Ω2 , F2 , µ2 , T2 ) is then called a factor of (Ω1 , F1 , µ1 , T1 ). Furthermore, φ is said to be an isomorphism, if there exists a homomorphism ψ : Ω2 → Ω1 such that ω1 = ψ (φ(ω1 )) µ1 -a.e. and ω2 = φ (ψ(ω2 )) µ2 -a.e.. Proposition 6.1.2. In Definition 6.1.2, if B2 is T2 -invariant, then φ−1 (B2 ) is T1 -invariant. Proof. φ−1 (B2 ) = φ−1 (T2−1 B2 ) = T1−1 φ−1 (B2 ). 6.1. Asymptotically Mean Stationary Dynamical Systems and Random Processes 89 Theorem 6.1.1. If a dynamical system is a.m.s. (invariant or ergodic), then all of its factors are a.m.s. (invariant or ergodic). Proof. Let (Ω2 , F2 , µ2 , T2 ) be a factor of (Ω1 , F1 , µ1 , T1 ) and φ : Ω1 → Ω2 is a homomorphism. a.m.s.: If (Ω1 , F1 , µ1 , T1 ) is a.m.s. with invariant mean µ1 , then, for any B2 ∈ F2 of finite measure, n−1 n−1 1X 1X µ2 T2−i B2 = lim µ1 φ−1 T2−i B2 n→∞ n n→∞ n i=0 i=0 lim n−1 1X µ1 T1−i φ−1 (B2 ) n→∞ n i=0 =µ1 φ−1 (B2 ) =µ2 (B2 ) , = lim where µ2 = µ1 φ−1 . Moreover, if B2 ∈ F2 is T2 -invariant, then φ−1 (B2 ) is T1 invariant by Proposition 6.1.2. Thus, µ2 (B2 ) = µ1 φ−1 (B2 ) = µ1 φ−1 (B2 ) = µ2 (B2 ). Therefore, (Ω2 , F2 , µ2 , T2 ) is a.m.s.. invariant: If (Ω1 , F1 , µ1 , T1 ) is invariant, then, ∀ B2 ∈ F2 , µ2 (B2 ) = µ1 (φ−1 (B2 )) = µ1 (T1−1 φ−1 (B2 )) = µ1 (φ−1 (T2−1 B2 )) = µ2 (T2−1 B2 ). Therefore, (Ω2 , F2 , µ2 , T2 ) is invariant. ergodic: For any B2 ∈ F2 that is T2 -invariant, φ−1 (B2 ) is T1 -invariant by Proposition 6.1.2. If (Ω1 , F1 , µ1 , T1 ) is ergodic, then either µ2 (B2 ) = µ1 (φ−1 (B2 )) = 0 or µ2 (Ω2 − B2 ) = µ1 (φ−1 (Ω2 − B2 )) = µ1 (Ω1 − φ−1 (B2 )) = 0. Henceforth, (Ω2 , F2 , µ2 , T2 ) is ergodic. 6.1.2 Random Processes Given a dynamical system (Ω1 , F1 , µ1 , T1 ) with a probability measure µ1 and a measurable function f1 : Ω1 → X1 (assume that X1 is countable with σ-algebra P(X1 ), the power set of X1 , for simplicity), we can define a random process ∞ M1 = Xi = f1 (T1i ) i=0 , such that Pr {Xj = xj , ∀j ∈ I} = µ1 \ j∈I T1−j f1−1 (xj ) , ∀ xj ∈ X1 and ∀ I ⊆ N. 90 Extended Shannon–McMillan–Breiman Theorem Actually, a random process can always be defined by a dynamical system according ∞ to the Kolmogorov Extension Theorem [Gra09, Theorem 3.3]. LetQM2 = {Xi }i=0 ∞ be a random process with a countable state space X2 and Ω2 = i=0 X2 . Define F2 to be the σ-algebra generated by ( ) ∞ ∞ [ Y G2 = {x0 } × {x1 } × · · · × {xn } × X2 xi ∈ X2 , ∀ 0 ≤ i ≤ n . n=0 i=n+1 The Kolmogorov Extension Theorem [Gra09, Theorem 3.3] states that there exists a unique probability measure µ2 defining the measure space (Ω2 , F2 , µ2 ), such that ∞ Y \ j−1 Y X2 , Pr {Xj = xj , ∀j ∈ I} = µ2 X2 × {xj } × j∈I i=0 i=j+1 for any xj ∈ X2 and index set I ⊆ N. Let T2 : Ω2 → Ω2 be the left shift, namely T2 ((x0 , x1 , · · · , xn , · · · )) = (x1 , x2 , · · · , xn+1 , · · · ), and define f2 : Ω2 → X2 as f2 : (x0 , x1 , · · · , xn , · · · ) 7→ x0 , we have that ∞ M2 = f2 (T2i ) i=0 . This says, the process M2 defines a dynamical system (Ω2 , F2 , µ2 , T2 ) “consistent” with itself. Definition 6.1.3. Following nation defined above, the random process M1 is said to be a.m.s. (stationary or ergodic), if (Ω1 , F1 , µ1 , T1 ) is a.m.s. (invariant or ergodic). Proposition 6.1.3. A function of an a.m.s. (stationary or ergodic) random process is a.m.s. (stationary or ergodic). Proposition 6.1.4. Assuming that M2 is finite-state Markov, if M2 is irreducible, then (Ω2 , F2 , µ2 , T2 ) is a.m.s. and ergodic. Proof. If M2 is irreducible, then its invariant distribution gives raise to the invariant mean of µ2 . Thus, (Ω2 , F2 , µ2 , T2 ) is a.m.s.. The ergodicity part follows from [Gra09, Lemma 7.15] (or [Gra09, Corollary 8.4]). Remark 6.2. The converse of Proposition 6.1.4 is not true, even when (Ω2 , F2 , µ2 , T2 ) is also#ergodic. For example, let M2 be a Markov process with transition matrix " 1 0 . One can verify that the corresponding system (Ω2 , F2 , µ2 , T2 ) defined 0.5 0.5 as above is a.m.s. and ergodic, although M2 is obviously not irreducible. However, M2 always admits a unique invariant distribution. 6.2. Induced Transformations of A.M.S. Systems 91 Proposition 6.1.5. Assuming that M1 is finite-state Markov, if (Ω1 , F1 , µ1 , T1 ) is a.m.s. and ergodic, then M1 admits a unique invariant distribution. Proof. Obviously, the invariant distribution is induced from the invariant mean of µ1 . It is unique because the invariant mean of µ1 is unique. When M1 = M2 , it is easy to show that (Ω2 , F2 , µ2 , T2 ) becomes a factor of (Ω1 , F1 , µ1 , T1 ), although (Ω1 , F1 , µ1 , T1 ) and (Ω2 , F2 , µ2 , T2 ) are not necessarily isomorphic. Consequently, Proposition 6.1.6. If M1 = M2 is a.m.s. (or stationary), then (Ω2 , F2 , µ2 , T2 ) is a.m.s. (or invariant). Proof. The conclusion follows from Theorem 6.1.1, since (Ω2 , F2 , µ2 , T2 ) is a factor of the a.m.s. (or invariant) dynamical system, say (Ω1 , F1 , µ1 , T1 ), defining M1 . In fact, many properties of (not necessarily discrete) random processes are better described and easier to analysed by the underlying dynamical systems. Interested readers are referred to [Aar97,Gra09] for a systematic establishment of the theories. 6.2 6.2.1 Induced Transformations of A.M.S. Systems Induced Transformations For an invariant system (Ω, F , m, T ) with a finite measure m, Poincaré’s Recurrence Theorem guarantees that ∞ ∞ [ \ T −j B = 0, ∀ B ∈ F . (6.2.1) m B − i=0 j=i As a consequence, for any A ∈ F (m(A) S∞one can define a new transformation T∞> 0), TA on (A0 , A , m|A ), where A0 = A ∩ i=0 j=i T −j A and A = {A0 ∩ B|B ∈ F }, such that (1) TA (x) = T ψA (x) (x), ∀ x ∈ A0 , where (1) ψA (x) = min i ∈ N+ |T i (x) ∈ A0 is the first return time function. Consequently, (A0 , A , m|A , TA ) forms a new dynamical system. Such a transformation TA is called an induced transformation of (Ω, F , m, T ) (with respect to A) [Kak43]. On the other hand, for an arbitrary a.m.s. dynamical system (Ω, F , µ, T ), the situation (of defining the concept of induced transformation) becomes delicate, because (6.2.1) is not necessarily valid even for a finite measure µ, unless µ µ 92 Extended Shannon–McMillan–Breiman Theorem [Gra09, Theorem 7.4]. Thus, there could be some A ∈ F of positive measure, such that TA is not defined on any non-empty subset of A. To avoid a situation of this sort, we shall focus on dynamical systems for which (6.2.1) holds. Definition 6.2.1. A dynamical system (Ω, F , µ, T ) is said to be recurrent (conservative) if ∞ [ ∞ \ T −j B = 0, ∀ B ∈ F . µ B − i=0 j=i Definition 6.2.2. In Definition 6.2.2, the random process M1 is said to be recurrent, if (Ω1 , F1 , µ1 , T1 ) is recurrent. Proposition 6.2.1. A function of a recurrent random process is recurrent. By Poincaré’s Recurrence Theorem, Proposition 6.2.2. All stationary random process are recurrent. Remark 6.3. There are several equivalent definitions of recurrence (conservativeness). Please refer to [Gra09, Chapter 7.4] and [Aar97] for more details. The physical interpretation of recurrence (conservativeness) states that an event of positive probability is expected to repeat itself infinitely often during the lifetime of the dynamical system. Because of this physical meaning, recurrence is often assumed for ergodic systems in literature [Aar97]. It is well-known that, for a recurrent invariant system (Ω, F , m, T ) with m being σ-finite, (A0 , A , m|A , TA ) with 0 < m(A) < ∞ is invariant. Unfortunately, the available proof of this result relies heavily on the invariance assumption. In other words, for more general systems, e.g. a.m.s. systems, the case is not yet settled. Thus, the solo purpose of this section is to prove that, if (Ω, F , µ, T ) is a recurrent a.m.s. dynamical system with µ being σ-finite, then (A0 , A , µ|A , TA ) is also a.m.s. for all 0 < µ(A) < ∞. At the same time, a connection between the invariant mean of µ|A and µ is established (see Theorem 6.2.1 and Theorem 6.2.4). As a direct conclusion of this assertion, we have that the Shannon–McMillan– Breiman (SMB) Theorem, as well as the Shannon–McMillan Theorem, holds simultaneously for all reduced processes of any finite-state recurrent a.m.s. random process (see Section 6.3). 6.2.2 Finite Measure µ We first prove the assertion for dynamical systems equipped with finite measures. To facilitate our discussion,(we designate 1A as the indicator function of a set 1; if x ∈ A; A ⊆ Ω. To be precise, 1A (x) = 0; if x ∈ Ω − A. 6.2. Induced Transformations of A.M.S. Systems 93 Theorem 6.2.1. For a recurrent a.m.s. dynamical system (Ω, F , µ, T ) with a finite measure µ and any A ∈ F with µ(A) > 0, (A0 , A , µ|A , TA ) is a.m.s.. Moreover, the invariant mean µ|A of µ|A admits Z 1B µ|A (B) = dµ, ∀ B ∈ A . A 1A Remark 6.4. The integral in Theorem 6.2.1 implicitly implies that 1A 6= 0 µ-a.e. on A, as we will prove later (see Lemma 6.2.2). Besides, as mentioned, 1A = Eµ (1A |I ) and 1B = Eµ (1B |I ), where I is the σ-algebra of T -invariant sets, µ-a.e. and µ-a.e. on Ω. Therefore, Z Eµ (1B |I ) µ|A (B) = dµ, ∀ B ∈ A . A Eµ (1A |I ) To prove Theorem 6.2.1, a couple of supporting lemmas are required. Lemma 6.2.1. Let (Ω, F , µ, T ) (µ is not necessarily finite) be an arbitrary dynam1 Pn−1 ical system. For any A ⊆ Ω and x ∈ Ω for which the limit limn→∞ 1A T i (x) n i=0 exists, let O = ω ∈ Ω|1A (ω) = 0 . We have that the limit limn→∞ 1 Pn−1 1A−O T i (x) exists and 1A−O (x) = 1A (x). n i=0 Proof. By definition, n−1 n−1 n−1 1X 1X 1X 1A T i (x) = 1A−O T i (x) + 1A∩O T i (x). n i=0 n i=0 n i=0 1 Pn−1 1A∩O T i (x) constantly equals to 0. n i=0 Otherwise, T i0 (x) ∈ A ∩ O for some i0 . Let k = min{i ∈ N|T i (x) ∈ A ∩ O} and y = T k (x). Then, for all n > k, we have If T i (x) ∈ / A ∩ O for all i ∈ N, then n−1 n−1 1X 1X i 1A∩O T (x) = 1A∩O T i (x) n i=0 n i=k = ≤ 1 n 1 n n−k−1 X 1A∩O T i (y) i=0 n−k−1 X 1A T i (y) i=0 →0, n → ∞, since y ∈ O. Therefore, 1 Pn−1 1A−O T i (x) → 1A (x), n → ∞. n i=0 94 Extended Shannon–McMillan–Breiman Theorem Lemma 6.2.2. In Theorem 6.2.1, we have that 1A 6= 0 a.e. on A, with respect to both µ and its invariant mean µ. Proof. Let O = ω ∈ Ω|1A (ω) = 0 . We get Z µ(A) = 1A dµ (6.2.2) 1A−O dµ (6.2.3) ZΩ = Ω =µ(A − O). (6.2.4) where (6.2.2) and (6.2.4) are due to the fact that (Ω, F , µ, T ) is a.m.s. [Gra09, Corollary 7.9], and (6.2.3) follows from Lemma 6.2.1. Consequently, µ(A ∩ O) = 0. Since (Ω, F , µ, T ) is a.m.s. and recurrent, we have that µ µ by [Gra09, Theorem 7.4]. Therefore, µ(A ∩ O) = 0. Proof of Theorem 6.2.1. For any x ∈ A0 and positive integer n, let (n) ψA (x) = n−1 X (1) ψA (TAi (x)). i=0 (n) It is easy to see that ψA (the nth return time function) is well-defined since the system is recurrent. For any B ∈ A , we have that Z n−1 n−1 1X 1X µ TA−i B = 1B TAi (ω)dµ(ω) n i=0 n A0 i=0 Z n−1 1X = 1B TAi (ω)dµ(ω) A n i=0 (6.2.5) (n) Z = A Z = A 1 n ψA (ω)−1 X 1B T i (ω)dµ(ω) i=0 (n) (n) ψA (ω) 1 (n) n ψ (ω) A ψA (ω)−1 X i=0 1B T i (ω)dµ(ω), 6.2. Induced Transformations of A.M.S. Systems 95 where (6.2.5) follows because µ(A − A0 ) = 0 since the system is recurrent. Due to the fact that (Ω, F , µ, T ) is a.m.s., it follows that (n) n (n) ψA (ω) = ψA (ω)−1 1 X (n) ψA (ω) 1A T i (ω) → 1A (ω) µ-a.e. and i=0 (n) ψA (ω)−1 1 X (n) ψA (ω) 1B T i (ω) → 1B (ω) µ-a.e. i=0 as n → ∞. Let O = {ω ∈ Ω|1A (ω) = 0}. We conclude that n−1 1X µ TA−i B n→∞ n i=0 lim Z = lim n→∞ Z = A−O (n) (n) ψA (ω) 1 (n) n A−O ψA (ω) Z 1B 1B dµ = dµ, 1A A 1A ψA (ω)−1 X 1B T i (ω)dµ(ω) (6.2.6) i=0 (6.2.7) where (6.2.6) is due to the fact that µ(A∩O) = 0 by Lemma 6.2.2 and (6.2.7) follows from the Dominated Convergence Theorem [Rud86]. The theorem is established. Corollary 6.2.1. If (Ω, F , µ, T ) in Theorem 6.2.1 is ergodic, then µ|A (B) = µ(A)µ(B) ,∀ B ∈ A . µ(A) Proof. If (Ω, F , µ, T ) is ergodic, then 1A = µ(A) and 1B = µ(B) a.e. with respect to both µ and µ. The statement follows. 1 Remark 6.5. By Corollary 6.2.1, the system A0 , A , µ|A , TA is a.m.s. and µ(A) 1 ergodic, and µ|A is a probability measure on (A0 , A ) with invariant mean µ(A) 1 1 µ|A = µ|A . µ(A) µ(A) For dynamical systems with finite measures, it is indeed quite natural to believe that an induced transformation of a recurrent a.m.s. system is also a.m.s., hinted by the fact that an induced transformation of an invariant system is invariant. However, as seen from the above, the proof for the case of a.m.s. systems does not follow naturally from the one for the invariant case [Aar97]. After all, the system is no longer invariant. 96 Extended Shannon–McMillan–Breiman Theorem 6.2.3 σ-finite Measure µ In the previous section, the assumption that µ is finite is important, it comes into play in many places in our argument. This assumption supports the use of the Dominated Convergence Theorem in the proof of Theorem 6.2.1, and it is also a requirement to guarantee convergence (µ-a.e.) of the sample mean of a bounded measurable real-valued function. Consequently, if instead µ is not finite, our method proving Theorem 6.2.1 is not applicable. In this section, we will therefore prove our assertion for the case of a σ-finite measure based on a different approach, which involves the ratio ergodic theorem of [Hop70]. Pn−1 For convenience, we define Sn (f ) to be the finite sum i=0 f T i , for some given transformation T , non-negative integer n and real-valued function f . Theorem 6.2.2 (Ratio Ergodic Theorem for Invariant Systems3 ). Let (Ω, F , m, T ) be an invariant Rdynamical system with m being σ-finite. For any f, g ∈ L1 (m) such that g ≥ 0 and Ω gdm > 0, there exists a function h(f, g) : Ω → R, such that Sn (f ) = h(f, g) m-a.e. on D = lim n→∞ Sn (g) ω ∈ Ω sup Sn (g)(ω) = ∞ . n Moreover, h(f, g) is T -invariant m-a.e. on D, it is I -measurable, where I ⊆ D∩F is the σ-algebra of T -invariant sets, and Z Z f dm = h(f, g)gdm, ∀ I ∈ I . I I To our knowledge, the first4 general ergodic theorem for a.m.s. systems is the generalisation of Birkhoff’s ergodic theorem [Bir31] presented in [GK80]. Coincidentally, there is a version of Hopf’s ratio ergodic theorem for a.m.s. systems. Theorem 6.2.3 (Ratio Ergodic Theorem for A.M.S. Systems). Given an a.m.s. dynamical system (Ω, F , µ, T ) with µ being σ-finite, let µ be the invariant mean of R µ. For any f, g ∈ L1 (µ) such that g ≥ 0 and Ω gdµ > 0, there exists a function h(f, g) : Ω → R, such that Sn (f ) = h(f, g); Sn (g) h(f, g) = h(f, g)T lim n→∞ a.e. on D = ω ∈ Ω sup Sn (g)(ω) = ∞ n 3 Hopf’s ratio ergodic theorem for invariant systems is often presented differently in the literature, in each instance with different and delicate details. Readers are kindly referred to the related literature ( [Hop70, Ste36, KK97, Zwe04] and etc.) for more information. 4 An earlier ergodic theorem from [Hur44] works for systems that are not necessarily invariant. However, that result relies on some additional constraints which, to the best of our knowledge, hinder an extension to a.m.s. systems. 6.2. Induced Transformations of A.M.S. Systems 97 with respect to both µ and µ. Moreover, if (Ω, F , µ, T ) is ergodic, then either µ(D) = µ(D) = 0 or R f dµ h(f, g) = RΩ µ-a.e. and µ-a.e. on Ω. gdµ Ω Proof. By Theorem 6.2.2, lim n→∞ Sn (f ) = h(f, g) µ-a.e. on D, Sn (g) for some function h(f, g) : Ω → R. Let Sn (f ) Sn (f ) , h∗ = lim , n→∞ Sn (g) n→∞ Sn (g) h∗ = lim and define Dlu = {x ∈ D |h∗ (x) ≥ u, h∗ (x) ≤ l } for all l, u ∈ Q. Obviously, Dlu is T -invariant. Thus, µ(Dlu ) = µ(Dlu ) = 0, ∀ l < u, because h∗ = h∗ µ-a.e. on D by Theorem 6.2.2. Consequently, ! [ X ∗ u µ ({x ∈ D|h (x) > h∗ (x)}) = µ Dl ≤ µ(Dlu ) = 0. l<u At every point x ∈ D where the limit limn→∞ lim n→∞ l<u Sn (f )(x) exists, it is obvious that Sn (g)(x) Sn (f )(x) Sn (f )(T x) = lim . n→∞ Sn (g)(x) Sn (g)(T x) Therefore, h(f, g) = h(f, g)T a.e. on D with respect to both µ and µ. The last statement is valid due to ergodicity. R R Remark 6.6. In Theorem 6.2.3, ΩRgdµ > 0 can be replaced by Ω gdµ > 0 if the R system is recurrent. This is because Ω gdµ > 0 =⇒ Ω gdµ > 0 by Lemma 6.2.3. Lemma 6.2.3. Given an a.m.s. dynamical system (Ω, F , µ, T ) with µ being σfinite, let µ be the invariant mean of µ. If (Ω, F , µ, T ) is recurrent, then µ µ. T∞ S∞ Proof. For any B ∈ F such that µ(B) = 0, let B∞ = i=0 j=i T −j B. We have that ∞ ∞ ∞ X X [ 0= µ(B) = µ T −j B ≥ µ T −j B ≥ µ(B∞ ) ≥ 0. j=0 j=0 j=0 Therefore, µ(B∞ ) = µ(B∞ ) = 0 since B∞ is T -invariant. Thus, µ(B) = µ(B −B∞ ). Moreover, µ(B − B∞ ) = 0 by the definition of recurrence. As a conclusion, µ(B) = 0. 98 Extended Shannon–McMillan–Breiman Theorem Remark 6.7. Whenever µ is finite, the converse of Lemma 6.2.3 is also valid [Gra09, Theorem 7.4]. However, it is not necessarily true for a non-finite measure µ. Theorem 6.2.4. For a recurrent a.m.s. dynamical system (Ω, F , µ, T ) with µ being σ-finite and any A ∈ F with 0 < µ(A) < ∞, (A0 , A , µ|A , TA ) is a.m.s.. In particular, the invariant mean µ|A of µ|A satisfies Z µ|A (B) = h(1B , 1A )dµ, ∀ B ∈ A , A where h(1B , 1A ) : Ω → R satisfies Sn (1B ) h(1B , 1A ) = lim a.e. on D = n→∞ Sn (1A ) ω ∈ Ω sup Sn (1A )(ω) = ∞ n (6.2.8) with respect to both µ and µ. Proof. First of all, Z Z 1A dµ = µ(A) > 0 =⇒ Ω 1A dµ > 0 Ω by Lemma 6.2.3. Furthermore, since µ(B) ≤ µ(A) < ∞ for any B ⊆ A , we have that Z n−1 1X 1B dµ = µ(B) = lim µ T −i B < ∞ and n→∞ n Ω i=0 Z n−1 X 1 µ T −i A < ∞ 1A dµ = µ(A) = lim n→∞ n Ω i=0 by definition. Therefore, there exists a function h(1B , 1A ) : Ω → R satisfying (6.2.8) based on Theorem 6.2.3. Moreover, we have that Z n−1 n−1 1X 1X −i µ TA B = 1B TAi (ω)dµ(ω) n i=0 n A0 i=0 Z Skn (1B ) (n) = dµ(ω), (where kn (ω) = φA (ω)). S (1 ) k A A0 n Obviously, 0 ≤ h(1B , 1A ) ≤ 1 µ-a.e. and µ-a.e. on D because 1B ≤ 1A , and A0 ⊆ D by the definitions of A0 and D. Since µ(A0 ) = µ(A) < ∞, the Dominated Convergence Theorem [Rud86] ensures that Z Z n−1 1X µ TA−i B = h(1B , 1A )dµ = h(1B , 1A )dµ. n→∞ n A0 A i=0 µ|A (B) = lim The statement is proved. 6.3. Extended Shannon–McMillan–Breiman Theorem 99 Remark 6.8. In the proof of Theorem 6.2.4, the condition µ(A) < ∞ cannot be dropped, since it ensures that 1A ∈ L1 (µ), i.e. µ(A) < ∞. Corollary 6.2.2. In Theorem 6.2.4, if (Ω, F , µ, T ) is ergodic, then µ|A (B) = µ(A)µ(B) ,∀ B ∈ A . µ(A) Proof. Since µ(D) ≥ µ(A0 ) = µ(A) > 0 and (Ω, F , µ, T ) is ergodic, we have that µ(Ω − D) = 0 and R 1B dµ µ(B) = µ-a.e., ∀ B ∈ A , h(1B , 1A ) = RΩ µ(A) 1 dµ Ω A by Theorem 6.2.3. The conclusion follows. 6.3 Extended Shannon–McMillan–Breiman Theorem Let (Ω, F , µ, T ) be a dynamical system with µ being a probability measure, and X be a random variable (a measurable function) with a finite sample space X defined on (Ω, F , µ). From Section 6.1.2, the random process ∞ ∞ {Xi }i=0 = X T i i=0 has distribution p (x0 , x1 , · · · , xn ) = µ n \ T −j X −1 (xj ) , ∀ n ∈ N. j=0 ∞ Theorem 6.3.1 (Shannon–McMillan Theorem [Sha48,McM53]). If {Xi }i=0 is stationary ergodic, i.e. (Ω, F , µ, T ) is invariant ergodic, then − where h = lim n→∞ 1 log p (X0 , X1 , · · · , Xn−1 ) → h in L1 (µ), n 1 1 E (− log p (X0 , X1 , · · · , Xn−1 )) = lim H (X0 , X1 , · · · , Xn−1 ). n→∞ n n Theorem 6.3.2 (Shannon–McMillan–Breiman (SMB) Theorem [Bre57]). If ∞ {Xi }i=0 is stationary ergodic, i.e. (Ω, F , µ, T ) is invariant ergodic, then − where h = lim n→∞ 1 log p (X0 , X1 , · · · , Xn−1 ) → h a.e., n 1 1 E (− log p (X0 , X1 , · · · , Xn−1 )) = lim H (X0 , X1 , · · · , Xn−1 ). n→∞ n n 100 Extended Shannon–McMillan–Breiman Theorem Remark 6.9. The constant h in the above theorems is called the entropy rate of ∞ the process {Xi }i=0 . It can be proved that h = lim H (Xn |X0 , X1 , · · · , Xn−1 ). n→∞ In fact, the stationary (invariant) condition can be further relaxed. Being a.m.s. is already sufficient. Theorem 6.3.3 ( [GK80, Corollary 4]). The SMB Theorem and the Shannon– McMillan Theorem hold for any a.m.s. process with finite state space. In addition to being a.m.s., assume that (Ω, F , µ, T ) is also recurrent. Given a subset Y ⊆ X of positive probability, i.e. Pr {X ∈ Y } > 0, the reduced process ∞ {Yj }j=0 with sub-state space Y is defined to be ∞ ∞ {Yj }j=0 = Xij j=0 , ( min{i ≥ 0|Xi ∈ Y }; j = 0; where ij = It is of interest to know whether the min{i > ij−1 |Xi ∈ Y }; j > 0. ∞ SMB Theorem (the Shannon–McMillan Theorem) holds also for {Yj }j=0 . Let A = T S ∞ ∞ X −1 (Y ) and A0 = A ∩ i=0 j=i T −j A. It is easily seen that n o∞ ∞ {Yj }j=0 = X TAj j=0 1 is essentially a random process defined on A0 , A0 ∩ F , µ|A0 ∩F , TA , which µ(A) is a.m.s. by Theorem 6.2.1 (by Theorem 6.2.4 as well) and ergodic by [Aar97, Proposition 1.5.2]. As a conclusion, the SMB Theorem (the Shannon–McMillan ∞ Theorem) holds for the reduced process {Yj }j=0 . Theorem 6.3.4 (Extended SMB Theorem). Given a recurrent a.m.s. ergodic dynamical system (Ω, F , µ, T ) with probability measure µ, {B1 , B2 , · · · , Bn , · · · } ⊆ F and a measurable function X : Ω → X (X is finite), the SMB Theorem, as well as the Shannon–McMillan Theorem, holds simultaneously for all the processes n o∞ X TBj i , i ∈ N+ . j=0 Proof. The statement follows from Theorem 6.2.1 (Theorem 6.2.4 as well) and Theorem 6.3.3. Corollary 6.3.1. The SMB Theorem, as well as the Shannon–McMillan Theorem, holds simultaneously for all reduced processes of any recurrent a.m.s. ergodic random process of finite states. Proof. The statement follows from Theorem 6.3.4 by letting Y1 , Y2 , · · · , Y2|X | −1 be all the non-empty subsets of X , Bi = X −1 (Yi ), ∀ 1 ≤ i ≤ 2|X | − 1 and Bi = X , ∀ i > 2|X | − 1. 6.A. Appendix 6.A 101 Appendix 6.A.1 Proof of Proposition 4.2.3 (n) By Proposition 6.1.4, we havethat {X } is a.m.s. ergodic, since it is irreducible. (n) Moreover, for any function Γ, Γ X is a.m.s. and ergodic by Proposition 6.1.3. Thus, the SMB Theorem holds for Γ X (n) by Theorem 6.3.3, i.e. − nh io 1 log Pr Γ X (1) , Γ X (2) , · · · , Γ X (n) → HΓ,X with probability 1. n In addition, we say that two functions Γ0 : X → D0 and Γ00 : X → D00 belong to the same class, if Γ0 = πΓ00 for some bijection π : Γ00 (D00 ) → Γ0 (D0 ). Obviously, there are P , where P is the number of all partitions5 of X , classes of functions defined on X . In the meanwhile, given any two functions Γ0 and Γ00 from the same class, it is obvious that HΓ0 ,X = HΓ00 ,X and n o Pr Γ0 X (l) = Γ0 x(l) , ∀ 1 ≤ l ≤ n n o = Pr Γ00 X (l) = Γ00 x(l) , ∀ 1 ≤ l ≤ n for any x(1) , x(2) , · · · , x(n) ∈ X n . Let F be a set containing exactly one function from of the P classes of functions defined on X . By definition, a sequence (1) each x , x(2) , · · · , x(n) is contained in TH (n, P) if and only if n o 1 − log Pr Γ X (l) = Γ x(l) , ∀ 1 ≤ l ≤ n − HΓ,X < , n for all the P functions in Γ ∈ F . Therefore, nh i o Pr X (1) , X (2) , · · · , X (n) ∈ / TH (n, P) ( ) nh io [ 1 (1) (2) (n) = Pr ,Γ X ,··· ,Γ X − HΓ,X > − n log Pr Γ X Γ ( ) nh io [ 1 (1) (2) (n) − log Pr Γ X = Pr ,Γ X ,··· ,Γ X − HΓ,X > n Γ∈F nh io X 1 ≤ Pr − log Pr Γ X (1) , Γ X (2) , · · · , Γ X (n) − HΓ,X > n Γ∈F →P × 0 = 0, as n → ∞. 5A partition of a set is a disjoint union of non-empty subsets of this set. Chapter 7 Asymptotically Mean Stationary Ergodic Sources onsider a finite-state Markov process M , if it is a.m.s. ergodic1 (while not necessarily irreducible), then it admits a unique invariant distribution by Proposition 6.1.5. This invariant distribution is induced from the invariant mean of the a.m.s. ergodic dynamical system defining M . Moreover, all reduced (Markov) processes of M inherit the a.m.s. property (by Theorem 6.2.1) and ergodicity (by [Aar97, Proposition 1.5.2]) from M . Thus, every reduced (Markov) process of M admits a unique invariant distribution by Proposition 6.1.5. On the other hand, recall from Chapter 4 and Chapter 5 that the irreducible condition is important because it guarantees recursively that each and every reduced process of an irreducible Markov process admits an invariant distribution. This provides the theoretical support to establish the AEP of Supremus typical sequences and all related results. However, the a.m.s. ergodic condition is already sufficient to make such a recursive claim on the invariant distributions of all the reduced processes. Therefore, results form these two chapters can be easily extended to the a.m.s. ergodic case with the same arguments. As a matter of fact, irreducibility is only a special realization of the recursive phenomenon characterized by the a.m.s. ergodic concept (Proposition 6.1.4). Henceforth, it is convincing to bring what we have established to the a.m.s. ergodic settings. C 1 We follow Definition 6.2.2 in this chapter in defining the term “ergodic”. Be reminded that, for a Markov chain, the term “ergodic” has another definition given by Definition 4.1.5. They are not equivalent. 103 104 Asymptotically Mean Stationary Ergodic Sources 7.1 Supremus Typicality in the Weak Sense n o (k) Let {X (n) } be a random process with a finite state space X . Define XY to be (n ) k the reduced process X , where ( min{n ≥ 0|X (n) ∈ Y }; k = 0, nk = (n) min{n > nk−1 |X ∈ Y }; k > 0. By the Kolmogorov Extension Theorem [Gra09, Theorem 3.3], {X (n) } = {X(T n )} for some dynamical system (Ω, F , µ, T ) and measurable function X : Ω → X (see Section 6.1.2). Assume that (Ω, F , µ, T ) is recurrent a.m.s. let A = n ergodic, o T∞ S∞ −j (k) −1 X (Y ) and A0 = A ∩ i=0 j=i T A. It is easily seen that XY is essentially k the random process X TA defined on the system 1 µ|A0 ∩F , TA , A0 , A0 ∩ F , µ(A) which is also recurrent (by Definition 6.2.1) a.m.s. (by Theorem 6.2.1) and ergodic (by [Aar97, Proposition 1.5.2]). In the meantime, n o the SMB Theorem holds simul(k) XY ’s by Corollary 6.3.1. In addition, ( 1; x ∈ Y ; let µ be the invariant mean of µ and IY = be the indicator function 0; x ∈ /Y, with respect to Y . The Point-wise Ergodic Theorem [Gra09, Theorem 8.1] states that, with probability 1, taneously for all the reduced processes n−1 1X |XY | = lim IY X(T i ) = Eµ (IY X) = µ X −1 (Y ) = µ(A). n→∞ n |X|→∞ |X| i=0 lim This says that, given a sequence X generated by {X (n) }, with high probability the reduced subsequence XY with respect to Y has probability close to exp2 (− |XY | HY ) ≈ exp2 [−np(Y )HY ] , n o (k) where p(Y ) = Eµ (IY X) and HY is the entropy rate of XY , when the length of X is big enough. This motivates the following definition of Supremus typicality. Definition 7.1.1 (Supremus Typicality in the Weak Sense). Let {X (n) } be a recurrent a.m.s. ergodic process with a finite state space X . A sequence x ∈ X n is said to be Supremus -typical with respect to {X (n) } for some > 0, if ∀ ∅ = 6 Y ⊆X, |xY | p(Y ) − < < p(Y ) + ; n |x | (H − ) < − log p (x ) < |x | (H + ), Y Y Y Y Y Y 7.1. Supremus Typicality in the Weak Sense 105 where o pY and HY are the joint distribution and entropy rate of the reduced process n (k) XY of {X (n) } with sub-state space Y , respectively. The set of all Supremus -typical sequences with respect to {X (n) } in X n is denoted by S (n, {X (n) }). Obviously, Supremus typical sequences form a subset of classical typical sequences defined as follows. Definition 7.1.2 (Typicality in the Weak Sense). Let {X (n) } be an a.m.s. ergodic process with a finite state space X . A sequence x ∈ X n is said to be -typical with respect to {X (n) } for some > 0, if n(HX − ) < − log pX (x) < n(HX + ), where pX and HX are the joint distribution and entropy rate of the process {X (n) }. The set of all -typical sequences with respect to {X (n) } in X n is denoted by T (n, {X (n) }). From the definitions, it is seen that Supremus typicality is a more restricted concept. In other words, it features more characteristics of the original random process. For example, Proposition 4.2.1 is also valid. Proposition 7.1.1. Every reduced subsequence of a Supremus -typical sequence in the weak sense is Supremus -typical in the weak sense. Unfortunately, the following example says otherwise for classical ones. Example 7.1.1. Let {X (n) } be an i.i.d. process with state space X = {α, β, γ} and distribution pX (α) = 997/1000; pX (β) = 2/1000; pX (γ) = 1/1000. It is easy to verify that x = [α, α, · · · , α, β, γ] ∈ X 1000 is 0.1-typical in the weak sense, i.e. x ∈ T (1000, {X (n) }), because − 1 log pX (x) − HX = − 1 log 997 + 1 log 2 < 0.01 < 0.1. 1000 1000 1000 1000 1000 However, for the reduced subsequence xY = [β, γ] ∈ Y 2 (Y = {β, γ}), 1 − log pY (xY ) − HX = 1 log 2 − 1 log 1 = 1 > 0.15 > 0.1. 2 6 3 6 3 6 n o (k) Thus, xY is not Supremus 0.1-typical with respect to XY in the weak sense. We will present more properties, both classic and new, embraced by the new concept in the following. 106 Asymptotically Mean Stationary Ergodic Sources Proposition 7.1.2 (AEP of Weak Supremus Typicality). In Definition 7.1.1, 1. S (n, {X (n) }) < exp2 [n (HX + )]; and 2. ∀ η > 0, there exists some positive integer N0 such that nh i o Pr X (1) , X (2) , · · · , X (n) ∈ / S (n, {X (n) }) < η and S (n, {X (n) }) > (1 − η) exp2 [n (HX − )] , for all n > N0 . Proof. 1. First of all, X 1≥ x∈S pX (x) (n,{X (n) }) X > exp2 [−n (HX + )] x∈S (n,{X (n) }) = S (n, {X (n) }) exp2 [−n (HX + )] . Therefore, S (n, {X (n) }) < exp2 [n (HX + )]. 2. Let X = X (1) , X (2) , · · · , X (n) . We have that n o n oo [ n (k) X∈ / S (n, {X (n) }) = XY ∈ / S |XY | , XY . ∅6=Y ⊆X In the meanwhile, The Point-wise Ergodic Theorem [Gra09, Theorem 8.1] and Corollary 6.3.1 guarantee that, with probability 1, X Y → p(Y ); n 1 − log pY (XY ) → HY |XY | simultaneously for all ∅ 6= Y ⊆ X . This implies that, for some positive integer N0 , [ n n o n oo (k) Pr X ∈ / S (n, {X (n) }) = Pr XY ∈ / S |XY | , XY < η, ∅6=Y ⊆X ∀ n > N0 . Furthermore, n o 1 − η < Pr X ∈ S (n, {X (n) }) X < exp2 [−n (HX − )] x∈S (n,{X (n) }) = S (n, {X (n) }) exp2 [−n (HX − )] 7.1. Supremus Typicality in the Weak Sense 107 Consequently, S (n, {X (n) }) > (1 − η) exp2 [n (HX − )]. The statement is proved. Remark 7.1. According to the AEP of Weak Supremus Typicality, stochastically speaking, the classical typical sequences that are not Supremus typical is also negligible as non-typical sequences. `m Lemma 7.1.1. In Definition 7.1.1, for all partition2 j=1 Yj of X and x = x(1) , x(2) , · · · , x(n) ∈ S (n, {X (n) }), the size of a n m S x, Yj = y (1) , y (2) , · · · , y (n) ∈ S (n, {X (n) }) j=1 o y (l) ∈ Yj ⇔ x(l) ∈ Yj , ∀ 1 ≤ l ≤ n, ∀ 1 ≤ j ≤ m is strictly smaller than !# " m # " m X X p(Yj )HYj + (|X | + 1) . exp2 xYj HYj + < exp2 n j=1 j=1 `m Proof. By Proposition 7.1.1, for any y ∈ S x, j=1 Yj , the reduced subsequence n o (k) yY (1 ≤ j ≤ m) resides in S xY , X . The number of all possible yY ’s j Yj j j is upper bounded by n (k) o < exp2 xYj (HYj + ) S xYj , XYj according to Proposition 7.1.2. Therefore, a Y m m n (k) o S x, Yj ≤ S xYj , XYj j=1 j=1 < m Y exp2 xYj (HYj + ) j=1 " m X xY HY + j j = exp2 # j=1 " = exp2 n m X xYj j=1 " < exp2 n !# HYj + m X n (p(Yj ) + )HYj + !# j=1 2A partition of a set is a disjoint union of non-empty subsets of this set. 108 Asymptotically Mean Stationary Ergodic Sources " ≤ exp2 n m X !# p(Yj )HYj + (|X | + 1) . j=1 The statement is established. 7.2 Hyper Supremus Typicality in the Weak Sense Definition 7.2.1 (Hyper Supremus Typicality in the Weak Sense). Let {X (n) } be a.m.s. (1)a recurrent ergodic process with a finite state space X . A sequence x , x(2) , · · · , x(n) ∈ X n is said to be Hyper Supremus -typical with respect to {X (n) } for some > 0, if Γ x(1) , Γ x(2) , · · · , Γ x(n) is Supremus -typical with respect to {Γ X (n) } for all feasible functions Γ. The set of all Hyper Supremus -typical sequences with respect to {X (n) } in X n is denoted by H (n, {X (n) }). One motivation for Definition 7.2.1 is to extend the definition of Supremus typicality from the “single process” case to the “joint processes” case, as n it isodone (n) for joint typicality [Cov75] in the classical sense. Given two processes X1 and n o h (n) (1) (2) X2 with state space X1 and X2 , respectively. Two sequences x1 = x1 , x1 , i h i (n) (1) (2) (n) · · · , x1 ∈ X1n and x2 = x2 , x2 , · · · , x2 ∈ X2n are said to be jointly typical in a classical sense if bothof them are classical typical and x = [Cov75], tr (1) (2) (i) (i) (n) (i) x ,x ,··· ,x x = x1 , x2 is classical typical. Here, x1 , as well as x2 , is just a function of x. Therefore, if x is Hyper Supremus typical, then both x1 and x2 are necessarily Supremus typical, hence, them are joint typical in the classical sense. Proposition 7.2.1. A function of a Hyper Supremus -typical sequence is Hyper Supremus -typical. A reduced subsequence of a Hyper Supremus -typical sequence is Hyper Supremus -typical. It is well-known that a function of classical typical sequence in the weak sense is not necessarily classical typical (unless it is defined in the strong sense for i.i.d. settings [Yeu08, Chapter 6.3]). Nevertheless, Proposition 7.2.1 states differently for Hyper Supremus typical sequences. On this regard, we see that Definition 7.2.1 embraces more features beyond characterizing the “joint effects.” Proposition 7.2.2 (AEP of Weak Hyper Supremus Typicality). In Definition 7.2.1, 1. H (n, {X (n) }) < exp2 [n (HX + )]; and 7.2. Hyper Supremus Typicality in the Weak Sense 109 2. ∀ η > 0, there exists some positive integer N0 such that i o nh / H (n, {X (n) }) < η Pr X (1) , X (2) , · · · , X (n) ∈ and H (n, {X (n) }) > (1 − η) exp2 [n (HX − )] , for all n > N0 . Proof. 1. H (n, {X (n) }) ≤ S (n, {X (n) }) < exp2 [n (HX + )]. 2. We say that two functions Γ0 : X → D0 and Γ00 : X → D00 belong to the same class, if Γ0 = πΓ00 for some bijection π : Γ00 (D00 ) → Γ0 (D0 ). Obviously, there are P , where P is the number of all partitions of X , classes of functions defined functions Γ0 and Γ00 come from the same 0 (1)on X0 . For any two (2) 0 (n) class, Γ x ,Γ x ,··· ,Γ x is Supremus -typical if and only if Γ00 x(1) , Γ00 x(2) , · · · , Γ00 x(n) is Supremus -typical. On the other hand, fix a function, say Γ, {Γ X (n) } is recurrent a.m.s. ergodic by Proposition 6.1.3 and Proposition 6.2.1. Therefore, there exists some NΓ > 0 such that nh i n oo η Pr Γ X (1) , Γ X (2) , · · · , Γ X (n) ∈ / S n, Γ X (n) < , P for all n > NΓ , so claimed by Proposition 7.1.2. Let F be the set containing exactly one function from each of the P classes of functions defined on X . We have that nh i o Pr X (1) , X (2) , · · · , X (n) ∈ / H (n, {X (n) }) ( ) i n oo [ nh (1) (2) (n) (n) = Pr Γ X ,Γ X ,··· ,Γ X ∈ / S n, Γ X Γ ( = Pr ) i n oo [ nh (1) (2) (n) (n) Γ X ,Γ X ,··· ,Γ X ∈ / S n, Γ X Γ∈F ≤ X Pr nh i n oo Γ X (1) , Γ X (2) , · · · , Γ X (n) ∈ / S n, Γ X (n) Γ∈F < η × |F | = η P for all n > N0 = maxΓ∈F {NΓ }. In addition, nh i o 1 − η < Pr X (1) , X (2) , · · · , X (n) ∈ H (n, {X (n) }) X < exp2 [−n (HX − )] x∈H (n,{X (n) }) = H (n, {X (n) }) exp2 [−n (HX − )] 110 Asymptotically Mean Stationary Ergodic Sources Consequently, H (n, {X (n) }) > (1 − η) exp2 [n (HX − )]. The statement is proved. Lemma 7.2.1. In Definition 7.2.1, for all partition x(2) , · · · , x(n) ∈ H (n, {X (n) }), the size of `m j=1 Yj of X and x = x(1) , a n m H x, Yj = y (1) , y (2) , · · · , y (n) ∈ H (n, {X (n) }) j=1 o y (l) ∈ Yj ⇔ x(l) ∈ Yj , ∀ 1 ≤ l ≤ n, ∀ 1 ≤ j ≤ m is strictly smaller than !# " m # " m X X xYj HYj + < exp2 n p(Yj )HYj + (|X | + 1) . exp2 j=1 j=1 Proof. Notice that H (x) ⊆ S (x). The proof follows from Lemma 7.1.1. NOTATION: We have defined HX and HY nto beo the entropy rate of the (k) random process {X (n) } and its reduced process XY , respectively. However, given an arbitrary function Γ : X → Y , even though Γ(X (n) ) has state space Y as well, the entropy rate of Γ(X (n) ) is not necessary equal to HY . nThis is o (k) because the distributions and the underlying dynamical systems defining XY and Γ(X (n) ) are different. To avoid causing confusion, we denote HΓ,X to be the entropy rate of Γ(X (n) ) . Lemma 7.2.2. In Lemma 7.2.1, define Γ(x) = l ⇔ x ∈ Yl . We have that m a H x, Yj < exp2 [n (HX − HΓ,X + 2)] . j=1 " (n) Z1 (n) (n) Z2 X (n) Γ X (n) # = (n ∈ N). By Proposition = Γ X and Proof. Let n o n o (n) (n) 6.1.3, Z1 and Z2 are recurrent a.m.s. ergodic since they are functions h i (1) (2) (n) (i) (n) of {X }. Define z1 to be z1 , z1 , · · · , z1 , where z1 = Γ x(i) . For any `m y = y (1) , y (2) , · · · , y (n) ∈ H x, j=1 Yj , we have that h i Γ y (1) , Γ y (2) , · · · , Γ y (n) = z1 7.2. Hyper Supremus Typicality in the Weak Sense 111 by definition. Therefore, pZ2 (z2 ) pX (x) = , pZ1 (z1 ) pZ1 (z1 ) " # h i y (i) (1) (2) (n) (i) where z2 = z2 , z2 , · · · , z2 and z2 = (i) . On the other hand, both z1 and z1 z2 are Hyper Supremus -typical since they are functions of the Hyper Supremus -typical sequence y. Henceforth, Pr { X = y| Z1 = z1 } = pX (x) exp2 [−n (HX + )] > (by Proposition 7.2.2) pZ1 (z1 ) exp2 [−n (HZ1 − )] exp2 [−n (HX + )] = 1 (1) (2) (l) exp2 −n liml→∞ H Z1 , Z1 , · · · , Z1 − l = exp2 [−n (HX − HΓ,X , + 2)] . Consequently, X 1≥ `m y∈H x, j=1 Pr { X = y| Z1 = z1 } Yj a m > H x, Yj exp2 [−n (HX − HΓ,X + 2)] . j=1 The statement follows. Remark 7.2. At this point, we possess the right background to unveil the very essential differences between the mechanisms of the First Proof and the Second Proof of Lemma 2.1.2. The First Proof resembles the argument given to prove Lemma 7.2.1. It comes from the property of the reduced subsequences that are modelled by corresponding reduced processes. On the other hand, The Second Proof, as the proof of Lemma 7.2.2, is based on the property characterized by the “joint effect” of the original process and one of its functions. It is a coincidence that both proofs lead to the same conclusion in Lemma 2.1.2. For more universal settings as given in Lemma 7.2.1 and Lemma 7.2.2, the results diverse. The effect of the differences is reflected by the notation HYj ’s and HΓ,X from the upper bounds, respectively. Actually, the same analytical differences can also be found by inspecting Lemma 4.2.1 and Lemma 4.2.2 (as well as Lemma 4.A.1 and Lemma 4.A.2). Given an index set S = {1, 2, · · · , s}, define the projective function πT (∅ = 6 tr T ⊆ S) to be the mapping maps an S indexed array, say [a1 , a2 , · · · , as ] , to the T tr indexed array ai1 , ai2 , · · · , ai|T | , where ij ∈ T . 112 Asymptotically Mean Stationary Ergodic Sources " # (n) X1 Lemma 7.2.3. In Definition 7.2.1, assume that X = (n) (n ∈ N) and X = X2 " # " # " # m (1) (2) (n) a X1 x1 x 1 , x1 , · · · , x1 , for any partition Yj of X1 and = ∈ (1) (2) (n) X2 x2 x2 , x2 , · · · , x2 j=1 (n) H (n, {X (n) }), the size of H ( " (1) y , Yj x2 = x1 , (1) x2 , j=1 m a # y (2) , · · · , y (n) (n) }) (2) (n) ∈ H (n, {X x2 , · · · , x2 ) y (l) ∈ Yj ⇔ (l) x1 ∈ Yj , ∀ 1 ≤ l ≤ n, ∀ 1 ≤ j ≤ m is strictly smaller than " m # X x1,Yj HZj − Hπ2 ,Zj + 2 exp2 j=1 " < exp2 n m X !# p(Zj ) HZj − Hπ2 ,Zj + (|X | + 2) , j=1 tr where Zj = [Yj , X2 ] . Remark 7.3. Assume that X := X1 × X2 = R1 × R2 (R1 and R2 are finite rings). For any left ideal I ≤l R1 , we have that R1 /I = {J1 , J2 , · · · , Jm }, where |R1 | and Jj ’s are disjoint cosets. Thus, R1 /I defines a partition of X1 = m = |I| `m R1 . From this, we can define H (x1 , R1 /I|x2 ) to be H x1 , j=1 Jj |x2 for all i h (n) (2) (1) and x2 = x1 ∈ Rn1 and x2 ∈ Rn2 . To be specific, if x1 = x1 , x1 , · · · , x1 i h (1) (2) (n) x2 , x2 , · · · , x2 , then H (x1 , R1 /I|x2 ) contains all Hyper Supremus -typical # " (n) (1) (2) y1 , y1 , · · · , y1 (i) (i) sequences, say (1) (n) , such that y1 ∈ x1 +I for all 1 ≤ i ≤ n. (2) x 2 , x2 , · · · , x2 " # x1 Proof of Lemma 7.2.3. For any x = . xZj is Hyper Supremus -typical of x2 ` tr length x1,Yj by Proposition 7.2.1. Consider the partition x2 ∈X2 [Yj , {x2 }] of Zj . We have that m a a tr zZj z ∈ H x1 , Yj x2 ⊆ H xZj , [Yj , {x2 }] . j=1 x2 ∈X2 7.2. Hyper Supremus Typicality in the Weak Sense 113 Thus, by Lemma 7.2.2, m a a tr ≤ H xZj , zZj z ∈ H x1 , [Yj , {x2 }] Yj x2 j=1 x2 ∈X2 < exp2 x1,Yj HZj − Hπ2 ,Zj + 2 , n o (n) since the reduced process XZj is recurrent a.m.s. ergodic. Consequently, m a H x1 , Yj x2 j=1 m m Y a zZj z ∈ H x1 , Yj x2 ≤ j=1 j=1 < m Y exp2 x1,Yj HZj − Hπ2 ,Zj + 2 j=1 m X x1,Yj HZj − Hπ2 ,Zj + 2 = exp2 j=1 m X xZj = exp2 n HZj − Hπ2 ,Zj + 2 n j=1 m X (p(Zj ) + ) HZj − Hπ2 ,Zj + 2 < exp2 n j=1 m X < exp2 n p(Zj ) HZj − Hπ2 ,Zj + (|X | + 2) . j=1 The statement is proved. Lemma 7.2.4. In Lemma 7.2.3, let Γ : X → m [ " [ j=1 x2 ∈X2 " # " # Yj be given as {x2 } # x1 Yj 7→ x2 {x2 } if x1 ∈ Yj . We have that m a H x1 , Yj x2 < exp2 [n (HX − HΓ,X + 2)] . j=1 114 Asymptotically Mean Stationary Ergodic Sources " Proof. Obviously, `m ` H x2 ∈X2 j=1 # Yj is a partition of X . In addition, {x2 } " # " # a m a x Y 1 j x1 , , Yj x2 = H . x2 j=1 x ∈X {x2 } j=1 m a 2 2 by definition. Thus, the statement follows from Lemma 7.2.2. For special cases, e.g. m = 1 or {X (n) } is i.i.d., the exponents of the two bounds given in Lemma 7.2.1 and Lemma 7.2.2, as well as the two given in Lemma 7.2.3 and Lemma 7.2.4, are equal (up to a difference of several ). In general, we can not determine which one is tighter at this point. However, the first one is often (n) more accessible and n easier o to evaluate. For instance if {X } is Markov, then the (k) reduced process XY is also Markov for all ∅ = 6 Y ⊆ X . As a consequence, the upper given by Lemma 7.2.1 is easily evaluated because p is simply the invariant distribution and the entropy rates HYj ’s can be obtain easily from the transition given by Lemma 7.2.2 is sigmatrix and p (see Chapter 4). In contrast, the bound nificantly more complicated mainly because Γ X (n) is not necessarily Markov which makes it significantly harder to evaluate the entropy rate HΓ,X . 7.3 Linear Coding over Finite Rings for A.M.S. Sources (n) Let Ri (i ∈ S = {1, 2, · · · , s}) be a finite ring and {X (n) }, where X (n) = X1 , Qs (n) (n) tr X2 , · · · , Xs , be a random process with state space R = i=1 Ri . In this section, we are to establish the following achievability theorems of LCoR. Theorem 7.3.1. (R1 , R2 , · · · , Rs ) satisfying, ∀ 0 6= T ⊆ S and ∀ ∅ = 6 Ii ≤l Ri , ( ) X X Ri log |Ii | > min p(Z ) (HZ − HπT c ,Z ) , HR − HΓIT ,R , log |Ri | Z ∈(RT /IT )×RT c i∈T tr where ΓIT : (rT , rT c ) R1 , R2 , · · · , Rs . tr 7→ (rT + IT , rT c ) , is achievable by linear coding over Corollary 7.3.1. Let log |R| r = max min 06=I≤l R log |I| ( ) X p(Z )HZ , HR − hI , Z ∈R/I where hI is the entropy rate of X (n) + I . R > r is achievable by linear coding over ring R. 7.3. Linear Coding over Finite Rings for A.M.S. Sources Proof of Theorem 7.3.1. Designate ( X r (T, IT ) = min 115 ) p(Z ) (HZ − HπT c ,Z ) , HR − HΓIT ,R , Z ∈(RT /IT )×RT c and let ki = nRi , where n is the length of the data sequences. By definition, log |Ri | 1X ki log |Ii | − r (T, IT ) > 2η n (7.3.1) i∈T for some small constant η > 0 and large enough n, ∀ ∅ 6= T ⊆ S, ∀ 0 6= Ii ≤l Ri . We claim that (R1 , R2 , · · · , Rs ) is achievable by linear coding over R1 , R2 , · · · , Rs based on the following proof. Encoding: For every i ∈ S, randomly generate a ki × n matrix Ai based on a uniform distribution, i.e. independently choose each entry of Ai uniformly at random from Ri . Define a linear encoder φi : Rni → Rki i such that φi : x 7→ Ai x, ∀ x ∈ Rni . Obviously the coding rate of this encoder is log |Ri | 1 nRi 1 ki n log |φi (Ri )| ≤ log |Ri | = ≤ Ri . n n n log |Ri | Decoding: Subject to observing yi ∈ Rki i (i ∈ S) from the ith encoder, the decoder claims Qs tr that x = (x1 , x2 , · · · , xs ) ∈ i=1 Rni is the array of the encoded data sequences, if and only if: 1. x ∈ H n, {X (n) } ; and tr 2. ∀ x0 = (x10 , x20 , · · · , xs0 ) ∈ H n, {X (n) } , if x0 6= x, then φj (xj0 ) 6= yj , for some j. Error: Assume that Xi ∈ Rni (i ∈ S) is the original data sequence generated by the ith source. It is readily seen that an error occurs if and only if one of the following events occurs: tr E1 : X = (X1 , X2 , · · · , Xs ) ∈ / H n, {X (n) } ; tr E2 : There exists X 6= (x10 , x20 , · · · , xs0 ) ∈ H n, {X (n) } , such that φi (xi0 ) = φi (Xi ), ∀ i ∈ S. 116 Asymptotically Mean Stationary Ergodic Sources Error Probability: By the AEP of Weak Hyper Supremus Typicality 7.2.1, Pr {E1 } → 0, n → ∞. Meanwhile, for ∅ = 6 T ⊆ S and 0 6= I ≤l RT , let n tr D (X; T ) = (x10 , x20 , · · · , xs0 ) ∈ H n, {X (n) } o xi0 6= Xi , ∀ i ∈ T and xi0 = Xi , ∀ i ∈ T c Q and Q D (XT , I|XT c ) = H (XT , RT /I|XT c )\{X}, where XT = i∈T Xi and XT c = i∈T c Xi . We have [ D (X; T ) = D (XT , I|XT c ), (7.3.2) 06=I≤l RT since I goes over all possible non-trivial left ideals. In addition, D (XT , I|XT c ) =H (XT , RT /I|XT c ) − 1 i h exp2 n P p(Z ) (H − H ) + (|R| + 2) −1 Z πT c ,Z h Z ∈(RT /IT )×RT c i < exp2 n HR − HΓ ,R + 2 − 1 IT < exp2 [n (r(T, IT ) + (|R| + 2))] − 1 (7.3.3) by Lemma 7.2.3 and Lemma 7.2.4. Consequently, = Pr {E2 |E1c } X ( tr x01 ,··· ,x0s (n) Pr {φi (xi0 ) = φi (Xi )|E1c } i∈S ) ∈H (n,{X = Y })\{X} X X Y Pr {φi (xi0 ) = φi (Xi )|E1c } (7.3.4) X (7.3.5) tr ∅6=T ⊆S (x0 ,··· ,x0 ) i∈T 1 s ∈D (X;T ) ≤ X X Y tr x01 ,··· ,x0s ∅6=T ⊆S 06=I≤l RT ( Pr {φi (xi0 ) = φi (Xi )|E1c } i∈T ) ∈D (XT ,I|XT c ) < X X Q ∅6=T ⊆S 06= i∈T [exp2 [n (r (T, IT ) + η)] − 1] Y |Ii |−ki (7.3.6) i∈T Ii ≤l RT < (2s − 1) 2|R| − 2 × " max Q∅6=T ⊆S, 06= i∈T Ii ≤l RT exp2 −n !# 1X ki log |Ii | − (r (T, IT ) + η) n i∈T (7.3.7) 7.3. Linear Coding over Finite Rings for A.M.S. Sources 117 < (2s − 1) 2|R| − 2 × exp2 [−nη] , (7.3.8) where ` (7.3.4) is from the fact that H n, {X (n) } \ {X} = ∅6=T ⊆S D (X; T ) (disjoint union); (7.3.5) follows from (7.3.2) by Boole’s inequality [Boo10, Fré35]; (7.3.6) is from (7.3.3) and Lemma 2.1.1, as well as the fact that every left ideal of RT is a Cartesian product of some left ideals Ii of Ri , i ∈ T (see Proposition η ; 1.1.3). At the same time, is required to be smaller than |R| + 2 (7.3.7) is due to the facts that the number of non-empty subsets of S is 2s − 1 and the number of non-trivial left ideals of the finite ring RT is less than 2|R| − 1, which is the number of non-empty subsets of R; (7.3.8) is from (7.3.1). η , Pr {E2 |E1c } → 0, when n → ∞, from (7.3.8), since |R| + 2 1P for sufficiently large n, ki log |Ii | − [r (T, I) + η] > η > 0. Therefore, n i∈Tc Pr {E1 ∪ E2 } = Pr {E1 } + Pr {E1 } Pr { E2 | E1c } → 0 as → 0 and n → ∞. Ps Theorem 7.3.2. In Problem 5.1, let ĝ = h( i=1 ki ) be a polynomial presentaPs (n) tion of g over ring R, and X (n) = i=1 ki Xi (note: this defines a random n o (n) (n) process {X } with state space R). If XS is recurrent a.m.s. ergodic, then Thus, for all ≤ (R1 , R2 , · · · , Rs ) satisfying log |R| min Ri > max 06=I≤l R log |I| ( ) X p(Z )HZ , HR − hI , Z ∈R/I where hI is the entropy rate of X (n) + I , is achievable for encoding g. Proof. By Proposition 6.1.3 and Proposition 6.2.1, {X (n) } is recurrent a.m.s. ern o (n) godic since XS is recurrent a.m.s. ergodic. Therefore, ∀ > 0, there exists a large enough n, an m × n matrix A ∈ Rm×n and a decoder ψ, such that Pr {X n 6= ψ (AX n )} < , provided that ( ) X n m > max min p(Z )HZ , HR − hI , 06=I≤l R log |I| Z ∈R/I 118 Asymptotically Mean Stationary Ergodic Sources by Corollary 7.3.1. Let φi = A ◦ ~ki (1 ≤ i ≤ s) be the encoder of the ith source. n ~ Upon receiving φi (Xi ) from the ith source, the decoder claims that h X̂ n , where Ps X̂ n = ψ [ i=1 φi (Xin )], is the function, namely ĝ, subject to computation. The probability of decoding error is n h i o Pr ~h ~k (X1n , X2n , · · · , Xsn ) 6= ~h X̂ n n o ≤ Pr X n 6= X̂ n ( " s #) X n n = Pr X 6= ψ φi (Xi ) i=1 ( " n = Pr X 6= ψ s X #) A~ki (Xin ) i=1 ( " n = Pr X 6= ψ A s X #) ~ki (X n ) i i=1 n h io = Pr X n 6= ψ A~k (X1n , X2n , · · · , Xsn ) = Pr {X n 6= ψ (AX n )} < . Therefore, all (R1 , R2 , · · · , Rs ) ∈ Rs with m log |R| log |R| Ri = > max min 06=I≤l R log |I| n ( ) X Z ∈R/I is achievable, i.e. (R1 , R2 , · · · , Rs ) ∈ R[ĝ] ⊆ R[g]. p(Z )HZ , HR − hI Chapter 8 Conclusion 8.1 Summary This thesis first presented a coding theorem of linear coding over finite rings (LCoR) for correlated i.i.d. date compression. This theorem covers corresponding achievability theorems of Elias [Eli55] and Csiszár [Csi82] for linear coding over finite fields as special cases. In addition, it was showed that, for any set of finite correlated discrete memoryless sources, there always exists a sequence of linear encoders over some finite non-field rings which achieves the data compression limit, the Slepian–Wolf region. Hence, the optimality problem regarding linear coding over finite non-field rings for data compression is closed with positive confirmation with respect to existence. As an application, we addressed the problem of encoding functions of sources where the decoder is interested in recovering a discrete function of the data generated and independently encoded by several correlated i.i.d. random sources. We proposed linear coding over finite rings as an alternative solution to this problem. Results in Körner–Marton [KM79] and Ahlswede–Han [AH83, Theorem 10] on encoding the binary sum were generalised to cases for encoding certain polynomial functions over rings. Since a discrete function with a finite domain always admits such a polynomial presentation, we concluded that both generalisations universally apply to encoding all discrete functions of finite domains. Based on these, we demonstrated that linear coding over finite rings strictly outperforms its field counterpart in terms of achieving better coding rates and reducing the required alphabet sizes of the encoders for encoding many discrete functions. In order to generalise the above results to Markov source and a.m.s. source settings, we introduced the concept of Supremus typicality. It was showed that Supremus typicality is stronger in term of characterising the ergodic behaviours of random sequences than the classical Shannon typicality. Moreover, it possesses better properties that give rise to results that are more accessible and easier to analyse compared to corresponding ones derived from its classical counterpart. Built on the properties established for Supremus typicality (e.g. AEP and Extended SMB Theorem), we generalised our results on LCoR to non-i.i.d. (Markov 119 120 Conclusion and a.m.s.) settings. It was seen that linear coding over non-field rings is equally optimal as its field counterpart for compressing irreducible Markov sources in many examples (not a complete proof as for the i.i.d. case). In addition, it was once again proved that linear encoders over non-field rings strictly outperform their field counterparts for encoding many functions. To be more precise, it was proved that the set of coding rates achieved by linear encoders over certain non-field ring is strictly larger than the one achieved by all the field versions. As mentioned, the idea of Supremus typical sequence is a very important element to the establishment of our results on non-i.i.d. sources. Its advantages were seen by comparing corresponding results derived from the classical Shannon typical sequence and the Supremus typical sequence arguments, respectively. Yet, fundamentally, their differences come from the SMB Theorem and the Extended SMB Theorem. Empirically speaking, classical Shannon typical sequences feature the SMB Theorem, and Supremus typical sequences are characterised by the Extended SMB Theorem. The Extended SMB Theorem specifies not only the ergodic behaviours of the “global” random process as the SMB Theorem does but also the behaviours of all the reduced (“local”) processes. From this viewpoint, we can see that Supremus typicality describes the “typical behaviours” of a randomly generated sequence better. It refines the idea of classical typicality. 8.2 Future Research Directions 1. We have proved that, for some classes of non-field rings, linear coding is optimal in achieving the best coding rates for correlated i.i.d. data compression. However, the statement is not yet proved to hold for all non-field rings. It could be that the achievability theorem we obtained does not present an optimal achievable coding rate region in general. Given that LCoR brings in many advantages in applications, it is interesting to have a conclusive answer to this problem. 2. An efficient method to construct an optimal or asymptotically optimal linear coding scheme over a finite ring is required and very important in practical applications. Even though our analysis for the ring scenarios is more complicated than that for the field cases, linear encoders working over some finite rings are in general considerably easier to implement in practice. This is because the implementation of finite field arithmetic can be quite demanding. Normally, a finite field is given by its polynomial representation, operations are carried out based on the polynomial operations (addition and multiplication) followed by the polynomial long division algorithm. In contrast, implementing arithmetic of many finite rings is a straightforward task. For instance, the arithmetic of modulo integers ring Zq , for any positive integer q, is simply the integer modulo q arithmetic, and the arithmetic of matrix rings are matrix additions and multiplications. 3. It is also very interesting to consider coding schemes based on other algebraic structures, e.g. groups [CF09], rng, modules, and algebras. Actually, for linear coding over finite rngs, many of our results on LCoR hold for rng correspondences since 8.2. Future Research Directions 121 essentially a rng is “a ring without the multiplicative identity.” It will be intriguing should it turn out that the rng version outperforms the ring version in the function encoding problem or other problems, in the same manner that the ring version outperforms its field counterpart. It will also be interesting to see whether the idea of using rng, as well as other algebraic structures, provides more understanding of related problems. 4. Although we have seen that linear encoders over non-field rings outperform their field counterparts in various aspects in the function encoding problem, the problem of characterising the achievable coding rate region for encoding a function of sources is generally open. This problem is linked to other unsolved network information theory problems as well. Hence, an approach to tackle the function encoding problem could potentially provide significant insight into other problems. 5. For decades, Shannon’s argument on typicality of sequences has not changed much. It was successfully applied to prove most of the information theory results. Yet, changes can be helpful sometimes. From careful investigation of our analysis used to establish the achievability theorems of LCoR, one can see that the ring linear encoder is still chosen randomly as it is done for the field linear encoder. Compared to the analysis used to prove the achievability theorem of LCoF, the major difference lies in the analysis of the stochastic properties of the random source data. For non-i.i.d. scenarios, such a difference is obviously seen from the introduction of the concept Supremus typicality. AEP of Supremus typicality states that the Shannon typical sequences which are not Supremus typical are negligible. Such a “refinement” of the concept of typical sequence works in our particular problems. Thus, apart from search for different coding schemes, we propose to look deeper into the stochastic behaviours of the systems (sources and channels) as well when dealing with other problems. Bibliography [Aar97] J. Aaronson, An Introduction to Infinite Ergodic Theory. R.I.: American Mathematical Society, 1997. [AF92] F. W. Anderson and K. R. Fuller, Rings and Categories of Modules, 2nd ed. Springer-Verlag, 1992. [AH83] R. Ahlswede and T. S. Han, “On source coding with side information via a multiple-access channel and related problems in multi-user information theory,” IEEE Transactions on Information Theory, vol. 29, no. 3, pp. 396–411, May 1983. [BB05] L. Breuer and D. Baum, An Introduction to Queueing Theory: and Matrix-Analytic Methods, 2005th ed. Springer, Dec. 2005. [Bir31] G. D. Birkhoff, “Proof of the ergodic theorem,” Proceedings of the National Academy of Sciences of the United States of America, vol. 17, no. 12, pp. 656–660, Dec. 1931. [Boo10] G. Boole, An investigation of the laws of thought on which are founded, the mathematical theories of logic and probabilities. [S.l.]: Watchmaker, 2010. [BR58] C. J. Burke and M. Rosenblatt, “A markovian function of a markov chain,” The Annals of Mathematical Statistics, vol. 29, no. 4, pp. 1112–1122, Dec. 1958. [Bre57] L. Breiman, “The individual ergodic theorem of information theory,” The Annals of Mathematical Statistics, vol. 28, no. 3, pp. 809–811, Sep. 1957. [Buc82] R. C. Buck, “Nomographic functions are nowhere dense,” Proceedings of the American Mathematical Society, vol. 85, no. 2, pp. 195–199, Jun. 1982. [CF09] G. Como and F. Fagnani, “The capacity of finite abelian group codes over symmetric memoryless channels,” IEEE Transactions on Information Theory, vol. 55, no. 5, pp. 2037–2054, May 2009. 123 Providence, 124 Bibliography [Cov75] T. M. Cover, “A proof of the data compression theorem of slepian and wolf for ergodic sources,” IEEE Transactions on Information Theory, vol. 21, no. 2, pp. 226–228, Mar. 1975. [Csi82] I. Csiszár, “Linear codes for sources and source networks: Error exponents, universal coding,” IEEE Transactions on Information Theory, vol. 28, no. 4, pp. 585–592, Jul. 1982. [Csi98] ——, “The method of types,” IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2505–2523, 1998. [CT06] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley-Interscience, Jul. 2006. [DA12] J. Du and M. Andersson, Private Communication, May 2012. [DF03] D. S. Dummit and R. M. Foote, Abstract Algebra, 3rd ed. Wiley, 2003. [DLS81] L. D. Davisson, G. Longo, and A. Sgarro, “The error exponent for the noiseless encoding of finite ergodic markov sources,” IEEE Transactions on Information Theory, vol. 27, no. 4, pp. 431–438, Jul. 1981. [Eli55] P. Elias, “Coding for noisy channels,” IRE Convention Record, vol. 3, pp. 37–46, Mar. 1955. [FCC+ 02] E. Fung, W. K. Ching, S. Chu, M. Ng, and W. Zang, “Multivariate markov chain models,” in 2002 IEEE International Conference on Systems, Man and Cybernetics, vol. 3, Oct. 2002. [Fré35] M. Fréchet, “Généralisation du théorème des probabilités totales,” Fundamenta Mathematicae, vol. 25, no. 1, pp. 379–387, 1935. [Fro12] G. Frobenius, “Uber matrizen aus nicht negativen elementen,” Sitzungsberichte Königlich Preussichen Akademie der Wissenschaft, pp. 456–477, 1912. [Gal68] R. G. Gallager, Information Theory and Reliable Communication. New York: Wiley, 1968. [GK80] R. M. Gray and J. C. Kieffer, “Asymptotically mean stationary measures,” The Annals of Probability, vol. 8, no. 5, pp. 962–973, Oct. 1980. [Gra09] R. M. Gray, Probability, Random Processes, and Ergodic Properties, 2nd ed. Springer, Aug. 2009. [Hop70] E. Hopf, Ergodentheorie (Ergebnisse der Mathematik und Ihrer Grenzgebiete / Zweiter Band) (German Edition), reprint der erstausgabe berlin 1937 edition ed. Springer, Jan. 1970. Bibliography 125 [HS] S. Huang and M. Skoglund, “On linear coding over finite rings and applications to computing,” IEEE Transactions on Information Theory, conditionally accepted for publication (submitted October 2012). [Online]. Available: http://people.kth.se/~sheng11 [HS12a] ——, “Computing polynomial functions of correlated sources: Inner bounds,” in International Symposium on Information Theory and its Applications, Oct. 2012, pp. 160–164. [HS12b] ——, “Linear source coding over rings and applications,” in IEEE Swedish Communication Technologies Workshop, Oct. 2012, pp. 1–6. [HS12c] ——, On Existence of Optimal Linear Encoders over Non-field Rings for Data Compression, KTH Royal Institute of Technology, December 2012. [Online]. Available: http://people.kth.se/~sheng11 [HS12d] ——, “Polynomials and computing functions of correlated sources,” in IEEE International Symposium on Information Theory, Jul. 2012, pp. 771–775. [HS13a] ——, “Encoding irreducible markovian functions of sources: An application of supremus typicality,” IEEE Transactions on Information Theory, May 2013, submitted to. [Online]. Available: http://people.kth. se/~sheng11 [HS13b] ——, “On achievability of linear source coding over finite rings,” in 2013 IEEE International Symposium on Information Theory Proceedings (ISIT), 2013, pp. 1984–1988. [HS13c] ——, “On existence of optimal linear encoders over non-field rings for data compression with application to computing,” in 2013 IEEE Information Theory Workshop (ITW), 2013. [HS14a] ——, “Induced transformations of recurrent a.m.s. dynamical systems,” Stochastics and Dynamics, 2014. [HS14b] ——, “Supremus typicality,” in 2014 IEEE International Symposium on Information Theory Proceedings (ISIT), 2014, pp. 2644–2648. [Hun80] T. W. Hungerford, Algebra (Graduate Texts in Mathematics). Springer, Dec. 1980. [Hur44] W. Hurewicz, “Ergodic theorem without invariant measure,” Annals of Mathematics, vol. 45, no. 1, pp. 192–206, Jan. 1944. [Kak43] S. Kakutani, “Induced measure preserving transformations,” Proceedings of the Imperial Academy, vol. 19, no. 10, pp. 635–641, 1943. 126 Bibliography [KK97] T. Kamae and M. Keane, “A simple proof of the ratio ergodic theorem,” Osaka Journal of Mathematics, vol. 34, no. 3, pp. 653–657, 1997. [KM79] J. Körner and K. Marton, “How to encode the modulo-two sum of binary sources,” IEEE Transactions on Information Theory, vol. 25, no. 2, pp. 219–221, Mar. 1979. [Lam01] T.-Y. Lam, A First Course in Noncommutative Rings, 2nd ed. Springer, Jun. 2001. [LN97] R. Lidl and H. Niederreiter, Finite Fields, 2nd ed. New York: Gambridge University Press, 1997. [McM53] B. McMillan, “The basic theorems of information theory,” The Annals of Mathematical Statistics, vol. 24, no. 2, pp. 196–219, Jun. 1953. [Mey89] C. D. Meyer, “Stochastic complementation, uncoupling markov chains, and the theory of nearly reducible systems,” SIAM Rev., vol. 31, no. 2, pp. 240–272, Jun. 1989. [MS84] G. Mullen and H. Stevens, “Polynomial functions (mod m),” Acta Mathematica Hungarica, vol. 44, no. 3-4, pp. 237–241, Sep. 1984. [Nor98] J. R. Norris, Markov Chains. [Per07] O. Perron, “Zur theorie der matrices,” Mathematische Annalen, vol. 64, no. 2, pp. 248–263, Jun. 1907. [Rot10] J. J. Rotman, Advanced Modern Algebra, 2nd ed. matical Society, Aug. 2010. American Mathe- [Rud86] W. Rudin, Real and Complex Analysis, 3rd ed. ence/Engineering/Math, May 1986. McGraw-Hill Sci- [Sha48] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948. [Ste36] W. Stepanoff, “Sur une extension du théorème ergodique,” Compositio Mathematica, vol. 3, pp. 239–253, 1936. [SW49] C. E. Shannon and W. Weaver, The mathematical theory of communication. Urbana: University of Illinois Press, 1949. [SW73] D. Slepian and J. K. Wolf, “Noiseless coding of correlated information sources,” IEEE Transactions on Information Theory, vol. 19, no. 4, pp. 471–480, Jul. 1973. [Yeu08] R. W. Yeung, Information Theory and Network Coding, 1st ed. Springer Publishing Company, Incorporated, Sep. 2008. Cambridge University Press, Jul. 1998. Bibliography [Zwe04] 127 R. Zweimüller, “Hopf’s ratio ergodic theorem by inducing,” Colloquium Mathematicum, vol. 101, no. 2, pp. 289–292, 2004.
© Copyright 2025