This "blog" is a running summary of what we are doing in class, Math 307-101, Fall (i.e., Winter Term I) 2011. Begin Wednesday, September 7: Chapter 4 of "Google's PageRank and Beyond." Note that Chapter 15 is a useful summary of all the mathematics used in the book, although sometimes it is written in a "heavy-handed" fashion. - Explain the idea behind the hyperlink matrix, H. Note that we will allow repeated links and self-links, at least in certain situations. - Review the idea of a stationary distribution of a matrix - KEY EXAMPLE: Two state Markov chains, such as PageRank with two sites or, equivalently, people moving in and out of California, or, equivalently, people switching from left to right lanes, etc. We get a 2 x 2 matrix [ a 1-a ] [ ] [ 1-b b ] to describe the transition probabilities. Applying the chain "repeatedly" amounts to taking powers of the matrix, and the limit is (usually) [ p q ] [ ] [ p q ] where p+q=1 and q(1-b) = p(1-a), i.e., p = (1-b)/(2-a-b) and q=(1-a)/(2-a-b). The steady state distribution is [ p q ] - Try an example: Initial population [ 12 12 ] Chain: [ 3/4 1/4 ] [ ] [ 1/2 1/2 ] Get sequence [ 12 12 ] -> [ 15 9 ] -> [ 15.75 8.25 ] -> [15.9375 8.0625 ] -> can you guess the pattern? Can illustrate most of basic linear algebra using this example: solving equations (for the limit), eigenvectors/values, powers of matrices, power method, etc. - e.g. assume there is a limit [ V P ] to the above, starting from [ 12 12 ]. Solve for V,P: V+P = 24, V = (3/4)V+(1/2)P, P = (1/4)V+(1/2)P. Get V+P = 24 and V = 2P . So V=16, P=8. In general, if V+P is whatever, we still have V=16, P=8. - [ 16 8 ] is an eigenvector in the above, with eigenvalue 1. [ 1 -1] is an eigenvector with eigenvalue 1/4 . This explains the class formula for the iterates: V = 8 + 16 (1/4)^x, where x is the iteration number ( [12 12] is iteration 1, [15 9] is iteration 2, etc. ) - Give the general procedure for matrices with simple eigenvalues: Solve for eigenvalues of M with characteristic equation: det(M - I lambda ) = 0; lambda are the eigenvalues. Then find eigenvectors. Then write initial condition in terms of eigenvectors. This gives the formula for iterates of a vector under repeated multiplication my M. - General facts about Markov matrices: Definition: non-negative entries, all row sums = 1 Def: "Irreducible" if can get from any state to any other. Def: "Aperiodic" if it is of period 1 (G.C.D. of cycle lengths), i.e., some power of the matrix has all positive entries. (See IV.5.2 of Notes, conditions (3') and (4')) Remark: Notes IV.5 sometimes confuses row and columns on page 161. Thm: If a Markov matrix is irreducible, there is a unique (up to scalar multiplication) eigenvector of eigenvalue 1 with all non-negative entries. When normalized to sum of entries one, it is a stochastic vector called the "stationary distribution." - See also Markov Chain entry in Wikipedia. - Similar example: Notes IV.3.1 Fibonacci Numbers More theory: Notes IV.5 Markov Chains (Attention: Notes IV.5 has column vectors instead of row vectors; the text and in class we use row vectors and matrices multiply on the right!) - Explain how to use 2 x 2 Markov chains to reason about two populations of websites, or Americans (in various states) and Canadians (in various provinces), etc. This is the idea of refining Markov chains, not covered in the text (as far as I know). - Explain how the power method works in the 2 x 2 case. ------------------------------ Around Sept 19: Have yet to describe the power method and eigenvalues. - On Sept 16 explained the notion of refining a Markov chain: say have Markov chain on people from Vancouver, Toronto, Montreal. We can refine the Vancouver state to "UBC" and "non-UBC" people from Vancouver, and get a related Markov chain. More precisely, assume the sum of the transition probabilities to UBC and non-UBC states is the same as the original probability to Vancouver, and the transition probabilities from UBC and non-UBC are both identical to those in the original from Vancouver; then the sum of the stationary distribution of UBC and non-UBC equals that of the original distribution of Vancouver, and all other stationary distributions are the same. In class we did something like: [ 0,1/2,1/2 ; 1,0,0 ; 1,0,0 ] (in Matlab/Octave notion) can be refined to [ 0,0,1/2,1/2; 0,0,1/2,1/2; .7,.6,0,0 ; .3,.4,0,0 ]. - Another refinement example: there are n websites, and each website has k1 links to the first website, k2 links to the second, etc. If k = k1 + ... + kn, then we see that the i-th website has PageRank ki/k. Note that we could refine the i-th website into ki individual websites with the same result. - Another refinement example: n websites all point to one another (including themselves). One website, called A, decides to m websites such that the new websites only point to themselves and A. How does A's PageRank go up? Three mega-states: (1) new, (2) A, and (3) original excepting A. We get matrix: [ m/(m+1),1/(m+1),0 ; m/(m+n),1/(m+n),(n-1)/(m+n) ; 0,1/n,(n-1)/n ] We get: pi1 = pi2 (m+1)m/(m+n), pi3 = pi2 n(n-1)/(m+n). Hence pi = [ (m+1)m , m+n , n(n-1) ] / ( (m+1)m+m+n+n(n-1) ). - Study power method on bad Markov chains: not irreducible or periodic. Then study power method on two by two, say symmetric for simplicity: [ a, 1-a ; 1-a , a ]. What happens when a is near 1? Near 0? ------------------------------- Around October 2: The Power Method for PageRank. Power method interpretation: We have G = alpha S + (1-alpha) E, where E = (1/n)e e^T is the matrix whose entries are all 1/n, and S is a Markov matrix. The power method starts with a stochastic vector, v, and considers v^T, v^T G, v^T G^2, v^T G^3,... and hopes that this converges to the stationary distribution pi^T. For any stochastic w we have w^T G = w^T [ alpha S + (1-alpha) E ] = alpha w^T S + (1-alpha) e^T/n where e is the all 1's vector, since w^T E = e^T/n for any stochastic w. So w^T G^2 = alpha^2 w^T S^2 + alpha (1-alpha) e^T/n S + (1-alpha) e^T/n , and more generally w^T G^k = alpha^k w^T S^k + (1-alpha) e^T/n (I + alpha S + alpha^2 S^2 + ... + alpha^(k-1) S^(k-1) ) . Hence the limit as k -> infinity (assuming that 0 <= alpha < 1 ) is (1-alpha) e^T/n (I + alpha S + alpha^2 S^2 + ... ) = (1-alpha) (sum from k=0 to infinity of e^T/n alpha^k S^k ). In other words, the k-th step Markov process when we start from a random website is (e^T/n) S^k; these vectors are summed with weights (1-alpha) alpha^k . The sum of the weights on step k and beyond is alpha^k. Hence if alpha = .85, then the sum of the first 20 terms accounts for "all but 4 percent" of the PageRank. One may also understand the weights of e^T/n S^k for k=0,1,2,..., i.e., the weights (1-alpha) alpha^k as the weights of the geometric distribution; in other words, the weight (1-alpha) alpha^k represents the probability of seeing exactly k tails if we repeatedly flip a coin until he get a heads, provided that the probability of heads is (1-alpha) (and therefore the probability of tails is alpha). Another way to understand the effects of damping is on the eigenvalues. We claim that if S has eigenvalues lambda_1=1, lambda_2, ..., lambda_n, then G has eigenvalues 1, alpha lambda_2, alpha lambda_3, ... alpha lambda_n. The first eigenvalue of 1 comes with any Markov chain; to see the other eigenvalues, assume that S has distinct eigenvalues; if v_i^T S = v_i^T lambda_i, i.e., v_i is the left eigenvalue corresponding to lambda_i, and if i > 1, then v_i^T e = 0, i.e., the sum of v_i's components is zero; too see this, note that (lambda_i v_i^T) e = (v_i^T S) e = v_i^T (S e) = v_i^T e, and hence (lambda_i - 1) v_i^T e = 0. But then v_i^T E = 0, and so v_i^T G = v_i^T alpha S + v_i^T (1-alpha) E = alpha lambda_i v_i^T . Hence replacing S with G has the effect of multiplying all the higher eigenvalues by alpha. In particular, even if S has many eigenvalues of absolute value 1 (i.e., S is periodic or reducible), then G won't be (if 0 <= alpha < 1) and all of G's eigenvalues except 1 will be bounded in absolute value by alpha. ------------------------------- Teleportation: (See Section 5.3 of text): We have set G = alpha S + (1-alpha)E with E = e e^T / n. Now we generalize this: we set E = e v^T, where v is a stochastic vector which we will specify. It represents how a "random jump" is taken. All our formulas generalize easily to this case; for example, our formula for the PageRank power method becomes w^T G^k = alpha^k w^T S^k + (1-alpha) v^T (I + alpha S + alpha^2 S^2 + ... + alpha^(k-1) S^(k-1) ) , and PageRank converges to PageRank(S,v,alpha) = (1-alpha) v^T (I + alpha S + alpha^2 S^2 + ...) = (1-alpha) v^T (I - alpha S)^(-1). The text calls I - alpha S the fundamental matrix. The text shows that if S and v are fixed, then (see Theorem 6.1.3) d PageRank(alpha) PageRank'(alpha) = ----------------- = - v^T (I-S) (I-alpha S)^(-2) , d alpha where ' denotes the derivative with respect to alpha. Of course, this could be expressed as a power series, and this will be more useful to us: PageRank'(alpha) = -v^T + (1-2 alpha) v^T S + (2 alpha - 3 alpha^2) v^T S^2 + (3 alpha^2 - 4 alpha^3) v^T S^3 + ... ------------------------------------------ Around Oct 12: Let us give a nice consequence of the above formulas in terms of L^1 norm: THE L^1 NORM: || pi ||_1 = | pi_1 | + | pi_2 | + ... + | pi_n |. The L^1 norm of a vector is the sum of the absolute values of its components. E.g. if u is any stochastic vector, then || u ||_1 = 1. It satisfies properties that any "norm" is supposed to have: (1) || u || = 0 iff u = 0 (2) For any vector, u, and scalar, beta, we have || beta u || = | beta | || u || (3) For vectors u and w we have || u+w || <= || u || + || w ||. There are two other popular norms: THE MAX NORM: || pi ||_max = max( |pi_1|,|pi_2|,...,|pi_n| ). THE L^2 NORM: || pi ||_2 = sqrt( |pi_1|^2 + ... + |pi_n|^2 ). Claim: If A is a Markov matrix, then for any v we have || v^T A || <= || v^T || provided that the norm || . || being used is the L_1 norm. Norms are useful for measuring what happens when you truncate power series. Notice that || PageRank ||_1 = 1 since PageRank is stochastic. Now we claim that || PageRank'(alpha) ||_1 <= 2/(1-alpha). This follows since || PageRank'(alpha) ||_1 <= || v^T || + | 1-2 alpha | || v^T S || + | 2 alpha - 3 alpha^2 | || v^T S^2 || + ... since v^T, v^T S, etc. are stochastic, this is <= 1 + | 1-2 alpha | + | 2 alpha - 3 alpha^2 | + | 3 alpha^2 - 4 alpha^3 | + ... Call this infinite sum f(alpha). One can show that f(alpha) <= 2/(1-alpha). This is not a good estimate for alpha close to 1, where f(alpha) is roughly c/(1-alpha), where c = 1 - 1/2.718281828... = .63212055... (the 2.718281828... is the "e" of calculus). Now let || || be the L_1 or Max norm. Then we have || v^T G - w^T G || = || alpha (v^T -w^T ) S || <= alpha || v^T - w^T ||. provided that v and w are stochastic vectors. In particular, for w = pi, the stationary distribution we have || v^T G - pi^T || <= alpha || v^T - pi^T ||. Hence || v^T G^k - pi^T || <= alpha^k || v^T - pi^T ||. ------------------------------- Around November 7: Start Least Squares and Curve Fitting. See notes, Chapter III.1 (with background from Chapter II.1 and II.2 to be able to speak of "column spaces"). Given a matrix, A, and a vector, b, we speak of a "best solution to Ax=b" as any x that minimizes || Ax-b ||_2 (the 2-norm of Ax-b). All such x satisfy the "normal equations" A^T A x = A^T b. The value of Ax is called "the projection of b onto the column space of A," since it is the orthogonal projection of b onto the subspace spanned by the columns of A. We claim that for every b and A there is at least one solution, x. This will require some notions from Chapter II of the notes--dimension, bases, and the column space (see below). We can apply this to linear regression: given data (x_1,y_1), ... (x_n,y_n), we can find the "best" alpha and beta to model this data with the line y = alpha + beta x, meaning that we find the alpha and beta that minimize sum from i = 1 to n of ( y_i - alpha - beta x_i)^2 . Using least squares gives the equation [ n sum_i x_i ] [ alpha ] = [ sum_i y_i ] [ ] [ ] = [ ] [ sum_i x_i sum_i x_i^2 ] [ beta ] = [ sum_i y_i x_i ] which are the usual equations for linear regression. Note that it is not obvious that these equations have a solution in alpha and beta, nor is it obvious when these solutions are unique. Similar curve fitting can be done with any variable modelled by functions of other ("explanatory") variables. If A^T A is invertible, then the normal equations A^T A x = A^T b have the unique solution: x = inverse(A^T A) A^T b, and the projection of b onto the column space of A (the space of all vectors of the form Ay over all y) is A x = A inverse(A^T A) A^T b. The "projection matrix of A", A inverse(A^T A) A^T, can be understood in terms of Gram-Schmidt orthonormalization: given vectors v_1,v_2,... we define a sequece of orthonormal vectors u_1,u_2,... (orthonormal means that the dot product u_i . u_j is 1 if i=j and 0 if i is not j) via u_1 = v_1 / || v_1 ||, u'_2 = v_2 - (u_1 dot v_2) u_1 = v_2 - projection onto u_1 of v_2 u_2 = u'_2 / || u'_2 || u'_3 = v_3 - (u_1 dot v_3) u_1 - (u_2 dot v_3) u_2 = v_3 - proj onto u_1 of v_3 - proj onto u_2 of v_3 u_3 = u'_3 / || u'_3 || etc. Then u_1, u_2, ... are orthonormal vectors with the same linear span as v_1,v_2,... Note that if the v_i are linearly dependent, then some of the u_i will be zero and they will be omitted from the Gram-Schmidt construction. This means that A has the same column space as Q, the matrix whose columns are u_1,u_2,... But Q^T Q = I, and so the projection matrix for Q (which is the same as for A) is simply Q Q^T b = u_1 (u_1 . b) + u_2 (u_2 . b) + ... This formula for a projection is extremely simple (and useful in practice) (although it requires you to apply Gram-Schmidt). The Gram-Schmidt process gives a decomposition A = Q R , with A and Q as before and R an upper triangular matrix. This is known as the QR-decomposition or QR-factorization (see the Wikipedia page on the QR-decomposition). The Gram-Schmidt process also shows that if U and W are subspaces of n-dimension space (real or complex) and U is a proper subspace of W, then there is a vector in W that is orthogonal to all the vectors in U. The normal equations always have a solution; this is equivalent to saying that the column space of A^T A is that of A^T, and we can see that this is true, for otherwise there would be an element of the column space of A^T that is orthogonal to the column space of A^T A; i.e., there would be a y with A^T y nonzero but (A^T A)x . A^T y = 0 for all x, or, in other words, A x . (A A^T)y = 0 for all x; taking x = A^T y would show that A A^T y . A A^T y = 0 so A A^T y = 0; but then y . A A^T y = 0 so A^T y . A^T y = 0 so A^T y = 0, a contradiction. ------------------------------------------------------------------ Here are some random notes: First Day of Class: Wednesday, September 7, 2011 Last Day of Class: Friday, December 2, 2011 Thanksgiving: Monday, October 10, 2011 (2nd Monday in October) Midterm: Wednesday, November 2, 2011 Remembrance Day: Friday, November 11, 2011 Final Exam Period: December 6 to December 20