Power-law again
Mark Newman and colleagues published a new paper on the Power Law distribution, along with Matlab code for fitting empirical data
A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data" E-print (2007). physics/0706.1062
Fitting a power-law distribution
This function implements both the discrete and continuous maximum likelihood estimators for fitting the power-law distribution to data, along with the goodness-of-fit based approach to estimating the lower cutoff for the scaling region. Usage information is included in the file; type ‘help plfit’ at the Matlab prompt for more information.
plfit.m (Matlab)
I have played with the Power-law Distribution before, but it becomes relevant this time as I work on the Google Web 1T 5-Gram corpus. Google censored n-grams (n=1..5) with frequency less than 40 out of 1 Trillion tokens. They did it for a perfect reason — as it stands, the dataset is 24G compressed.
Censorship becomes an issue when I try to estimate p(word| context) based on n-gram, when there are not many words in the corpus in that context. That is, if I did:
p(x| context) = Freq(x|context)/ Freq(context), where
Freq(context) = sum over all w: F(w|context)
and we can only observe a subset of all w, then we get an over estimate of p(x|context). How much bias depends on the cutoff.
What’s needed is to estimate the frequency of all censored words in this context, so that:
Freq(context) = sum over all w: F(w|context) + censored_freq
Under the assumption of the Power Law, this can be done. And we have strong reasons to believe word frequency distribution follows the Power Law
(But what do we know about n-gram distributions? If X={xi} follows Power Law, will X-X combinations follow PL, too? With unrestricted, full combinations, no, the X-X pairs will follow the ____ distribution. But we are dealing with language, where word combinations are governed by syntax, or at least in part by phrasal grammar. If we conceptualize n-grams as meaningful language units, then we can probably expect a Power Law. On the other hand, there is a lot of junk in the n-gram if you looked. So … I don’t know, and it’s an empirical question.)
according to Wikipedia, I can find the prob of censored words by:
Plotting power-law distributions
In general, power-law distributions are plotted on doubly logarithmic axes, which emphasizes the upper tail region. The most convenient way to do this is via the (complementary) cumulative distribution (cdf), P(x) = Pr(X > x),
![]()
to the data
. Given a choice for xmin, a simple derivation by this method yields the estimator equation
![]()
where {xi} are the n data points
. (For a more detailed derivation, see Hall or Newman below.) This estimator exhibits a small finite sample-size bias of order O(n − 1), which is small when n > 100. Further, the uncertainty in the estimation can be derived from the maximum likelihood argument, and has the form
. This estimator is equivalent to the popular Hill estimator from quantitative finance and extreme value theory.
This makes sense, and because of the size of the corpus, mostly likely the frequency of occurance can be seen as a continuous variable (no ties) above 40, which would be the x-min.
Gotta try it sometime.
. Given a choice for
. (For a more detailed derivation, see
. This estimator is equivalent to the popular Hill estimator from