Detecting Power Laws
Binning Samples
MarcusFrean and
NickChapman talked about how to detect
power laws by binning samples. See
bins.gif
This is handy for discrete distributions, and essential for real-valued ones.
If you use exponentially spaced bin boundaries, divide the bin's occupancy by its width, and then plot the log of
that against the bin number, you should get a straight line whose negative slope is the exponent
a.
Nick's going to check it out in practice, by generating artificial "powerlaw" distributions...
RoBt askz: So again I want to be clear about the rationale for exponentially sized bins. Is it because this will support the "scale-free" aspect of the distribution?
Just to check my understanding of what you are then saying: does this mean you then do
not take the log of the independent variable when plotting
to look for a straight line?
MarcusFrean sayz: the (current!) idea is to
# use exponentially sized bins,
# divide the counts by the bin-width to get "adjusted counts"
# ''plot the log of the adjusted counts versus the bin index''. Or, equivalently, the abscissa can be the log of the bin boundary.
This will be a straight line if the distribution is a powerlaw. and in that case the slope is the exponent.
RoBt: Ok, this seems reasonable. I still wonder about effects at the low end though. When we are dealing with things are that
come in discrete sizes, rather than continuous, it seems to me that the tiny bucket sizes make result in lots of noise data
(exactly the kind of thing that bucket size selection normally handles).
I am also unsure, especially at these tiny sizes, whether we are really sure what we are measuring: e.g. classes vs methods vs fields vs lines vs tokens
vs lexemes vs characters vs bits. Surely other people have faced similar issues? It doesn't seem right to change our idea of the distribution
because of edge-effects relating to the tiniest quanta in the data. Comments?
MarcusFrean: Yah, I guess you can start with your first bucket boundary at the smallest value in the dataset, make the first bin width =1, and go from there upwards. I've got a different worry - it seems like this logarithmic binning runs the risk of losing information by sticking all the data from (say) 10 to 15 or whatever into a single bin, when in reality we may have loads of excellent info on 10, 11, 12, 13, 14 and 15. Seems like we're throwing hard-earned data away, in effect.
Provided your data is positive integers though you don't need to do this binning anyway (do you?), and the standard method of ranking and then plotting on log scales (as in the scrips below, and the CACM paper) seems preferable... Agree / disagree ??
Matlab diagnostics
It looks like some of the distributions that ''aren't''
PowerLaw might be
LogNormal.
I've written some quick and dirty diagnostics for the two cases.
Put your raw data (list of samples) into a file "datafile".
Grab this matlab script:
diagnoseDatafile.m
Then it's
followed by
The 10 here is the number of bins used in the diagnosis of logNormals.
You need to fiddle with this... It's not definitive, but if you can find a number of bins such that there's (a) plenty of data points still and (b) they're on a straight line, that's very suggestive of log normal.
It draws two plots. If the LHS one is a straight line you've got a powerlaw idstribution, and if the RHS is straight it's log normal. To get a feel for things, you can make some fake data from either distribution and check the diagnosis out:
- generatePowerlawData.m - Call this with one argument, the exponent. E.g: ''generatePowerlawData(1.8) ''
- generateLognormalData.m - Call this with two arguments, the mode and its "spread". E.g: ''generateLognormalData(10, 0.5) ''
I notice that things get a little hairy for log normals that have a very low mean (nb it's got to be at least 1).
Any other observations, please collect here...
MarcusFrean
I've added a bit to Marcus' diagnose script which tries to fit a straight line to both the powerlaw and log-normal plots. It'll draw the best-fit line onto them and print out the corresponding equation.
diagnoseDatafile.m
JeromeDolman?
Also see
LogNormal distributions.