2013年6月24日星期一

Prime Checker With Cache

如果需要频繁检查某个数是否是素数,可以把查过的素数缓存下来。
 
 

class PrimeCheckerWithCache:
 def __init__(self):
  self.primeSet=set([2,3])
  self.maxTested=4
 def isPrim(self, test):
  #cache hit
  if test<=self.maxTested:
   return test in self.primeSet
  #test and update cache
  for i in range(self.maxTested+1,test+1):
   if not any([i%p==0 for p in self.primeSet]):
    self.primeSet.add(i)
  self.maxTested=test
  return test in self.primeSet 
 
 
正确性证明:
不变量:primeSet中是不超过maxTested的全部素数。
初始:手动设置,成立。
保持:注意到,每次加入primeSet的,总是不能被任何已知素数整除的最小的自然数,也就是primeSet之外最小的素数。因此不变量保持。

因为素数密度很小,所以都缓存下来在规模较小时是可行的。

关于素数密度(不超过n的素数个数/n),我最近学习到有一些有意思的事实
1、素数无穷多
2、素数的密度收敛到0
3、素数的密度比平方数大:)呵呵没想到吧

2013年6月2日星期日

Demographic prediction over aggregated data set

Sometimes you may have no explicit demographic data you desired, P(D|C), where C is the content (page/video/game or whatever you offer) the visitor is consuming and D is the visitor demographic bucket, but you can get some aggregated report from third party, in this schema: P(A,D), here A is the advertisement.
If you own the ad system, you can know P(A|C) because you know where does each ad target. Even if advertisers didn't explicityly target on any content, and you are optimizting for advertiser, you can simulate so as to get P(A|C).
For a fixed demographic bucket, D, we have a linear equation:
P(A|D)=SUM{P(A|C)*P(C|D), foreach C}
We can solve (approximately) P(C|D) using least square.
Finally, we calculate P(D|C) propotional to P(C|D)P(D).
 
To evaluate if the model works, you can use April data to train a model P(C|D), and then predict P(A|D) for May, and compare it to the truth report P(A,D). Here we assume P(D) and P(C|D) is relatively stable over time, while advertisements booked in the system change frequently.
 
A trouble might be that the content in your network also change frequently, then you have to generalize your content from specific item, such as page URL to page content topic or video id to series.
 
Summarize the idea:
Audience has some demographic property and as a result, he/she is interested in some topic of content. As he/she is consuming some content, some ad is shown to him and you get a ad-demo distribution report. The model above is trying to address the connection between demographic and content topic.