2013年6月2日星期日

Demographic prediction over aggregated data set

Sometimes you may have no explicit demographic data you desired, P(D|C), where C is the content (page/video/game or whatever you offer) the visitor is consuming and D is the visitor demographic bucket, but you can get some aggregated report from third party, in this schema: P(A,D), here A is the advertisement.
If you own the ad system, you can know P(A|C) because you know where does each ad target. Even if advertisers didn't explicityly target on any content, and you are optimizting for advertiser, you can simulate so as to get P(A|C).
For a fixed demographic bucket, D, we have a linear equation:
P(A|D)=SUM{P(A|C)*P(C|D), foreach C}
We can solve (approximately) P(C|D) using least square.
Finally, we calculate P(D|C) propotional to P(C|D)P(D).
 
To evaluate if the model works, you can use April data to train a model P(C|D), and then predict P(A|D) for May, and compare it to the truth report P(A,D). Here we assume P(D) and P(C|D) is relatively stable over time, while advertisements booked in the system change frequently.
 
A trouble might be that the content in your network also change frequently, then you have to generalize your content from specific item, such as page URL to page content topic or video id to series.
 
Summarize the idea:
Audience has some demographic property and as a result, he/she is interested in some topic of content. As he/she is consuming some content, some ad is shown to him and you get a ad-demo distribution report. The model above is trying to address the connection between demographic and content topic.

没有评论: