machine-learningstatisticsnormalization

Data normalization


When I want to classify "good" or "best" links then I could use like counts from Facebook or retweet counts from twitter. But some communities have large user bases so their links get much more likes or retweets. How can I "normalize" these huge community likes with for example the likes of a similar news item link of a much smaller community who have much lesser like count?

Is this called normalizing by the way? And in what kind of books can i learn these kinds of algorithms about "quality" (in this case of an article for example)? What is it called anyhow what I am trying to do?


Solution

  • Your could try this linear regression:

    Quality_of_link = alfa + B1Number_of_links + B2User_base + error term.

    To determine parameters (B1 and B2) for Dependent Variables (Number_of_links, User_base) you could use historical data (number_of_links; user_base; quality of link) and estimate the values of the parameters by running a linear regressions. You could do this in a statical program. Good statical programs include R-project and SPSS.

    Important in this respect is the objective way to determine the Quality_of_link. I think you could do a test by rating a number of links, preferable by the targeted audience of you site. Then use the average value given on a scale (e.g. 0-100) to the links.

    After you have run the regression in you test phase your can use it in you final model. This would then be: Quality_of_link = alfa + B1Number_of_links + B2User_base. You could then use say above a Quility_of_link above 70 is a good link and higher than 90 best link.

    For good textbooks it will be difficult to point you to a particular book which I haven't read myself. I would first recommend using the knowledge you already have an use the internet if some knowledge needs to be refreshed.