SEO but also HERO, through machine learning and big
Go back to the home of SEO
How to become a SEO HERO by fighting webspam?
Our friends operating Search Engines have to cope with finding
techniques to eliminate spam polluting their Search Engine results.
There is a lot of very active research going on in this field , some by
the Operators themselves, but also by Academics (although the difference
between the two isn’t always clear cut).
Researchers working in Stanford’s Infolab (where Brin and Page come
from) have written the following article. As its title shows, it
summarises the various and main strategies used to counter unwanted Spam
on the web (particularly spam invading online communities):
Paul Heymann, Georgia Koutrika, Hector Garcia-Molina. Fighting Spam on
Social Websites: A Survey of Potential Approaches and Future Challenges.
IEEE Internet Computing Special Issue on Social Search, 11(6): 36-45
So what do we discover in this interesting article (although not very
informative from a technical viewpoint)? We find a classification of the
different methods used to fight online Spam. I’ve separated these into
automatic and manual methods.
To demote (or downgrade) spam is to make sure that it doesn’t come up in
the top results of search engine returns. This doesn’t necessarily mean
that the page in question has been recognised as being spam. It merely
means that there is enough suspicion about its contents to warrant
downgrading through the use of a penalty. What kind of penalty can be
imagined? Very simple things: from .info ? penalties to limiting
contacts/connections with similar IPs, or more complicated setups like
the famous Trustrank and Spamrank. These penalties are applied
automatically and if this were done manually it would be a deliberate
spam detection objective.
The idea here is to detect the spam pages in the total corpus. It means
being able to recognise a page of spam. The easiest way to do this is
human and a moderator can get rid of any pages that he/she examines and
judges illegitimate. Radical action here means removing all the pages
belonging to an author identified as a spammer. Extreme action for
example would be to remove any websites with a name similar to that of
an identified spammer.
Automatic techniques can also be used. The best known are based on
content analysis (see works of Ntoulas, Najork, Manasse and Fetterly)
but we can also find methods based on link analysis (detecting link
farms, see works of Wu and Davison
and many others) as well as analysing webusers’ behaviour (here I’m
afraid I don’t have any references, but I assume that it is about
detecting the frequency of publications etc).
The obvious problem about spam detection is that until the spamming page
is found it is usually very well positioned and can stay there until it
is detected. This makes life sweet for the spammer and doesn’t encourage
him to give up this job.
Here the aim is to make it difficult for spam to get online and if it
does, to make it an expensive enterprise.
Manual techniques are very simple: this means making automatic
interaction with the system near impossible and thus forcing the spammer
to spend most of his time trying to interact with the system (or paying
people to do it). It is also possible to have users make micro payments
for each action (this wouldn’t be painful for the normal user but would
be excruciating for the spammer who sends messages in great numbers).
In the way of automated tests we have the all pervading captchas as well
as more amusing techniques like limiting access to a community (limiting
number of users and setting conditions for taking part in it etc) and
maximising the personal parameters needed. I like this last technique a
lot because the reasoning behind it is that if each user can personalize
his or her page as much as she likes, then there’s no more room for
spammers, seeing that trying to get into each page would be too much
like hard work.
Briefly then…the article contains a pleasant and readable synthesis of
methods used to fight spam, nothing new, but I encourage you to take a
look at it (it contains no difficult mathematical formulas).