Tease rating and web crawlers

Do you think Milovana.com is perfect in every way? Hopefully not, so what can we do to improve? Every idea, suggestion or criticism is highly appreciated.
Post Reply
Foxhawke
Explorer At Heart
Explorer At Heart
Posts: 457
Joined: Fri Aug 10, 2007 8:59 pm

Tease rating and web crawlers

Post by Foxhawke »

We were just discussing how some teases gets rated in just a few minutes after they are posted. This seems to happen quite often. It just hit me that this could be web crawlers doing the voting. Bots from google, yahoo, msn/live.com and such will click through all links they find, trying to index all pages (we can see them quite often in the forum list of online users). And it just might be that some bots opt for the last link (5) first, while others opt for the first one (1).

This could be fixed a few different ways.

- Change the voting into a form post instead of regular links. Web crawlers don't post forms as they peek around.
- Add rel="nofollow" to the <a> tag for the rating links. This should stop most bots (if not all) from following the links.
- Add vote.php to robots.txt (this site doesn't appear to have one at all). I think all web crawlers honor the robots.txt file, but there may be some smaller ones that don't.

The first and last are probably the best (as in most effective) solutions, but the middle one might be easiest to implement.

If you have any questions about this, like how the robots.txt works, feel free to shoot me a PM or ask here. I'll be happy to help!
User avatar
all2true
Explorer At Heart
Explorer At Heart
Posts: 753
Joined: Fri Feb 08, 2008 3:42 pm
Gender: Male
Sexual Orientation: Straight
I am a: Slave
Dom/me(s): Looking for a dom/domme
Sub/Slave(s): I am a subby!
Location: Midwest USA

Re: Tease rating and web crawlers

Post by all2true »

I think its the large number of people that are actually
on this web site,
Most are not Forum posters
so I am not sure
how putting voting in forums would work
seraph0x
Administrator
Administrator
Posts: 2666
Joined: Sun Jul 23, 2006 8:58 am

Re: Tease rating and web crawlers

Post by seraph0x »

The robots.txt does exist (note that you don't get a 404), it's one of the ways in which the site detects who's a bot and who's a user. (Other ways include known bot IP ranges, user agents, speed of accesses and to some extent the crawling behavior itself. - For the techies: We took the bot detection built into phpBB3 and simply improved it using some additional criteria.)

For bot users, database write access is disabled, which means the bot will get the regular "Thanks for your vote", but the vote won't actually be counted. Also, bots aren't counted in terms of views for the same reason.

Teases get voted very quickly because many of our users like to skim through teases rather than following them (at least some of the time). And with over a hundred people on the site at all times, there's always some of those folks around. Search engine bots don't actually react very fast to new teases at all, it takes them a day or two to pick 'em up and crawl them.

Part of the reason bots don't pick up new teases very quickly is that Milovana currently is extremely shitty to crawl. Most notably, search engine bots will index thousands of different ways to browse the tease listings and eventually be so swamped with tease lists that they just go "oh fuck this" and give up - without indexing much of the actual tease content. :lol:

The reason I didn't care much about optimizing the site for bots is that we get over 90% of our traffic from users who have already bookmarked Milovana. In other words once you find this site, you're pretty much hooked for life. The rest is from a few inbound links. If we do get traffic from search engines it's usually people entering "milovana" or "milovana.com", which is pretty much the same as if they just entered the address directly. So I'd rather work on new features for existing users than catering to the machines.

But since you brought the issue up, I might as well do some basic optimizations. The bots' votes aren't counted, but it's still not a good idea to have "thanks for your vote" pages in search results as human users clicking on these will in fact be counted. And the above "getting swamped in tease browser pages" wouldn't be too hard to fix either.

Edit: Now added nofollow to the voting links and blocked the vote.php in robots.txt. This should remove any voting links from search results.

So thanks for the wakeup call! :-)
Foxhawke
Explorer At Heart
Explorer At Heart
Posts: 457
Joined: Fri Aug 10, 2007 8:59 pm

Re: Tease rating and web crawlers

Post by Foxhawke »

Ah, so you had it under control. I guess I may counted as a bot now that I tried to access robots.txt then? :D

I just figured that some of these early / quick votes could be bots, but I guess it actually is just users clicking through them really quickly.

Since it came up in chat I peeked around and it seemed to me like there were nothing stopping bots from voting, but that's because you didn't do it the usual way (the stuff I suggested). I do understand working on other stuff then optimizing for the bots is more fun, and since the bots doesn't pollute the data I guess it doesn't really matter. :)

Thanks for the reply though, always interesting to learn about new solutions to problems. :)
seraph0x
Administrator
Administrator
Posts: 2666
Joined: Sun Jul 23, 2006 8:58 am

Re: Tease rating and web crawlers

Post by seraph0x »

Foxhawke wrote:Ah, so you had it under control. I guess I may counted as a bot now that I tried to access robots.txt then? :D
For that session, yes, unless you were logged in. For registered users the "bot check" is skipped altogether.
Foxhawke wrote:I just figured that some of these early / quick votes could be bots, but I guess it actually is just users clicking through them really quickly.
There could still be bots that slip through our detection, like somebody wget-ting the site with a custom user agent. For that the only practical way would indeed be the option to use a form. Well, maybe I'll change it someday.
Foxhawke
Explorer At Heart
Explorer At Heart
Posts: 457
Joined: Fri Aug 10, 2007 8:59 pm

Re: Tease rating and web crawlers

Post by Foxhawke »

seraph0x wrote:For that session, yes, unless you were logged in. For registered users the "bot check" is skipped altogether.
I was logged in. So all is good then. :)
seraph0x wrote:There could still be bots that slip through our detection, like somebody wget-ting the site with a custom user agent. For that the only practical way would indeed be the option to use a form. Well, maybe I'll change it someday.
But that has to be a pretty small number. I would even imagine that most wget'ers would simply go, tease by tease, and go just increment the page number until they get 404s, and then move on to the next tease id, rather then looking for links and following them to the vote page (using the mirror or crawl-flags, I forgot what they are). So that may be such a rare occurance that it's not even worth bothering with. :)
Post Reply