Millions of anonymous ASP.Net profiles?
Asked Answered
M

2

6

UPDATE: I've just realised that we are using Google Mini Search to crawl the website in order for us to support Google Search. This is bound to be creating an anonymous profile for not only each crawl but maybe even each page - would that be possible?

Hi all, some advice needed!

Our website receives approximately 50,000 hits a day, and we use anonymous ASP.Net membership profiles/users, this is resulting in millions (4.5m currently) of "active" profiles and the database is 'crawling', we have a nightly task that cleans up all the inactive ones.

There is no way that we have 4.5m unique visitors (our county population is only 1/2 million), could this be caused by crawlers and spiders?

Also, if we have to live with this huge number of profiles is there anyway of optimising the DB?

Thanks

Kev

Mame answered 4/5, 2010 at 10:13 Comment(9)
What indexes do you have on your tables? Are you using the default profile provider?Gayle
@Daniel, I have no additional indexes, just the vanilla .Net Membership setup. We are using a custom profile provider.Mame
@Mantarok - had an idea, check updated answer.Linseylinseywoolsey
r.e. update: yes, this could be possible. Probably what you want to do is to prevent the creation of the anonymous user/profile in the first place if it is a crawler. You need to put the sample module in place and monitor the behavior of your search appliance to get a better idea of how to proceed.Linseylinseywoolsey
@code, one thought, will a Http Module be fired for a request for robots.txt? Seeing as it's not an ASP.Net file.Mame
Ooh. that is a concern. I tested the module using VS dev server which runs everything through asp.net. Hmm.... Are your web servers running IIS7?Linseylinseywoolsey
Am still thinking about this. Did you come to a solution yet?Linseylinseywoolsey
I'm using your proposed solution of capturing the spider/crawler using the HttpModule, I will then try and prune the profiles from that, sorry forgot to mark as answer - on it now! Thanks for your help. Oh and instead of just trapping robots.txt I'm going to check the agent name and use that to filter them.Mame
no problem. I was just revisiting this and considering if there was a viable way to catch robots.txt on IIS6. Glad I could help.Linseylinseywoolsey
L
2

Update following conversation:

Might I suggest that you implement a filter that can identify crawlers via request headers, and logging the anon cookie which you can later that same day. decrypt and delete the anon aspnet_profile and aspnet_users record with the associated UserId.

You might be fighting a losing battle but at least you will get a clear idea of where all the traffic is coming from.


AnonymousId cookies and, by proxy, anonymous profiles are valid for 90 days after last use. This can result in the anon profiles piling up.

A very simple way to handle this is to use ProfileManager.

ProfileManager.DeleteInactiveProfiles(ProfileAuthenticationOption.Anonymous, DateTime.Now.AddDays(-7));

will clear out all the anonymous profiles that have not been accessed in the last 7 days.

But that leaves you with the anonymous records in aspnet_Users. Membership does not expose a method similar to ProfileManager for deleting stale anonymous users.

So...

The best bet is a raw sql attack, deleting from aspnet_Profile where you consider them stale, and then run the same query on aspnet_User where IsAnonymous = 1.

Good luck with that. Once you get it cleaned up, just stay on top of it.


Updated Update:

The code below is only valid on IIS7 AND if you channel all requests through ASP.Net

You could implement a module that watches for requests to robots.txt and get the anonymous id cookie and stash it in a robots table which you can use to safely purge your membership/profile tables of robot meta every night. This might help.

Example:

using System;
using System.Diagnostics;
using System.Web;

namespace NoDomoArigatoMisterRoboto
{
    public class RobotLoggerModule : IHttpModule
    {
        #region IHttpModule Members

        public void Init(HttpApplication context)
        {
            context.PreSendRequestHeaders += PreSendRequestHeaders;
        }

        public void Dispose()
        {
            //noop
        }

        #endregion

        private static void PreSendRequestHeaders(object sender, EventArgs e)
        {
            HttpRequest request = ((HttpApplication)sender).Request;

            

            bool isRobot = 
                request.Url.GetLeftPart(UriPartial.Path).EndsWith("robots.txt", StringComparison.InvariantCultureIgnoreCase);

            string anonymousId = request.AnonymousID;

            if (anonymousId != null && isRobot)
            {
                // log this id for pruning later
                Trace.WriteLine(string.Format("{0} is a robot.", anonymousId));
            }
        }
    }
}

Reference: http://www.codeproject.com/Articles/39026/Exploring-Web-config-system-web-httpModules.aspx


Linseylinseywoolsey answered 4/5, 2010 at 13:46 Comment(5)
I am clearing them up, but I'm using the default Inactive time, which I think is around 60 days, I can quite easily change that to 7 but the website manager would rather they stayed for as long as possible because it contains customisations to the home page. So even clearing up 60 day-old profiles is retaining 4.5 million...Mame
@Mantorok- you are keeping anonymous customization for users that have not visited your site for 2 months? that sounds like retention of the anal kind. Would you even remember what aesthetic changes you made to a site you visited, anonymously, 2 months ago? just sayin.... ;-)Linseylinseywoolsey
no I completely agree with you, I wanted it to be a week or so, but I had to take orders. I may have to have another little 'chat' with our web manager :-)Mame
that's an interesting update. Do you have any details on what I should be looking for in the headers? Thanks.Mame
thanks for that, I will look into that now. See my last update - Doh!Mame
G
1

You could try deleting anonymous profiles in the Session_End event in your Global.asax.cs file.

There is every likelyhood that your site is being crawled, either by a legitimate search engine crawler and/or by an illegal crawler looking for vulnerabilities that would allow hackers to take control of your site/server. You should look into this, regardless of which solution you take for removing old profiles.

If you are using the default Profile Provider, which keeps all of the profile information in a single column, you might want to read this link which is to Scott Guthrie's article on a better performing table-based profile provider.

Gayle answered 4/5, 2010 at 11:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.