websecurity@lists.webappsec.org

The Web Security Mailing List

View all threads

Parallelizing the crawl

TL
Tasos Laskos
Mon, Jan 16, 2012 5:41 PM

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of get
a brainstorm going.

I've recently implemented a system for audit distribution in the form of
a high performance grid (won't self-promote) however an area which I
left alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial
crawl and then calculates and distributes the audit workload amongst its
slaves but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact that
it's not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get me
started:

  • Assign crawl of subdomains to slaves -- no questions asked
  • Briefly scope out the webapp structure and spread the crawl of the
    distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.

Hi guys, it's been a while. I've got a tricky question for you today and I hope that we sort of get a brainstorm going. I've recently implemented a system for audit distribution in the form of a high performance grid (won't self-promote) however an area which I left alone (for the time being) was the crawl process. See, the way it works now is the master instance performs the initial crawl and then calculates and distributes the audit workload amongst its slaves but the crawl takes place the old fashioned way. As you might have guessed the major set back is caused by the fact that it's not possible to determine the workload of the crawl a priori. I've got a couple of naive ideas to parallelize the crawl just to get me started: * Assign crawl of subdomains to slaves -- no questions asked * Briefly scope out the webapp structure and spread the crawl of the distinguishable visible directories amongst the slaves. Or even a combination of the above if applicable. Both ideas are better than what I've got now and there aren't any downsides to them even if the distribution turns out to be suboptimal. I'm curious though, has anyone faced a similar problem? Any general ideas? Cheers, Tasos Laskos.
SS
Sebastian Schinzel
Mon, Jan 16, 2012 10:32 PM

Tasos,

So far, you are implying a unidirectional connection from master to slave.
What about making this bidirectional to get a kind of "on the fly" load
balancing, e.g.

  1. Master starts crawl and sends URLs to be crawled to slaves
  2. slaves continue crawl until their "stack" of URLs to be crawled grows
    above a given boundary, say 20 (those are the "busy slaves")
  3. The busy slaves sends additional URLs back to the master
  4. The master sends URLs to be crawled from busy slaves to the
    idle slaves
  5. GOTO 2

Cheers,
Sebastian

On 16. Jan 2012, at 18:41 PM, Tasos Laskos wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of get a brainstorm going.

I've recently implemented a system for audit distribution in the form of a high performance grid (won't self-promote) however an area which I left alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial crawl and then calculates and distributes the audit workload amongst its slaves but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact that it's not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get me started:

  • Assign crawl of subdomains to slaves -- no questions asked
  • Briefly scope out the webapp structure and spread the crawl of the distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any downsides to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org
http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org


Sebastian Schinzel, M.Sc.

Universität Erlangen-Nürnberg
Lehrstuhl für Informatik 1
IT-Sicherheitsinfrastrukturen

Am Wolfmantel 46
91058 Erlangen
Germany

Tel.: +49 (0) 9131 / 85-69911
Mobil: +49 (0) 151 / 18215206
Fax: +49 (0) 9131 / 85-25319
Web: http://www1.cs.fau.de/
Email: sebastian.schinzel@cs.fau.de
Twitter: http://twitter.com/seecurity

Tasos, So far, you are implying a unidirectional connection from master to slave. What about making this bidirectional to get a kind of "on the fly" load balancing, e.g. 1. Master starts crawl and sends URLs to be crawled to slaves 2. slaves continue crawl until their "stack" of URLs to be crawled grows above a given boundary, say 20 (those are the "busy slaves") 3. The busy slaves sends additional URLs back to the master 4. The master sends URLs to be crawled from busy slaves to the idle slaves 5. GOTO 2 Cheers, Sebastian On 16. Jan 2012, at 18:41 PM, Tasos Laskos wrote: > Hi guys, it's been a while. > > I've got a tricky question for you today and I hope that we sort of get a brainstorm going. > > I've recently implemented a system for audit distribution in the form of a high performance grid (won't self-promote) however an area which I left alone (for the time being) was the crawl process. > > See, the way it works now is the master instance performs the initial crawl and then calculates and distributes the audit workload amongst its slaves but the crawl takes place the old fashioned way. > > As you might have guessed the major set back is caused by the fact that it's not possible to determine the workload of the crawl a priori. > > I've got a couple of naive ideas to parallelize the crawl just to get me started: > * Assign crawl of subdomains to slaves -- no questions asked > * Briefly scope out the webapp structure and spread the crawl of the distinguishable visible directories amongst the slaves. > > Or even a combination of the above if applicable. > > Both ideas are better than what I've got now and there aren't any downsides to them even if the distribution turns out to be suboptimal. > > I'm curious though, has anyone faced a similar problem? > Any general ideas? > > Cheers, > Tasos Laskos. > > _______________________________________________ > The Web Security Mailing List > > WebSecurity RSS Feed > http://www.webappsec.org/rss/websecurity.rss > > Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA > > WASC on Twitter > http://twitter.com/wascupdates > > websecurity@lists.webappsec.org > http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org --- Sebastian Schinzel, M.Sc. Universität Erlangen-Nürnberg Lehrstuhl für Informatik 1 IT-Sicherheitsinfrastrukturen Am Wolfmantel 46 91058 Erlangen Germany Tel.: +49 (0) 9131 / 85-69911 Mobil: +49 (0) 151 / 18215206 Fax: +49 (0) 9131 / 85-25319 Web: http://www1.cs.fau.de/ Email: sebastian.schinzel@cs.fau.de Twitter: http://twitter.com/seecurity
TL
Tasos Laskos
Tue, Jan 17, 2012 7:33 AM

Sure that sounds good.
No reason not to remedy an unfrair distribution on the fly.

What keeps bugging me is that say you've got a site like so:

index
|
|----- foo/
|----- bar/

So we (the master) parse the index and see the structure and start by
assigning 'foo/' to us and 'bar/' to a slave.

We may leave it as is and hope for some performance gain in case 'foo/'
and 'bar/' differ wildly but 'bar/' may contain a link to something
under 'foo/'.

The slave can't know if the master already knows of that path or it if
should tell it or just follow it and let the master figure out what to
do with it during the analysis at the end.

Now, if it chooses to check back with the master that means I've got to
implement a look-up table which is a no-no as the latency can very well
take away any performance gains provided by the distribution.

If it chooses to follow it then this could eventually lead to both
instances eventually crawling pretty much the same paths.

I fully expect the solution to be a combination/compromise between all
the suggestions voiced in this thread.

Like:

  • all instances converging in intervals about the collective paths
    they've discovered in order for each to keep a local look-up cache (no
    latency introduced since no-one would rely/wait on this to be up to date
    or even available)
  • if a URL isn't in the local cache and it is clearly out of an
    instance's scope then perform a queued batch look-up while it  crawls
    paths that are indisputably its own to mitigate the effect of the
    look-up latency.

This is an interesting problem.

  • Tasos

On 01/17/2012 12:32 AM, Sebastian Schinzel wrote:

Tasos,

So far, you are implying a unidirectional connection from master to slave.
What about making this bidirectional to get a kind of "on the fly" load
balancing, e.g.

  1. Master starts crawl and sends URLs to be crawled to slaves
  2. slaves continue crawl until their "stack" of URLs to be crawled grows
    above a given boundary, say 20 (those are the "busy slaves")
  3. The busy slaves sends additional URLs back to the master
  4. The master sends URLs to be crawled from busy slaves to the
    idle slaves
  5. GOTO 2

Cheers,
Sebastian

On 16. Jan 2012, at 18:41 PM, Tasos Laskos wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of get a brainstorm going.

I've recently implemented a system for audit distribution in the form of a high performance grid (won't self-promote) however an area which I left alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial crawl and then calculates and distributes the audit workload amongst its slaves but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact that it's not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get me started:

  • Assign crawl of subdomains to slaves -- no questions asked
  • Briefly scope out the webapp structure and spread the crawl of the distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any downsides to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org
http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org


Sebastian Schinzel, M.Sc.

Universität Erlangen-Nürnberg
Lehrstuhl für Informatik 1
IT-Sicherheitsinfrastrukturen

Am Wolfmantel 46
91058 Erlangen
Germany

Tel.: +49 (0) 9131 / 85-69911
Mobil: +49 (0) 151 / 18215206
Fax: +49 (0) 9131 / 85-25319
Web: http://www1.cs.fau.de/
Email: sebastian.schinzel@cs.fau.de
Twitter: http://twitter.com/seecurity

Sure that sounds good. No reason not to remedy an unfrair distribution on the fly. What keeps bugging me is that say you've got a site like so: index | |----- foo/ |----- bar/ So we (the master) parse the index and see the structure and start by assigning 'foo/' to us and 'bar/' to a slave. We may leave it as is and hope for some performance gain in case 'foo/' and 'bar/' differ wildly *but* 'bar/' may contain a link to something under 'foo/'. The slave can't know if the master already knows of that path or it if should tell it or just follow it and let the master figure out what to do with it during the analysis at the end. Now, if it chooses to check back with the master that means I've got to implement a look-up table which is a no-no as the latency can very well take away any performance gains provided by the distribution. If it chooses to follow it then this could eventually lead to both instances eventually crawling pretty much the same paths. I fully expect the solution to be a combination/compromise between all the suggestions voiced in this thread. Like: * all instances converging in intervals about the collective paths they've discovered in order for each to keep a local look-up cache (no latency introduced since no-one would rely/wait on this to be up to date or even available) * if a URL isn't in the local cache and it is clearly out of an instance's scope then perform a queued batch look-up while it crawls paths that are indisputably its own to mitigate the effect of the look-up latency. This is an interesting problem. - Tasos On 01/17/2012 12:32 AM, Sebastian Schinzel wrote: > Tasos, > > So far, you are implying a unidirectional connection from master to slave. > What about making this bidirectional to get a kind of "on the fly" load > balancing, e.g. > > 1. Master starts crawl and sends URLs to be crawled to slaves > 2. slaves continue crawl until their "stack" of URLs to be crawled grows > above a given boundary, say 20 (those are the "busy slaves") > 3. The busy slaves sends additional URLs back to the master > 4. The master sends URLs to be crawled from busy slaves to the > idle slaves > 5. GOTO 2 > > Cheers, > Sebastian > > On 16. Jan 2012, at 18:41 PM, Tasos Laskos wrote: > >> Hi guys, it's been a while. >> >> I've got a tricky question for you today and I hope that we sort of get a brainstorm going. >> >> I've recently implemented a system for audit distribution in the form of a high performance grid (won't self-promote) however an area which I left alone (for the time being) was the crawl process. >> >> See, the way it works now is the master instance performs the initial crawl and then calculates and distributes the audit workload amongst its slaves but the crawl takes place the old fashioned way. >> >> As you might have guessed the major set back is caused by the fact that it's not possible to determine the workload of the crawl a priori. >> >> I've got a couple of naive ideas to parallelize the crawl just to get me started: >> * Assign crawl of subdomains to slaves -- no questions asked >> * Briefly scope out the webapp structure and spread the crawl of the distinguishable visible directories amongst the slaves. >> >> Or even a combination of the above if applicable. >> >> Both ideas are better than what I've got now and there aren't any downsides to them even if the distribution turns out to be suboptimal. >> >> I'm curious though, has anyone faced a similar problem? >> Any general ideas? >> >> Cheers, >> Tasos Laskos. >> >> _______________________________________________ >> The Web Security Mailing List >> >> WebSecurity RSS Feed >> http://www.webappsec.org/rss/websecurity.rss >> >> Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA >> >> WASC on Twitter >> http://twitter.com/wascupdates >> >> websecurity@lists.webappsec.org >> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org > > --- > Sebastian Schinzel, M.Sc. > > Universität Erlangen-Nürnberg > Lehrstuhl für Informatik 1 > IT-Sicherheitsinfrastrukturen > > Am Wolfmantel 46 > 91058 Erlangen > Germany > > Tel.: +49 (0) 9131 / 85-69911 > Mobil: +49 (0) 151 / 18215206 > Fax: +49 (0) 9131 / 85-25319 > Web: http://www1.cs.fau.de/ > Email: sebastian.schinzel@cs.fau.de > Twitter: http://twitter.com/seecurity > > > > >
RH
Richard Hauswald
Tue, Jan 17, 2012 1:05 PM

Tasos,
what prevents you from let the workers pull the work from the master
instead of pushing it to the workers? Then you could let the workers
pull work packets containing e.g. 20 work items. After a worker has no
work left, it will push the results to the master and pull another
work packet.
Regards,
Richard

On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskos tasos.laskos@gmail.com wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of get a
brainstorm going.

I've recently implemented a system for audit distribution in the form of a
high performance grid (won't self-promote) however an area which I left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial crawl
and then calculates and distributes the audit workload amongst its slaves
but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact that it's
not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get me
started:
 * Assign crawl of subdomains to slaves -- no questions asked
 * Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any downsides
to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org
http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

Tasos, what prevents you from let the workers pull the work from the master instead of pushing it to the workers? Then you could let the workers pull work packets containing e.g. 20 work items. After a worker has no work left, it will push the results to the master and pull another work packet. Regards, Richard On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskos <tasos.laskos@gmail.com> wrote: > Hi guys, it's been a while. > > I've got a tricky question for you today and I hope that we sort of get a > brainstorm going. > > I've recently implemented a system for audit distribution in the form of a > high performance grid (won't self-promote) however an area which I left > alone (for the time being) was the crawl process. > > See, the way it works now is the master instance performs the initial crawl > and then calculates and distributes the audit workload amongst its slaves > but the crawl takes place the old fashioned way. > > As you might have guessed the major set back is caused by the fact that it's > not possible to determine the workload of the crawl a priori. > > I've got a couple of naive ideas to parallelize the crawl just to get me > started: >  * Assign crawl of subdomains to slaves -- no questions asked >  * Briefly scope out the webapp structure and spread the crawl of the > distinguishable visible directories amongst the slaves. > > Or even a combination of the above if applicable. > > Both ideas are better than what I've got now and there aren't any downsides > to them even if the distribution turns out to be suboptimal. > > I'm curious though, has anyone faced a similar problem? > Any general ideas? > > Cheers, > Tasos Laskos. > > _______________________________________________ > The Web Security Mailing List > > WebSecurity RSS Feed > http://www.webappsec.org/rss/websecurity.rss > > Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA > > WASC on Twitter > http://twitter.com/wascupdates > > websecurity@lists.webappsec.org > http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org -- Richard Hauswald Blog: http://tnfstacc.blogspot.com/ LinkedIn: http://www.linkedin.com/in/richardhauswald Xing: http://www.xing.com/profile/Richard_Hauswald
TL
Tasos Laskos
Tue, Jan 17, 2012 1:15 PM

What prevents this is the nature of the crawl process.
What I'm trying to achieve here is not spread the workload but actually
find it.

I'm not interested in parsing the pages or any sort of processing but
only gather all available paths.

So there's not really any "work" to distribute actually.

Does this make sense?

On 01/17/2012 03:05 PM, Richard Hauswald wrote:

Tasos,
what prevents you from let the workers pull the work from the master
instead of pushing it to the workers? Then you could let the workers
pull work packets containing e.g. 20 work items. After a worker has no
work left, it will push the results to the master and pull another
work packet.
Regards,
Richard

On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskostasos.laskos@gmail.com  wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of get a
brainstorm going.

I've recently implemented a system for audit distribution in the form of a
high performance grid (won't self-promote) however an area which I left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial crawl
and then calculates and distributes the audit workload amongst its slaves
but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact that it's
not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get me
started:

  • Assign crawl of subdomains to slaves -- no questions asked
  • Briefly scope out the webapp structure and spread the crawl of the
    distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any downsides
to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org
http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

What prevents this is the nature of the crawl process. What I'm trying to achieve here is not spread the workload but actually find it. I'm not interested in parsing the pages or any sort of processing but only gather all available paths. So there's not really any "work" to distribute actually. Does this make sense? On 01/17/2012 03:05 PM, Richard Hauswald wrote: > Tasos, > what prevents you from let the workers pull the work from the master > instead of pushing it to the workers? Then you could let the workers > pull work packets containing e.g. 20 work items. After a worker has no > work left, it will push the results to the master and pull another > work packet. > Regards, > Richard > > On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskos<tasos.laskos@gmail.com> wrote: >> Hi guys, it's been a while. >> >> I've got a tricky question for you today and I hope that we sort of get a >> brainstorm going. >> >> I've recently implemented a system for audit distribution in the form of a >> high performance grid (won't self-promote) however an area which I left >> alone (for the time being) was the crawl process. >> >> See, the way it works now is the master instance performs the initial crawl >> and then calculates and distributes the audit workload amongst its slaves >> but the crawl takes place the old fashioned way. >> >> As you might have guessed the major set back is caused by the fact that it's >> not possible to determine the workload of the crawl a priori. >> >> I've got a couple of naive ideas to parallelize the crawl just to get me >> started: >> * Assign crawl of subdomains to slaves -- no questions asked >> * Briefly scope out the webapp structure and spread the crawl of the >> distinguishable visible directories amongst the slaves. >> >> Or even a combination of the above if applicable. >> >> Both ideas are better than what I've got now and there aren't any downsides >> to them even if the distribution turns out to be suboptimal. >> >> I'm curious though, has anyone faced a similar problem? >> Any general ideas? >> >> Cheers, >> Tasos Laskos. >> >> _______________________________________________ >> The Web Security Mailing List >> >> WebSecurity RSS Feed >> http://www.webappsec.org/rss/websecurity.rss >> >> Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA >> >> WASC on Twitter >> http://twitter.com/wascupdates >> >> websecurity@lists.webappsec.org >> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org > > >
M
merlyn@stonehenge.com
Wed, Jan 18, 2012 2:05 AM

Why does this sound like something I solved about a decade ago.

Oh yeah, because I did.

http://www.stonehenge.com/merlyn/LinuxMag/col16.html

But today, I'd build something with ZeroMQ.  If you haven't figured out
how cool ZeroMQ is, you haven't been paying attention.

Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
merlyn@stonehenge.com URL:http://www.stonehenge.com/merlyn/
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.posterous.com/ for Smalltalk discussion

Why does this sound like something I solved about a decade ago. Oh yeah, because I did. http://www.stonehenge.com/merlyn/LinuxMag/col16.html But today, I'd build something with ZeroMQ. If you haven't figured out how cool ZeroMQ is, you haven't been paying attention. -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 <merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/> Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc. See http://methodsandmessages.posterous.com/ for Smalltalk discussion
TL
Tasos Laskos
Wed, Jan 18, 2012 8:04 AM

First of all, I'd encourage you do get off your high horse if you decide
to reply again.
If everyone starts being cocky and saying "Look at my d*ck! You can do
pull-ups with this thing!" that'd get tiring quite quickly.

As for the merits of your response:

On 01/18/2012 04:05 AM, Randal L. Schwartz wrote:

Why does this sound like something I solved about a decade ago.

Oh yeah, because I did.

http://www.stonehenge.com/merlyn/LinuxMag/col16.html

Did you?
Things get tricky once you kiss your multi GBps/low-latency bus goodbye.

If I understood the code correctly you're not doing anything special,
and not distributing anything to remote workers so I don't know how well
it'll scale.

As I've said before about the same approach I'll give it a shot when the
time comes to see how it'll behave under real world circumstances but I
have my reservations.

But today, I'd build something with ZeroMQ.  If you haven't figured out
how cool ZeroMQ is, you haven't been paying attention.

I like my EventMachine based RPC implementation thanks.

Cheers,
Tasos L.

First of all, I'd encourage you do get off your high horse if you decide to reply again. If everyone starts being cocky and saying "Look at my d*ck! You can do pull-ups with this thing!" that'd get tiring quite quickly. As for the merits of your response: On 01/18/2012 04:05 AM, Randal L. Schwartz wrote: > Why does this sound like something I solved about a decade ago. > > Oh yeah, because I did. > > http://www.stonehenge.com/merlyn/LinuxMag/col16.html Did you? Things get tricky once you kiss your multi GBps/low-latency bus goodbye. If I understood the code correctly you're not doing anything special, and not distributing anything to remote workers so I don't know how well it'll scale. As I've said before about the same approach I'll give it a shot when the time comes to see how it'll behave under real world circumstances but I have my reservations. > > But today, I'd build something with ZeroMQ. If you haven't figured out > how cool ZeroMQ is, you haven't been paying attention. I like my EventMachine based RPC implementation thanks. Cheers, Tasos L.
AR
Andres Riancho
Mon, Jan 30, 2012 6:38 PM

Tasos,

On Tue, Jan 17, 2012 at 4:33 AM, Tasos Laskos tasos.laskos@gmail.com wrote:

Sure that sounds good.
No reason not to remedy an unfrair distribution on the fly.

What keeps bugging me is that say you've got a site like so:

index
|
|----- foo/
|----- bar/

So we (the master) parse the index and see the structure and start by
assigning 'foo/' to us and 'bar/' to a slave.

We may leave it as is and hope for some performance gain in case 'foo/' and
'bar/' differ wildly but 'bar/' may contain a link to something under
'foo/'.

The slave can't know if the master already knows of that path or it if
should tell it or just follow it and let the master figure out what to do
with it during the analysis at the end.

Now, if it chooses to check back with the master that means I've got to
implement a look-up table which is a no-no as the latency can very well take
away any performance gains provided by the distribution.

If it chooses to follow it then this could eventually lead to both instances
eventually crawling pretty much the same paths.

When you create the new "crawl slave" you can assign a number to it.
Each slave has his ID, so before following a link you can do something
very easy like adding all the ASCII codes for each letter in the URI
and mod by the number of slaves. If the result is the ID of your crawl
slave then the URL is followed, if not the URL is sent back to the
master so it can be sent to the correct slave. With that very simple
algorithm you load balance crawling in a pretty good way (since the
distribution of mod for the ASCII-URL should be pretty good).

I fully expect the solution to be a combination/compromise between all the
suggestions voiced in this thread.

Like:
 * all instances converging in intervals about the collective paths they've
discovered in order for each to keep a local look-up cache (no latency
introduced since no-one would rely/wait on this to be up to date or even
available)
 * if a URL isn't in the local cache and it is clearly out of an instance's
scope then perform a queued batch look-up while it  crawls paths that are
indisputably its own to mitigate the effect of the look-up latency.

This is an interesting problem.

 - Tasos

On 01/17/2012 12:32 AM, Sebastian Schinzel wrote:

Tasos,

So far, you are implying a unidirectional connection from master to slave.
What about making this bidirectional to get a kind of "on the fly" load
balancing, e.g.

  1. Master starts crawl and sends URLs to be crawled to slaves
  2. slaves continue crawl until their "stack" of URLs to be crawled grows
    above a given boundary, say 20 (those are the "busy slaves")
  3. The busy slaves sends additional URLs back to the master
  4. The master sends URLs to be crawled from busy slaves to the
    idle slaves
  5. GOTO 2

Cheers,
Sebastian

On 16. Jan 2012, at 18:41 PM, Tasos Laskos wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of get a
brainstorm going.

I've recently implemented a system for audit distribution in the form of
a high performance grid (won't self-promote) however an area which I left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial
crawl and then calculates and distributes the audit workload amongst its
slaves but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact that
it's not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get me
started:
 * Assign crawl of subdomains to slaves -- no questions asked
 * Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org


Sebastian Schinzel, M.Sc.

Universität Erlangen-Nürnberg
Lehrstuhl für Informatik 1
IT-Sicherheitsinfrastrukturen

Am Wolfmantel 46
91058 Erlangen
Germany

Tel.:           +49 (0) 9131 / 85-69911
Mobil:  +49 (0) 151 / 18215206
Fax:            +49 (0) 9131 / 85-25319
Web:    http://www1.cs.fau.de/
Email:  sebastian.schinzel@cs.fau.de
Twitter:        http://twitter.com/seecurity

--
Andrés Riancho
Director of Web Security at Rapid7 LLC
Founder at Bonsai Information Security
Project Leader at w3af

Tasos, On Tue, Jan 17, 2012 at 4:33 AM, Tasos Laskos <tasos.laskos@gmail.com> wrote: > Sure that sounds good. > No reason not to remedy an unfrair distribution on the fly. > > What keeps bugging me is that say you've got a site like so: > > index > | > |----- foo/ > |----- bar/ > > So we (the master) parse the index and see the structure and start by > assigning 'foo/' to us and 'bar/' to a slave. > > We may leave it as is and hope for some performance gain in case 'foo/' and > 'bar/' differ wildly *but* 'bar/' may contain a link to something under > 'foo/'. > > The slave can't know if the master already knows of that path or it if > should tell it or just follow it and let the master figure out what to do > with it during the analysis at the end. > > Now, if it chooses to check back with the master that means I've got to > implement a look-up table which is a no-no as the latency can very well take > away any performance gains provided by the distribution. > > If it chooses to follow it then this could eventually lead to both instances > eventually crawling pretty much the same paths. When you create the new "crawl slave" you can assign a number to it. Each slave has his ID, so before following a link you can do something very easy like adding all the ASCII codes for each letter in the URI and mod by the number of slaves. If the result is the ID of your crawl slave then the URL is followed, if not the URL is sent back to the master so it can be sent to the correct slave. With that very simple algorithm you load balance crawling in a pretty good way (since the distribution of mod for the ASCII-URL should be pretty good). > I fully expect the solution to be a combination/compromise between all the > suggestions voiced in this thread. > > Like: >  * all instances converging in intervals about the collective paths they've > discovered in order for each to keep a local look-up cache (no latency > introduced since no-one would rely/wait on this to be up to date or even > available) >  * if a URL isn't in the local cache and it is clearly out of an instance's > scope then perform a queued batch look-up while it  crawls paths that are > indisputably its own to mitigate the effect of the look-up latency. > > This is an interesting problem. > >  - Tasos > > > On 01/17/2012 12:32 AM, Sebastian Schinzel wrote: >> >> Tasos, >> >> So far, you are implying a unidirectional connection from master to slave. >> What about making this bidirectional to get a kind of "on the fly" load >> balancing, e.g. >> >> 1. Master starts crawl and sends URLs to be crawled to slaves >> 2. slaves continue crawl until their "stack" of URLs to be crawled grows >> above a given boundary, say 20 (those are the "busy slaves") >> 3. The busy slaves sends additional URLs back to the master >> 4. The master sends URLs to be crawled from busy slaves to the >> idle slaves >> 5. GOTO 2 >> >> Cheers, >> Sebastian >> >> On 16. Jan 2012, at 18:41 PM, Tasos Laskos wrote: >> >>> Hi guys, it's been a while. >>> >>> I've got a tricky question for you today and I hope that we sort of get a >>> brainstorm going. >>> >>> I've recently implemented a system for audit distribution in the form of >>> a high performance grid (won't self-promote) however an area which I left >>> alone (for the time being) was the crawl process. >>> >>> See, the way it works now is the master instance performs the initial >>> crawl and then calculates and distributes the audit workload amongst its >>> slaves but the crawl takes place the old fashioned way. >>> >>> As you might have guessed the major set back is caused by the fact that >>> it's not possible to determine the workload of the crawl a priori. >>> >>> I've got a couple of naive ideas to parallelize the crawl just to get me >>> started: >>>  * Assign crawl of subdomains to slaves -- no questions asked >>>  * Briefly scope out the webapp structure and spread the crawl of the >>> distinguishable visible directories amongst the slaves. >>> >>> Or even a combination of the above if applicable. >>> >>> Both ideas are better than what I've got now and there aren't any >>> downsides to them even if the distribution turns out to be suboptimal. >>> >>> I'm curious though, has anyone faced a similar problem? >>> Any general ideas? >>> >>> Cheers, >>> Tasos Laskos. >>> >>> _______________________________________________ >>> The Web Security Mailing List >>> >>> WebSecurity RSS Feed >>> http://www.webappsec.org/rss/websecurity.rss >>> >>> Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA >>> >>> WASC on Twitter >>> http://twitter.com/wascupdates >>> >>> websecurity@lists.webappsec.org >>> >>> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org >> >> >> --- >> Sebastian Schinzel, M.Sc. >> >> Universität Erlangen-Nürnberg >> Lehrstuhl für Informatik 1 >> IT-Sicherheitsinfrastrukturen >> >> Am Wolfmantel 46 >> 91058 Erlangen >> Germany >> >> Tel.:           +49 (0) 9131 / 85-69911 >> Mobil:  +49 (0) 151 / 18215206 >> Fax:            +49 (0) 9131 / 85-25319 >> Web:    http://www1.cs.fau.de/ >> Email:  sebastian.schinzel@cs.fau.de >> Twitter:        http://twitter.com/seecurity >> >> >> >> >> > > > _______________________________________________ > The Web Security Mailing List > > WebSecurity RSS Feed > http://www.webappsec.org/rss/websecurity.rss > > Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA > > WASC on Twitter > http://twitter.com/wascupdates > > websecurity@lists.webappsec.org > http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org -- Andrés Riancho Director of Web Security at Rapid7 LLC Founder at Bonsai Information Security Project Leader at w3af
TL
Tasos Laskos
Mon, Jan 30, 2012 6:49 PM

Elegant, simple, easy to implement...that's enough to make me itch.
I'll certainly give it a shot  but it depends too much on dumb luck
doesn't it?

It's probably a personal thing but I'm way too controlling to depend on
that.

Truth be told though, I really, really hope that this works as it'll
be a hell of a lot easier to maintain than my policy idea.

Thanks a lot Andres. :)

On 01/30/2012 08:38 PM, Andres Riancho wrote:

Tasos,

On Tue, Jan 17, 2012 at 4:33 AM, Tasos Laskostasos.laskos@gmail.com  wrote:

Sure that sounds good.
No reason not to remedy an unfrair distribution on the fly.

What keeps bugging me is that say you've got a site like so:

index
|
|----- foo/
|----- bar/

So we (the master) parse the index and see the structure and start by
assigning 'foo/' to us and 'bar/' to a slave.

We may leave it as is and hope for some performance gain in case 'foo/' and
'bar/' differ wildly but 'bar/' may contain a link to something under
'foo/'.

The slave can't know if the master already knows of that path or it if
should tell it or just follow it and let the master figure out what to do
with it during the analysis at the end.

Now, if it chooses to check back with the master that means I've got to
implement a look-up table which is a no-no as the latency can very well take
away any performance gains provided by the distribution.

If it chooses to follow it then this could eventually lead to both instances
eventually crawling pretty much the same paths.

When you create the new "crawl slave" you can assign a number to it.
Each slave has his ID, so before following a link you can do something
very easy like adding all the ASCII codes for each letter in the URI
and mod by the number of slaves. If the result is the ID of your crawl
slave then the URL is followed, if not the URL is sent back to the
master so it can be sent to the correct slave. With that very simple
algorithm you load balance crawling in a pretty good way (since the
distribution of mod for the ASCII-URL should be pretty good).

I fully expect the solution to be a combination/compromise between all the
suggestions voiced in this thread.

Like:

  • all instances converging in intervals about the collective paths they've
    discovered in order for each to keep a local look-up cache (no latency
    introduced since no-one would rely/wait on this to be up to date or even
    available)
  • if a URL isn't in the local cache and it is clearly out of an instance's
    scope then perform a queued batch look-up while it  crawls paths that are
    indisputably its own to mitigate the effect of the look-up latency.

This is an interesting problem.

  • Tasos

On 01/17/2012 12:32 AM, Sebastian Schinzel wrote:

Tasos,

So far, you are implying a unidirectional connection from master to slave.
What about making this bidirectional to get a kind of "on the fly" load
balancing, e.g.

  1. Master starts crawl and sends URLs to be crawled to slaves
  2. slaves continue crawl until their "stack" of URLs to be crawled grows
    above a given boundary, say 20 (those are the "busy slaves")
  3. The busy slaves sends additional URLs back to the master
  4. The master sends URLs to be crawled from busy slaves to the
    idle slaves
  5. GOTO 2

Cheers,
Sebastian

On 16. Jan 2012, at 18:41 PM, Tasos Laskos wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of get a
brainstorm going.

I've recently implemented a system for audit distribution in the form of
a high performance grid (won't self-promote) however an area which I left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial
crawl and then calculates and distributes the audit workload amongst its
slaves but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact that
it's not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get me
started:

  • Assign crawl of subdomains to slaves -- no questions asked
  • Briefly scope out the webapp structure and spread the crawl of the
    distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org


Sebastian Schinzel, M.Sc.

Universität Erlangen-Nürnberg
Lehrstuhl für Informatik 1
IT-Sicherheitsinfrastrukturen

Am Wolfmantel 46
91058 Erlangen
Germany

Tel.:          +49 (0) 9131 / 85-69911
Mobil:  +49 (0) 151 / 18215206
Fax:            +49 (0) 9131 / 85-25319
Web:    http://www1.cs.fau.de/
Email:  sebastian.schinzel@cs.fau.de
Twitter:        http://twitter.com/seecurity

Elegant, simple, easy to implement...that's enough to make me itch. I'll certainly give it a shot but it depends too much on dumb luck doesn't it? It's probably a personal thing but I'm way too controlling to depend on that. Truth be told though, I really, *really* hope that this works as it'll be a hell of a lot easier to maintain than my policy idea. Thanks a lot Andres. :) On 01/30/2012 08:38 PM, Andres Riancho wrote: > Tasos, > > On Tue, Jan 17, 2012 at 4:33 AM, Tasos Laskos<tasos.laskos@gmail.com> wrote: >> Sure that sounds good. >> No reason not to remedy an unfrair distribution on the fly. >> >> What keeps bugging me is that say you've got a site like so: >> >> index >> | >> |----- foo/ >> |----- bar/ >> >> So we (the master) parse the index and see the structure and start by >> assigning 'foo/' to us and 'bar/' to a slave. >> >> We may leave it as is and hope for some performance gain in case 'foo/' and >> 'bar/' differ wildly *but* 'bar/' may contain a link to something under >> 'foo/'. >> >> The slave can't know if the master already knows of that path or it if >> should tell it or just follow it and let the master figure out what to do >> with it during the analysis at the end. >> >> Now, if it chooses to check back with the master that means I've got to >> implement a look-up table which is a no-no as the latency can very well take >> away any performance gains provided by the distribution. >> >> If it chooses to follow it then this could eventually lead to both instances >> eventually crawling pretty much the same paths. > > When you create the new "crawl slave" you can assign a number to it. > Each slave has his ID, so before following a link you can do something > very easy like adding all the ASCII codes for each letter in the URI > and mod by the number of slaves. If the result is the ID of your crawl > slave then the URL is followed, if not the URL is sent back to the > master so it can be sent to the correct slave. With that very simple > algorithm you load balance crawling in a pretty good way (since the > distribution of mod for the ASCII-URL should be pretty good). > >> I fully expect the solution to be a combination/compromise between all the >> suggestions voiced in this thread. >> >> Like: >> * all instances converging in intervals about the collective paths they've >> discovered in order for each to keep a local look-up cache (no latency >> introduced since no-one would rely/wait on this to be up to date or even >> available) >> * if a URL isn't in the local cache and it is clearly out of an instance's >> scope then perform a queued batch look-up while it crawls paths that are >> indisputably its own to mitigate the effect of the look-up latency. >> >> This is an interesting problem. >> >> - Tasos >> >> >> On 01/17/2012 12:32 AM, Sebastian Schinzel wrote: >>> >>> Tasos, >>> >>> So far, you are implying a unidirectional connection from master to slave. >>> What about making this bidirectional to get a kind of "on the fly" load >>> balancing, e.g. >>> >>> 1. Master starts crawl and sends URLs to be crawled to slaves >>> 2. slaves continue crawl until their "stack" of URLs to be crawled grows >>> above a given boundary, say 20 (those are the "busy slaves") >>> 3. The busy slaves sends additional URLs back to the master >>> 4. The master sends URLs to be crawled from busy slaves to the >>> idle slaves >>> 5. GOTO 2 >>> >>> Cheers, >>> Sebastian >>> >>> On 16. Jan 2012, at 18:41 PM, Tasos Laskos wrote: >>> >>>> Hi guys, it's been a while. >>>> >>>> I've got a tricky question for you today and I hope that we sort of get a >>>> brainstorm going. >>>> >>>> I've recently implemented a system for audit distribution in the form of >>>> a high performance grid (won't self-promote) however an area which I left >>>> alone (for the time being) was the crawl process. >>>> >>>> See, the way it works now is the master instance performs the initial >>>> crawl and then calculates and distributes the audit workload amongst its >>>> slaves but the crawl takes place the old fashioned way. >>>> >>>> As you might have guessed the major set back is caused by the fact that >>>> it's not possible to determine the workload of the crawl a priori. >>>> >>>> I've got a couple of naive ideas to parallelize the crawl just to get me >>>> started: >>>> * Assign crawl of subdomains to slaves -- no questions asked >>>> * Briefly scope out the webapp structure and spread the crawl of the >>>> distinguishable visible directories amongst the slaves. >>>> >>>> Or even a combination of the above if applicable. >>>> >>>> Both ideas are better than what I've got now and there aren't any >>>> downsides to them even if the distribution turns out to be suboptimal. >>>> >>>> I'm curious though, has anyone faced a similar problem? >>>> Any general ideas? >>>> >>>> Cheers, >>>> Tasos Laskos. >>>> >>>> _______________________________________________ >>>> The Web Security Mailing List >>>> >>>> WebSecurity RSS Feed >>>> http://www.webappsec.org/rss/websecurity.rss >>>> >>>> Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA >>>> >>>> WASC on Twitter >>>> http://twitter.com/wascupdates >>>> >>>> websecurity@lists.webappsec.org >>>> >>>> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org >>> >>> >>> --- >>> Sebastian Schinzel, M.Sc. >>> >>> Universität Erlangen-Nürnberg >>> Lehrstuhl für Informatik 1 >>> IT-Sicherheitsinfrastrukturen >>> >>> Am Wolfmantel 46 >>> 91058 Erlangen >>> Germany >>> >>> Tel.: +49 (0) 9131 / 85-69911 >>> Mobil: +49 (0) 151 / 18215206 >>> Fax: +49 (0) 9131 / 85-25319 >>> Web: http://www1.cs.fau.de/ >>> Email: sebastian.schinzel@cs.fau.de >>> Twitter: http://twitter.com/seecurity >>> >>> >>> >>> >>> >> >> >> _______________________________________________ >> The Web Security Mailing List >> >> WebSecurity RSS Feed >> http://www.webappsec.org/rss/websecurity.rss >> >> Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA >> >> WASC on Twitter >> http://twitter.com/wascupdates >> >> websecurity@lists.webappsec.org >> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org > > >
AR
Andres Riancho
Mon, Jan 30, 2012 7:02 PM

Tasos,

On Mon, Jan 30, 2012 at 3:49 PM, Tasos Laskos tasos.laskos@gmail.com wrote:

Elegant, simple, easy to implement...that's enough to make me itch.
I'll certainly give it a shot  but it depends too much on dumb luck doesn't
it?

It's probably a personal thing but I'm way too controlling to depend on
that.

Truth be told though, I really, really hope that this works as it'll be a
hell of a lot easier to maintain than my policy idea.

Yeah, when I was reading your idea about the directories, policies,
etc. my first thought was "this is too complex and bug friendly". The
only issue I see with my implementation idea is that it will (on
average) send an URL back to the master more often that you expect.
For example, for a setup with 4 crawling nodes, node #1 will keep only
1/4 of the URLs for itself and 3/4 will be sent to the master. This
shouldn't be an issue in most installations (but you've to decide
that) since all you want to send back to master is the new URL and
maybe the referrer (less than 200bytes, 130bytes if you compress it).

I would recommend keeping a TCP connection open all the time for
sending these information back and forth, in order to avoid TCP
connection setup overhead.

Let me know how it works out :)

Regards,

Thanks a lot Andres. :)

On 01/30/2012 08:38 PM, Andres Riancho wrote:

Tasos,

On Tue, Jan 17, 2012 at 4:33 AM, Tasos Laskostasos.laskos@gmail.com
 wrote:

Sure that sounds good.
No reason not to remedy an unfrair distribution on the fly.

What keeps bugging me is that say you've got a site like so:

index
|
|----- foo/
|----- bar/

So we (the master) parse the index and see the structure and start by
assigning 'foo/' to us and 'bar/' to a slave.

We may leave it as is and hope for some performance gain in case 'foo/'
and
'bar/' differ wildly but 'bar/' may contain a link to something under
'foo/'.

The slave can't know if the master already knows of that path or it if
should tell it or just follow it and let the master figure out what to do
with it during the analysis at the end.

Now, if it chooses to check back with the master that means I've got to
implement a look-up table which is a no-no as the latency can very well
take
away any performance gains provided by the distribution.

If it chooses to follow it then this could eventually lead to both
instances
eventually crawling pretty much the same paths.

When you create the new "crawl slave" you can assign a number to it.
Each slave has his ID, so before following a link you can do something
very easy like adding all the ASCII codes for each letter in the URI
and mod by the number of slaves. If the result is the ID of your crawl
slave then the URL is followed, if not the URL is sent back to the
master so it can be sent to the correct slave. With that very simple
algorithm you load balance crawling in a pretty good way (since the
distribution of mod for the ASCII-URL should be pretty good).

I fully expect the solution to be a combination/compromise between all
the
suggestions voiced in this thread.

Like:
 * all instances converging in intervals about the collective paths
they've
discovered in order for each to keep a local look-up cache (no latency
introduced since no-one would rely/wait on this to be up to date or even
available)
 * if a URL isn't in the local cache and it is clearly out of an
instance's
scope then perform a queued batch look-up while it  crawls paths that are
indisputably its own to mitigate the effect of the look-up latency.

This is an interesting problem.

 - Tasos

On 01/17/2012 12:32 AM, Sebastian Schinzel wrote:

Tasos,

So far, you are implying a unidirectional connection from master to
slave.
What about making this bidirectional to get a kind of "on the fly" load
balancing, e.g.

  1. Master starts crawl and sends URLs to be crawled to slaves
  2. slaves continue crawl until their "stack" of URLs to be crawled grows
    above a given boundary, say 20 (those are the "busy slaves")
  3. The busy slaves sends additional URLs back to the master
  4. The master sends URLs to be crawled from busy slaves to the
    idle slaves
  5. GOTO 2

Cheers,
Sebastian

On 16. Jan 2012, at 18:41 PM, Tasos Laskos wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of get
a
brainstorm going.

I've recently implemented a system for audit distribution in the form
of
a high performance grid (won't self-promote) however an area which I
left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial
crawl and then calculates and distributes the audit workload amongst
its
slaves but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact that
it's not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get
me
started:
 * Assign crawl of subdomains to slaves -- no questions asked
 * Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org


Sebastian Schinzel, M.Sc.

Universität Erlangen-Nürnberg
Lehrstuhl für Informatik 1
IT-Sicherheitsinfrastrukturen

Am Wolfmantel 46
91058 Erlangen
Germany

Tel.:           +49 (0) 9131 / 85-69911
Mobil:  +49 (0) 151 / 18215206
Fax:            +49 (0) 9131 / 85-25319
Web:    http://www1.cs.fau.de/
Email:  sebastian.schinzel@cs.fau.de
Twitter:        http://twitter.com/seecurity

--
Andrés Riancho
Director of Web Security at Rapid7 LLC
Founder at Bonsai Information Security
Project Leader at w3af

Tasos, On Mon, Jan 30, 2012 at 3:49 PM, Tasos Laskos <tasos.laskos@gmail.com> wrote: > Elegant, simple, easy to implement...that's enough to make me itch. > I'll certainly give it a shot  but it depends too much on dumb luck doesn't > it? > > It's probably a personal thing but I'm way too controlling to depend on > that. > > Truth be told though, I really, *really* hope that this works as it'll be a > hell of a lot easier to maintain than my policy idea. Yeah, when I was reading your idea about the directories, policies, etc. my first thought was "this is too complex and bug friendly". The only issue I see with my implementation idea is that it will (on average) send an URL back to the master more often that you expect. For example, for a setup with 4 crawling nodes, node #1 will keep only 1/4 of the URLs for itself and 3/4 will be sent to the master. This shouldn't be an issue in most installations (but you've to decide that) since all you want to send back to master is the new URL and maybe the referrer (less than 200bytes, 130bytes if you compress it). I would recommend keeping a TCP connection open all the time for sending these information back and forth, in order to avoid TCP connection setup overhead. Let me know how it works out :) Regards, > Thanks a lot Andres. :) > > > On 01/30/2012 08:38 PM, Andres Riancho wrote: >> >> Tasos, >> >> On Tue, Jan 17, 2012 at 4:33 AM, Tasos Laskos<tasos.laskos@gmail.com> >>  wrote: >>> >>> Sure that sounds good. >>> No reason not to remedy an unfrair distribution on the fly. >>> >>> What keeps bugging me is that say you've got a site like so: >>> >>> index >>> | >>> |----- foo/ >>> |----- bar/ >>> >>> So we (the master) parse the index and see the structure and start by >>> assigning 'foo/' to us and 'bar/' to a slave. >>> >>> We may leave it as is and hope for some performance gain in case 'foo/' >>> and >>> 'bar/' differ wildly *but* 'bar/' may contain a link to something under >>> 'foo/'. >>> >>> The slave can't know if the master already knows of that path or it if >>> should tell it or just follow it and let the master figure out what to do >>> with it during the analysis at the end. >>> >>> Now, if it chooses to check back with the master that means I've got to >>> implement a look-up table which is a no-no as the latency can very well >>> take >>> away any performance gains provided by the distribution. >>> >>> If it chooses to follow it then this could eventually lead to both >>> instances >>> eventually crawling pretty much the same paths. >> >> >> When you create the new "crawl slave" you can assign a number to it. >> Each slave has his ID, so before following a link you can do something >> very easy like adding all the ASCII codes for each letter in the URI >> and mod by the number of slaves. If the result is the ID of your crawl >> slave then the URL is followed, if not the URL is sent back to the >> master so it can be sent to the correct slave. With that very simple >> algorithm you load balance crawling in a pretty good way (since the >> distribution of mod for the ASCII-URL should be pretty good). >> >>> I fully expect the solution to be a combination/compromise between all >>> the >>> suggestions voiced in this thread. >>> >>> Like: >>>  * all instances converging in intervals about the collective paths >>> they've >>> discovered in order for each to keep a local look-up cache (no latency >>> introduced since no-one would rely/wait on this to be up to date or even >>> available) >>>  * if a URL isn't in the local cache and it is clearly out of an >>> instance's >>> scope then perform a queued batch look-up while it  crawls paths that are >>> indisputably its own to mitigate the effect of the look-up latency. >>> >>> This is an interesting problem. >>> >>>  - Tasos >>> >>> >>> On 01/17/2012 12:32 AM, Sebastian Schinzel wrote: >>>> >>>> >>>> Tasos, >>>> >>>> So far, you are implying a unidirectional connection from master to >>>> slave. >>>> What about making this bidirectional to get a kind of "on the fly" load >>>> balancing, e.g. >>>> >>>> 1. Master starts crawl and sends URLs to be crawled to slaves >>>> 2. slaves continue crawl until their "stack" of URLs to be crawled grows >>>> above a given boundary, say 20 (those are the "busy slaves") >>>> 3. The busy slaves sends additional URLs back to the master >>>> 4. The master sends URLs to be crawled from busy slaves to the >>>> idle slaves >>>> 5. GOTO 2 >>>> >>>> Cheers, >>>> Sebastian >>>> >>>> On 16. Jan 2012, at 18:41 PM, Tasos Laskos wrote: >>>> >>>>> Hi guys, it's been a while. >>>>> >>>>> I've got a tricky question for you today and I hope that we sort of get >>>>> a >>>>> brainstorm going. >>>>> >>>>> I've recently implemented a system for audit distribution in the form >>>>> of >>>>> a high performance grid (won't self-promote) however an area which I >>>>> left >>>>> alone (for the time being) was the crawl process. >>>>> >>>>> See, the way it works now is the master instance performs the initial >>>>> crawl and then calculates and distributes the audit workload amongst >>>>> its >>>>> slaves but the crawl takes place the old fashioned way. >>>>> >>>>> As you might have guessed the major set back is caused by the fact that >>>>> it's not possible to determine the workload of the crawl a priori. >>>>> >>>>> I've got a couple of naive ideas to parallelize the crawl just to get >>>>> me >>>>> started: >>>>>  * Assign crawl of subdomains to slaves -- no questions asked >>>>>  * Briefly scope out the webapp structure and spread the crawl of the >>>>> distinguishable visible directories amongst the slaves. >>>>> >>>>> Or even a combination of the above if applicable. >>>>> >>>>> Both ideas are better than what I've got now and there aren't any >>>>> downsides to them even if the distribution turns out to be suboptimal. >>>>> >>>>> I'm curious though, has anyone faced a similar problem? >>>>> Any general ideas? >>>>> >>>>> Cheers, >>>>> Tasos Laskos. >>>>> >>>>> _______________________________________________ >>>>> The Web Security Mailing List >>>>> >>>>> WebSecurity RSS Feed >>>>> http://www.webappsec.org/rss/websecurity.rss >>>>> >>>>> Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA >>>>> >>>>> WASC on Twitter >>>>> http://twitter.com/wascupdates >>>>> >>>>> websecurity@lists.webappsec.org >>>>> >>>>> >>>>> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org >>>> >>>> >>>> >>>> --- >>>> Sebastian Schinzel, M.Sc. >>>> >>>> Universität Erlangen-Nürnberg >>>> Lehrstuhl für Informatik 1 >>>> IT-Sicherheitsinfrastrukturen >>>> >>>> Am Wolfmantel 46 >>>> 91058 Erlangen >>>> Germany >>>> >>>> Tel.:           +49 (0) 9131 / 85-69911 >>>> Mobil:  +49 (0) 151 / 18215206 >>>> Fax:            +49 (0) 9131 / 85-25319 >>>> Web:    http://www1.cs.fau.de/ >>>> Email:  sebastian.schinzel@cs.fau.de >>>> Twitter:        http://twitter.com/seecurity >>>> >>>> >>>> >>>> >>>> >>> >>> >>> _______________________________________________ >>> The Web Security Mailing List >>> >>> WebSecurity RSS Feed >>> http://www.webappsec.org/rss/websecurity.rss >>> >>> Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA >>> >>> WASC on Twitter >>> http://twitter.com/wascupdates >>> >>> websecurity@lists.webappsec.org >>> >>> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org >> >> >> >> > -- Andrés Riancho Director of Web Security at Rapid7 LLC Founder at Bonsai Information Security Project Leader at w3af