websecurity@lists.webappsec.org

The Web Security Mailing List

View all threads

Re: [WEB SECURITY] Parallelizing the crawl

RH
Richard Hauswald
Tue, Jan 17, 2012 2:20 PM

Btw, did you intentionally send this e-mail privately?

No, clicked the wrong button inside gmail - sorry.

I thought your intend was to divide URLs by subdomains, run a crawl,
"Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves". This would
not redistribute new discovered URLs and require a special initial
setup step (divide URLs by subdomains). And by pushing tasks to the
workers you'd have to take care about the load of the workers which
means you'd have to implement a scheduling / load balancing policy. By
pulling the work this would happen automatically. To make use of multi
core CPUs you could make your workers scan for the systems count of
CPU cores and spawn workers threads in a defined ratio. You could also
run a worker on the master host. This should lead to a good load
balancing by default. So it's not really ignoring scheduling details.

On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskos tasos.laskos@gmail.com wrote:

Yep, that's similar to what I previously posted with the difference that the
master in my system won't be a slacker so he'll be the lucky guy to grab the
seed URL.

After that things are pretty much the same -- ignoring the scheduling
details etc.

Btw, did you intentionally send this e-mail privately?

On 01/17/2012 03:48 PM, Richard Hauswald wrote:

Yes, I didn't understand the nature of your problem the first time ...

So, you could still use the pull principle. To make it simple I'll not
consider work packets/batching. This could be used later to further
improve performance by reducing "latency" / "seek" times.

So what about the following process:

  1. Create a bunch of workers
  2. Create a List of URLs which can be considered the work queue.
    Initially filled with one element: the landing page URL in state NEW.
  3. Let all your workers poll the master for a single work item in
    state NEW(pay attention to synchronize this step on the master). One
    of them is the lucky guy and gets the landing page URL. The master
    will update work item to state PROCESSING( you may append a starting
    time, which could be used for reassigning already assigned work items
    after a timeout). All the other workers will still be idle.
  4. The lucky guy parses the page for new URLs and does whatever it
    should also do.
  5. The lucky guy posts the results + the parsed URLs to the master.
  6. The master stores the results, pushes the new URLs into the work
    queue with state NEW and updates the work item to state COMPLETED. If
    there is only one new URL we are not lucky but if there are 10 we'd
    have now 10 work items to distribute.
  7. Continue until all work items are in state COMPLETED.

Does this make sense?

On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskostasos.laskos@gmail.com
 wrote:

What prevents this is the nature of the crawl process.
What I'm trying to achieve here is not spread the workload but actually
find
it.

I'm not interested in parsing the pages or any sort of processing but
only
gather all available paths.

So there's not really any "work" to distribute actually.

Does this make sense?

On 01/17/2012 03:05 PM, Richard Hauswald wrote:

Tasos,
what prevents you from let the workers pull the work from the master
instead of pushing it to the workers? Then you could let the workers
pull work packets containing e.g. 20 work items. After a worker has no
work left, it will push the results to the master and pull another
work packet.
Regards,
Richard

On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskostasos.laskos@gmail.com
 wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of get
a
brainstorm going.

I've recently implemented a system for audit distribution in the form
of
a
high performance grid (won't self-promote) however an area which I left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial
crawl
and then calculates and distributes the audit workload amongst its
slaves
but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact that
it's
not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get
me
started:
 * Assign crawl of subdomains to slaves -- no questions asked
 * Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides
to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

> Btw, did you intentionally send this e-mail privately? No, clicked the wrong button inside gmail - sorry. I thought your intend was to divide URLs by subdomains, run a crawl, "Briefly scope out the webapp structure and spread the crawl of the distinguishable visible directories amongst the slaves". This would not redistribute new discovered URLs and require a special initial setup step (divide URLs by subdomains). And by pushing tasks to the workers you'd have to take care about the load of the workers which means you'd have to implement a scheduling / load balancing policy. By pulling the work this would happen automatically. To make use of multi core CPUs you could make your workers scan for the systems count of CPU cores and spawn workers threads in a defined ratio. You could also run a worker on the master host. This should lead to a good load balancing by default. So it's not really ignoring scheduling details. On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskos <tasos.laskos@gmail.com> wrote: > Yep, that's similar to what I previously posted with the difference that the > master in my system won't be a slacker so he'll be the lucky guy to grab the > seed URL. > > After that things are pretty much the same -- ignoring the scheduling > details etc. > > Btw, did you intentionally send this e-mail privately? > > > On 01/17/2012 03:48 PM, Richard Hauswald wrote: >> >> Yes, I didn't understand the nature of your problem the first time ... >> >> So, you could still use the pull principle. To make it simple I'll not >> consider work packets/batching. This could be used later to further >> improve performance by reducing "latency" / "seek" times. >> >> So what about the following process: >> 1. Create a bunch of workers >> 2. Create a List of URLs which can be considered the work queue. >> Initially filled with one element: the landing page URL in state NEW. >> 3. Let all your workers poll the master for a single work item in >> state NEW(pay attention to synchronize this step on the master). One >> of them is the lucky guy and gets the landing page URL. The master >> will update work item to state PROCESSING( you may append a starting >> time, which could be used for reassigning already assigned work items >> after a timeout). All the other workers will still be idle. >> 4. The lucky guy parses the page for new URLs and does whatever it >> should also do. >> 5. The lucky guy posts the results + the parsed URLs to the master. >> 6. The master stores the results, pushes the new URLs into the work >> queue with state NEW and updates the work item to state COMPLETED. If >> there is only one new URL we are not lucky but if there are 10 we'd >> have now 10 work items to distribute. >> 7. Continue until all work items are in state COMPLETED. >> >> Does this make sense? >> >> On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskos<tasos.laskos@gmail.com> >>  wrote: >>> >>> What prevents this is the nature of the crawl process. >>> What I'm trying to achieve here is not spread the workload but actually >>> find >>> it. >>> >>> I'm not interested in parsing the pages or any sort of processing but >>> only >>> gather all available paths. >>> >>> So there's not really any "work" to distribute actually. >>> >>> Does this make sense? >>> >>> >>> On 01/17/2012 03:05 PM, Richard Hauswald wrote: >>>> >>>> >>>> Tasos, >>>> what prevents you from let the workers pull the work from the master >>>> instead of pushing it to the workers? Then you could let the workers >>>> pull work packets containing e.g. 20 work items. After a worker has no >>>> work left, it will push the results to the master and pull another >>>> work packet. >>>> Regards, >>>> Richard >>>> >>>> On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>  wrote: >>>>> >>>>> >>>>> Hi guys, it's been a while. >>>>> >>>>> I've got a tricky question for you today and I hope that we sort of get >>>>> a >>>>> brainstorm going. >>>>> >>>>> I've recently implemented a system for audit distribution in the form >>>>> of >>>>> a >>>>> high performance grid (won't self-promote) however an area which I left >>>>> alone (for the time being) was the crawl process. >>>>> >>>>> See, the way it works now is the master instance performs the initial >>>>> crawl >>>>> and then calculates and distributes the audit workload amongst its >>>>> slaves >>>>> but the crawl takes place the old fashioned way. >>>>> >>>>> As you might have guessed the major set back is caused by the fact that >>>>> it's >>>>> not possible to determine the workload of the crawl a priori. >>>>> >>>>> I've got a couple of naive ideas to parallelize the crawl just to get >>>>> me >>>>> started: >>>>>  * Assign crawl of subdomains to slaves -- no questions asked >>>>>  * Briefly scope out the webapp structure and spread the crawl of the >>>>> distinguishable visible directories amongst the slaves. >>>>> >>>>> Or even a combination of the above if applicable. >>>>> >>>>> Both ideas are better than what I've got now and there aren't any >>>>> downsides >>>>> to them even if the distribution turns out to be suboptimal. >>>>> >>>>> I'm curious though, has anyone faced a similar problem? >>>>> Any general ideas? >>>>> >>>>> Cheers, >>>>> Tasos Laskos. >>>>> >>>>> _______________________________________________ >>>>> The Web Security Mailing List >>>>> >>>>> WebSecurity RSS Feed >>>>> http://www.webappsec.org/rss/websecurity.rss >>>>> >>>>> Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA >>>>> >>>>> WASC on Twitter >>>>> http://twitter.com/wascupdates >>>>> >>>>> websecurity@lists.webappsec.org >>>>> >>>>> >>>>> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org >>>> >>>> >>>> >>>> >>>> >>> >> >> >> > -- Richard Hauswald Blog: http://tnfstacc.blogspot.com/ LinkedIn: http://www.linkedin.com/in/richardhauswald Xing: http://www.xing.com/profile/Richard_Hauswald
TL
Tasos Laskos
Tue, Jan 17, 2012 2:44 PM

You're right, it does sound good but it would still suffer from the same
problem, workers visiting the same pages.

Although this could be somewhat mitigated the same way I described in an
earlier post (which I think hasn't been moderated yet).

I've bookmarked it for future reference, thanks man.

On 01/17/2012 04:20 PM, Richard Hauswald wrote:

Btw, did you intentionally send this e-mail privately?

No, clicked the wrong button inside gmail - sorry.

I thought your intend was to divide URLs by subdomains, run a crawl,
"Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves". This would
not redistribute new discovered URLs and require a special initial
setup step (divide URLs by subdomains). And by pushing tasks to the
workers you'd have to take care about the load of the workers which
means you'd have to implement a scheduling / load balancing policy. By
pulling the work this would happen automatically. To make use of multi
core CPUs you could make your workers scan for the systems count of
CPU cores and spawn workers threads in a defined ratio. You could also
run a worker on the master host. This should lead to a good load
balancing by default. So it's not really ignoring scheduling details.

On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskostasos.laskos@gmail.com  wrote:

Yep, that's similar to what I previously posted with the difference that the
master in my system won't be a slacker so he'll be the lucky guy to grab the
seed URL.

After that things are pretty much the same -- ignoring the scheduling
details etc.

Btw, did you intentionally send this e-mail privately?

On 01/17/2012 03:48 PM, Richard Hauswald wrote:

Yes, I didn't understand the nature of your problem the first time ...

So, you could still use the pull principle. To make it simple I'll not
consider work packets/batching. This could be used later to further
improve performance by reducing "latency" / "seek" times.

So what about the following process:

  1. Create a bunch of workers
  2. Create a List of URLs which can be considered the work queue.
    Initially filled with one element: the landing page URL in state NEW.
  3. Let all your workers poll the master for a single work item in
    state NEW(pay attention to synchronize this step on the master). One
    of them is the lucky guy and gets the landing page URL. The master
    will update work item to state PROCESSING( you may append a starting
    time, which could be used for reassigning already assigned work items
    after a timeout). All the other workers will still be idle.
  4. The lucky guy parses the page for new URLs and does whatever it
    should also do.
  5. The lucky guy posts the results + the parsed URLs to the master.
  6. The master stores the results, pushes the new URLs into the work
    queue with state NEW and updates the work item to state COMPLETED. If
    there is only one new URL we are not lucky but if there are 10 we'd
    have now 10 work items to distribute.
  7. Continue until all work items are in state COMPLETED.

Does this make sense?

On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskostasos.laskos@gmail.com
wrote:

What prevents this is the nature of the crawl process.
What I'm trying to achieve here is not spread the workload but actually
find
it.

I'm not interested in parsing the pages or any sort of processing but
only
gather all available paths.

So there's not really any "work" to distribute actually.

Does this make sense?

On 01/17/2012 03:05 PM, Richard Hauswald wrote:

Tasos,
what prevents you from let the workers pull the work from the master
instead of pushing it to the workers? Then you could let the workers
pull work packets containing e.g. 20 work items. After a worker has no
work left, it will push the results to the master and pull another
work packet.
Regards,
Richard

On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskostasos.laskos@gmail.com
wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of get
a
brainstorm going.

I've recently implemented a system for audit distribution in the form
of
a
high performance grid (won't self-promote) however an area which I left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial
crawl
and then calculates and distributes the audit workload amongst its
slaves
but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact that
it's
not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get
me
started:

  • Assign crawl of subdomains to slaves -- no questions asked
  • Briefly scope out the webapp structure and spread the crawl of the
    distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides
to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

You're right, it does sound good but it would still suffer from the same problem, workers visiting the same pages. Although this could be somewhat mitigated the same way I described in an earlier post (which I think hasn't been moderated yet). I've bookmarked it for future reference, thanks man. On 01/17/2012 04:20 PM, Richard Hauswald wrote: >> Btw, did you intentionally send this e-mail privately? > No, clicked the wrong button inside gmail - sorry. > > I thought your intend was to divide URLs by subdomains, run a crawl, > "Briefly scope out the webapp structure and spread the crawl of the > distinguishable visible directories amongst the slaves". This would > not redistribute new discovered URLs and require a special initial > setup step (divide URLs by subdomains). And by pushing tasks to the > workers you'd have to take care about the load of the workers which > means you'd have to implement a scheduling / load balancing policy. By > pulling the work this would happen automatically. To make use of multi > core CPUs you could make your workers scan for the systems count of > CPU cores and spawn workers threads in a defined ratio. You could also > run a worker on the master host. This should lead to a good load > balancing by default. So it's not really ignoring scheduling details. > > On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskos<tasos.laskos@gmail.com> wrote: >> Yep, that's similar to what I previously posted with the difference that the >> master in my system won't be a slacker so he'll be the lucky guy to grab the >> seed URL. >> >> After that things are pretty much the same -- ignoring the scheduling >> details etc. >> >> Btw, did you intentionally send this e-mail privately? >> >> >> On 01/17/2012 03:48 PM, Richard Hauswald wrote: >>> >>> Yes, I didn't understand the nature of your problem the first time ... >>> >>> So, you could still use the pull principle. To make it simple I'll not >>> consider work packets/batching. This could be used later to further >>> improve performance by reducing "latency" / "seek" times. >>> >>> So what about the following process: >>> 1. Create a bunch of workers >>> 2. Create a List of URLs which can be considered the work queue. >>> Initially filled with one element: the landing page URL in state NEW. >>> 3. Let all your workers poll the master for a single work item in >>> state NEW(pay attention to synchronize this step on the master). One >>> of them is the lucky guy and gets the landing page URL. The master >>> will update work item to state PROCESSING( you may append a starting >>> time, which could be used for reassigning already assigned work items >>> after a timeout). All the other workers will still be idle. >>> 4. The lucky guy parses the page for new URLs and does whatever it >>> should also do. >>> 5. The lucky guy posts the results + the parsed URLs to the master. >>> 6. The master stores the results, pushes the new URLs into the work >>> queue with state NEW and updates the work item to state COMPLETED. If >>> there is only one new URL we are not lucky but if there are 10 we'd >>> have now 10 work items to distribute. >>> 7. Continue until all work items are in state COMPLETED. >>> >>> Does this make sense? >>> >>> On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskos<tasos.laskos@gmail.com> >>> wrote: >>>> >>>> What prevents this is the nature of the crawl process. >>>> What I'm trying to achieve here is not spread the workload but actually >>>> find >>>> it. >>>> >>>> I'm not interested in parsing the pages or any sort of processing but >>>> only >>>> gather all available paths. >>>> >>>> So there's not really any "work" to distribute actually. >>>> >>>> Does this make sense? >>>> >>>> >>>> On 01/17/2012 03:05 PM, Richard Hauswald wrote: >>>>> >>>>> >>>>> Tasos, >>>>> what prevents you from let the workers pull the work from the master >>>>> instead of pushing it to the workers? Then you could let the workers >>>>> pull work packets containing e.g. 20 work items. After a worker has no >>>>> work left, it will push the results to the master and pull another >>>>> work packet. >>>>> Regards, >>>>> Richard >>>>> >>>>> On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> Hi guys, it's been a while. >>>>>> >>>>>> I've got a tricky question for you today and I hope that we sort of get >>>>>> a >>>>>> brainstorm going. >>>>>> >>>>>> I've recently implemented a system for audit distribution in the form >>>>>> of >>>>>> a >>>>>> high performance grid (won't self-promote) however an area which I left >>>>>> alone (for the time being) was the crawl process. >>>>>> >>>>>> See, the way it works now is the master instance performs the initial >>>>>> crawl >>>>>> and then calculates and distributes the audit workload amongst its >>>>>> slaves >>>>>> but the crawl takes place the old fashioned way. >>>>>> >>>>>> As you might have guessed the major set back is caused by the fact that >>>>>> it's >>>>>> not possible to determine the workload of the crawl a priori. >>>>>> >>>>>> I've got a couple of naive ideas to parallelize the crawl just to get >>>>>> me >>>>>> started: >>>>>> * Assign crawl of subdomains to slaves -- no questions asked >>>>>> * Briefly scope out the webapp structure and spread the crawl of the >>>>>> distinguishable visible directories amongst the slaves. >>>>>> >>>>>> Or even a combination of the above if applicable. >>>>>> >>>>>> Both ideas are better than what I've got now and there aren't any >>>>>> downsides >>>>>> to them even if the distribution turns out to be suboptimal. >>>>>> >>>>>> I'm curious though, has anyone faced a similar problem? >>>>>> Any general ideas? >>>>>> >>>>>> Cheers, >>>>>> Tasos Laskos. >>>>>> >>>>>> _______________________________________________ >>>>>> The Web Security Mailing List >>>>>> >>>>>> WebSecurity RSS Feed >>>>>> http://www.webappsec.org/rss/websecurity.rss >>>>>> >>>>>> Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA >>>>>> >>>>>> WASC on Twitter >>>>>> http://twitter.com/wascupdates >>>>>> >>>>>> websecurity@lists.webappsec.org >>>>>> >>>>>> >>>>>> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >> > > >
RH
Richard Hauswald
Tue, Jan 17, 2012 3:01 PM

Why would workers visiting the same pages?

On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskos tasos.laskos@gmail.com wrote:

You're right, it does sound good but it would still suffer from the same
problem, workers visiting the same pages.

Although this could be somewhat mitigated the same way I described in an
earlier post (which I think hasn't been moderated yet).

I've bookmarked it for future reference, thanks man.

On 01/17/2012 04:20 PM, Richard Hauswald wrote:

Btw, did you intentionally send this e-mail privately?

No, clicked the wrong button inside gmail - sorry.

I thought your intend was to divide URLs by subdomains, run a crawl,
"Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves". This would
not redistribute new discovered URLs and require a special initial
setup step (divide URLs by subdomains). And by pushing tasks to the
workers you'd have to take care about the load of the workers which
means you'd have to implement a scheduling / load balancing policy. By
pulling the work this would happen automatically. To make use of multi
core CPUs you could make your workers scan for the systems count of
CPU cores and spawn workers threads in a defined ratio. You could also
run a worker on the master host. This should lead to a good load
balancing by default. So it's not really ignoring scheduling details.

On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskostasos.laskos@gmail.com
 wrote:

Yep, that's similar to what I previously posted with the difference that
the
master in my system won't be a slacker so he'll be the lucky guy to grab
the
seed URL.

After that things are pretty much the same -- ignoring the scheduling
details etc.

Btw, did you intentionally send this e-mail privately?

On 01/17/2012 03:48 PM, Richard Hauswald wrote:

Yes, I didn't understand the nature of your problem the first time ...

So, you could still use the pull principle. To make it simple I'll not
consider work packets/batching. This could be used later to further
improve performance by reducing "latency" / "seek" times.

So what about the following process:

  1. Create a bunch of workers
  2. Create a List of URLs which can be considered the work queue.
    Initially filled with one element: the landing page URL in state NEW.
  3. Let all your workers poll the master for a single work item in
    state NEW(pay attention to synchronize this step on the master). One
    of them is the lucky guy and gets the landing page URL. The master
    will update work item to state PROCESSING( you may append a starting
    time, which could be used for reassigning already assigned work items
    after a timeout). All the other workers will still be idle.
  4. The lucky guy parses the page for new URLs and does whatever it
    should also do.
  5. The lucky guy posts the results + the parsed URLs to the master.
  6. The master stores the results, pushes the new URLs into the work
    queue with state NEW and updates the work item to state COMPLETED. If
    there is only one new URL we are not lucky but if there are 10 we'd
    have now 10 work items to distribute.
  7. Continue until all work items are in state COMPLETED.

Does this make sense?

On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskostasos.laskos@gmail.com
 wrote:

What prevents this is the nature of the crawl process.
What I'm trying to achieve here is not spread the workload but actually
find
it.

I'm not interested in parsing the pages or any sort of processing but
only
gather all available paths.

So there's not really any "work" to distribute actually.

Does this make sense?

On 01/17/2012 03:05 PM, Richard Hauswald wrote:

Tasos,
what prevents you from let the workers pull the work from the master
instead of pushing it to the workers? Then you could let the workers
pull work packets containing e.g. 20 work items. After a worker has no
work left, it will push the results to the master and pull another
work packet.
Regards,
Richard

On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskostasos.laskos@gmail.com
 wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of
get
a
brainstorm going.

I've recently implemented a system for audit distribution in the form
of
a
high performance grid (won't self-promote) however an area which I
left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial
crawl
and then calculates and distributes the audit workload amongst its
slaves
but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact
that
it's
not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get
me
started:
 * Assign crawl of subdomains to slaves -- no questions asked
 * Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides
to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn
http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

Why would workers visiting the same pages? On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskos <tasos.laskos@gmail.com> wrote: > You're right, it does sound good but it would still suffer from the same > problem, workers visiting the same pages. > > Although this could be somewhat mitigated the same way I described in an > earlier post (which I think hasn't been moderated yet). > > I've bookmarked it for future reference, thanks man. > > > On 01/17/2012 04:20 PM, Richard Hauswald wrote: >>> >>> Btw, did you intentionally send this e-mail privately? >> >> No, clicked the wrong button inside gmail - sorry. >> >> I thought your intend was to divide URLs by subdomains, run a crawl, >> "Briefly scope out the webapp structure and spread the crawl of the >> distinguishable visible directories amongst the slaves". This would >> not redistribute new discovered URLs and require a special initial >> setup step (divide URLs by subdomains). And by pushing tasks to the >> workers you'd have to take care about the load of the workers which >> means you'd have to implement a scheduling / load balancing policy. By >> pulling the work this would happen automatically. To make use of multi >> core CPUs you could make your workers scan for the systems count of >> CPU cores and spawn workers threads in a defined ratio. You could also >> run a worker on the master host. This should lead to a good load >> balancing by default. So it's not really ignoring scheduling details. >> >> On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskos<tasos.laskos@gmail.com> >>  wrote: >>> >>> Yep, that's similar to what I previously posted with the difference that >>> the >>> master in my system won't be a slacker so he'll be the lucky guy to grab >>> the >>> seed URL. >>> >>> After that things are pretty much the same -- ignoring the scheduling >>> details etc. >>> >>> Btw, did you intentionally send this e-mail privately? >>> >>> >>> On 01/17/2012 03:48 PM, Richard Hauswald wrote: >>>> >>>> >>>> Yes, I didn't understand the nature of your problem the first time ... >>>> >>>> So, you could still use the pull principle. To make it simple I'll not >>>> consider work packets/batching. This could be used later to further >>>> improve performance by reducing "latency" / "seek" times. >>>> >>>> So what about the following process: >>>> 1. Create a bunch of workers >>>> 2. Create a List of URLs which can be considered the work queue. >>>> Initially filled with one element: the landing page URL in state NEW. >>>> 3. Let all your workers poll the master for a single work item in >>>> state NEW(pay attention to synchronize this step on the master). One >>>> of them is the lucky guy and gets the landing page URL. The master >>>> will update work item to state PROCESSING( you may append a starting >>>> time, which could be used for reassigning already assigned work items >>>> after a timeout). All the other workers will still be idle. >>>> 4. The lucky guy parses the page for new URLs and does whatever it >>>> should also do. >>>> 5. The lucky guy posts the results + the parsed URLs to the master. >>>> 6. The master stores the results, pushes the new URLs into the work >>>> queue with state NEW and updates the work item to state COMPLETED. If >>>> there is only one new URL we are not lucky but if there are 10 we'd >>>> have now 10 work items to distribute. >>>> 7. Continue until all work items are in state COMPLETED. >>>> >>>> Does this make sense? >>>> >>>> On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>  wrote: >>>>> >>>>> >>>>> What prevents this is the nature of the crawl process. >>>>> What I'm trying to achieve here is not spread the workload but actually >>>>> find >>>>> it. >>>>> >>>>> I'm not interested in parsing the pages or any sort of processing but >>>>> only >>>>> gather all available paths. >>>>> >>>>> So there's not really any "work" to distribute actually. >>>>> >>>>> Does this make sense? >>>>> >>>>> >>>>> On 01/17/2012 03:05 PM, Richard Hauswald wrote: >>>>>> >>>>>> >>>>>> >>>>>> Tasos, >>>>>> what prevents you from let the workers pull the work from the master >>>>>> instead of pushing it to the workers? Then you could let the workers >>>>>> pull work packets containing e.g. 20 work items. After a worker has no >>>>>> work left, it will push the results to the master and pull another >>>>>> work packet. >>>>>> Regards, >>>>>> Richard >>>>>> >>>>>> On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>>>  wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi guys, it's been a while. >>>>>>> >>>>>>> I've got a tricky question for you today and I hope that we sort of >>>>>>> get >>>>>>> a >>>>>>> brainstorm going. >>>>>>> >>>>>>> I've recently implemented a system for audit distribution in the form >>>>>>> of >>>>>>> a >>>>>>> high performance grid (won't self-promote) however an area which I >>>>>>> left >>>>>>> alone (for the time being) was the crawl process. >>>>>>> >>>>>>> See, the way it works now is the master instance performs the initial >>>>>>> crawl >>>>>>> and then calculates and distributes the audit workload amongst its >>>>>>> slaves >>>>>>> but the crawl takes place the old fashioned way. >>>>>>> >>>>>>> As you might have guessed the major set back is caused by the fact >>>>>>> that >>>>>>> it's >>>>>>> not possible to determine the workload of the crawl a priori. >>>>>>> >>>>>>> I've got a couple of naive ideas to parallelize the crawl just to get >>>>>>> me >>>>>>> started: >>>>>>>  * Assign crawl of subdomains to slaves -- no questions asked >>>>>>>  * Briefly scope out the webapp structure and spread the crawl of the >>>>>>> distinguishable visible directories amongst the slaves. >>>>>>> >>>>>>> Or even a combination of the above if applicable. >>>>>>> >>>>>>> Both ideas are better than what I've got now and there aren't any >>>>>>> downsides >>>>>>> to them even if the distribution turns out to be suboptimal. >>>>>>> >>>>>>> I'm curious though, has anyone faced a similar problem? >>>>>>> Any general ideas? >>>>>>> >>>>>>> Cheers, >>>>>>> Tasos Laskos. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> The Web Security Mailing List >>>>>>> >>>>>>> WebSecurity RSS Feed >>>>>>> http://www.webappsec.org/rss/websecurity.rss >>>>>>> >>>>>>> Join WASC on LinkedIn >>>>>>> http://www.linkedin.com/e/gis/83336/4B20E4374DBA >>>>>>> >>>>>>> WASC on Twitter >>>>>>> http://twitter.com/wascupdates >>>>>>> >>>>>>> websecurity@lists.webappsec.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> > -- Richard Hauswald Blog: http://tnfstacc.blogspot.com/ LinkedIn: http://www.linkedin.com/in/richardhauswald Xing: http://www.xing.com/profile/Richard_Hauswald
TL
Tasos Laskos
Tue, Jan 17, 2012 3:07 PM

Well, websites are a crazy mesh (or mess, sometimes) and lots of pages
link to other pages so workers will eventually end up being redundant.

Of course, I'm basing my response on the assumption that your model has
the workers actually crawl and not simply visit a given number of pages,
parse and then sent back the paths they've extracted.
If so then pardon me, I misunderstood.

On 01/17/2012 05:01 PM, Richard Hauswald wrote:

Why would workers visiting the same pages?

On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskostasos.laskos@gmail.com  wrote:

You're right, it does sound good but it would still suffer from the same
problem, workers visiting the same pages.

Although this could be somewhat mitigated the same way I described in an
earlier post (which I think hasn't been moderated yet).

I've bookmarked it for future reference, thanks man.

On 01/17/2012 04:20 PM, Richard Hauswald wrote:

Btw, did you intentionally send this e-mail privately?

No, clicked the wrong button inside gmail - sorry.

I thought your intend was to divide URLs by subdomains, run a crawl,
"Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves". This would
not redistribute new discovered URLs and require a special initial
setup step (divide URLs by subdomains). And by pushing tasks to the
workers you'd have to take care about the load of the workers which
means you'd have to implement a scheduling / load balancing policy. By
pulling the work this would happen automatically. To make use of multi
core CPUs you could make your workers scan for the systems count of
CPU cores and spawn workers threads in a defined ratio. You could also
run a worker on the master host. This should lead to a good load
balancing by default. So it's not really ignoring scheduling details.

On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskostasos.laskos@gmail.com
wrote:

Yep, that's similar to what I previously posted with the difference that
the
master in my system won't be a slacker so he'll be the lucky guy to grab
the
seed URL.

After that things are pretty much the same -- ignoring the scheduling
details etc.

Btw, did you intentionally send this e-mail privately?

On 01/17/2012 03:48 PM, Richard Hauswald wrote:

Yes, I didn't understand the nature of your problem the first time ...

So, you could still use the pull principle. To make it simple I'll not
consider work packets/batching. This could be used later to further
improve performance by reducing "latency" / "seek" times.

So what about the following process:

  1. Create a bunch of workers
  2. Create a List of URLs which can be considered the work queue.
    Initially filled with one element: the landing page URL in state NEW.
  3. Let all your workers poll the master for a single work item in
    state NEW(pay attention to synchronize this step on the master). One
    of them is the lucky guy and gets the landing page URL. The master
    will update work item to state PROCESSING( you may append a starting
    time, which could be used for reassigning already assigned work items
    after a timeout). All the other workers will still be idle.
  4. The lucky guy parses the page for new URLs and does whatever it
    should also do.
  5. The lucky guy posts the results + the parsed URLs to the master.
  6. The master stores the results, pushes the new URLs into the work
    queue with state NEW and updates the work item to state COMPLETED. If
    there is only one new URL we are not lucky but if there are 10 we'd
    have now 10 work items to distribute.
  7. Continue until all work items are in state COMPLETED.

Does this make sense?

On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskostasos.laskos@gmail.com
wrote:

What prevents this is the nature of the crawl process.
What I'm trying to achieve here is not spread the workload but actually
find
it.

I'm not interested in parsing the pages or any sort of processing but
only
gather all available paths.

So there's not really any "work" to distribute actually.

Does this make sense?

On 01/17/2012 03:05 PM, Richard Hauswald wrote:

Tasos,
what prevents you from let the workers pull the work from the master
instead of pushing it to the workers? Then you could let the workers
pull work packets containing e.g. 20 work items. After a worker has no
work left, it will push the results to the master and pull another
work packet.
Regards,
Richard

On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskostasos.laskos@gmail.com
wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of
get
a
brainstorm going.

I've recently implemented a system for audit distribution in the form
of
a
high performance grid (won't self-promote) however an area which I
left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the initial
crawl
and then calculates and distributes the audit workload amongst its
slaves
but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact
that
it's
not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to get
me
started:

  • Assign crawl of subdomains to slaves -- no questions asked
  • Briefly scope out the webapp structure and spread the crawl of the
    distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides
to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn
http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

Well, websites are a crazy mesh (or mess, sometimes) and lots of pages link to other pages so workers will eventually end up being redundant. Of course, I'm basing my response on the assumption that your model has the workers actually crawl and not simply visit a given number of pages, parse and then sent back the paths they've extracted. If so then pardon me, I misunderstood. On 01/17/2012 05:01 PM, Richard Hauswald wrote: > Why would workers visiting the same pages? > > On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskos<tasos.laskos@gmail.com> wrote: >> You're right, it does sound good but it would still suffer from the same >> problem, workers visiting the same pages. >> >> Although this could be somewhat mitigated the same way I described in an >> earlier post (which I think hasn't been moderated yet). >> >> I've bookmarked it for future reference, thanks man. >> >> >> On 01/17/2012 04:20 PM, Richard Hauswald wrote: >>>> >>>> Btw, did you intentionally send this e-mail privately? >>> >>> No, clicked the wrong button inside gmail - sorry. >>> >>> I thought your intend was to divide URLs by subdomains, run a crawl, >>> "Briefly scope out the webapp structure and spread the crawl of the >>> distinguishable visible directories amongst the slaves". This would >>> not redistribute new discovered URLs and require a special initial >>> setup step (divide URLs by subdomains). And by pushing tasks to the >>> workers you'd have to take care about the load of the workers which >>> means you'd have to implement a scheduling / load balancing policy. By >>> pulling the work this would happen automatically. To make use of multi >>> core CPUs you could make your workers scan for the systems count of >>> CPU cores and spawn workers threads in a defined ratio. You could also >>> run a worker on the master host. This should lead to a good load >>> balancing by default. So it's not really ignoring scheduling details. >>> >>> On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskos<tasos.laskos@gmail.com> >>> wrote: >>>> >>>> Yep, that's similar to what I previously posted with the difference that >>>> the >>>> master in my system won't be a slacker so he'll be the lucky guy to grab >>>> the >>>> seed URL. >>>> >>>> After that things are pretty much the same -- ignoring the scheduling >>>> details etc. >>>> >>>> Btw, did you intentionally send this e-mail privately? >>>> >>>> >>>> On 01/17/2012 03:48 PM, Richard Hauswald wrote: >>>>> >>>>> >>>>> Yes, I didn't understand the nature of your problem the first time ... >>>>> >>>>> So, you could still use the pull principle. To make it simple I'll not >>>>> consider work packets/batching. This could be used later to further >>>>> improve performance by reducing "latency" / "seek" times. >>>>> >>>>> So what about the following process: >>>>> 1. Create a bunch of workers >>>>> 2. Create a List of URLs which can be considered the work queue. >>>>> Initially filled with one element: the landing page URL in state NEW. >>>>> 3. Let all your workers poll the master for a single work item in >>>>> state NEW(pay attention to synchronize this step on the master). One >>>>> of them is the lucky guy and gets the landing page URL. The master >>>>> will update work item to state PROCESSING( you may append a starting >>>>> time, which could be used for reassigning already assigned work items >>>>> after a timeout). All the other workers will still be idle. >>>>> 4. The lucky guy parses the page for new URLs and does whatever it >>>>> should also do. >>>>> 5. The lucky guy posts the results + the parsed URLs to the master. >>>>> 6. The master stores the results, pushes the new URLs into the work >>>>> queue with state NEW and updates the work item to state COMPLETED. If >>>>> there is only one new URL we are not lucky but if there are 10 we'd >>>>> have now 10 work items to distribute. >>>>> 7. Continue until all work items are in state COMPLETED. >>>>> >>>>> Does this make sense? >>>>> >>>>> On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> What prevents this is the nature of the crawl process. >>>>>> What I'm trying to achieve here is not spread the workload but actually >>>>>> find >>>>>> it. >>>>>> >>>>>> I'm not interested in parsing the pages or any sort of processing but >>>>>> only >>>>>> gather all available paths. >>>>>> >>>>>> So there's not really any "work" to distribute actually. >>>>>> >>>>>> Does this make sense? >>>>>> >>>>>> >>>>>> On 01/17/2012 03:05 PM, Richard Hauswald wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Tasos, >>>>>>> what prevents you from let the workers pull the work from the master >>>>>>> instead of pushing it to the workers? Then you could let the workers >>>>>>> pull work packets containing e.g. 20 work items. After a worker has no >>>>>>> work left, it will push the results to the master and pull another >>>>>>> work packet. >>>>>>> Regards, >>>>>>> Richard >>>>>>> >>>>>>> On Mon, Jan 16, 2012 at 6:41 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi guys, it's been a while. >>>>>>>> >>>>>>>> I've got a tricky question for you today and I hope that we sort of >>>>>>>> get >>>>>>>> a >>>>>>>> brainstorm going. >>>>>>>> >>>>>>>> I've recently implemented a system for audit distribution in the form >>>>>>>> of >>>>>>>> a >>>>>>>> high performance grid (won't self-promote) however an area which I >>>>>>>> left >>>>>>>> alone (for the time being) was the crawl process. >>>>>>>> >>>>>>>> See, the way it works now is the master instance performs the initial >>>>>>>> crawl >>>>>>>> and then calculates and distributes the audit workload amongst its >>>>>>>> slaves >>>>>>>> but the crawl takes place the old fashioned way. >>>>>>>> >>>>>>>> As you might have guessed the major set back is caused by the fact >>>>>>>> that >>>>>>>> it's >>>>>>>> not possible to determine the workload of the crawl a priori. >>>>>>>> >>>>>>>> I've got a couple of naive ideas to parallelize the crawl just to get >>>>>>>> me >>>>>>>> started: >>>>>>>> * Assign crawl of subdomains to slaves -- no questions asked >>>>>>>> * Briefly scope out the webapp structure and spread the crawl of the >>>>>>>> distinguishable visible directories amongst the slaves. >>>>>>>> >>>>>>>> Or even a combination of the above if applicable. >>>>>>>> >>>>>>>> Both ideas are better than what I've got now and there aren't any >>>>>>>> downsides >>>>>>>> to them even if the distribution turns out to be suboptimal. >>>>>>>> >>>>>>>> I'm curious though, has anyone faced a similar problem? >>>>>>>> Any general ideas? >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Tasos Laskos. >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> The Web Security Mailing List >>>>>>>> >>>>>>>> WebSecurity RSS Feed >>>>>>>> http://www.webappsec.org/rss/websecurity.rss >>>>>>>> >>>>>>>> Join WASC on LinkedIn >>>>>>>> http://www.linkedin.com/e/gis/83336/4B20E4374DBA >>>>>>>> >>>>>>>> WASC on Twitter >>>>>>>> http://twitter.com/wascupdates >>>>>>>> >>>>>>>> websecurity@lists.webappsec.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >> > > >
RH
Richard Hauswald
Tue, Jan 17, 2012 3:45 PM

Yeah, you are right. URL's should be unique in the work queue.
Otherwise - in case of circular links between the pages - you could
end up in an endless loop :-o

If a worker should just extract paths or do a full crawl depends on
the duration of a full crawl. I can think of 3 different ways,
depending on your situation:

  1. Do a full crawl and post back results + extracted paths
  2. Have workers do 2 different jobs, 1 job is to extract paths, 1 job
    is to do the actual crawl
  3. Get a path and post back the whole page content so the master can
    store it. Then have a worker pool assigned for extracting paths and
    one for a full crawl, both based on the stored page content.
    But this really depends on the network speed, the load the workers
    create on the web application to crawl(in case its not just a simple
    html file based web site), the duration of a full crawl and the number
    of different paths in the application.

Is there still something I missed or would one of 1,2,3 solve your problem?

On Tue, Jan 17, 2012 at 4:07 PM, Tasos Laskos tasos.laskos@gmail.com wrote:

Well, websites are a crazy mesh (or mess, sometimes) and lots of pages link
to other pages so workers will eventually end up being redundant.

Of course, I'm basing my response on the assumption that your model has the
workers actually crawl and not simply visit a given number of pages, parse
and then sent back the paths they've extracted.
If so then pardon me, I misunderstood.

On 01/17/2012 05:01 PM, Richard Hauswald wrote:

Why would workers visiting the same pages?

On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskostasos.laskos@gmail.com
 wrote:

You're right, it does sound good but it would still suffer from the same
problem, workers visiting the same pages.

Although this could be somewhat mitigated the same way I described in an
earlier post (which I think hasn't been moderated yet).

I've bookmarked it for future reference, thanks man.

On 01/17/2012 04:20 PM, Richard Hauswald wrote:

Btw, did you intentionally send this e-mail privately?

No, clicked the wrong button inside gmail - sorry.

I thought your intend was to divide URLs by subdomains, run a crawl,
"Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves". This would
not redistribute new discovered URLs and require a special initial
setup step (divide URLs by subdomains). And by pushing tasks to the
workers you'd have to take care about the load of the workers which
means you'd have to implement a scheduling / load balancing policy. By
pulling the work this would happen automatically. To make use of multi
core CPUs you could make your workers scan for the systems count of
CPU cores and spawn workers threads in a defined ratio. You could also
run a worker on the master host. This should lead to a good load
balancing by default. So it's not really ignoring scheduling details.

On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskostasos.laskos@gmail.com
 wrote:

Yep, that's similar to what I previously posted with the difference
that
the
master in my system won't be a slacker so he'll be the lucky guy to
grab
the
seed URL.

After that things are pretty much the same -- ignoring the scheduling
details etc.

Btw, did you intentionally send this e-mail privately?

On 01/17/2012 03:48 PM, Richard Hauswald wrote:

Yes, I didn't understand the nature of your problem the first time ...

So, you could still use the pull principle. To make it simple I'll not
consider work packets/batching. This could be used later to further
improve performance by reducing "latency" / "seek" times.

So what about the following process:

  1. Create a bunch of workers
  2. Create a List of URLs which can be considered the work queue.
    Initially filled with one element: the landing page URL in state NEW.
  3. Let all your workers poll the master for a single work item in
    state NEW(pay attention to synchronize this step on the master). One
    of them is the lucky guy and gets the landing page URL. The master
    will update work item to state PROCESSING( you may append a starting
    time, which could be used for reassigning already assigned work items
    after a timeout). All the other workers will still be idle.
  4. The lucky guy parses the page for new URLs and does whatever it
    should also do.
  5. The lucky guy posts the results + the parsed URLs to the master.
  6. The master stores the results, pushes the new URLs into the work
    queue with state NEW and updates the work item to state COMPLETED. If
    there is only one new URL we are not lucky but if there are 10 we'd
    have now 10 work items to distribute.
  7. Continue until all work items are in state COMPLETED.

Does this make sense?

On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskostasos.laskos@gmail.com
 wrote:

What prevents this is the nature of the crawl process.
What I'm trying to achieve here is not spread the workload but
actually
find
it.

I'm not interested in parsing the pages or any sort of processing but
only
gather all available paths.

So there's not really any "work" to distribute actually.

Does this make sense?

On 01/17/2012 03:05 PM, Richard Hauswald wrote:

Tasos,
what prevents you from let the workers pull the work from the master
instead of pushing it to the workers? Then you could let the workers
pull work packets containing e.g. 20 work items. After a worker has
no
work left, it will push the results to the master and pull another
work packet.
Regards,
Richard

On Mon, Jan 16, 2012 at 6:41 PM, Tasos
Laskostasos.laskos@gmail.com
 wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of
get
a
brainstorm going.

I've recently implemented a system for audit distribution in the
form
of
a
high performance grid (won't self-promote) however an area which I
left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the
initial
crawl
and then calculates and distributes the audit workload amongst its
slaves
but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact
that
it's
not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to
get
me
started:
 * Assign crawl of subdomains to slaves -- no questions asked
 * Briefly scope out the webapp structure and spread the crawl of
the
distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides
to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn
http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

Yeah, you are right. URL's should be unique in the work queue. Otherwise - in case of circular links between the pages - you could end up in an endless loop :-o If a worker should just extract paths or do a full crawl depends on the duration of a full crawl. I can think of 3 different ways, depending on your situation: 1. Do a full crawl and post back results + extracted paths 2. Have workers do 2 different jobs, 1 job is to extract paths, 1 job is to do the actual crawl 3. Get a path and post back the whole page content so the master can store it. Then have a worker pool assigned for extracting paths and one for a full crawl, both based on the stored page content. But this really depends on the network speed, the load the workers create on the web application to crawl(in case its not just a simple html file based web site), the duration of a full crawl and the number of different paths in the application. Is there still something I missed or would one of 1,2,3 solve your problem? On Tue, Jan 17, 2012 at 4:07 PM, Tasos Laskos <tasos.laskos@gmail.com> wrote: > Well, websites are a crazy mesh (or mess, sometimes) and lots of pages link > to other pages so workers will eventually end up being redundant. > > Of course, I'm basing my response on the assumption that your model has the > workers actually crawl and not simply visit a given number of pages, parse > and then sent back the paths they've extracted. > If so then pardon me, I misunderstood. > > > On 01/17/2012 05:01 PM, Richard Hauswald wrote: >> >> Why would workers visiting the same pages? >> >> On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskos<tasos.laskos@gmail.com> >>  wrote: >>> >>> You're right, it does sound good but it would still suffer from the same >>> problem, workers visiting the same pages. >>> >>> Although this could be somewhat mitigated the same way I described in an >>> earlier post (which I think hasn't been moderated yet). >>> >>> I've bookmarked it for future reference, thanks man. >>> >>> >>> On 01/17/2012 04:20 PM, Richard Hauswald wrote: >>>>> >>>>> >>>>> Btw, did you intentionally send this e-mail privately? >>>> >>>> >>>> No, clicked the wrong button inside gmail - sorry. >>>> >>>> I thought your intend was to divide URLs by subdomains, run a crawl, >>>> "Briefly scope out the webapp structure and spread the crawl of the >>>> distinguishable visible directories amongst the slaves". This would >>>> not redistribute new discovered URLs and require a special initial >>>> setup step (divide URLs by subdomains). And by pushing tasks to the >>>> workers you'd have to take care about the load of the workers which >>>> means you'd have to implement a scheduling / load balancing policy. By >>>> pulling the work this would happen automatically. To make use of multi >>>> core CPUs you could make your workers scan for the systems count of >>>> CPU cores and spawn workers threads in a defined ratio. You could also >>>> run a worker on the master host. This should lead to a good load >>>> balancing by default. So it's not really ignoring scheduling details. >>>> >>>> On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>  wrote: >>>>> >>>>> >>>>> Yep, that's similar to what I previously posted with the difference >>>>> that >>>>> the >>>>> master in my system won't be a slacker so he'll be the lucky guy to >>>>> grab >>>>> the >>>>> seed URL. >>>>> >>>>> After that things are pretty much the same -- ignoring the scheduling >>>>> details etc. >>>>> >>>>> Btw, did you intentionally send this e-mail privately? >>>>> >>>>> >>>>> On 01/17/2012 03:48 PM, Richard Hauswald wrote: >>>>>> >>>>>> >>>>>> >>>>>> Yes, I didn't understand the nature of your problem the first time ... >>>>>> >>>>>> So, you could still use the pull principle. To make it simple I'll not >>>>>> consider work packets/batching. This could be used later to further >>>>>> improve performance by reducing "latency" / "seek" times. >>>>>> >>>>>> So what about the following process: >>>>>> 1. Create a bunch of workers >>>>>> 2. Create a List of URLs which can be considered the work queue. >>>>>> Initially filled with one element: the landing page URL in state NEW. >>>>>> 3. Let all your workers poll the master for a single work item in >>>>>> state NEW(pay attention to synchronize this step on the master). One >>>>>> of them is the lucky guy and gets the landing page URL. The master >>>>>> will update work item to state PROCESSING( you may append a starting >>>>>> time, which could be used for reassigning already assigned work items >>>>>> after a timeout). All the other workers will still be idle. >>>>>> 4. The lucky guy parses the page for new URLs and does whatever it >>>>>> should also do. >>>>>> 5. The lucky guy posts the results + the parsed URLs to the master. >>>>>> 6. The master stores the results, pushes the new URLs into the work >>>>>> queue with state NEW and updates the work item to state COMPLETED. If >>>>>> there is only one new URL we are not lucky but if there are 10 we'd >>>>>> have now 10 work items to distribute. >>>>>> 7. Continue until all work items are in state COMPLETED. >>>>>> >>>>>> Does this make sense? >>>>>> >>>>>> On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>>>  wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> What prevents this is the nature of the crawl process. >>>>>>> What I'm trying to achieve here is not spread the workload but >>>>>>> actually >>>>>>> find >>>>>>> it. >>>>>>> >>>>>>> I'm not interested in parsing the pages or any sort of processing but >>>>>>> only >>>>>>> gather all available paths. >>>>>>> >>>>>>> So there's not really any "work" to distribute actually. >>>>>>> >>>>>>> Does this make sense? >>>>>>> >>>>>>> >>>>>>> On 01/17/2012 03:05 PM, Richard Hauswald wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Tasos, >>>>>>>> what prevents you from let the workers pull the work from the master >>>>>>>> instead of pushing it to the workers? Then you could let the workers >>>>>>>> pull work packets containing e.g. 20 work items. After a worker has >>>>>>>> no >>>>>>>> work left, it will push the results to the master and pull another >>>>>>>> work packet. >>>>>>>> Regards, >>>>>>>> Richard >>>>>>>> >>>>>>>> On Mon, Jan 16, 2012 at 6:41 PM, Tasos >>>>>>>> Laskos<tasos.laskos@gmail.com> >>>>>>>>  wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi guys, it's been a while. >>>>>>>>> >>>>>>>>> I've got a tricky question for you today and I hope that we sort of >>>>>>>>> get >>>>>>>>> a >>>>>>>>> brainstorm going. >>>>>>>>> >>>>>>>>> I've recently implemented a system for audit distribution in the >>>>>>>>> form >>>>>>>>> of >>>>>>>>> a >>>>>>>>> high performance grid (won't self-promote) however an area which I >>>>>>>>> left >>>>>>>>> alone (for the time being) was the crawl process. >>>>>>>>> >>>>>>>>> See, the way it works now is the master instance performs the >>>>>>>>> initial >>>>>>>>> crawl >>>>>>>>> and then calculates and distributes the audit workload amongst its >>>>>>>>> slaves >>>>>>>>> but the crawl takes place the old fashioned way. >>>>>>>>> >>>>>>>>> As you might have guessed the major set back is caused by the fact >>>>>>>>> that >>>>>>>>> it's >>>>>>>>> not possible to determine the workload of the crawl a priori. >>>>>>>>> >>>>>>>>> I've got a couple of naive ideas to parallelize the crawl just to >>>>>>>>> get >>>>>>>>> me >>>>>>>>> started: >>>>>>>>>  * Assign crawl of subdomains to slaves -- no questions asked >>>>>>>>>  * Briefly scope out the webapp structure and spread the crawl of >>>>>>>>> the >>>>>>>>> distinguishable visible directories amongst the slaves. >>>>>>>>> >>>>>>>>> Or even a combination of the above if applicable. >>>>>>>>> >>>>>>>>> Both ideas are better than what I've got now and there aren't any >>>>>>>>> downsides >>>>>>>>> to them even if the distribution turns out to be suboptimal. >>>>>>>>> >>>>>>>>> I'm curious though, has anyone faced a similar problem? >>>>>>>>> Any general ideas? >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Tasos Laskos. >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> The Web Security Mailing List >>>>>>>>> >>>>>>>>> WebSecurity RSS Feed >>>>>>>>> http://www.webappsec.org/rss/websecurity.rss >>>>>>>>> >>>>>>>>> Join WASC on LinkedIn >>>>>>>>> http://www.linkedin.com/e/gis/83336/4B20E4374DBA >>>>>>>>> >>>>>>>>> WASC on Twitter >>>>>>>>> http://twitter.com/wascupdates >>>>>>>>> >>>>>>>>> websecurity@lists.webappsec.org >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> > -- Richard Hauswald Blog: http://tnfstacc.blogspot.com/ LinkedIn: http://www.linkedin.com/in/richardhauswald Xing: http://www.xing.com/profile/Richard_Hauswald
TL
Tasos Laskos
Tue, Jan 17, 2012 3:54 PM

I've leaned towards #2 from the get go and the following could help
reduce redundancy -- from a previous message:

All instances converging in intervals about the collective paths they've
discovered in order for each to keep a local look-up cache (no latency
introduced since no-one would rely/wait on this to be up to date or even
available)

This info could be pulled along with the links to follow from the master.

It's a weird problem to solve efficiently this one... :)

On 01/17/2012 05:45 PM, Richard Hauswald wrote:

Yeah, you are right. URL's should be unique in the work queue.
Otherwise - in case of circular links between the pages - you could
end up in an endless loop :-o

If a worker should just extract paths or do a full crawl depends on
the duration of a full crawl. I can think of 3 different ways,
depending on your situation:

  1. Do a full crawl and post back results + extracted paths
  2. Have workers do 2 different jobs, 1 job is to extract paths, 1 job
    is to do the actual crawl
  3. Get a path and post back the whole page content so the master can
    store it. Then have a worker pool assigned for extracting paths and
    one for a full crawl, both based on the stored page content.
    But this really depends on the network speed, the load the workers
    create on the web application to crawl(in case its not just a simple
    html file based web site), the duration of a full crawl and the number
    of different paths in the application.

Is there still something I missed or would one of 1,2,3 solve your problem?

On Tue, Jan 17, 2012 at 4:07 PM, Tasos Laskostasos.laskos@gmail.com  wrote:

Well, websites are a crazy mesh (or mess, sometimes) and lots of pages link
to other pages so workers will eventually end up being redundant.

Of course, I'm basing my response on the assumption that your model has the
workers actually crawl and not simply visit a given number of pages, parse
and then sent back the paths they've extracted.
If so then pardon me, I misunderstood.

On 01/17/2012 05:01 PM, Richard Hauswald wrote:

Why would workers visiting the same pages?

On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskostasos.laskos@gmail.com
wrote:

You're right, it does sound good but it would still suffer from the same
problem, workers visiting the same pages.

Although this could be somewhat mitigated the same way I described in an
earlier post (which I think hasn't been moderated yet).

I've bookmarked it for future reference, thanks man.

On 01/17/2012 04:20 PM, Richard Hauswald wrote:

Btw, did you intentionally send this e-mail privately?

No, clicked the wrong button inside gmail - sorry.

I thought your intend was to divide URLs by subdomains, run a crawl,
"Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves". This would
not redistribute new discovered URLs and require a special initial
setup step (divide URLs by subdomains). And by pushing tasks to the
workers you'd have to take care about the load of the workers which
means you'd have to implement a scheduling / load balancing policy. By
pulling the work this would happen automatically. To make use of multi
core CPUs you could make your workers scan for the systems count of
CPU cores and spawn workers threads in a defined ratio. You could also
run a worker on the master host. This should lead to a good load
balancing by default. So it's not really ignoring scheduling details.

On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskostasos.laskos@gmail.com
wrote:

Yep, that's similar to what I previously posted with the difference
that
the
master in my system won't be a slacker so he'll be the lucky guy to
grab
the
seed URL.

After that things are pretty much the same -- ignoring the scheduling
details etc.

Btw, did you intentionally send this e-mail privately?

On 01/17/2012 03:48 PM, Richard Hauswald wrote:

Yes, I didn't understand the nature of your problem the first time ...

So, you could still use the pull principle. To make it simple I'll not
consider work packets/batching. This could be used later to further
improve performance by reducing "latency" / "seek" times.

So what about the following process:

  1. Create a bunch of workers
  2. Create a List of URLs which can be considered the work queue.
    Initially filled with one element: the landing page URL in state NEW.
  3. Let all your workers poll the master for a single work item in
    state NEW(pay attention to synchronize this step on the master). One
    of them is the lucky guy and gets the landing page URL. The master
    will update work item to state PROCESSING( you may append a starting
    time, which could be used for reassigning already assigned work items
    after a timeout). All the other workers will still be idle.
  4. The lucky guy parses the page for new URLs and does whatever it
    should also do.
  5. The lucky guy posts the results + the parsed URLs to the master.
  6. The master stores the results, pushes the new URLs into the work
    queue with state NEW and updates the work item to state COMPLETED. If
    there is only one new URL we are not lucky but if there are 10 we'd
    have now 10 work items to distribute.
  7. Continue until all work items are in state COMPLETED.

Does this make sense?

On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskostasos.laskos@gmail.com
wrote:

What prevents this is the nature of the crawl process.
What I'm trying to achieve here is not spread the workload but
actually
find
it.

I'm not interested in parsing the pages or any sort of processing but
only
gather all available paths.

So there's not really any "work" to distribute actually.

Does this make sense?

On 01/17/2012 03:05 PM, Richard Hauswald wrote:

Tasos,
what prevents you from let the workers pull the work from the master
instead of pushing it to the workers? Then you could let the workers
pull work packets containing e.g. 20 work items. After a worker has
no
work left, it will push the results to the master and pull another
work packet.
Regards,
Richard

On Mon, Jan 16, 2012 at 6:41 PM, Tasos
Laskostasos.laskos@gmail.com
wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort of
get
a
brainstorm going.

I've recently implemented a system for audit distribution in the
form
of
a
high performance grid (won't self-promote) however an area which I
left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the
initial
crawl
and then calculates and distributes the audit workload amongst its
slaves
but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the fact
that
it's
not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to
get
me
started:

  • Assign crawl of subdomains to slaves -- no questions asked
  • Briefly scope out the webapp structure and spread the crawl of
    the
    distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides
to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn
http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

I've leaned towards #2 from the get go and the following could help reduce redundancy -- from a previous message: All instances converging in intervals about the collective paths they've discovered in order for each to keep a local look-up cache (no latency introduced since no-one would rely/wait on this to be up to date or even available) This info could be pulled along with the links to follow from the master. It's a weird problem to solve efficiently this one... :) On 01/17/2012 05:45 PM, Richard Hauswald wrote: > Yeah, you are right. URL's should be unique in the work queue. > Otherwise - in case of circular links between the pages - you could > end up in an endless loop :-o > > If a worker should just extract paths or do a full crawl depends on > the duration of a full crawl. I can think of 3 different ways, > depending on your situation: > 1. Do a full crawl and post back results + extracted paths > 2. Have workers do 2 different jobs, 1 job is to extract paths, 1 job > is to do the actual crawl > 3. Get a path and post back the whole page content so the master can > store it. Then have a worker pool assigned for extracting paths and > one for a full crawl, both based on the stored page content. > But this really depends on the network speed, the load the workers > create on the web application to crawl(in case its not just a simple > html file based web site), the duration of a full crawl and the number > of different paths in the application. > > Is there still something I missed or would one of 1,2,3 solve your problem? > > On Tue, Jan 17, 2012 at 4:07 PM, Tasos Laskos<tasos.laskos@gmail.com> wrote: >> Well, websites are a crazy mesh (or mess, sometimes) and lots of pages link >> to other pages so workers will eventually end up being redundant. >> >> Of course, I'm basing my response on the assumption that your model has the >> workers actually crawl and not simply visit a given number of pages, parse >> and then sent back the paths they've extracted. >> If so then pardon me, I misunderstood. >> >> >> On 01/17/2012 05:01 PM, Richard Hauswald wrote: >>> >>> Why would workers visiting the same pages? >>> >>> On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskos<tasos.laskos@gmail.com> >>> wrote: >>>> >>>> You're right, it does sound good but it would still suffer from the same >>>> problem, workers visiting the same pages. >>>> >>>> Although this could be somewhat mitigated the same way I described in an >>>> earlier post (which I think hasn't been moderated yet). >>>> >>>> I've bookmarked it for future reference, thanks man. >>>> >>>> >>>> On 01/17/2012 04:20 PM, Richard Hauswald wrote: >>>>>> >>>>>> >>>>>> Btw, did you intentionally send this e-mail privately? >>>>> >>>>> >>>>> No, clicked the wrong button inside gmail - sorry. >>>>> >>>>> I thought your intend was to divide URLs by subdomains, run a crawl, >>>>> "Briefly scope out the webapp structure and spread the crawl of the >>>>> distinguishable visible directories amongst the slaves". This would >>>>> not redistribute new discovered URLs and require a special initial >>>>> setup step (divide URLs by subdomains). And by pushing tasks to the >>>>> workers you'd have to take care about the load of the workers which >>>>> means you'd have to implement a scheduling / load balancing policy. By >>>>> pulling the work this would happen automatically. To make use of multi >>>>> core CPUs you could make your workers scan for the systems count of >>>>> CPU cores and spawn workers threads in a defined ratio. You could also >>>>> run a worker on the master host. This should lead to a good load >>>>> balancing by default. So it's not really ignoring scheduling details. >>>>> >>>>> On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> Yep, that's similar to what I previously posted with the difference >>>>>> that >>>>>> the >>>>>> master in my system won't be a slacker so he'll be the lucky guy to >>>>>> grab >>>>>> the >>>>>> seed URL. >>>>>> >>>>>> After that things are pretty much the same -- ignoring the scheduling >>>>>> details etc. >>>>>> >>>>>> Btw, did you intentionally send this e-mail privately? >>>>>> >>>>>> >>>>>> On 01/17/2012 03:48 PM, Richard Hauswald wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Yes, I didn't understand the nature of your problem the first time ... >>>>>>> >>>>>>> So, you could still use the pull principle. To make it simple I'll not >>>>>>> consider work packets/batching. This could be used later to further >>>>>>> improve performance by reducing "latency" / "seek" times. >>>>>>> >>>>>>> So what about the following process: >>>>>>> 1. Create a bunch of workers >>>>>>> 2. Create a List of URLs which can be considered the work queue. >>>>>>> Initially filled with one element: the landing page URL in state NEW. >>>>>>> 3. Let all your workers poll the master for a single work item in >>>>>>> state NEW(pay attention to synchronize this step on the master). One >>>>>>> of them is the lucky guy and gets the landing page URL. The master >>>>>>> will update work item to state PROCESSING( you may append a starting >>>>>>> time, which could be used for reassigning already assigned work items >>>>>>> after a timeout). All the other workers will still be idle. >>>>>>> 4. The lucky guy parses the page for new URLs and does whatever it >>>>>>> should also do. >>>>>>> 5. The lucky guy posts the results + the parsed URLs to the master. >>>>>>> 6. The master stores the results, pushes the new URLs into the work >>>>>>> queue with state NEW and updates the work item to state COMPLETED. If >>>>>>> there is only one new URL we are not lucky but if there are 10 we'd >>>>>>> have now 10 work items to distribute. >>>>>>> 7. Continue until all work items are in state COMPLETED. >>>>>>> >>>>>>> Does this make sense? >>>>>>> >>>>>>> On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> What prevents this is the nature of the crawl process. >>>>>>>> What I'm trying to achieve here is not spread the workload but >>>>>>>> actually >>>>>>>> find >>>>>>>> it. >>>>>>>> >>>>>>>> I'm not interested in parsing the pages or any sort of processing but >>>>>>>> only >>>>>>>> gather all available paths. >>>>>>>> >>>>>>>> So there's not really any "work" to distribute actually. >>>>>>>> >>>>>>>> Does this make sense? >>>>>>>> >>>>>>>> >>>>>>>> On 01/17/2012 03:05 PM, Richard Hauswald wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Tasos, >>>>>>>>> what prevents you from let the workers pull the work from the master >>>>>>>>> instead of pushing it to the workers? Then you could let the workers >>>>>>>>> pull work packets containing e.g. 20 work items. After a worker has >>>>>>>>> no >>>>>>>>> work left, it will push the results to the master and pull another >>>>>>>>> work packet. >>>>>>>>> Regards, >>>>>>>>> Richard >>>>>>>>> >>>>>>>>> On Mon, Jan 16, 2012 at 6:41 PM, Tasos >>>>>>>>> Laskos<tasos.laskos@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi guys, it's been a while. >>>>>>>>>> >>>>>>>>>> I've got a tricky question for you today and I hope that we sort of >>>>>>>>>> get >>>>>>>>>> a >>>>>>>>>> brainstorm going. >>>>>>>>>> >>>>>>>>>> I've recently implemented a system for audit distribution in the >>>>>>>>>> form >>>>>>>>>> of >>>>>>>>>> a >>>>>>>>>> high performance grid (won't self-promote) however an area which I >>>>>>>>>> left >>>>>>>>>> alone (for the time being) was the crawl process. >>>>>>>>>> >>>>>>>>>> See, the way it works now is the master instance performs the >>>>>>>>>> initial >>>>>>>>>> crawl >>>>>>>>>> and then calculates and distributes the audit workload amongst its >>>>>>>>>> slaves >>>>>>>>>> but the crawl takes place the old fashioned way. >>>>>>>>>> >>>>>>>>>> As you might have guessed the major set back is caused by the fact >>>>>>>>>> that >>>>>>>>>> it's >>>>>>>>>> not possible to determine the workload of the crawl a priori. >>>>>>>>>> >>>>>>>>>> I've got a couple of naive ideas to parallelize the crawl just to >>>>>>>>>> get >>>>>>>>>> me >>>>>>>>>> started: >>>>>>>>>> * Assign crawl of subdomains to slaves -- no questions asked >>>>>>>>>> * Briefly scope out the webapp structure and spread the crawl of >>>>>>>>>> the >>>>>>>>>> distinguishable visible directories amongst the slaves. >>>>>>>>>> >>>>>>>>>> Or even a combination of the above if applicable. >>>>>>>>>> >>>>>>>>>> Both ideas are better than what I've got now and there aren't any >>>>>>>>>> downsides >>>>>>>>>> to them even if the distribution turns out to be suboptimal. >>>>>>>>>> >>>>>>>>>> I'm curious though, has anyone faced a similar problem? >>>>>>>>>> Any general ideas? >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Tasos Laskos. >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> The Web Security Mailing List >>>>>>>>>> >>>>>>>>>> WebSecurity RSS Feed >>>>>>>>>> http://www.webappsec.org/rss/websecurity.rss >>>>>>>>>> >>>>>>>>>> Join WASC on LinkedIn >>>>>>>>>> http://www.linkedin.com/e/gis/83336/4B20E4374DBA >>>>>>>>>> >>>>>>>>>> WASC on Twitter >>>>>>>>>> http://twitter.com/wascupdates >>>>>>>>>> >>>>>>>>>> websecurity@lists.webappsec.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >> > > >
R
Ray
Tue, Jan 17, 2012 5:00 PM

How about this? (comes to mind after reading the previous posts)

Master: only distributes URLs to crawl (crawl pool).  Responsible for
local lookup/deduplication of URLs before they enter the crawl pool.  The
lookup/dedup mechanism can also be used to generate the list of crawled
URLs in the end too.

Slaves: only crawls, extracts URLs and reports them back to master

Iteration #1:
Master is seeded with only one URL (let's say), which is the root/starting
URL for the site.
Master performs local lookup/deduplication, nothing to dedup (only one URL).
Master distributes URL in crawl pool to slave (number of slaves to
use dependent on the max number of URLs to crawl/process per slave).
Slave crawls, extracts and reports extracted URLs to master.

Iteration #2...#n:
Master gets reports of new URLs from slaves.
Master performs local lookup/deduplication, adding unrecognized URLs to
crawl pool and local lookup table.
Master distributes URLs in crawl pool to corresponding number of slaves.
Slaves crawl, extract and report extracted URLs to master.
(Exit condition: crawl pool empty after all working slaves have finished
their current task and dedup completed)

Regards,
Ray

On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos tasos.laskos@gmail.comwrote:

I've leaned towards #2 from the get go and the following could help reduce
redundancy -- from a previous message:

All instances converging in intervals about the collective paths they've
discovered in order for each to keep a local look-up cache (no latency
introduced since no-one would rely/wait on this to be up to date or even
available)

This info could be pulled along with the links to follow from the master.

It's a weird problem to solve efficiently this one... :)

On 01/17/2012 05:45 PM, Richard Hauswald wrote:

Yeah, you are right. URL's should be unique in the work queue.
Otherwise - in case of circular links between the pages - you could
end up in an endless loop :-o

If a worker should just extract paths or do a full crawl depends on
the duration of a full crawl. I can think of 3 different ways,
depending on your situation:

  1. Do a full crawl and post back results + extracted paths
  2. Have workers do 2 different jobs, 1 job is to extract paths, 1 job
    is to do the actual crawl
  3. Get a path and post back the whole page content so the master can
    store it. Then have a worker pool assigned for extracting paths and
    one for a full crawl, both based on the stored page content.
    But this really depends on the network speed, the load the workers
    create on the web application to crawl(in case its not just a simple
    html file based web site), the duration of a full crawl and the number
    of different paths in the application.

Is there still something I missed or would one of 1,2,3 solve your
problem?

On Tue, Jan 17, 2012 at 4:07 PM, Tasos Laskostasos.laskos@gmail.com
wrote:

Well, websites are a crazy mesh (or mess, sometimes) and lots of pages
link
to other pages so workers will eventually end up being redundant.

Of course, I'm basing my response on the assumption that your model has
the
workers actually crawl and not simply visit a given number of pages,
parse
and then sent back the paths they've extracted.
If so then pardon me, I misunderstood.

On 01/17/2012 05:01 PM, Richard Hauswald wrote:

Why would workers visiting the same pages?

On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskostasos.laskos@gmail.com
wrote:

You're right, it does sound good but it would still suffer from the
same
problem, workers visiting the same pages.

Although this could be somewhat mitigated the same way I described in
an
earlier post (which I think hasn't been moderated yet).

I've bookmarked it for future reference, thanks man.

On 01/17/2012 04:20 PM, Richard Hauswald wrote:

Btw, did you intentionally send this e-mail privately?

No, clicked the wrong button inside gmail - sorry.

I thought your intend was to divide URLs by subdomains, run a crawl,
"Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves". This would
not redistribute new discovered URLs and require a special initial
setup step (divide URLs by subdomains). And by pushing tasks to the
workers you'd have to take care about the load of the workers which
means you'd have to implement a scheduling / load balancing policy. By
pulling the work this would happen automatically. To make use of multi
core CPUs you could make your workers scan for the systems count of
CPU cores and spawn workers threads in a defined ratio. You could also
run a worker on the master host. This should lead to a good load
balancing by default. So it's not really ignoring scheduling details.

On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskostasos.laskos@gmail.com
wrote:

Yep, that's similar to what I previously posted with the difference
that
the
master in my system won't be a slacker so he'll be the lucky guy to
grab
the
seed URL.

After that things are pretty much the same -- ignoring the scheduling
details etc.

Btw, did you intentionally send this e-mail privately?

On 01/17/2012 03:48 PM, Richard Hauswald wrote:

Yes, I didn't understand the nature of your problem the first time
...

So, you could still use the pull principle. To make it simple I'll
not
consider work packets/batching. This could be used later to further
improve performance by reducing "latency" / "seek" times.

So what about the following process:

  1. Create a bunch of workers
  2. Create a List of URLs which can be considered the work queue.
    Initially filled with one element: the landing page URL in state
    NEW.
  3. Let all your workers poll the master for a single work item in
    state NEW(pay attention to synchronize this step on the master). One
    of them is the lucky guy and gets the landing page URL. The master
    will update work item to state PROCESSING( you may append a starting
    time, which could be used for reassigning already assigned work
    items
    after a timeout). All the other workers will still be idle.
  4. The lucky guy parses the page for new URLs and does whatever it
    should also do.
  5. The lucky guy posts the results + the parsed URLs to the master.
  6. The master stores the results, pushes the new URLs into the work
    queue with state NEW and updates the work item to state COMPLETED.
    If
    there is only one new URL we are not lucky but if there are 10 we'd
    have now 10 work items to distribute.
  7. Continue until all work items are in state COMPLETED.

Does this make sense?

On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskos<
tasos.laskos@gmail.com>
wrote:

What prevents this is the nature of the crawl process.
What I'm trying to achieve here is not spread the workload but
actually
find
it.

I'm not interested in parsing the pages or any sort of processing
but
only
gather all available paths.

So there's not really any "work" to distribute actually.

Does this make sense?

On 01/17/2012 03:05 PM, Richard Hauswald wrote:

Tasos,
what prevents you from let the workers pull the work from the
master
instead of pushing it to the workers? Then you could let the
workers
pull work packets containing e.g. 20 work items. After a worker
has
no
work left, it will push the results to the master and pull another
work packet.
Regards,
Richard

On Mon, Jan 16, 2012 at 6:41 PM, Tasos
Laskostasos.laskos@gmail.com
wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort
of
get
a
brainstorm going.

I've recently implemented a system for audit distribution in the
form
of
a
high performance grid (won't self-promote) however an area which
I
left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the
initial
crawl
and then calculates and distributes the audit workload amongst
its
slaves
but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the
fact
that
it's
not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to
get
me
started:

  • Assign crawl of subdomains to slaves -- no questions asked
  • Briefly scope out the webapp structure and spread the crawl of
    the
    distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides
to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.

_____________**
The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/**websecurity.rsshttp://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn
http://www.linkedin.com/e/gis/**83336/4B20E4374DBAhttp://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.**orgwebsecurity@lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_
lists.webappsec.orghttp://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

How about this? (comes to mind after reading the previous posts) *Master*: only *distributes* URLs to crawl (crawl pool). Responsible for local lookup/*deduplication* of URLs before they enter the crawl pool. The lookup/dedup mechanism can also be used to generate the list of crawled URLs in the end too. *Slaves*: only *crawls*, *extracts* URLs and reports them back to master *Iteration #1:* Master is seeded with only one URL (let's say), which is the root/starting URL for the site. Master performs local lookup/deduplication, nothing to dedup (only one URL). Master distributes URL in crawl pool to slave (number of slaves to use dependent on the max number of URLs to crawl/process per slave). Slave crawls, extracts and reports extracted URLs to master. *Iteration #2...#n:* Master gets reports of new URLs from slaves. Master performs local lookup/deduplication, adding unrecognized URLs to crawl pool and local lookup table. Master distributes URLs in crawl pool to corresponding number of slaves. Slaves crawl, extract and report extracted URLs to master. (*Exit condition*: crawl pool empty after all working slaves have finished their current task and dedup completed) Regards, Ray On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos <tasos.laskos@gmail.com>wrote: > I've leaned towards #2 from the get go and the following could help reduce > redundancy -- from a previous message: > > > All instances converging in intervals about the collective paths they've > discovered in order for each to keep a local look-up cache (no latency > introduced since no-one would rely/wait on this to be up to date or even > available) > > This info could be pulled along with the links to follow from the master. > > It's a weird problem to solve efficiently this one... :) > > > On 01/17/2012 05:45 PM, Richard Hauswald wrote: > >> Yeah, you are right. URL's should be unique in the work queue. >> Otherwise - in case of circular links between the pages - you could >> end up in an endless loop :-o >> >> If a worker should just extract paths or do a full crawl depends on >> the duration of a full crawl. I can think of 3 different ways, >> depending on your situation: >> 1. Do a full crawl and post back results + extracted paths >> 2. Have workers do 2 different jobs, 1 job is to extract paths, 1 job >> is to do the actual crawl >> 3. Get a path and post back the whole page content so the master can >> store it. Then have a worker pool assigned for extracting paths and >> one for a full crawl, both based on the stored page content. >> But this really depends on the network speed, the load the workers >> create on the web application to crawl(in case its not just a simple >> html file based web site), the duration of a full crawl and the number >> of different paths in the application. >> >> Is there still something I missed or would one of 1,2,3 solve your >> problem? >> >> On Tue, Jan 17, 2012 at 4:07 PM, Tasos Laskos<tasos.laskos@gmail.com> >> wrote: >> >>> Well, websites are a crazy mesh (or mess, sometimes) and lots of pages >>> link >>> to other pages so workers will eventually end up being redundant. >>> >>> Of course, I'm basing my response on the assumption that your model has >>> the >>> workers actually crawl and not simply visit a given number of pages, >>> parse >>> and then sent back the paths they've extracted. >>> If so then pardon me, I misunderstood. >>> >>> >>> On 01/17/2012 05:01 PM, Richard Hauswald wrote: >>> >>>> >>>> Why would workers visiting the same pages? >>>> >>>> On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>> wrote: >>>> >>>>> >>>>> You're right, it does sound good but it would still suffer from the >>>>> same >>>>> problem, workers visiting the same pages. >>>>> >>>>> Although this could be somewhat mitigated the same way I described in >>>>> an >>>>> earlier post (which I think hasn't been moderated yet). >>>>> >>>>> I've bookmarked it for future reference, thanks man. >>>>> >>>>> >>>>> On 01/17/2012 04:20 PM, Richard Hauswald wrote: >>>>> >>>>>> >>>>>>> >>>>>>> Btw, did you intentionally send this e-mail privately? >>>>>>> >>>>>> >>>>>> >>>>>> No, clicked the wrong button inside gmail - sorry. >>>>>> >>>>>> I thought your intend was to divide URLs by subdomains, run a crawl, >>>>>> "Briefly scope out the webapp structure and spread the crawl of the >>>>>> distinguishable visible directories amongst the slaves". This would >>>>>> not redistribute new discovered URLs and require a special initial >>>>>> setup step (divide URLs by subdomains). And by pushing tasks to the >>>>>> workers you'd have to take care about the load of the workers which >>>>>> means you'd have to implement a scheduling / load balancing policy. By >>>>>> pulling the work this would happen automatically. To make use of multi >>>>>> core CPUs you could make your workers scan for the systems count of >>>>>> CPU cores and spawn workers threads in a defined ratio. You could also >>>>>> run a worker on the master host. This should lead to a good load >>>>>> balancing by default. So it's not really ignoring scheduling details. >>>>>> >>>>>> On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> Yep, that's similar to what I previously posted with the difference >>>>>>> that >>>>>>> the >>>>>>> master in my system won't be a slacker so he'll be the lucky guy to >>>>>>> grab >>>>>>> the >>>>>>> seed URL. >>>>>>> >>>>>>> After that things are pretty much the same -- ignoring the scheduling >>>>>>> details etc. >>>>>>> >>>>>>> Btw, did you intentionally send this e-mail privately? >>>>>>> >>>>>>> >>>>>>> On 01/17/2012 03:48 PM, Richard Hauswald wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Yes, I didn't understand the nature of your problem the first time >>>>>>>> ... >>>>>>>> >>>>>>>> So, you could still use the pull principle. To make it simple I'll >>>>>>>> not >>>>>>>> consider work packets/batching. This could be used later to further >>>>>>>> improve performance by reducing "latency" / "seek" times. >>>>>>>> >>>>>>>> So what about the following process: >>>>>>>> 1. Create a bunch of workers >>>>>>>> 2. Create a List of URLs which can be considered the work queue. >>>>>>>> Initially filled with one element: the landing page URL in state >>>>>>>> NEW. >>>>>>>> 3. Let all your workers poll the master for a single work item in >>>>>>>> state NEW(pay attention to synchronize this step on the master). One >>>>>>>> of them is the lucky guy and gets the landing page URL. The master >>>>>>>> will update work item to state PROCESSING( you may append a starting >>>>>>>> time, which could be used for reassigning already assigned work >>>>>>>> items >>>>>>>> after a timeout). All the other workers will still be idle. >>>>>>>> 4. The lucky guy parses the page for new URLs and does whatever it >>>>>>>> should also do. >>>>>>>> 5. The lucky guy posts the results + the parsed URLs to the master. >>>>>>>> 6. The master stores the results, pushes the new URLs into the work >>>>>>>> queue with state NEW and updates the work item to state COMPLETED. >>>>>>>> If >>>>>>>> there is only one new URL we are not lucky but if there are 10 we'd >>>>>>>> have now 10 work items to distribute. >>>>>>>> 7. Continue until all work items are in state COMPLETED. >>>>>>>> >>>>>>>> Does this make sense? >>>>>>>> >>>>>>>> On Tue, Jan 17, 2012 at 2:15 PM, Tasos Laskos< >>>>>>>> tasos.laskos@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> What prevents this is the nature of the crawl process. >>>>>>>>> What I'm trying to achieve here is not spread the workload but >>>>>>>>> actually >>>>>>>>> find >>>>>>>>> it. >>>>>>>>> >>>>>>>>> I'm not interested in parsing the pages or any sort of processing >>>>>>>>> but >>>>>>>>> only >>>>>>>>> gather all available paths. >>>>>>>>> >>>>>>>>> So there's not really any "work" to distribute actually. >>>>>>>>> >>>>>>>>> Does this make sense? >>>>>>>>> >>>>>>>>> >>>>>>>>> On 01/17/2012 03:05 PM, Richard Hauswald wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Tasos, >>>>>>>>>> what prevents you from let the workers pull the work from the >>>>>>>>>> master >>>>>>>>>> instead of pushing it to the workers? Then you could let the >>>>>>>>>> workers >>>>>>>>>> pull work packets containing e.g. 20 work items. After a worker >>>>>>>>>> has >>>>>>>>>> no >>>>>>>>>> work left, it will push the results to the master and pull another >>>>>>>>>> work packet. >>>>>>>>>> Regards, >>>>>>>>>> Richard >>>>>>>>>> >>>>>>>>>> On Mon, Jan 16, 2012 at 6:41 PM, Tasos >>>>>>>>>> Laskos<tasos.laskos@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi guys, it's been a while. >>>>>>>>>>> >>>>>>>>>>> I've got a tricky question for you today and I hope that we sort >>>>>>>>>>> of >>>>>>>>>>> get >>>>>>>>>>> a >>>>>>>>>>> brainstorm going. >>>>>>>>>>> >>>>>>>>>>> I've recently implemented a system for audit distribution in the >>>>>>>>>>> form >>>>>>>>>>> of >>>>>>>>>>> a >>>>>>>>>>> high performance grid (won't self-promote) however an area which >>>>>>>>>>> I >>>>>>>>>>> left >>>>>>>>>>> alone (for the time being) was the crawl process. >>>>>>>>>>> >>>>>>>>>>> See, the way it works now is the master instance performs the >>>>>>>>>>> initial >>>>>>>>>>> crawl >>>>>>>>>>> and then calculates and distributes the audit workload amongst >>>>>>>>>>> its >>>>>>>>>>> slaves >>>>>>>>>>> but the crawl takes place the old fashioned way. >>>>>>>>>>> >>>>>>>>>>> As you might have guessed the major set back is caused by the >>>>>>>>>>> fact >>>>>>>>>>> that >>>>>>>>>>> it's >>>>>>>>>>> not possible to determine the workload of the crawl a priori. >>>>>>>>>>> >>>>>>>>>>> I've got a couple of naive ideas to parallelize the crawl just to >>>>>>>>>>> get >>>>>>>>>>> me >>>>>>>>>>> started: >>>>>>>>>>> * Assign crawl of subdomains to slaves -- no questions asked >>>>>>>>>>> * Briefly scope out the webapp structure and spread the crawl of >>>>>>>>>>> the >>>>>>>>>>> distinguishable visible directories amongst the slaves. >>>>>>>>>>> >>>>>>>>>>> Or even a combination of the above if applicable. >>>>>>>>>>> >>>>>>>>>>> Both ideas are better than what I've got now and there aren't any >>>>>>>>>>> downsides >>>>>>>>>>> to them even if the distribution turns out to be suboptimal. >>>>>>>>>>> >>>>>>>>>>> I'm curious though, has anyone faced a similar problem? >>>>>>>>>>> Any general ideas? >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Tasos Laskos. >>>>>>>>>>> >>>>>>>>>>> ______________________________**_________________ >>>>>>>>>>> The Web Security Mailing List >>>>>>>>>>> >>>>>>>>>>> WebSecurity RSS Feed >>>>>>>>>>> http://www.webappsec.org/rss/**websecurity.rss<http://www.webappsec.org/rss/websecurity.rss> >>>>>>>>>>> >>>>>>>>>>> Join WASC on LinkedIn >>>>>>>>>>> http://www.linkedin.com/e/gis/**83336/4B20E4374DBA<http://www.linkedin.com/e/gis/83336/4B20E4374DBA> >>>>>>>>>>> >>>>>>>>>>> WASC on Twitter >>>>>>>>>>> http://twitter.com/wascupdates >>>>>>>>>>> >>>>>>>>>>> websecurity@lists.webappsec.**org<websecurity@lists.webappsec.org> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> http://lists.webappsec.org/**mailman/listinfo/websecurity_** >>>>>>>>>>> lists.webappsec.org<http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> > > ______________________________**_________________ > The Web Security Mailing List > > WebSecurity RSS Feed > http://www.webappsec.org/rss/**websecurity.rss<http://www.webappsec.org/rss/websecurity.rss> > > Join WASC on LinkedIn http://www.linkedin.com/e/gis/**83336/4B20E4374DBA<http://www.linkedin.com/e/gis/83336/4B20E4374DBA> > > WASC on Twitter > http://twitter.com/wascupdates > > websecurity@lists.webappsec.**org <websecurity@lists.webappsec.org> > http://lists.webappsec.org/**mailman/listinfo/websecurity_** > lists.webappsec.org<http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org> >
RD
Ryan Dewhurst
Tue, Jan 17, 2012 5:31 PM

Just an idea:

Using search engine results and sitemaps to predict and then
distribute the pages. There are obvious disadvantages to this method
but could be used.

On Tue, Jan 17, 2012 at 3:54 PM, Tasos Laskos tasos.laskos@gmail.com wrote:

I've leaned towards #2 from the get go and the following could help reduce
redundancy -- from a previous message:

All instances converging in intervals about the collective paths they've
discovered in order for each to keep a local look-up cache (no latency
introduced since no-one would rely/wait on this to be up to date or even
available)

This info could be pulled along with the links to follow from the master.

It's a weird problem to solve efficiently this one... :)

On 01/17/2012 05:45 PM, Richard Hauswald wrote:

Yeah, you are right. URL's should be unique in the work queue.
Otherwise - in case of circular links between the pages - you could
end up in an endless loop :-o

If a worker should just extract paths or do a full crawl depends on
the duration of a full crawl. I can think of 3 different ways,
depending on your situation:

  1. Do a full crawl and post back results + extracted paths
  2. Have workers do 2 different jobs, 1 job is to extract paths, 1 job
    is to do the actual crawl
  3. Get a path and post back the whole page content so the master can
    store it. Then have a worker pool assigned for extracting paths and
    one for a full crawl, both based on the stored page content.
    But this really depends on the network speed, the load the workers
    create on the web application to crawl(in case its not just a simple
    html file based web site), the duration of a full crawl and the number
    of different paths in the application.

Is there still something I missed or would one of 1,2,3 solve your
problem?

On Tue, Jan 17, 2012 at 4:07 PM, Tasos Laskostasos.laskos@gmail.com
 wrote:

Well, websites are a crazy mesh (or mess, sometimes) and lots of pages
link
to other pages so workers will eventually end up being redundant.

Of course, I'm basing my response on the assumption that your model has
the
workers actually crawl and not simply visit a given number of pages,
parse
and then sent back the paths they've extracted.
If so then pardon me, I misunderstood.

On 01/17/2012 05:01 PM, Richard Hauswald wrote:

Why would workers visiting the same pages?

On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskostasos.laskos@gmail.com
 wrote:

You're right, it does sound good but it would still suffer from the
same
problem, workers visiting the same pages.

Although this could be somewhat mitigated the same way I described in
an
earlier post (which I think hasn't been moderated yet).

I've bookmarked it for future reference, thanks man.

On 01/17/2012 04:20 PM, Richard Hauswald wrote:

Btw, did you intentionally send this e-mail privately?

No, clicked the wrong button inside gmail - sorry.

I thought your intend was to divide URLs by subdomains, run a crawl,
"Briefly scope out the webapp structure and spread the crawl of the
distinguishable visible directories amongst the slaves". This would
not redistribute new discovered URLs and require a special initial
setup step (divide URLs by subdomains). And by pushing tasks to the
workers you'd have to take care about the load of the workers which
means you'd have to implement a scheduling / load balancing policy. By
pulling the work this would happen automatically. To make use of multi
core CPUs you could make your workers scan for the systems count of
CPU cores and spawn workers threads in a defined ratio. You could also
run a worker on the master host. This should lead to a good load
balancing by default. So it's not really ignoring scheduling details.

On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskostasos.laskos@gmail.com
 wrote:

Yep, that's similar to what I previously posted with the difference
that
the
master in my system won't be a slacker so he'll be the lucky guy to
grab
the
seed URL.

After that things are pretty much the same -- ignoring the scheduling
details etc.

Btw, did you intentionally send this e-mail privately?

On 01/17/2012 03:48 PM, Richard Hauswald wrote:

Yes, I didn't understand the nature of your problem the first time
...

So, you could still use the pull principle. To make it simple I'll
not
consider work packets/batching. This could be used later to further
improve performance by reducing "latency" / "seek" times.

So what about the following process:

  1. Create a bunch of workers
  2. Create a List of URLs which can be considered the work queue.
    Initially filled with one element: the landing page URL in state
    NEW.
  3. Let all your workers poll the master for a single work item in
    state NEW(pay attention to synchronize this step on the master). One
    of them is the lucky guy and gets the landing page URL. The master
    will update work item to state PROCESSING( you may append a starting
    time, which could be used for reassigning already assigned work
    items
    after a timeout). All the other workers will still be idle.
  4. The lucky guy parses the page for new URLs and does whatever it
    should also do.
  5. The lucky guy posts the results + the parsed URLs to the master.
  6. The master stores the results, pushes the new URLs into the work
    queue with state NEW and updates the work item to state COMPLETED.
    If
    there is only one new URL we are not lucky but if there are 10 we'd
    have now 10 work items to distribute.
  7. Continue until all work items are in state COMPLETED.

Does this make sense?

On Tue, Jan 17, 2012 at 2:15 PM, Tasos
Laskostasos.laskos@gmail.com
 wrote:

What prevents this is the nature of the crawl process.
What I'm trying to achieve here is not spread the workload but
actually
find
it.

I'm not interested in parsing the pages or any sort of processing
but
only
gather all available paths.

So there's not really any "work" to distribute actually.

Does this make sense?

On 01/17/2012 03:05 PM, Richard Hauswald wrote:

Tasos,
what prevents you from let the workers pull the work from the
master
instead of pushing it to the workers? Then you could let the
workers
pull work packets containing e.g. 20 work items. After a worker
has
no
work left, it will push the results to the master and pull another
work packet.
Regards,
Richard

On Mon, Jan 16, 2012 at 6:41 PM, Tasos
Laskostasos.laskos@gmail.com
 wrote:

Hi guys, it's been a while.

I've got a tricky question for you today and I hope that we sort
of
get
a
brainstorm going.

I've recently implemented a system for audit distribution in the
form
of
a
high performance grid (won't self-promote) however an area which
I
left
alone (for the time being) was the crawl process.

See, the way it works now is the master instance performs the
initial
crawl
and then calculates and distributes the audit workload amongst
its
slaves
but the crawl takes place the old fashioned way.

As you might have guessed the major set back is caused by the
fact
that
it's
not possible to determine the workload of the crawl a priori.

I've got a couple of naive ideas to parallelize the crawl just to
get
me
started:
 * Assign crawl of subdomains to slaves -- no questions asked
 * Briefly scope out the webapp structure and spread the crawl of
the
distinguishable visible directories amongst the slaves.

Or even a combination of the above if applicable.

Both ideas are better than what I've got now and there aren't any
downsides
to them even if the distribution turns out to be suboptimal.

I'm curious though, has anyone faced a similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/websecurity.rss

Join WASC on LinkedIn
http://www.linkedin.com/e/gis/83336/4B20E4374DBA

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

Just an idea: Using search engine results and sitemaps to predict and then distribute the pages. There are obvious disadvantages to this method but could be used. On Tue, Jan 17, 2012 at 3:54 PM, Tasos Laskos <tasos.laskos@gmail.com> wrote: > I've leaned towards #2 from the get go and the following could help reduce > redundancy -- from a previous message: > > > All instances converging in intervals about the collective paths they've > discovered in order for each to keep a local look-up cache (no latency > introduced since no-one would rely/wait on this to be up to date or even > available) > > This info could be pulled along with the links to follow from the master. > > It's a weird problem to solve efficiently this one... :) > > > On 01/17/2012 05:45 PM, Richard Hauswald wrote: >> >> Yeah, you are right. URL's should be unique in the work queue. >> Otherwise - in case of circular links between the pages - you could >> end up in an endless loop :-o >> >> If a worker should just extract paths or do a full crawl depends on >> the duration of a full crawl. I can think of 3 different ways, >> depending on your situation: >> 1. Do a full crawl and post back results + extracted paths >> 2. Have workers do 2 different jobs, 1 job is to extract paths, 1 job >> is to do the actual crawl >> 3. Get a path and post back the whole page content so the master can >> store it. Then have a worker pool assigned for extracting paths and >> one for a full crawl, both based on the stored page content. >> But this really depends on the network speed, the load the workers >> create on the web application to crawl(in case its not just a simple >> html file based web site), the duration of a full crawl and the number >> of different paths in the application. >> >> Is there still something I missed or would one of 1,2,3 solve your >> problem? >> >> On Tue, Jan 17, 2012 at 4:07 PM, Tasos Laskos<tasos.laskos@gmail.com> >>  wrote: >>> >>> Well, websites are a crazy mesh (or mess, sometimes) and lots of pages >>> link >>> to other pages so workers will eventually end up being redundant. >>> >>> Of course, I'm basing my response on the assumption that your model has >>> the >>> workers actually crawl and not simply visit a given number of pages, >>> parse >>> and then sent back the paths they've extracted. >>> If so then pardon me, I misunderstood. >>> >>> >>> On 01/17/2012 05:01 PM, Richard Hauswald wrote: >>>> >>>> >>>> Why would workers visiting the same pages? >>>> >>>> On Tue, Jan 17, 2012 at 3:44 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>  wrote: >>>>> >>>>> >>>>> You're right, it does sound good but it would still suffer from the >>>>> same >>>>> problem, workers visiting the same pages. >>>>> >>>>> Although this could be somewhat mitigated the same way I described in >>>>> an >>>>> earlier post (which I think hasn't been moderated yet). >>>>> >>>>> I've bookmarked it for future reference, thanks man. >>>>> >>>>> >>>>> On 01/17/2012 04:20 PM, Richard Hauswald wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Btw, did you intentionally send this e-mail privately? >>>>>> >>>>>> >>>>>> >>>>>> No, clicked the wrong button inside gmail - sorry. >>>>>> >>>>>> I thought your intend was to divide URLs by subdomains, run a crawl, >>>>>> "Briefly scope out the webapp structure and spread the crawl of the >>>>>> distinguishable visible directories amongst the slaves". This would >>>>>> not redistribute new discovered URLs and require a special initial >>>>>> setup step (divide URLs by subdomains). And by pushing tasks to the >>>>>> workers you'd have to take care about the load of the workers which >>>>>> means you'd have to implement a scheduling / load balancing policy. By >>>>>> pulling the work this would happen automatically. To make use of multi >>>>>> core CPUs you could make your workers scan for the systems count of >>>>>> CPU cores and spawn workers threads in a defined ratio. You could also >>>>>> run a worker on the master host. This should lead to a good load >>>>>> balancing by default. So it's not really ignoring scheduling details. >>>>>> >>>>>> On Tue, Jan 17, 2012 at 2:55 PM, Tasos Laskos<tasos.laskos@gmail.com> >>>>>>  wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Yep, that's similar to what I previously posted with the difference >>>>>>> that >>>>>>> the >>>>>>> master in my system won't be a slacker so he'll be the lucky guy to >>>>>>> grab >>>>>>> the >>>>>>> seed URL. >>>>>>> >>>>>>> After that things are pretty much the same -- ignoring the scheduling >>>>>>> details etc. >>>>>>> >>>>>>> Btw, did you intentionally send this e-mail privately? >>>>>>> >>>>>>> >>>>>>> On 01/17/2012 03:48 PM, Richard Hauswald wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Yes, I didn't understand the nature of your problem the first time >>>>>>>> ... >>>>>>>> >>>>>>>> So, you could still use the pull principle. To make it simple I'll >>>>>>>> not >>>>>>>> consider work packets/batching. This could be used later to further >>>>>>>> improve performance by reducing "latency" / "seek" times. >>>>>>>> >>>>>>>> So what about the following process: >>>>>>>> 1. Create a bunch of workers >>>>>>>> 2. Create a List of URLs which can be considered the work queue. >>>>>>>> Initially filled with one element: the landing page URL in state >>>>>>>> NEW. >>>>>>>> 3. Let all your workers poll the master for a single work item in >>>>>>>> state NEW(pay attention to synchronize this step on the master). One >>>>>>>> of them is the lucky guy and gets the landing page URL. The master >>>>>>>> will update work item to state PROCESSING( you may append a starting >>>>>>>> time, which could be used for reassigning already assigned work >>>>>>>> items >>>>>>>> after a timeout). All the other workers will still be idle. >>>>>>>> 4. The lucky guy parses the page for new URLs and does whatever it >>>>>>>> should also do. >>>>>>>> 5. The lucky guy posts the results + the parsed URLs to the master. >>>>>>>> 6. The master stores the results, pushes the new URLs into the work >>>>>>>> queue with state NEW and updates the work item to state COMPLETED. >>>>>>>> If >>>>>>>> there is only one new URL we are not lucky but if there are 10 we'd >>>>>>>> have now 10 work items to distribute. >>>>>>>> 7. Continue until all work items are in state COMPLETED. >>>>>>>> >>>>>>>> Does this make sense? >>>>>>>> >>>>>>>> On Tue, Jan 17, 2012 at 2:15 PM, Tasos >>>>>>>> Laskos<tasos.laskos@gmail.com> >>>>>>>>  wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> What prevents this is the nature of the crawl process. >>>>>>>>> What I'm trying to achieve here is not spread the workload but >>>>>>>>> actually >>>>>>>>> find >>>>>>>>> it. >>>>>>>>> >>>>>>>>> I'm not interested in parsing the pages or any sort of processing >>>>>>>>> but >>>>>>>>> only >>>>>>>>> gather all available paths. >>>>>>>>> >>>>>>>>> So there's not really any "work" to distribute actually. >>>>>>>>> >>>>>>>>> Does this make sense? >>>>>>>>> >>>>>>>>> >>>>>>>>> On 01/17/2012 03:05 PM, Richard Hauswald wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Tasos, >>>>>>>>>> what prevents you from let the workers pull the work from the >>>>>>>>>> master >>>>>>>>>> instead of pushing it to the workers? Then you could let the >>>>>>>>>> workers >>>>>>>>>> pull work packets containing e.g. 20 work items. After a worker >>>>>>>>>> has >>>>>>>>>> no >>>>>>>>>> work left, it will push the results to the master and pull another >>>>>>>>>> work packet. >>>>>>>>>> Regards, >>>>>>>>>> Richard >>>>>>>>>> >>>>>>>>>> On Mon, Jan 16, 2012 at 6:41 PM, Tasos >>>>>>>>>> Laskos<tasos.laskos@gmail.com> >>>>>>>>>>  wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi guys, it's been a while. >>>>>>>>>>> >>>>>>>>>>> I've got a tricky question for you today and I hope that we sort >>>>>>>>>>> of >>>>>>>>>>> get >>>>>>>>>>> a >>>>>>>>>>> brainstorm going. >>>>>>>>>>> >>>>>>>>>>> I've recently implemented a system for audit distribution in the >>>>>>>>>>> form >>>>>>>>>>> of >>>>>>>>>>> a >>>>>>>>>>> high performance grid (won't self-promote) however an area which >>>>>>>>>>> I >>>>>>>>>>> left >>>>>>>>>>> alone (for the time being) was the crawl process. >>>>>>>>>>> >>>>>>>>>>> See, the way it works now is the master instance performs the >>>>>>>>>>> initial >>>>>>>>>>> crawl >>>>>>>>>>> and then calculates and distributes the audit workload amongst >>>>>>>>>>> its >>>>>>>>>>> slaves >>>>>>>>>>> but the crawl takes place the old fashioned way. >>>>>>>>>>> >>>>>>>>>>> As you might have guessed the major set back is caused by the >>>>>>>>>>> fact >>>>>>>>>>> that >>>>>>>>>>> it's >>>>>>>>>>> not possible to determine the workload of the crawl a priori. >>>>>>>>>>> >>>>>>>>>>> I've got a couple of naive ideas to parallelize the crawl just to >>>>>>>>>>> get >>>>>>>>>>> me >>>>>>>>>>> started: >>>>>>>>>>>  * Assign crawl of subdomains to slaves -- no questions asked >>>>>>>>>>>  * Briefly scope out the webapp structure and spread the crawl of >>>>>>>>>>> the >>>>>>>>>>> distinguishable visible directories amongst the slaves. >>>>>>>>>>> >>>>>>>>>>> Or even a combination of the above if applicable. >>>>>>>>>>> >>>>>>>>>>> Both ideas are better than what I've got now and there aren't any >>>>>>>>>>> downsides >>>>>>>>>>> to them even if the distribution turns out to be suboptimal. >>>>>>>>>>> >>>>>>>>>>> I'm curious though, has anyone faced a similar problem? >>>>>>>>>>> Any general ideas? >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Tasos Laskos. >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> The Web Security Mailing List >>>>>>>>>>> >>>>>>>>>>> WebSecurity RSS Feed >>>>>>>>>>> http://www.webappsec.org/rss/websecurity.rss >>>>>>>>>>> >>>>>>>>>>> Join WASC on LinkedIn >>>>>>>>>>> http://www.linkedin.com/e/gis/83336/4B20E4374DBA >>>>>>>>>>> >>>>>>>>>>> WASC on Twitter >>>>>>>>>>> http://twitter.com/wascupdates >>>>>>>>>>> >>>>>>>>>>> websecurity@lists.webappsec.org >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> > > > _______________________________________________ > The Web Security Mailing List > > WebSecurity RSS Feed > http://www.webappsec.org/rss/websecurity.rss > > Join WASC on LinkedIn http://www.linkedin.com/e/gis/83336/4B20E4374DBA > > WASC on Twitter > http://twitter.com/wascupdates > > websecurity@lists.webappsec.org > http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org