Hm...that's the same thing Richard was saying at some point isn't it?
Certainly one of the techniques to try.
On 01/17/2012 07:00 PM, Ray wrote:
How about this? (comes to mind after reading the previous posts)
Master: only distributes URLs to crawl (crawl pool). Responsible
for local lookup/deduplication of URLs before they enter the crawl
pool. The lookup/dedup mechanism can also be used to generate the list
of crawled URLs in the end too.
Slaves: only crawls, extracts URLs and reports them back to master
Iteration #1:
Master is seeded with only one URL (let's say), which is the
root/starting URL for the site.
Master performs local lookup/deduplication, nothing to dedup (only one URL).
Master distributes URL in crawl pool to slave (number of slaves to
use dependent on the max number of URLs to crawl/process per slave).
Slave crawls, extracts and reports extracted URLs to master.
Iteration #2...#n:
Master gets reports of new URLs from slaves.
Master performs local lookup/deduplication, adding unrecognized URLs to
crawl pool and local lookup table.
Master distributes URLs in crawl pool to corresponding number of slaves.
Slaves crawl, extract and report extracted URLs to master.
(Exit condition: crawl pool empty after all working slaves have
finished their current task and dedup completed)
Regards,
Ray
On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos <tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com> wrote:
I've leaned towards #2 from the get go and the following could help
reduce redundancy -- from a previous message:
All instances converging in intervals about the collective paths
they've discovered in order for each to keep a local look-up cache
(no latency introduced since no-one would rely/wait on this to be up
to date or even available)
This info could be pulled along with the links to follow from the
master.
It's a weird problem to solve efficiently this one... :)
On 01/17/2012 05:45 PM, Richard Hauswald wrote:
Yeah, you are right. URL's should be unique in the work queue.
Otherwise - in case of circular links between the pages - you could
end up in an endless loop :-o
If a worker should just extract paths or do a full crawl depends on
the duration of a full crawl. I can think of 3 different ways,
depending on your situation:
1. Do a full crawl and post back results + extracted paths
2. Have workers do 2 different jobs, 1 job is to extract paths,
1 job
is to do the actual crawl
3. Get a path and post back the whole page content so the master can
store it. Then have a worker pool assigned for extracting paths and
one for a full crawl, both based on the stored page content.
But this really depends on the network speed, the load the workers
create on the web application to crawl(in case its not just a simple
html file based web site), the duration of a full crawl and the
number
of different paths in the application.
Is there still something I missed or would one of 1,2,3 solve
your problem?
On Tue, Jan 17, 2012 at 4:07 PM, Tasos
Laskos<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>>
wrote:
Well, websites are a crazy mesh (or mess, sometimes) and
lots of pages link
to other pages so workers will eventually end up being
redundant.
Of course, I'm basing my response on the assumption that
your model has the
workers actually crawl and not simply visit a given number
of pages, parse
and then sent back the paths they've extracted.
If so then pardon me, I misunderstood.
On 01/17/2012 05:01 PM, Richard Hauswald wrote:
Why would workers visiting the same pages?
On Tue, Jan 17, 2012 at 3:44 PM, Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com>>
wrote:
You're right, it does sound good but it would still
suffer from the same
problem, workers visiting the same pages.
Although this could be somewhat mitigated the same
way I described in an
earlier post (which I think hasn't been moderated yet).
I've bookmarked it for future reference, thanks man.
On 01/17/2012 04:20 PM, Richard Hauswald wrote:
Btw, did you intentionally send this e-mail
privately?
No, clicked the wrong button inside gmail - sorry.
I thought your intend was to divide URLs by
subdomains, run a crawl,
"Briefly scope out the webapp structure and
spread the crawl of the
distinguishable visible directories amongst the
slaves". This would
not redistribute new discovered URLs and require
a special initial
setup step (divide URLs by subdomains). And by
pushing tasks to the
workers you'd have to take care about the load
of the workers which
means you'd have to implement a scheduling /
load balancing policy. By
pulling the work this would happen
automatically. To make use of multi
core CPUs you could make your workers scan for
the systems count of
CPU cores and spawn workers threads in a defined
ratio. You could also
run a worker on the master host. This should
lead to a good load
balancing by default. So it's not really
ignoring scheduling details.
On Tue, Jan 17, 2012 at 2:55 PM, Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com>>
wrote:
Yep, that's similar to what I previously
posted with the difference
that
the
master in my system won't be a slacker so
he'll be the lucky guy to
grab
the
seed URL.
After that things are pretty much the same
-- ignoring the scheduling
details etc.
Btw, did you intentionally send this e-mail
privately?
On 01/17/2012 03:48 PM, Richard Hauswald wrote:
Yes, I didn't understand the nature of
your problem the first time ...
So, you could still use the pull
principle. To make it simple I'll not
consider work packets/batching. This
could be used later to further
improve performance by reducing
"latency" / "seek" times.
So what about the following process:
1. Create a bunch of workers
2. Create a List of URLs which can be
considered the work queue.
Initially filled with one element: the
landing page URL in state NEW.
3. Let all your workers poll the master
for a single work item in
state NEW(pay attention to synchronize
this step on the master). One
of them is the lucky guy and gets the
landing page URL. The master
will update work item to state
PROCESSING( you may append a starting
time, which could be used for
reassigning already assigned work items
after a timeout). All the other workers
will still be idle.
4. The lucky guy parses the page for new
URLs and does whatever it
should also do.
5. The lucky guy posts the results + the
parsed URLs to the master.
6. The master stores the results, pushes
the new URLs into the work
queue with state NEW and updates the
work item to state COMPLETED. If
there is only one new URL we are not
lucky but if there are 10 we'd
have now 10 work items to distribute.
7. Continue until all work items are in
state COMPLETED.
Does this make sense?
On Tue, Jan 17, 2012 at 2:15 PM, Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com>>
wrote:
What prevents this is the nature of
the crawl process.
What I'm trying to achieve here is
not spread the workload but
actually
find
it.
I'm not interested in parsing the
pages or any sort of processing but
only
gather all available paths.
So there's not really any "work" to
distribute actually.
Does this make sense?
On 01/17/2012 03:05 PM, Richard
Hauswald wrote:
Tasos,
what prevents you from let the
workers pull the work from the
master
instead of pushing it to the
workers? Then you could let the
workers
pull work packets containing
e.g. 20 work items. After a
worker has
no
work left, it will push the
results to the master and pull
another
work packet.
Regards,
Richard
On Mon, Jan 16, 2012 at 6:41 PM,
Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com>>
wrote:
Hi guys, it's been a while.
I've got a tricky question
for you today and I hope
that we sort of
get
a
brainstorm going.
I've recently implemented a
system for audit
distribution in the
form
of
a
high performance grid (won't
self-promote) however an
area which I
left
alone (for the time being)
was the crawl process.
See, the way it works now is
the master instance performs the
initial
crawl
and then calculates and
distributes the audit
workload amongst its
slaves
but the crawl takes place
the old fashioned way.
As you might have guessed
the major set back is caused
by the fact
that
it's
not possible to determine
the workload of the crawl a
priori.
I've got a couple of naive
ideas to parallelize the
crawl just to
get
me
started:
* Assign crawl of
subdomains to slaves -- no
questions asked
* Briefly scope out the
webapp structure and spread
the crawl of
the
distinguishable visible
directories amongst the slaves.
Or even a combination of the
above if applicable.
Both ideas are better than
what I've got now and there
aren't any
downsides
to them even if the
distribution turns out to be
suboptimal.
I'm curious though, has
anyone faced a similar problem?
Any general ideas?
Cheers,
Tasos Laskos.
_________________________________________________
The Web Security Mailing List
WebSecurity RSS Feed
http://www.webappsec.org/rss/__websecurity.rss
<http://www.webappsec.org/rss/websecurity.rss>
Join WASC on LinkedIn
http://www.linkedin.com/e/gis/__83336/4B20E4374DBA
<http://www.linkedin.com/e/gis/83336/4B20E4374DBA>
WASC on Twitter
http://twitter.com/wascupdates
websecurity@lists.webappsec.__org
<mailto:websecurity@lists.webappsec.org>
http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
<http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>
_________________________________________________
The Web Security Mailing List
WebSecurity RSS Feed
http://www.webappsec.org/rss/__websecurity.rss
<http://www.webappsec.org/rss/websecurity.rss>
Join WASC on LinkedIn
http://www.linkedin.com/e/gis/__83336/4B20E4374DBA
<http://www.linkedin.com/e/gis/83336/4B20E4374DBA>
WASC on Twitter
http://twitter.com/wascupdates
websecurity@lists.webappsec.__org
<mailto:websecurity@lists.webappsec.org>
http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
<http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>
I thought mine was slightly different? But whichever the case, just to
contribute something to the discussion :)
On Wed, Jan 18, 2012 at 1:03 AM, Tasos Laskos tasos.laskos@gmail.comwrote:
Hm...that's the same thing Richard was saying at some point isn't it?
Certainly one of the techniques to try.
On 01/17/2012 07:00 PM, Ray wrote:
How about this? (comes to mind after reading the previous posts)
Master: only distributes URLs to crawl (crawl pool). Responsible
for local lookup/deduplication of URLs before they enter the crawl
pool. The lookup/dedup mechanism can also be used to generate the list
of crawled URLs in the end too.
Slaves: only crawls, extracts URLs and reports them back to master
Iteration #1:
Master is seeded with only one URL (let's say), which is the
root/starting URL for the site.
Master performs local lookup/deduplication, nothing to dedup (only one
URL).
Master distributes URL in crawl pool to slave (number of slaves to
use dependent on the max number of URLs to crawl/process per slave).
Slave crawls, extracts and reports extracted URLs to master.
Iteration #2...#n:
Master gets reports of new URLs from slaves.
Master performs local lookup/deduplication, adding unrecognized URLs to
crawl pool and local lookup table.
Master distributes URLs in crawl pool to corresponding number of slaves.
Slaves crawl, extract and report extracted URLs to master.
(Exit condition: crawl pool empty after all working slaves have
finished their current task and dedup completed)
Regards,
Ray
On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos <tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com**> wrote:
I've leaned towards #2 from the get go and the following could help
reduce redundancy -- from a previous message:
All instances converging in intervals about the collective paths
they've discovered in order for each to keep a local look-up cache
(no latency introduced since no-one would rely/wait on this to be up
to date or even available)
This info could be pulled along with the links to follow from the
master.
It's a weird problem to solve efficiently this one... :)
On 01/17/2012 05:45 PM, Richard Hauswald wrote:
Yeah, you are right. URL's should be unique in the work queue.
Otherwise - in case of circular links between the pages - you could
end up in an endless loop :-o
If a worker should just extract paths or do a full crawl depends on
the duration of a full crawl. I can think of 3 different ways,
depending on your situation:
1. Do a full crawl and post back results + extracted paths
2. Have workers do 2 different jobs, 1 job is to extract paths,
1 job
is to do the actual crawl
3. Get a path and post back the whole page content so the master
can
store it. Then have a worker pool assigned for extracting paths and
one for a full crawl, both based on the stored page content.
But this really depends on the network speed, the load the workers
create on the web application to crawl(in case its not just a
simple
html file based web site), the duration of a full crawl and the
number
of different paths in the application.
Is there still something I missed or would one of 1,2,3 solve
your problem?
On Tue, Jan 17, 2012 at 4:07 PM, Tasos
Laskos<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com**>>
wrote:
Well, websites are a crazy mesh (or mess, sometimes) and
lots of pages link
to other pages so workers will eventually end up being
redundant.
Of course, I'm basing my response on the assumption that
your model has the
workers actually crawl and not simply visit a given number
of pages, parse
and then sent back the paths they've extracted.
If so then pardon me, I misunderstood.
On 01/17/2012 05:01 PM, Richard Hauswald wrote:
Why would workers visiting the same pages?
On Tue, Jan 17, 2012 at 3:44 PM, Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com**>>
wrote:
You're right, it does sound good but it would still
suffer from the same
problem, workers visiting the same pages.
Although this could be somewhat mitigated the same
way I described in an
earlier post (which I think hasn't been moderated yet).
I've bookmarked it for future reference, thanks man.
On 01/17/2012 04:20 PM, Richard Hauswald wrote:
Btw, did you intentionally send this e-mail
privately?
No, clicked the wrong button inside gmail - sorry.
I thought your intend was to divide URLs by
subdomains, run a crawl,
"Briefly scope out the webapp structure and
spread the crawl of the
distinguishable visible directories amongst the
slaves". This would
not redistribute new discovered URLs and require
a special initial
setup step (divide URLs by subdomains). And by
pushing tasks to the
workers you'd have to take care about the load
of the workers which
means you'd have to implement a scheduling /
load balancing policy. By
pulling the work this would happen
automatically. To make use of multi
core CPUs you could make your workers scan for
the systems count of
CPU cores and spawn workers threads in a defined
ratio. You could also
run a worker on the master host. This should
lead to a good load
balancing by default. So it's not really
ignoring scheduling details.
On Tue, Jan 17, 2012 at 2:55 PM, Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com**>>
wrote:
Yep, that's similar to what I previously
posted with the difference
that
the
master in my system won't be a slacker so
he'll be the lucky guy to
grab
the
seed URL.
After that things are pretty much the same
-- ignoring the scheduling
details etc.
Btw, did you intentionally send this e-mail
privately?
On 01/17/2012 03:48 PM, Richard Hauswald wrote:
Yes, I didn't understand the nature of
your problem the first time ...
So, you could still use the pull
principle. To make it simple I'll not
consider work packets/batching. This
could be used later to further
improve performance by reducing
"latency" / "seek" times.
So what about the following process:
1. Create a bunch of workers
2. Create a List of URLs which can be
considered the work queue.
Initially filled with one element: the
landing page URL in state NEW.
3. Let all your workers poll the master
for a single work item in
state NEW(pay attention to synchronize
this step on the master). One
of them is the lucky guy and gets the
landing page URL. The master
will update work item to state
PROCESSING( you may append a starting
time, which could be used for
reassigning already assigned work items
after a timeout). All the other workers
will still be idle.
4. The lucky guy parses the page for new
URLs and does whatever it
should also do.
5. The lucky guy posts the results + the
parsed URLs to the master.
6. The master stores the results, pushes
the new URLs into the work
queue with state NEW and updates the
work item to state COMPLETED. If
there is only one new URL we are not
lucky but if there are 10 we'd
have now 10 work items to distribute.
7. Continue until all work items are in
state COMPLETED.
Does this make sense?
On Tue, Jan 17, 2012 at 2:15 PM, Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com**>>
wrote:
What prevents this is the nature of
the crawl process.
What I'm trying to achieve here is
not spread the workload but
actually
find
it.
I'm not interested in parsing the
pages or any sort of processing but
only
gather all available paths.
So there's not really any "work" to
distribute actually.
Does this make sense?
On 01/17/2012 03:05 PM, Richard
Hauswald wrote:
Tasos,
what prevents you from let the
workers pull the work from the
master
instead of pushing it to the
workers? Then you could let the
workers
pull work packets containing
e.g. 20 work items. After a
worker has
no
work left, it will push the
results to the master and pull
another
work packet.
Regards,
Richard
On Mon, Jan 16, 2012 at 6:41 PM,
Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com**>>
wrote:
Hi guys, it's been a while.
I've got a tricky question
for you today and I hope
that we sort of
get
a
brainstorm going.
I've recently implemented a
system for audit
distribution in the
form
of
a
high performance grid (won't
self-promote) however an
area which I
left
alone (for the time being)
was the crawl process.
See, the way it works now is
the master instance performs
the
initial
crawl
and then calculates and
distributes the audit
workload amongst its
slaves
but the crawl takes place
the old fashioned way.
As you might have guessed
the major set back is caused
by the fact
that
it's
not possible to determine
the workload of the crawl a
priori.
I've got a couple of naive
ideas to parallelize the
crawl just to
get
me
started:
* Assign crawl of
subdomains to slaves -- no
questions asked
* Briefly scope out the
webapp structure and spread
the crawl of
the
distinguishable visible
directories amongst the slaves.
Or even a combination of the
above if applicable.
Both ideas are better than
what I've got now and there
aren't any
downsides
to them even if the
distribution turns out to be
suboptimal.
I'm curious though, has
anyone faced a similar problem?
Any general ideas?
Cheers,
Tasos Laskos.
______________________________
**___________________
The Web Security Mailing List
WebSecurity RSS Feed
http://www.webappsec.org/rss/_
**_websecurity.rss http://www.webappsec.org/rss/__websecurity.rss
<http://www.webappsec.org/rss/
**websecurity.rss http://www.webappsec.org/rss/websecurity.rss>
Join WASC on LinkedIn
http://www.linkedin.com/e/gis/
**__83336/4B20E4374DBAhttp://www.linkedin.com/e/gis/__83336/4B20E4374DBA
<http://www.linkedin.com/e/**
gis/83336/4B20E4374DBA http://www.linkedin.com/e/gis/83336/4B20E4374DBA
WASC on Twitter
http://twitter.com/wascupdates
websecurity@lists.webappsec.__
org
<mailto:websecurity@lists.
webappsec.org websecurity@lists.webappsec.org>
http://lists.webappsec.org/__*
*mailman/listinfo/websecurity___lists.webappsec.orghttp://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
<http://lists.webappsec.org/
mailman/listinfo/websecurity_**lists.webappsec.orghttp://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org
______________________________**___________________
The Web Security Mailing List
WebSecurity RSS Feed
http://www.webappsec.org/rss/_**_websecurity.rss<http://www.webappsec.org/rss/__websecurity.rss>
<http://www.webappsec.org/rss/**websecurity.rss<http://www.webappsec.org/rss/websecurity.rss>
Join WASC on LinkedIn
http://www.linkedin.com/e/gis/**__83336/4B20E4374DBA<http://www.linkedin.com/e/gis/__83336/4B20E4374DBA>
<http://www.linkedin.com/e/**gis/83336/4B20E4374DBA<http://www.linkedin.com/e/gis/83336/4B20E4374DBA>
WASC on Twitter
http://twitter.com/wascupdates
websecurity@lists.webappsec.__**org
<mailto:websecurity@lists.**webappsec.org<websecurity@lists.webappsec.org>
http://lists.webappsec.org/__**mailman/listinfo/websecurity__**
lists.webappsec.orghttp://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
<http://lists.webappsec.org/**mailman/listinfo/websecurity**
lists.webappsec.orghttp://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org
I think I got it and it turns out to be a composite approach indeed.
If at any point a worker becomes idle he sends the paths he has
discovered back to the master for store/further processing/whatever.
Thoughts?
On 01/17/2012 07:07 PM, Ray wrote:
I thought mine was slightly different? But whichever the case, just to
contribute something to the discussion :)
On Wed, Jan 18, 2012 at 1:03 AM, Tasos Laskos <tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com> wrote:
Hm...that's the same thing Richard was saying at some point isn't it?
Certainly one of the techniques to try.
On 01/17/2012 07:00 PM, Ray wrote:
How about this? (comes to mind after reading the previous posts)
*Master*: only *distributes* URLs to crawl (crawl pool).
Responsible
for local lookup/*deduplication* of URLs before they enter the crawl
pool. The lookup/dedup mechanism can also be used to generate
the list
of crawled URLs in the end too.
*Slaves*: only *crawls*, *extracts* URLs and reports them back
to master
_Iteration #1:_
Master is seeded with only one URL (let's say), which is the
root/starting URL for the site.
Master performs local lookup/deduplication, nothing to dedup
(only one URL).
Master distributes URL in crawl pool to slave (number of slaves to
use dependent on the max number of URLs to crawl/process per slave).
Slave crawls, extracts and reports extracted URLs to master.
_Iteration #2...#n:_
Master gets reports of new URLs from slaves.
Master performs local lookup/deduplication, adding unrecognized
URLs to
crawl pool and local lookup table.
Master distributes URLs in crawl pool to corresponding number of
slaves.
Slaves crawl, extract and report extracted URLs to master.
(*Exit condition*: crawl pool empty after all working slaves have
finished their current task and dedup completed)
Regards,
Ray
On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos
<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>
<mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com>__>> wrote:
I've leaned towards #2 from the get go and the following
could help
reduce redundancy -- from a previous message:
All instances converging in intervals about the collective paths
they've discovered in order for each to keep a local look-up
cache
(no latency introduced since no-one would rely/wait on this
to be up
to date or even available)
This info could be pulled along with the links to follow
from the
master.
It's a weird problem to solve efficiently this one... :)
On 01/17/2012 05:45 PM, Richard Hauswald wrote:
Yeah, you are right. URL's should be unique in the work
queue.
Otherwise - in case of circular links between the pages
- you could
end up in an endless loop :-o
If a worker should just extract paths or do a full crawl
depends on
the duration of a full crawl. I can think of 3 different
ways,
depending on your situation:
1. Do a full crawl and post back results + extracted paths
2. Have workers do 2 different jobs, 1 job is to extract
paths,
1 job
is to do the actual crawl
3. Get a path and post back the whole page content so
the master can
store it. Then have a worker pool assigned for
extracting paths and
one for a full crawl, both based on the stored page content.
But this really depends on the network speed, the load
the workers
create on the web application to crawl(in case its not
just a simple
html file based web site), the duration of a full crawl
and the
number
of different paths in the application.
Is there still something I missed or would one of 1,2,3
solve
your problem?
On Tue, Jan 17, 2012 at 4:07 PM, Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com> <mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com>__>>
wrote:
Well, websites are a crazy mesh (or mess, sometimes) and
lots of pages link
to other pages so workers will eventually end up being
redundant.
Of course, I'm basing my response on the assumption that
your model has the
workers actually crawl and not simply visit a given
number
of pages, parse
and then sent back the paths they've extracted.
If so then pardon me, I misunderstood.
On 01/17/2012 05:01 PM, Richard Hauswald wrote:
Why would workers visiting the same pages?
On Tue, Jan 17, 2012 at 3:44 PM, Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com>
<mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>>
wrote:
You're right, it does sound good but it
would still
suffer from the same
problem, workers visiting the same pages.
Although this could be somewhat mitigated
the same
way I described in an
earlier post (which I think hasn't been
moderated yet).
I've bookmarked it for future reference,
thanks man.
On 01/17/2012 04:20 PM, Richard Hauswald wrote:
Btw, did you intentionally send this
e-mail
privately?
No, clicked the wrong button inside
gmail - sorry.
I thought your intend was to divide URLs by
subdomains, run a crawl,
"Briefly scope out the webapp structure and
spread the crawl of the
distinguishable visible directories
amongst the
slaves". This would
not redistribute new discovered URLs and
require
a special initial
setup step (divide URLs by subdomains).
And by
pushing tasks to the
workers you'd have to take care about
the load
of the workers which
means you'd have to implement a scheduling /
load balancing policy. By
pulling the work this would happen
automatically. To make use of multi
core CPUs you could make your workers
scan for
the systems count of
CPU cores and spawn workers threads in a
defined
ratio. You could also
run a worker on the master host. This should
lead to a good load
balancing by default. So it's not really
ignoring scheduling details.
On Tue, Jan 17, 2012 at 2:55 PM, Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com>
<mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>>
wrote:
Yep, that's similar to what I previously
posted with the difference
that
the
master in my system won't be a
slacker so
he'll be the lucky guy to
grab
the
seed URL.
After that things are pretty much
the same
-- ignoring the scheduling
details etc.
Btw, did you intentionally send this
e-mail
privately?
On 01/17/2012 03:48 PM, Richard
Hauswald wrote:
Yes, I didn't understand the
nature of
your problem the first time ...
So, you could still use the pull
principle. To make it simple
I'll not
consider work packets/batching. This
could be used later to further
improve performance by reducing
"latency" / "seek" times.
So what about the following process:
1. Create a bunch of workers
2. Create a List of URLs which
can be
considered the work queue.
Initially filled with one
element: the
landing page URL in state NEW.
3. Let all your workers poll the
master
for a single work item in
state NEW(pay attention to
synchronize
this step on the master). One
of them is the lucky guy and
gets the
landing page URL. The master
will update work item to state
PROCESSING( you may append a
starting
time, which could be used for
reassigning already assigned
work items
after a timeout). All the other
workers
will still be idle.
4. The lucky guy parses the page
for new
URLs and does whatever it
should also do.
5. The lucky guy posts the
results + the
parsed URLs to the master.
6. The master stores the
results, pushes
the new URLs into the work
queue with state NEW and updates the
work item to state COMPLETED. If
there is only one new URL we are not
lucky but if there are 10 we'd
have now 10 work items to
distribute.
7. Continue until all work items
are in
state COMPLETED.
Does this make sense?
On Tue, Jan 17, 2012 at 2:15 PM,
Tasos
Laskos<tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com>
<mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>>
wrote:
What prevents this is the
nature of
the crawl process.
What I'm trying to achieve
here is
not spread the workload but
actually
find
it.
I'm not interested in
parsing the
pages or any sort of
processing but
only
gather all available paths.
So there's not really any
"work" to
distribute actually.
Does this make sense?
On 01/17/2012 03:05 PM, Richard
Hauswald wrote:
Tasos,
what prevents you from
let the
workers pull the work
from the
master
instead of pushing it to the
workers? Then you could
let the
workers
pull work packets containing
e.g. 20 work items. After a
worker has
no
work left, it will push the
results to the master
and pull
another
work packet.
Regards,
Richard
On Mon, Jan 16, 2012 at
6:41 PM,
Tasos
Laskos<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>
<mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>>
wrote:
Hi guys, it's been a
while.
I've got a tricky
question
for you today and I hope
that we sort of
get
a
brainstorm going.
I've recently
implemented a
system for audit
distribution in the
form
of
a
high performance
grid (won't
self-promote) however an
area which I
left
alone (for the time
being)
was the crawl process.
See, the way it
works now is
the master instance
performs the
initial
crawl
and then calculates and
distributes the audit
workload amongst its
slaves
but the crawl takes
place
the old fashioned way.
As you might have
guessed
the major set back
is caused
by the fact
that
it's
not possible to
determine
the workload of the
crawl a
priori.
I've got a couple of
naive
ideas to parallelize the
crawl just to
get
me
started:
* Assign crawl of
subdomains to slaves
-- no
questions asked
* Briefly scope
out the
webapp structure and
spread
the crawl of
the
distinguishable visible
directories amongst
the slaves.
Or even a
combination of the
above if applicable.
Both ideas are
better than
what I've got now
and there
aren't any
downsides
to them even if the
distribution turns
out to be
suboptimal.
I'm curious though, has
anyone faced a
similar problem?
Any general ideas?
Cheers,
Tasos Laskos.
___________________________________________________
The Web Security
Mailing List
WebSecurity RSS Feed
http://www.webappsec.org/rss/____websecurity.rss
<http://www.webappsec.org/rss/__websecurity.rss>
<http://www.webappsec.org/rss/__websecurity.rss
<http://www.webappsec.org/rss/websecurity.rss>>
Join WASC on LinkedIn
http://www.linkedin.com/e/gis/____83336/4B20E4374DBA
<http://www.linkedin.com/e/gis/__83336/4B20E4374DBA>
<http://www.linkedin.com/e/__gis/83336/4B20E4374DBA
<http://www.linkedin.com/e/gis/83336/4B20E4374DBA>>
WASC on Twitter
http://twitter.com/wascupdates
websecurity@lists.webappsec.____org
<mailto:websecurity@lists.__webappsec.org
<mailto:websecurity@lists.webappsec.org>>
http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org
<http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org>
<http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
<http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>>
___________________________________________________
The Web Security Mailing List
WebSecurity RSS Feed
http://www.webappsec.org/rss/____websecurity.rss
<http://www.webappsec.org/rss/__websecurity.rss>
<http://www.webappsec.org/rss/__websecurity.rss
<http://www.webappsec.org/rss/websecurity.rss>>
Join WASC on LinkedIn
http://www.linkedin.com/e/gis/____83336/4B20E4374DBA
<http://www.linkedin.com/e/gis/__83336/4B20E4374DBA>
<http://www.linkedin.com/e/__gis/83336/4B20E4374DBA
<http://www.linkedin.com/e/gis/83336/4B20E4374DBA>>
WASC on Twitter
http://twitter.com/wascupdates
websecurity@lists.webappsec.____org
<mailto:websecurity@lists.__webappsec.org
<mailto:websecurity@lists.webappsec.org>>
http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org
<http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org>
<http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
<http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>>
Forgot to mention a couple of details:
On 01/18/2012 09:16 PM, Tasos Laskos wrote:
I think I got it and it turns out to be a composite approach indeed.
If at any point a worker becomes idle he sends the paths he has
discovered back to the master for store/further processing/whatever.
Thoughts?
On 01/17/2012 07:07 PM, Ray wrote:
I thought mine was slightly different? But whichever the case, just to
contribute something to the discussion :)
On Wed, Jan 18, 2012 at 1:03 AM, Tasos Laskos <tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com> wrote:
Hm...that's the same thing Richard was saying at some point isn't it?
Certainly one of the techniques to try.
On 01/17/2012 07:00 PM, Ray wrote:
How about this? (comes to mind after reading the previous posts)
Master: only distributes URLs to crawl (crawl pool).
Responsible
for local lookup/deduplication of URLs before they enter the crawl
pool. The lookup/dedup mechanism can also be used to generate
the list
of crawled URLs in the end too.
Slaves: only crawls, extracts URLs and reports them back
to master
Iteration #1:
Master is seeded with only one URL (let's say), which is the
root/starting URL for the site.
Master performs local lookup/deduplication, nothing to dedup
(only one URL).
Master distributes URL in crawl pool to slave (number of slaves to
use dependent on the max number of URLs to crawl/process per slave).
Slave crawls, extracts and reports extracted URLs to master.
Iteration #2...#n:
Master gets reports of new URLs from slaves.
Master performs local lookup/deduplication, adding unrecognized
URLs to
crawl pool and local lookup table.
Master distributes URLs in crawl pool to corresponding number of
slaves.
Slaves crawl, extract and report extracted URLs to master.
(Exit condition: crawl pool empty after all working slaves have
finished their current task and dedup completed)
Regards,
Ray
On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos
<tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com__>> wrote:
I've leaned towards #2 from the get go and the following
could help
reduce redundancy -- from a previous message:
All instances converging in intervals about the collective paths
they've discovered in order for each to keep a local look-up
cache
(no latency introduced since no-one would rely/wait on this
to be up
to date or even available)
This info could be pulled along with the links to follow
from the
master.
It's a weird problem to solve efficiently this one... :)
On 01/17/2012 05:45 PM, Richard Hauswald wrote:
Yeah, you are right. URL's should be unique in the work
queue.
Otherwise - in case of circular links between the pages
If a worker should just extract paths or do a full crawl
depends on
the duration of a full crawl. I can think of 3 different
ways,
depending on your situation:
Is there still something I missed or would one of 1,2,3
solve
your problem?
On Tue, Jan 17, 2012 at 4:07 PM, Tasos
Laskos<tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com__>>
wrote:
Well, websites are a crazy mesh (or mess, sometimes) and
lots of pages link
to other pages so workers will eventually end up being
redundant.
Of course, I'm basing my response on the assumption that
your model has the
workers actually crawl and not simply visit a given
number
of pages, parse
and then sent back the paths they've extracted.
If so then pardon me, I misunderstood.
On 01/17/2012 05:01 PM, Richard Hauswald wrote:
Why would workers visiting the same pages?
On Tue, Jan 17, 2012 at 3:44 PM, Tasos
Laskos<tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com__>>
wrote:
You're right, it does sound good but it
would still
suffer from the same
problem, workers visiting the same pages.
Although this could be somewhat mitigated
the same
way I described in an
earlier post (which I think hasn't been
moderated yet).
I've bookmarked it for future reference,
thanks man.
On 01/17/2012 04:20 PM, Richard Hauswald wrote:
Btw, did you intentionally send this
e-mail
privately?
No, clicked the wrong button inside
gmail - sorry.
I thought your intend was to divide URLs by
subdomains, run a crawl,
"Briefly scope out the webapp structure and
spread the crawl of the
distinguishable visible directories
amongst the
slaves". This would
not redistribute new discovered URLs and
require
a special initial
setup step (divide URLs by subdomains).
And by
pushing tasks to the
workers you'd have to take care about
the load
of the workers which
means you'd have to implement a scheduling /
load balancing policy. By
pulling the work this would happen
automatically. To make use of multi
core CPUs you could make your workers
scan for
the systems count of
CPU cores and spawn workers threads in a
defined
ratio. You could also
run a worker on the master host. This should
lead to a good load
balancing by default. So it's not really
ignoring scheduling details.
On Tue, Jan 17, 2012 at 2:55 PM, Tasos
Laskos<tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com__>>
wrote:
Yep, that's similar to what I previously
posted with the difference
that
the
master in my system won't be a
slacker so
he'll be the lucky guy to
grab
the
seed URL.
After that things are pretty much
the same
-- ignoring the scheduling
details etc.
Btw, did you intentionally send this
e-mail
privately?
On 01/17/2012 03:48 PM, Richard
Hauswald wrote:
Yes, I didn't understand the
nature of
your problem the first time ...
So, you could still use the pull
principle. To make it simple
I'll not
consider work packets/batching. This
could be used later to further
improve performance by reducing
"latency" / "seek" times.
So what about the following process:
Does this make sense?
On Tue, Jan 17, 2012 at 2:15 PM,
Tasos
Laskos<tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com__>>
wrote:
What prevents this is the
nature of
the crawl process.
What I'm trying to achieve
here is
not spread the workload but
actually
find
it.
I'm not interested in
parsing the
pages or any sort of
processing but
only
gather all available paths.
So there's not really any
"work" to
distribute actually.
Does this make sense?
On 01/17/2012 03:05 PM, Richard
Hauswald wrote:
Tasos,
what prevents you from
let the
workers pull the work
from the
master
instead of pushing it to the
workers? Then you could
let the
workers
pull work packets containing
e.g. 20 work items. After a
worker has
no
work left, it will push the
results to the master
and pull
another
work packet.
Regards,
Richard
On Mon, Jan 16, 2012 at
6:41 PM,
Tasos
Laskos<tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com__>>
wrote:
Hi guys, it's been a
while.
I've got a tricky
question
for you today and I hope
that we sort of
get
a
brainstorm going.
I've recently
implemented a
system for audit
distribution in the
form
of
a
high performance
grid (won't
self-promote) however an
area which I
left
alone (for the time
being)
was the crawl process.
See, the way it
works now is
the master instance
performs the
initial
crawl
and then calculates and
distributes the audit
workload amongst its
slaves
but the crawl takes
place
the old fashioned way.
As you might have
guessed
the major set back
is caused
by the fact
that
it's
not possible to
determine
the workload of the
crawl a
priori.
I've got a couple of
naive
ideas to parallelize the
crawl just to
get
me
started:
Or even a
combination of the
above if applicable.
Both ideas are
better than
what I've got now
and there
aren't any
downsides
to them even if the
distribution turns
out to be
suboptimal.
I'm curious though, has
anyone faced a
similar problem?
Any general ideas?
Cheers,
Tasos Laskos.
The Web Security
Mailing List
WebSecurity RSS Feed
http://www.webappsec.org/rss/____websecurity.rss
http://www.webappsec.org/rss/__websecurity.rss
<http://www.webappsec.org/rss/__websecurity.rss
http://www.webappsec.org/rss/websecurity.rss>
Join WASC on LinkedIn
http://www.linkedin.com/e/gis/____83336/4B20E4374DBA
http://www.linkedin.com/e/gis/__83336/4B20E4374DBA
<http://www.linkedin.com/e/__gis/83336/4B20E4374DBA
http://www.linkedin.com/e/gis/83336/4B20E4374DBA>
WASC on Twitter
http://twitter.com/wascupdates
websecurity@lists.webappsec.____org
<mailto:websecurity@lists.__webappsec.org
mailto:websecurity@lists.webappsec.org>
http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org
http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
<http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>
The Web Security Mailing List
WebSecurity RSS Feed
http://www.webappsec.org/rss/____websecurity.rss
http://www.webappsec.org/rss/__websecurity.rss
<http://www.webappsec.org/rss/__websecurity.rss
http://www.webappsec.org/rss/websecurity.rss>
Join WASC on LinkedIn
http://www.linkedin.com/e/gis/____83336/4B20E4374DBA
http://www.linkedin.com/e/gis/__83336/4B20E4374DBA
<http://www.linkedin.com/e/__gis/83336/4B20E4374DBA
http://www.linkedin.com/e/gis/83336/4B20E4374DBA>
WASC on Twitter
http://twitter.com/wascupdates
websecurity@lists.webappsec.____org
<mailto:websecurity@lists.__webappsec.org
mailto:websecurity@lists.webappsec.org>
http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org
http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
<http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>