websecurity@lists.webappsec.org

The Web Security Mailing List

View all threads

Re: [WEB SECURITY] Parallelizing the crawl

TL
Tasos Laskos
Tue, Jan 17, 2012 5:03 PM

Hm...that's the same thing Richard was saying at some point isn't it?
Certainly one of the techniques to try.

On 01/17/2012 07:00 PM, Ray wrote:

How about this? (comes to mind after reading the previous posts)

Master: only distributes URLs to crawl (crawl pool).  Responsible
for local lookup/deduplication of URLs before they enter the crawl
pool.  The lookup/dedup mechanism can also be used to generate the list
of crawled URLs in the end too.

Slaves: only crawls, extracts URLs and reports them back to master

Iteration #1:
Master is seeded with only one URL (let's say), which is the
root/starting URL for the site.
Master performs local lookup/deduplication, nothing to dedup (only one URL).
Master distributes URL in crawl pool to slave (number of slaves to
use dependent on the max number of URLs to crawl/process per slave).
Slave crawls, extracts and reports extracted URLs to master.

Iteration #2...#n:
Master gets reports of new URLs from slaves.
Master performs local lookup/deduplication, adding unrecognized URLs to
crawl pool and local lookup table.
Master distributes URLs in crawl pool to corresponding number of slaves.
Slaves crawl, extract and report extracted URLs to master.
(Exit condition: crawl pool empty after all working slaves have
finished their current task and dedup completed)

Regards,
Ray

On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos <tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com> wrote:

 I've leaned towards #2 from the get go and the following could help
 reduce redundancy -- from a previous message:


 All instances converging in intervals about the collective paths
 they've discovered in order for each to keep a local look-up cache
 (no latency introduced since no-one would rely/wait on this to be up
 to date or even available)

 This info could be pulled along with the links to follow from the
 master.

 It's a weird problem to solve efficiently this one... :)


 On 01/17/2012 05:45 PM, Richard Hauswald wrote:

     Yeah, you are right. URL's should be unique in the work queue.
     Otherwise - in case of circular links between the pages - you could
     end up in an endless loop :-o

     If a worker should just extract paths or do a full crawl depends on
     the duration of a full crawl. I can think of 3 different ways,
     depending on your situation:
     1. Do a full crawl and post back results + extracted paths
     2. Have workers do 2 different jobs, 1 job is to extract paths,
     1 job
     is to do the actual crawl
     3. Get a path and post back the whole page content so the master can
     store it. Then have a worker pool assigned for extracting paths and
     one for a full crawl, both based on the stored page content.
     But this really depends on the network speed, the load the workers
     create on the web application to crawl(in case its not just a simple
     html file based web site), the duration of a full crawl and the
     number
     of different paths in the application.

     Is there still something I missed or would one of 1,2,3 solve
     your problem?

     On Tue, Jan 17, 2012 at 4:07 PM, Tasos
     Laskos<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>>
       wrote:

         Well, websites are a crazy mesh (or mess, sometimes) and
         lots of pages link
         to other pages so workers will eventually end up being
         redundant.

         Of course, I'm basing my response on the assumption that
         your model has the
         workers actually crawl and not simply visit a given number
         of pages, parse
         and then sent back the paths they've extracted.
         If so then pardon me, I misunderstood.


         On 01/17/2012 05:01 PM, Richard Hauswald wrote:


             Why would workers visiting the same pages?

             On Tue, Jan 17, 2012 at 3:44 PM, Tasos
             Laskos<tasos.laskos@gmail.com
             <mailto:tasos.laskos@gmail.com>>
               wrote:


                 You're right, it does sound good but it would still
                 suffer from the same
                 problem, workers visiting the same pages.

                 Although this could be somewhat mitigated the same
                 way I described in an
                 earlier post (which I think hasn't been moderated yet).

                 I've bookmarked it for future reference, thanks man.


                 On 01/17/2012 04:20 PM, Richard Hauswald wrote:



                         Btw, did you intentionally send this e-mail
                         privately?



                     No, clicked the wrong button inside gmail - sorry.

                     I thought your intend was to divide URLs by
                     subdomains, run a crawl,
                     "Briefly scope out the webapp structure and
                     spread the crawl of the
                     distinguishable visible directories amongst the
                     slaves". This would
                     not redistribute new discovered URLs and require
                     a special initial
                     setup step (divide URLs by subdomains). And by
                     pushing tasks to the
                     workers you'd have to take care about the load
                     of the workers which
                     means you'd have to implement a scheduling /
                     load balancing policy. By
                     pulling the work this would happen
                     automatically. To make use of multi
                     core CPUs you could make your workers scan for
                     the systems count of
                     CPU cores and spawn workers threads in a defined
                     ratio. You could also
                     run a worker on the master host. This should
                     lead to a good load
                     balancing by default. So it's not really
                     ignoring scheduling details.

                     On Tue, Jan 17, 2012 at 2:55 PM, Tasos
                     Laskos<tasos.laskos@gmail.com
                     <mailto:tasos.laskos@gmail.com>>
                       wrote:



                         Yep, that's similar to what I previously
                         posted with the difference
                         that
                         the
                         master in my system won't be a slacker so
                         he'll be the lucky guy to
                         grab
                         the
                         seed URL.

                         After that things are pretty much the same
                         -- ignoring the scheduling
                         details etc.

                         Btw, did you intentionally send this e-mail
                         privately?


                         On 01/17/2012 03:48 PM, Richard Hauswald wrote:




                             Yes, I didn't understand the nature of
                             your problem the first time ...

                             So, you could still use the pull
                             principle. To make it simple I'll not
                             consider work packets/batching. This
                             could be used later to further
                             improve performance by reducing
                             "latency" / "seek" times.

                             So what about the following process:
                             1. Create a bunch of workers
                             2. Create a List of URLs which can be
                             considered the work queue.
                             Initially filled with one element: the
                             landing page URL in state NEW.
                             3. Let all your workers poll the master
                             for a single work item in
                             state NEW(pay attention to synchronize
                             this step on the master). One
                             of them is the lucky guy and gets the
                             landing page URL. The master
                             will update work item to state
                             PROCESSING( you may append a starting
                             time, which could be used for
                             reassigning already assigned work items
                             after a timeout). All the other workers
                             will still be idle.
                             4. The lucky guy parses the page for new
                             URLs and does whatever it
                             should also do.
                             5. The lucky guy posts the results + the
                             parsed URLs to the master.
                             6. The master stores the results, pushes
                             the new URLs into the work
                             queue with state NEW and updates the
                             work item to state COMPLETED. If
                             there is only one new URL we are not
                             lucky but if there are 10 we'd
                             have now 10 work items to distribute.
                             7. Continue until all work items are in
                             state COMPLETED.

                             Does this make sense?

                             On Tue, Jan 17, 2012 at 2:15 PM, Tasos
                             Laskos<tasos.laskos@gmail.com
                             <mailto:tasos.laskos@gmail.com>>
                               wrote:




                                 What prevents this is the nature of
                                 the crawl process.
                                 What I'm trying to achieve here is
                                 not spread the workload but
                                 actually
                                 find
                                 it.

                                 I'm not interested in parsing the
                                 pages or any sort of processing but
                                 only
                                 gather all available paths.

                                 So there's not really any "work" to
                                 distribute actually.

                                 Does this make sense?


                                 On 01/17/2012 03:05 PM, Richard
                                 Hauswald wrote:





                                     Tasos,
                                     what prevents you from let the
                                     workers pull the work from the
                                     master
                                     instead of pushing it to the
                                     workers? Then you could let the
                                     workers
                                     pull work packets containing
                                     e.g. 20 work items. After a
                                     worker has
                                     no
                                     work left, it will push the
                                     results to the master and pull
                                     another
                                     work packet.
                                     Regards,
                                     Richard

                                     On Mon, Jan 16, 2012 at 6:41 PM,
                                     Tasos
                                     Laskos<tasos.laskos@gmail.com
                                     <mailto:tasos.laskos@gmail.com>>
                                       wrote:





                                         Hi guys, it's been a while.

                                         I've got a tricky question
                                         for you today and I hope
                                         that we sort of
                                         get
                                         a
                                         brainstorm going.

                                         I've recently implemented a
                                         system for audit
                                         distribution in the
                                         form
                                         of
                                         a
                                         high performance grid (won't
                                         self-promote) however an
                                         area which I
                                         left
                                         alone (for the time being)
                                         was the crawl process.

                                         See, the way it works now is
                                         the master instance performs the
                                         initial
                                         crawl
                                         and then calculates and
                                         distributes the audit
                                         workload amongst its
                                         slaves
                                         but the crawl takes place
                                         the old fashioned way.

                                         As you might have guessed
                                         the major set back is caused
                                         by the fact
                                         that
                                         it's
                                         not possible to determine
                                         the workload of the crawl a
                                         priori.

                                         I've got a couple of naive
                                         ideas to parallelize the
                                         crawl just to
                                         get
                                         me
                                         started:
                                           * Assign crawl of
                                         subdomains to slaves -- no
                                         questions asked
                                           * Briefly scope out the
                                         webapp structure and spread
                                         the crawl of
                                         the
                                         distinguishable visible
                                         directories amongst the slaves.

                                         Or even a combination of the
                                         above if applicable.

                                         Both ideas are better than
                                         what I've got now and there
                                         aren't any
                                         downsides
                                         to them even if the
                                         distribution turns out to be
                                         suboptimal.

                                         I'm curious though, has
                                         anyone faced a similar problem?
                                         Any general ideas?

                                         Cheers,
                                         Tasos Laskos.

                                         _________________________________________________
                                         The Web Security Mailing List

                                         WebSecurity RSS Feed
                                         http://www.webappsec.org/rss/__websecurity.rss
                                         <http://www.webappsec.org/rss/websecurity.rss>

                                         Join WASC on LinkedIn
                                         http://www.linkedin.com/e/gis/__83336/4B20E4374DBA
                                         <http://www.linkedin.com/e/gis/83336/4B20E4374DBA>

                                         WASC on Twitter
                                         http://twitter.com/wascupdates

                                         websecurity@lists.webappsec.__org
                                         <mailto:websecurity@lists.webappsec.org>




                                         http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
                                         <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>


























 _________________________________________________
 The Web Security Mailing List

 WebSecurity RSS Feed
 http://www.webappsec.org/rss/__websecurity.rss
 <http://www.webappsec.org/rss/websecurity.rss>

 Join WASC on LinkedIn
 http://www.linkedin.com/e/gis/__83336/4B20E4374DBA
 <http://www.linkedin.com/e/gis/83336/4B20E4374DBA>

 WASC on Twitter
 http://twitter.com/wascupdates

 websecurity@lists.webappsec.__org
 <mailto:websecurity@lists.webappsec.org>
 http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
 <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>
Hm...that's the same thing Richard was saying at some point isn't it? Certainly one of the techniques to try. On 01/17/2012 07:00 PM, Ray wrote: > How about this? (comes to mind after reading the previous posts) > > *Master*: only *distributes* URLs to crawl (crawl pool). Responsible > for local lookup/*deduplication* of URLs before they enter the crawl > pool. The lookup/dedup mechanism can also be used to generate the list > of crawled URLs in the end too. > > *Slaves*: only *crawls*, *extracts* URLs and reports them back to master > > _Iteration #1:_ > Master is seeded with only one URL (let's say), which is the > root/starting URL for the site. > Master performs local lookup/deduplication, nothing to dedup (only one URL). > Master distributes URL in crawl pool to slave (number of slaves to > use dependent on the max number of URLs to crawl/process per slave). > Slave crawls, extracts and reports extracted URLs to master. > > _Iteration #2...#n:_ > Master gets reports of new URLs from slaves. > Master performs local lookup/deduplication, adding unrecognized URLs to > crawl pool and local lookup table. > Master distributes URLs in crawl pool to corresponding number of slaves. > Slaves crawl, extract and report extracted URLs to master. > (*Exit condition*: crawl pool empty after all working slaves have > finished their current task and dedup completed) > > Regards, > Ray > > On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos <tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com>> wrote: > > I've leaned towards #2 from the get go and the following could help > reduce redundancy -- from a previous message: > > > All instances converging in intervals about the collective paths > they've discovered in order for each to keep a local look-up cache > (no latency introduced since no-one would rely/wait on this to be up > to date or even available) > > This info could be pulled along with the links to follow from the > master. > > It's a weird problem to solve efficiently this one... :) > > > On 01/17/2012 05:45 PM, Richard Hauswald wrote: > > Yeah, you are right. URL's should be unique in the work queue. > Otherwise - in case of circular links between the pages - you could > end up in an endless loop :-o > > If a worker should just extract paths or do a full crawl depends on > the duration of a full crawl. I can think of 3 different ways, > depending on your situation: > 1. Do a full crawl and post back results + extracted paths > 2. Have workers do 2 different jobs, 1 job is to extract paths, > 1 job > is to do the actual crawl > 3. Get a path and post back the whole page content so the master can > store it. Then have a worker pool assigned for extracting paths and > one for a full crawl, both based on the stored page content. > But this really depends on the network speed, the load the workers > create on the web application to crawl(in case its not just a simple > html file based web site), the duration of a full crawl and the > number > of different paths in the application. > > Is there still something I missed or would one of 1,2,3 solve > your problem? > > On Tue, Jan 17, 2012 at 4:07 PM, Tasos > Laskos<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>> > wrote: > > Well, websites are a crazy mesh (or mess, sometimes) and > lots of pages link > to other pages so workers will eventually end up being > redundant. > > Of course, I'm basing my response on the assumption that > your model has the > workers actually crawl and not simply visit a given number > of pages, parse > and then sent back the paths they've extracted. > If so then pardon me, I misunderstood. > > > On 01/17/2012 05:01 PM, Richard Hauswald wrote: > > > Why would workers visiting the same pages? > > On Tue, Jan 17, 2012 at 3:44 PM, Tasos > Laskos<tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com>> > wrote: > > > You're right, it does sound good but it would still > suffer from the same > problem, workers visiting the same pages. > > Although this could be somewhat mitigated the same > way I described in an > earlier post (which I think hasn't been moderated yet). > > I've bookmarked it for future reference, thanks man. > > > On 01/17/2012 04:20 PM, Richard Hauswald wrote: > > > > Btw, did you intentionally send this e-mail > privately? > > > > No, clicked the wrong button inside gmail - sorry. > > I thought your intend was to divide URLs by > subdomains, run a crawl, > "Briefly scope out the webapp structure and > spread the crawl of the > distinguishable visible directories amongst the > slaves". This would > not redistribute new discovered URLs and require > a special initial > setup step (divide URLs by subdomains). And by > pushing tasks to the > workers you'd have to take care about the load > of the workers which > means you'd have to implement a scheduling / > load balancing policy. By > pulling the work this would happen > automatically. To make use of multi > core CPUs you could make your workers scan for > the systems count of > CPU cores and spawn workers threads in a defined > ratio. You could also > run a worker on the master host. This should > lead to a good load > balancing by default. So it's not really > ignoring scheduling details. > > On Tue, Jan 17, 2012 at 2:55 PM, Tasos > Laskos<tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com>> > wrote: > > > > Yep, that's similar to what I previously > posted with the difference > that > the > master in my system won't be a slacker so > he'll be the lucky guy to > grab > the > seed URL. > > After that things are pretty much the same > -- ignoring the scheduling > details etc. > > Btw, did you intentionally send this e-mail > privately? > > > On 01/17/2012 03:48 PM, Richard Hauswald wrote: > > > > > Yes, I didn't understand the nature of > your problem the first time ... > > So, you could still use the pull > principle. To make it simple I'll not > consider work packets/batching. This > could be used later to further > improve performance by reducing > "latency" / "seek" times. > > So what about the following process: > 1. Create a bunch of workers > 2. Create a List of URLs which can be > considered the work queue. > Initially filled with one element: the > landing page URL in state NEW. > 3. Let all your workers poll the master > for a single work item in > state NEW(pay attention to synchronize > this step on the master). One > of them is the lucky guy and gets the > landing page URL. The master > will update work item to state > PROCESSING( you may append a starting > time, which could be used for > reassigning already assigned work items > after a timeout). All the other workers > will still be idle. > 4. The lucky guy parses the page for new > URLs and does whatever it > should also do. > 5. The lucky guy posts the results + the > parsed URLs to the master. > 6. The master stores the results, pushes > the new URLs into the work > queue with state NEW and updates the > work item to state COMPLETED. If > there is only one new URL we are not > lucky but if there are 10 we'd > have now 10 work items to distribute. > 7. Continue until all work items are in > state COMPLETED. > > Does this make sense? > > On Tue, Jan 17, 2012 at 2:15 PM, Tasos > Laskos<tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com>> > wrote: > > > > > What prevents this is the nature of > the crawl process. > What I'm trying to achieve here is > not spread the workload but > actually > find > it. > > I'm not interested in parsing the > pages or any sort of processing but > only > gather all available paths. > > So there's not really any "work" to > distribute actually. > > Does this make sense? > > > On 01/17/2012 03:05 PM, Richard > Hauswald wrote: > > > > > > Tasos, > what prevents you from let the > workers pull the work from the > master > instead of pushing it to the > workers? Then you could let the > workers > pull work packets containing > e.g. 20 work items. After a > worker has > no > work left, it will push the > results to the master and pull > another > work packet. > Regards, > Richard > > On Mon, Jan 16, 2012 at 6:41 PM, > Tasos > Laskos<tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com>> > wrote: > > > > > > Hi guys, it's been a while. > > I've got a tricky question > for you today and I hope > that we sort of > get > a > brainstorm going. > > I've recently implemented a > system for audit > distribution in the > form > of > a > high performance grid (won't > self-promote) however an > area which I > left > alone (for the time being) > was the crawl process. > > See, the way it works now is > the master instance performs the > initial > crawl > and then calculates and > distributes the audit > workload amongst its > slaves > but the crawl takes place > the old fashioned way. > > As you might have guessed > the major set back is caused > by the fact > that > it's > not possible to determine > the workload of the crawl a > priori. > > I've got a couple of naive > ideas to parallelize the > crawl just to > get > me > started: > * Assign crawl of > subdomains to slaves -- no > questions asked > * Briefly scope out the > webapp structure and spread > the crawl of > the > distinguishable visible > directories amongst the slaves. > > Or even a combination of the > above if applicable. > > Both ideas are better than > what I've got now and there > aren't any > downsides > to them even if the > distribution turns out to be > suboptimal. > > I'm curious though, has > anyone faced a similar problem? > Any general ideas? > > Cheers, > Tasos Laskos. > > _________________________________________________ > The Web Security Mailing List > > WebSecurity RSS Feed > http://www.webappsec.org/rss/__websecurity.rss > <http://www.webappsec.org/rss/websecurity.rss> > > Join WASC on LinkedIn > http://www.linkedin.com/e/gis/__83336/4B20E4374DBA > <http://www.linkedin.com/e/gis/83336/4B20E4374DBA> > > WASC on Twitter > http://twitter.com/wascupdates > > websecurity@lists.webappsec.__org > <mailto:websecurity@lists.webappsec.org> > > > > > http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org > <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org> > > > > > > > > > > > > > > > > > > > > > > > > > > > _________________________________________________ > The Web Security Mailing List > > WebSecurity RSS Feed > http://www.webappsec.org/rss/__websecurity.rss > <http://www.webappsec.org/rss/websecurity.rss> > > Join WASC on LinkedIn > http://www.linkedin.com/e/gis/__83336/4B20E4374DBA > <http://www.linkedin.com/e/gis/83336/4B20E4374DBA> > > WASC on Twitter > http://twitter.com/wascupdates > > websecurity@lists.webappsec.__org > <mailto:websecurity@lists.webappsec.org> > http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org > <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org> > >
R
Ray
Tue, Jan 17, 2012 5:07 PM

I thought mine was slightly different?  But whichever the case, just to
contribute something to the discussion :)

On Wed, Jan 18, 2012 at 1:03 AM, Tasos Laskos tasos.laskos@gmail.comwrote:

Hm...that's the same thing Richard was saying at some point isn't it?
Certainly one of the techniques to try.

On 01/17/2012 07:00 PM, Ray wrote:

How about this? (comes to mind after reading the previous posts)

Master: only distributes URLs to crawl (crawl pool).  Responsible
for local lookup/deduplication of URLs before they enter the crawl

pool.  The lookup/dedup mechanism can also be used to generate the list
of crawled URLs in the end too.

Slaves: only crawls, extracts URLs and reports them back to master

Iteration #1:

Master is seeded with only one URL (let's say), which is the
root/starting URL for the site.
Master performs local lookup/deduplication, nothing to dedup (only one
URL).
Master distributes URL in crawl pool to slave (number of slaves to
use dependent on the max number of URLs to crawl/process per slave).
Slave crawls, extracts and reports extracted URLs to master.

Iteration #2...#n:

Master gets reports of new URLs from slaves.
Master performs local lookup/deduplication, adding unrecognized URLs to
crawl pool and local lookup table.
Master distributes URLs in crawl pool to corresponding number of slaves.
Slaves crawl, extract and report extracted URLs to master.
(Exit condition: crawl pool empty after all working slaves have

finished their current task and dedup completed)

Regards,
Ray

On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos <tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com**> wrote:

I've leaned towards #2 from the get go and the following could help
reduce redundancy -- from a previous message:


All instances converging in intervals about the collective paths
they've discovered in order for each to keep a local look-up cache
(no latency introduced since no-one would rely/wait on this to be up
to date or even available)

This info could be pulled along with the links to follow from the
master.

It's a weird problem to solve efficiently this one... :)


On 01/17/2012 05:45 PM, Richard Hauswald wrote:

    Yeah, you are right. URL's should be unique in the work queue.
    Otherwise - in case of circular links between the pages - you could
    end up in an endless loop :-o

    If a worker should just extract paths or do a full crawl depends on
    the duration of a full crawl. I can think of 3 different ways,
    depending on your situation:
    1. Do a full crawl and post back results + extracted paths
    2. Have workers do 2 different jobs, 1 job is to extract paths,
    1 job
    is to do the actual crawl
    3. Get a path and post back the whole page content so the master

can
store it. Then have a worker pool assigned for extracting paths and
one for a full crawl, both based on the stored page content.
But this really depends on the network speed, the load the workers
create on the web application to crawl(in case its not just a
simple
html file based web site), the duration of a full crawl and the
number
of different paths in the application.

    Is there still something I missed or would one of 1,2,3 solve
    your problem?

    On Tue, Jan 17, 2012 at 4:07 PM, Tasos
    Laskos<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com**>>

      wrote:

        Well, websites are a crazy mesh (or mess, sometimes) and
        lots of pages link
        to other pages so workers will eventually end up being
        redundant.

        Of course, I'm basing my response on the assumption that
        your model has the
        workers actually crawl and not simply visit a given number
        of pages, parse
        and then sent back the paths they've extracted.
        If so then pardon me, I misunderstood.


        On 01/17/2012 05:01 PM, Richard Hauswald wrote:


            Why would workers visiting the same pages?

            On Tue, Jan 17, 2012 at 3:44 PM, Tasos
            Laskos<tasos.laskos@gmail.com
            <mailto:tasos.laskos@gmail.com**>>

              wrote:


                You're right, it does sound good but it would still
                suffer from the same
                problem, workers visiting the same pages.

                Although this could be somewhat mitigated the same
                way I described in an
                earlier post (which I think hasn't been moderated yet).

                I've bookmarked it for future reference, thanks man.


                On 01/17/2012 04:20 PM, Richard Hauswald wrote:



                        Btw, did you intentionally send this e-mail
                        privately?



                    No, clicked the wrong button inside gmail - sorry.

                    I thought your intend was to divide URLs by
                    subdomains, run a crawl,
                    "Briefly scope out the webapp structure and
                    spread the crawl of the
                    distinguishable visible directories amongst the
                    slaves". This would
                    not redistribute new discovered URLs and require
                    a special initial
                    setup step (divide URLs by subdomains). And by
                    pushing tasks to the
                    workers you'd have to take care about the load
                    of the workers which
                    means you'd have to implement a scheduling /
                    load balancing policy. By
                    pulling the work this would happen
                    automatically. To make use of multi
                    core CPUs you could make your workers scan for
                    the systems count of
                    CPU cores and spawn workers threads in a defined
                    ratio. You could also
                    run a worker on the master host. This should
                    lead to a good load
                    balancing by default. So it's not really
                    ignoring scheduling details.

                    On Tue, Jan 17, 2012 at 2:55 PM, Tasos
                    Laskos<tasos.laskos@gmail.com
                    <mailto:tasos.laskos@gmail.com**>>

                      wrote:



                        Yep, that's similar to what I previously
                        posted with the difference
                        that
                        the
                        master in my system won't be a slacker so
                        he'll be the lucky guy to
                        grab
                        the
                        seed URL.

                        After that things are pretty much the same
                        -- ignoring the scheduling
                        details etc.

                        Btw, did you intentionally send this e-mail
                        privately?


                        On 01/17/2012 03:48 PM, Richard Hauswald wrote:




                            Yes, I didn't understand the nature of
                            your problem the first time ...

                            So, you could still use the pull
                            principle. To make it simple I'll not
                            consider work packets/batching. This
                            could be used later to further
                            improve performance by reducing
                            "latency" / "seek" times.

                            So what about the following process:
                            1. Create a bunch of workers
                            2. Create a List of URLs which can be
                            considered the work queue.
                            Initially filled with one element: the
                            landing page URL in state NEW.
                            3. Let all your workers poll the master
                            for a single work item in
                            state NEW(pay attention to synchronize
                            this step on the master). One
                            of them is the lucky guy and gets the
                            landing page URL. The master
                            will update work item to state
                            PROCESSING( you may append a starting
                            time, which could be used for
                            reassigning already assigned work items
                            after a timeout). All the other workers
                            will still be idle.
                            4. The lucky guy parses the page for new
                            URLs and does whatever it
                            should also do.
                            5. The lucky guy posts the results + the
                            parsed URLs to the master.
                            6. The master stores the results, pushes
                            the new URLs into the work
                            queue with state NEW and updates the
                            work item to state COMPLETED. If
                            there is only one new URL we are not
                            lucky but if there are 10 we'd
                            have now 10 work items to distribute.
                            7. Continue until all work items are in
                            state COMPLETED.

                            Does this make sense?

                            On Tue, Jan 17, 2012 at 2:15 PM, Tasos
                            Laskos<tasos.laskos@gmail.com
                            <mailto:tasos.laskos@gmail.com**>>

                              wrote:




                                What prevents this is the nature of
                                the crawl process.
                                What I'm trying to achieve here is
                                not spread the workload but
                                actually
                                find
                                it.

                                I'm not interested in parsing the
                                pages or any sort of processing but
                                only
                                gather all available paths.

                                So there's not really any "work" to
                                distribute actually.

                                Does this make sense?


                                On 01/17/2012 03:05 PM, Richard
                                Hauswald wrote:





                                    Tasos,
                                    what prevents you from let the
                                    workers pull the work from the
                                    master
                                    instead of pushing it to the
                                    workers? Then you could let the
                                    workers
                                    pull work packets containing
                                    e.g. 20 work items. After a
                                    worker has
                                    no
                                    work left, it will push the
                                    results to the master and pull
                                    another
                                    work packet.
                                    Regards,
                                    Richard

                                    On Mon, Jan 16, 2012 at 6:41 PM,
                                    Tasos
                                    Laskos<tasos.laskos@gmail.com
                                    <mailto:tasos.laskos@gmail.com**>>

                                      wrote:





                                        Hi guys, it's been a while.

                                        I've got a tricky question
                                        for you today and I hope
                                        that we sort of
                                        get
                                        a
                                        brainstorm going.

                                        I've recently implemented a
                                        system for audit
                                        distribution in the
                                        form
                                        of
                                        a
                                        high performance grid (won't
                                        self-promote) however an
                                        area which I
                                        left
                                        alone (for the time being)
                                        was the crawl process.

                                        See, the way it works now is
                                        the master instance performs

the
initial
crawl
and then calculates and
distributes the audit
workload amongst its
slaves
but the crawl takes place
the old fashioned way.

                                        As you might have guessed
                                        the major set back is caused
                                        by the fact
                                        that
                                        it's
                                        not possible to determine
                                        the workload of the crawl a
                                        priori.

                                        I've got a couple of naive
                                        ideas to parallelize the
                                        crawl just to
                                        get
                                        me
                                        started:
                                          * Assign crawl of
                                        subdomains to slaves -- no
                                        questions asked
                                          * Briefly scope out the
                                        webapp structure and spread
                                        the crawl of
                                        the
                                        distinguishable visible
                                        directories amongst the slaves.

                                        Or even a combination of the
                                        above if applicable.

                                        Both ideas are better than
                                        what I've got now and there
                                        aren't any
                                        downsides
                                        to them even if the
                                        distribution turns out to be
                                        suboptimal.

                                        I'm curious though, has
                                        anyone faced a similar problem?
                                        Any general ideas?

                                        Cheers,
                                        Tasos Laskos.

                                        ______________________________

**___________________

                                        The Web Security Mailing List

                                        WebSecurity RSS Feed
                                        http://www.webappsec.org/rss/_

**_websecurity.rss http://www.webappsec.org/rss/__websecurity.rss

                                        <http://www.webappsec.org/rss/

**websecurity.rss http://www.webappsec.org/rss/websecurity.rss>

                                        Join WASC on LinkedIn
                                        http://www.linkedin.com/e/gis/

**__83336/4B20E4374DBAhttp://www.linkedin.com/e/gis/__83336/4B20E4374DBA

                                        <http://www.linkedin.com/e/**

gis/83336/4B20E4374DBA http://www.linkedin.com/e/gis/83336/4B20E4374DBA

                                        WASC on Twitter
                                        http://twitter.com/wascupdates

                                        websecurity@lists.webappsec.__

org
<mailto:websecurity@lists.

webappsec.org websecurity@lists.webappsec.org>

                                        http://lists.webappsec.org/__*

*mailman/listinfo/websecurity___lists.webappsec.orghttp://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
<http://lists.webappsec.org/

mailman/listinfo/websecurity_**lists.webappsec.orghttp://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

______________________________**___________________

The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/_**_websecurity.rss<http://www.webappsec.org/rss/__websecurity.rss>

<http://www.webappsec.org/rss/**websecurity.rss<http://www.webappsec.org/rss/websecurity.rss>
Join WASC on LinkedIn
http://www.linkedin.com/e/gis/**__83336/4B20E4374DBA<http://www.linkedin.com/e/gis/__83336/4B20E4374DBA>

<http://www.linkedin.com/e/**gis/83336/4B20E4374DBA<http://www.linkedin.com/e/gis/83336/4B20E4374DBA>
WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.__**org
<mailto:websecurity@lists.**webappsec.org<websecurity@lists.webappsec.org>
http://lists.webappsec.org/__**mailman/listinfo/websecurity__**

lists.webappsec.orghttp://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
<http://lists.webappsec.org/**mailman/listinfo/websecurity
**
lists.webappsec.orghttp://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org

I thought mine was slightly different? But whichever the case, just to contribute something to the discussion :) On Wed, Jan 18, 2012 at 1:03 AM, Tasos Laskos <tasos.laskos@gmail.com>wrote: > Hm...that's the same thing Richard was saying at some point isn't it? > Certainly one of the techniques to try. > > > On 01/17/2012 07:00 PM, Ray wrote: > >> How about this? (comes to mind after reading the previous posts) >> >> *Master*: only *distributes* URLs to crawl (crawl pool). Responsible >> for local lookup/*deduplication* of URLs before they enter the crawl >> >> pool. The lookup/dedup mechanism can also be used to generate the list >> of crawled URLs in the end too. >> >> *Slaves*: only *crawls*, *extracts* URLs and reports them back to master >> >> _Iteration #1:_ >> >> Master is seeded with only one URL (let's say), which is the >> root/starting URL for the site. >> Master performs local lookup/deduplication, nothing to dedup (only one >> URL). >> Master distributes URL in crawl pool to slave (number of slaves to >> use dependent on the max number of URLs to crawl/process per slave). >> Slave crawls, extracts and reports extracted URLs to master. >> >> _Iteration #2...#n:_ >> >> Master gets reports of new URLs from slaves. >> Master performs local lookup/deduplication, adding unrecognized URLs to >> crawl pool and local lookup table. >> Master distributes URLs in crawl pool to corresponding number of slaves. >> Slaves crawl, extract and report extracted URLs to master. >> (*Exit condition*: crawl pool empty after all working slaves have >> >> finished their current task and dedup completed) >> >> Regards, >> Ray >> >> On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos <tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com**>> wrote: >> >> I've leaned towards #2 from the get go and the following could help >> reduce redundancy -- from a previous message: >> >> >> All instances converging in intervals about the collective paths >> they've discovered in order for each to keep a local look-up cache >> (no latency introduced since no-one would rely/wait on this to be up >> to date or even available) >> >> This info could be pulled along with the links to follow from the >> master. >> >> It's a weird problem to solve efficiently this one... :) >> >> >> On 01/17/2012 05:45 PM, Richard Hauswald wrote: >> >> Yeah, you are right. URL's should be unique in the work queue. >> Otherwise - in case of circular links between the pages - you could >> end up in an endless loop :-o >> >> If a worker should just extract paths or do a full crawl depends on >> the duration of a full crawl. I can think of 3 different ways, >> depending on your situation: >> 1. Do a full crawl and post back results + extracted paths >> 2. Have workers do 2 different jobs, 1 job is to extract paths, >> 1 job >> is to do the actual crawl >> 3. Get a path and post back the whole page content so the master >> can >> store it. Then have a worker pool assigned for extracting paths and >> one for a full crawl, both based on the stored page content. >> But this really depends on the network speed, the load the workers >> create on the web application to crawl(in case its not just a >> simple >> html file based web site), the duration of a full crawl and the >> number >> of different paths in the application. >> >> Is there still something I missed or would one of 1,2,3 solve >> your problem? >> >> On Tue, Jan 17, 2012 at 4:07 PM, Tasos >> Laskos<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com**>> >> >> wrote: >> >> Well, websites are a crazy mesh (or mess, sometimes) and >> lots of pages link >> to other pages so workers will eventually end up being >> redundant. >> >> Of course, I'm basing my response on the assumption that >> your model has the >> workers actually crawl and not simply visit a given number >> of pages, parse >> and then sent back the paths they've extracted. >> If so then pardon me, I misunderstood. >> >> >> On 01/17/2012 05:01 PM, Richard Hauswald wrote: >> >> >> Why would workers visiting the same pages? >> >> On Tue, Jan 17, 2012 at 3:44 PM, Tasos >> Laskos<tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com**>> >> >> wrote: >> >> >> You're right, it does sound good but it would still >> suffer from the same >> problem, workers visiting the same pages. >> >> Although this could be somewhat mitigated the same >> way I described in an >> earlier post (which I think hasn't been moderated yet). >> >> I've bookmarked it for future reference, thanks man. >> >> >> On 01/17/2012 04:20 PM, Richard Hauswald wrote: >> >> >> >> Btw, did you intentionally send this e-mail >> privately? >> >> >> >> No, clicked the wrong button inside gmail - sorry. >> >> I thought your intend was to divide URLs by >> subdomains, run a crawl, >> "Briefly scope out the webapp structure and >> spread the crawl of the >> distinguishable visible directories amongst the >> slaves". This would >> not redistribute new discovered URLs and require >> a special initial >> setup step (divide URLs by subdomains). And by >> pushing tasks to the >> workers you'd have to take care about the load >> of the workers which >> means you'd have to implement a scheduling / >> load balancing policy. By >> pulling the work this would happen >> automatically. To make use of multi >> core CPUs you could make your workers scan for >> the systems count of >> CPU cores and spawn workers threads in a defined >> ratio. You could also >> run a worker on the master host. This should >> lead to a good load >> balancing by default. So it's not really >> ignoring scheduling details. >> >> On Tue, Jan 17, 2012 at 2:55 PM, Tasos >> Laskos<tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com**>> >> >> wrote: >> >> >> >> Yep, that's similar to what I previously >> posted with the difference >> that >> the >> master in my system won't be a slacker so >> he'll be the lucky guy to >> grab >> the >> seed URL. >> >> After that things are pretty much the same >> -- ignoring the scheduling >> details etc. >> >> Btw, did you intentionally send this e-mail >> privately? >> >> >> On 01/17/2012 03:48 PM, Richard Hauswald wrote: >> >> >> >> >> Yes, I didn't understand the nature of >> your problem the first time ... >> >> So, you could still use the pull >> principle. To make it simple I'll not >> consider work packets/batching. This >> could be used later to further >> improve performance by reducing >> "latency" / "seek" times. >> >> So what about the following process: >> 1. Create a bunch of workers >> 2. Create a List of URLs which can be >> considered the work queue. >> Initially filled with one element: the >> landing page URL in state NEW. >> 3. Let all your workers poll the master >> for a single work item in >> state NEW(pay attention to synchronize >> this step on the master). One >> of them is the lucky guy and gets the >> landing page URL. The master >> will update work item to state >> PROCESSING( you may append a starting >> time, which could be used for >> reassigning already assigned work items >> after a timeout). All the other workers >> will still be idle. >> 4. The lucky guy parses the page for new >> URLs and does whatever it >> should also do. >> 5. The lucky guy posts the results + the >> parsed URLs to the master. >> 6. The master stores the results, pushes >> the new URLs into the work >> queue with state NEW and updates the >> work item to state COMPLETED. If >> there is only one new URL we are not >> lucky but if there are 10 we'd >> have now 10 work items to distribute. >> 7. Continue until all work items are in >> state COMPLETED. >> >> Does this make sense? >> >> On Tue, Jan 17, 2012 at 2:15 PM, Tasos >> Laskos<tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com**>> >> >> wrote: >> >> >> >> >> What prevents this is the nature of >> the crawl process. >> What I'm trying to achieve here is >> not spread the workload but >> actually >> find >> it. >> >> I'm not interested in parsing the >> pages or any sort of processing but >> only >> gather all available paths. >> >> So there's not really any "work" to >> distribute actually. >> >> Does this make sense? >> >> >> On 01/17/2012 03:05 PM, Richard >> Hauswald wrote: >> >> >> >> >> >> Tasos, >> what prevents you from let the >> workers pull the work from the >> master >> instead of pushing it to the >> workers? Then you could let the >> workers >> pull work packets containing >> e.g. 20 work items. After a >> worker has >> no >> work left, it will push the >> results to the master and pull >> another >> work packet. >> Regards, >> Richard >> >> On Mon, Jan 16, 2012 at 6:41 PM, >> Tasos >> Laskos<tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com**>> >> >> wrote: >> >> >> >> >> >> Hi guys, it's been a while. >> >> I've got a tricky question >> for you today and I hope >> that we sort of >> get >> a >> brainstorm going. >> >> I've recently implemented a >> system for audit >> distribution in the >> form >> of >> a >> high performance grid (won't >> self-promote) however an >> area which I >> left >> alone (for the time being) >> was the crawl process. >> >> See, the way it works now is >> the master instance performs >> the >> initial >> crawl >> and then calculates and >> distributes the audit >> workload amongst its >> slaves >> but the crawl takes place >> the old fashioned way. >> >> As you might have guessed >> the major set back is caused >> by the fact >> that >> it's >> not possible to determine >> the workload of the crawl a >> priori. >> >> I've got a couple of naive >> ideas to parallelize the >> crawl just to >> get >> me >> started: >> * Assign crawl of >> subdomains to slaves -- no >> questions asked >> * Briefly scope out the >> webapp structure and spread >> the crawl of >> the >> distinguishable visible >> directories amongst the slaves. >> >> Or even a combination of the >> above if applicable. >> >> Both ideas are better than >> what I've got now and there >> aren't any >> downsides >> to them even if the >> distribution turns out to be >> suboptimal. >> >> I'm curious though, has >> anyone faced a similar problem? >> Any general ideas? >> >> Cheers, >> Tasos Laskos. >> >> ______________________________ >> **___________________ >> >> The Web Security Mailing List >> >> WebSecurity RSS Feed >> http://www.webappsec.org/rss/_ >> **_websecurity.rss <http://www.webappsec.org/rss/__websecurity.rss> >> >> <http://www.webappsec.org/rss/ >> **websecurity.rss <http://www.webappsec.org/rss/websecurity.rss>> >> >> Join WASC on LinkedIn >> http://www.linkedin.com/e/gis/ >> **__83336/4B20E4374DBA<http://www.linkedin.com/e/gis/__83336/4B20E4374DBA> >> >> <http://www.linkedin.com/e/** >> gis/83336/4B20E4374DBA <http://www.linkedin.com/e/gis/83336/4B20E4374DBA> >> > >> >> WASC on Twitter >> http://twitter.com/wascupdates >> >> websecurity@lists.webappsec.__ >> **org >> <mailto:websecurity@lists.** >> webappsec.org <websecurity@lists.webappsec.org>> >> >> >> >> >> http://lists.webappsec.org/__* >> *mailman/listinfo/websecurity__**_lists.webappsec.org<http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org> >> <http://lists.webappsec.org/** >> mailman/listinfo/websecurity_**lists.webappsec.org<http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ______________________________**___________________ >> >> The Web Security Mailing List >> >> WebSecurity RSS Feed >> http://www.webappsec.org/rss/_**_websecurity.rss<http://www.webappsec.org/rss/__websecurity.rss> >> >> <http://www.webappsec.org/rss/**websecurity.rss<http://www.webappsec.org/rss/websecurity.rss> >> > >> >> Join WASC on LinkedIn >> http://www.linkedin.com/e/gis/**__83336/4B20E4374DBA<http://www.linkedin.com/e/gis/__83336/4B20E4374DBA> >> >> <http://www.linkedin.com/e/**gis/83336/4B20E4374DBA<http://www.linkedin.com/e/gis/83336/4B20E4374DBA> >> > >> >> WASC on Twitter >> http://twitter.com/wascupdates >> >> websecurity@lists.webappsec.__**org >> <mailto:websecurity@lists.**webappsec.org<websecurity@lists.webappsec.org> >> > >> http://lists.webappsec.org/__**mailman/listinfo/websecurity__** >> _lists.webappsec.org<http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org> >> <http://lists.webappsec.org/**mailman/listinfo/websecurity_** >> lists.webappsec.org<http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org> >> > >> >> >> >
TL
Tasos Laskos
Wed, Jan 18, 2012 7:16 PM

I think I got it and it turns out to be a composite approach indeed.

  1. Master scopes out the place (follows 10 paths or so) and deduces the
    webapp structure -- could be incomplete, doesn't matter we just want
    some seeds.
  2. Master creates a per directory policy and assigns dirs to workers AND
    sends that police to them as well
  3. Workers perform the crawl as usual but also implement that policy for
    URLs that don't match their own policy rules i.e. send URLs that are out
    of their scope to the appropriate peer and let him handle it
  4. If no policy matches a URL then it is sent back to the master; the
    master creates a new policy(ies), stores the work in a Queue and then
    sends an announcement ("There's some work up for grabs" )
  5. Busy workers ignore it; idling workers try to pull it and the work is
    assigned first-come/first-serve along with the updated policy
  6. Go to 3

If at any point a worker becomes idle he sends the paths he has
discovered back to the master for store/further processing/whatever.

  • No need for look-ups
  • No decision lag as no-one will need to request permission to
    perform any sort of action
  • Automated load-balancing
  • No redundant crawls

Thoughts?

On 01/17/2012 07:07 PM, Ray wrote:

I thought mine was slightly different?  But whichever the case, just to
contribute something to the discussion :)

On Wed, Jan 18, 2012 at 1:03 AM, Tasos Laskos <tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com> wrote:

 Hm...that's the same thing Richard was saying at some point isn't it?
 Certainly one of the techniques to try.


 On 01/17/2012 07:00 PM, Ray wrote:

     How about this? (comes to mind after reading the previous posts)

     *Master*: only *distributes* URLs to crawl (crawl pool).
       Responsible
     for local lookup/*deduplication* of URLs before they enter the crawl

     pool.  The lookup/dedup mechanism can also be used to generate
     the list
     of crawled URLs in the end too.

     *Slaves*: only *crawls*, *extracts* URLs and reports them back
     to master

     _Iteration #1:_

     Master is seeded with only one URL (let's say), which is the
     root/starting URL for the site.
     Master performs local lookup/deduplication, nothing to dedup
     (only one URL).
     Master distributes URL in crawl pool to slave (number of slaves to
     use dependent on the max number of URLs to crawl/process per slave).
     Slave crawls, extracts and reports extracted URLs to master.

     _Iteration #2...#n:_

     Master gets reports of new URLs from slaves.
     Master performs local lookup/deduplication, adding unrecognized
     URLs to
     crawl pool and local lookup table.
     Master distributes URLs in crawl pool to corresponding number of
     slaves.
     Slaves crawl, extract and report extracted URLs to master.
     (*Exit condition*: crawl pool empty after all working slaves have

     finished their current task and dedup completed)

     Regards,
     Ray

     On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos
     <tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>
     <mailto:tasos.laskos@gmail.com
     <mailto:tasos.laskos@gmail.com>__>> wrote:

         I've leaned towards #2 from the get go and the following
     could help
         reduce redundancy -- from a previous message:


         All instances converging in intervals about the collective paths
         they've discovered in order for each to keep a local look-up
     cache
         (no latency introduced since no-one would rely/wait on this
     to be up
         to date or even available)

         This info could be pulled along with the links to follow
     from the
         master.

         It's a weird problem to solve efficiently this one... :)


         On 01/17/2012 05:45 PM, Richard Hauswald wrote:

             Yeah, you are right. URL's should be unique in the work
     queue.
             Otherwise - in case of circular links between the pages
     - you could
             end up in an endless loop :-o

             If a worker should just extract paths or do a full crawl
     depends on
             the duration of a full crawl. I can think of 3 different
     ways,
             depending on your situation:
             1. Do a full crawl and post back results + extracted paths
             2. Have workers do 2 different jobs, 1 job is to extract
     paths,
             1 job
             is to do the actual crawl
             3. Get a path and post back the whole page content so
     the master can
             store it. Then have a worker pool assigned for
     extracting paths and
             one for a full crawl, both based on the stored page content.
             But this really depends on the network speed, the load
     the workers
             create on the web application to crawl(in case its not
     just a simple
             html file based web site), the duration of a full crawl
     and the
             number
             of different paths in the application.

             Is there still something I missed or would one of 1,2,3
     solve
             your problem?

             On Tue, Jan 17, 2012 at 4:07 PM, Tasos
             Laskos<tasos.laskos@gmail.com
     <mailto:tasos.laskos@gmail.com> <mailto:tasos.laskos@gmail.com
     <mailto:tasos.laskos@gmail.com>__>>

               wrote:

                 Well, websites are a crazy mesh (or mess, sometimes) and
                 lots of pages link
                 to other pages so workers will eventually end up being
                 redundant.

                 Of course, I'm basing my response on the assumption that
                 your model has the
                 workers actually crawl and not simply visit a given
     number
                 of pages, parse
                 and then sent back the paths they've extracted.
                 If so then pardon me, I misunderstood.


                 On 01/17/2012 05:01 PM, Richard Hauswald wrote:


                     Why would workers visiting the same pages?

                     On Tue, Jan 17, 2012 at 3:44 PM, Tasos
                     Laskos<tasos.laskos@gmail.com
     <mailto:tasos.laskos@gmail.com>
     <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>>

                       wrote:


                         You're right, it does sound good but it
     would still
                         suffer from the same
                         problem, workers visiting the same pages.

                         Although this could be somewhat mitigated
     the same
                         way I described in an
                         earlier post (which I think hasn't been
     moderated yet).

                         I've bookmarked it for future reference,
     thanks man.


                         On 01/17/2012 04:20 PM, Richard Hauswald wrote:



                                 Btw, did you intentionally send this
     e-mail
                                 privately?



                             No, clicked the wrong button inside
     gmail - sorry.

                             I thought your intend was to divide URLs by
                             subdomains, run a crawl,
     "Briefly scope out the webapp structure and
                             spread the crawl of the
                             distinguishable visible directories
     amongst the
                             slaves". This would
                             not redistribute new discovered URLs and
     require
                             a special initial
                             setup step (divide URLs by subdomains).
     And by
                             pushing tasks to the
                             workers you'd have to take care about
     the load
                             of the workers which
                             means you'd have to implement a scheduling /
                             load balancing policy. By
                             pulling the work this would happen
                             automatically. To make use of multi
                             core CPUs you could make your workers
     scan for
                             the systems count of
                             CPU cores and spawn workers threads in a
     defined
                             ratio. You could also
                             run a worker on the master host. This should
                             lead to a good load
                             balancing by default. So it's not really
                             ignoring scheduling details.

                             On Tue, Jan 17, 2012 at 2:55 PM, Tasos
                             Laskos<tasos.laskos@gmail.com
     <mailto:tasos.laskos@gmail.com>
     <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>>

                               wrote:



                                 Yep, that's similar to what I previously
                                 posted with the difference
                                 that
                                 the
                                 master in my system won't be a
     slacker so
                                 he'll be the lucky guy to
                                 grab
                                 the
                                 seed URL.

                                 After that things are pretty much
     the same
                                 -- ignoring the scheduling
                                 details etc.

                                 Btw, did you intentionally send this
     e-mail
                                 privately?


                                 On 01/17/2012 03:48 PM, Richard
     Hauswald wrote:




                                     Yes, I didn't understand the
     nature of
                                     your problem the first time ...

                                     So, you could still use the pull
                                     principle. To make it simple
     I'll not
                                     consider work packets/batching. This
                                     could be used later to further
                                     improve performance by reducing
     "latency" / "seek" times.

                                     So what about the following process:
                                     1. Create a bunch of workers
                                     2. Create a List of URLs which
     can be
                                     considered the work queue.
                                     Initially filled with one
     element: the
                                     landing page URL in state NEW.
                                     3. Let all your workers poll the
     master
                                     for a single work item in
                                     state NEW(pay attention to
     synchronize
                                     this step on the master). One
                                     of them is the lucky guy and
     gets the
                                     landing page URL. The master
                                     will update work item to state
                                     PROCESSING( you may append a
     starting
                                     time, which could be used for
                                     reassigning already assigned
     work items
                                     after a timeout). All the other
     workers
                                     will still be idle.
                                     4. The lucky guy parses the page
     for new
                                     URLs and does whatever it
                                     should also do.
                                     5. The lucky guy posts the
     results + the
                                     parsed URLs to the master.
                                     6. The master stores the
     results, pushes
                                     the new URLs into the work
                                     queue with state NEW and updates the
                                     work item to state COMPLETED. If
                                     there is only one new URL we are not
                                     lucky but if there are 10 we'd
                                     have now 10 work items to
     distribute.
                                     7. Continue until all work items
     are in
                                     state COMPLETED.

                                     Does this make sense?

                                     On Tue, Jan 17, 2012 at 2:15 PM,
     Tasos
                                     Laskos<tasos.laskos@gmail.com
     <mailto:tasos.laskos@gmail.com>
     <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>>

                                       wrote:




                                         What prevents this is the
     nature of
                                         the crawl process.
                                         What I'm trying to achieve
     here is
                                         not spread the workload but
                                         actually
                                         find
                                         it.

                                         I'm not interested in
     parsing the
                                         pages or any sort of
     processing but
                                         only
                                         gather all available paths.

                                         So there's not really any
     "work" to
                                         distribute actually.

                                         Does this make sense?


                                         On 01/17/2012 03:05 PM, Richard
                                         Hauswald wrote:





                                             Tasos,
                                             what prevents you from
     let the
                                             workers pull the work
     from the
                                             master
                                             instead of pushing it to the
                                             workers? Then you could
     let the
                                             workers
                                             pull work packets containing
                                             e.g. 20 work items. After a
                                             worker has
                                             no
                                             work left, it will push the
                                             results to the master
     and pull
                                             another
                                             work packet.
                                             Regards,
                                             Richard

                                             On Mon, Jan 16, 2012 at
     6:41 PM,
                                             Tasos

       Laskos<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>
     <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>>

                                               wrote:





                                                 Hi guys, it's been a
     while.

                                                 I've got a tricky
     question
                                                 for you today and I hope
                                                 that we sort of
                                                 get
                                                 a
                                                 brainstorm going.

                                                 I've recently
     implemented a
                                                 system for audit
                                                 distribution in the
                                                 form
                                                 of
                                                 a
                                                 high performance
     grid (won't
                                                 self-promote) however an
                                                 area which I
                                                 left
                                                 alone (for the time
     being)
                                                 was the crawl process.

                                                 See, the way it
     works now is
                                                 the master instance
     performs the
                                                 initial
                                                 crawl
                                                 and then calculates and
                                                 distributes the audit
                                                 workload amongst its
                                                 slaves
                                                 but the crawl takes
     place
                                                 the old fashioned way.

                                                 As you might have
     guessed
                                                 the major set back
     is caused
                                                 by the fact
                                                 that
                                                 it's
                                                 not possible to
     determine
                                                 the workload of the
     crawl a
                                                 priori.

                                                 I've got a couple of
     naive
                                                 ideas to parallelize the
                                                 crawl just to
                                                 get
                                                 me
                                                 started:
                                                   * Assign crawl of
                                                 subdomains to slaves
     -- no
                                                 questions asked
                                                   * Briefly scope
     out the
                                                 webapp structure and
     spread
                                                 the crawl of
                                                 the
                                                 distinguishable visible
                                                 directories amongst
     the slaves.

                                                 Or even a
     combination of the
                                                 above if applicable.

                                                 Both ideas are
     better than
                                                 what I've got now
     and there
                                                 aren't any
                                                 downsides
                                                 to them even if the
                                                 distribution turns
     out to be
                                                 suboptimal.

                                                 I'm curious though, has
                                                 anyone faced a
     similar problem?
                                                 Any general ideas?

                                                 Cheers,
                                                 Tasos Laskos.


       ___________________________________________________

                                                 The Web Security
     Mailing List

                                                 WebSecurity RSS Feed
     http://www.webappsec.org/rss/____websecurity.rss
     <http://www.webappsec.org/rss/__websecurity.rss>

     <http://www.webappsec.org/rss/__websecurity.rss
     <http://www.webappsec.org/rss/websecurity.rss>>

                                                 Join WASC on LinkedIn
     http://www.linkedin.com/e/gis/____83336/4B20E4374DBA
     <http://www.linkedin.com/e/gis/__83336/4B20E4374DBA>

     <http://www.linkedin.com/e/__gis/83336/4B20E4374DBA
     <http://www.linkedin.com/e/gis/83336/4B20E4374DBA>>

                                                 WASC on Twitter
     http://twitter.com/wascupdates


       websecurity@lists.webappsec.____org
     <mailto:websecurity@lists.__webappsec.org
     <mailto:websecurity@lists.webappsec.org>>




     http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org
     <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org>
     <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
     <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>>


























         ___________________________________________________

         The Web Security Mailing List

         WebSecurity RSS Feed
     http://www.webappsec.org/rss/____websecurity.rss
     <http://www.webappsec.org/rss/__websecurity.rss>

     <http://www.webappsec.org/rss/__websecurity.rss
     <http://www.webappsec.org/rss/websecurity.rss>>

         Join WASC on LinkedIn
     http://www.linkedin.com/e/gis/____83336/4B20E4374DBA
     <http://www.linkedin.com/e/gis/__83336/4B20E4374DBA>

     <http://www.linkedin.com/e/__gis/83336/4B20E4374DBA
     <http://www.linkedin.com/e/gis/83336/4B20E4374DBA>>

         WASC on Twitter
     http://twitter.com/wascupdates

         websecurity@lists.webappsec.____org
     <mailto:websecurity@lists.__webappsec.org
     <mailto:websecurity@lists.webappsec.org>>
     http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org
     <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org>
     <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
     <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>>
I think I got it and it turns out to be a composite approach indeed. 1. Master scopes out the place (follows 10 paths or so) and deduces the webapp structure -- could be incomplete, doesn't matter we just want some seeds. 2. Master creates a per directory policy and assigns dirs to workers AND sends that police to them as well 3. Workers perform the crawl as usual but also implement that policy for URLs that don't match their own policy rules i.e. send URLs that are out of their scope to the appropriate peer and let him handle it 4. If no policy matches a URL then it is sent back to the master; the master creates a new policy(ies), stores the work in a Queue and then sends an announcement ("There's some work up for grabs" ) 5. Busy workers ignore it; idling workers try to pull it and the work is assigned first-come/first-serve along with the updated policy 6. Go to 3 If at any point a worker becomes idle he sends the paths he has discovered back to the master for store/further processing/whatever. * No need for look-ups * No decision lag as no-one will need to request permission to perform any sort of action * Automated load-balancing * No redundant crawls Thoughts? On 01/17/2012 07:07 PM, Ray wrote: > I thought mine was slightly different? But whichever the case, just to > contribute something to the discussion :) > > On Wed, Jan 18, 2012 at 1:03 AM, Tasos Laskos <tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com>> wrote: > > Hm...that's the same thing Richard was saying at some point isn't it? > Certainly one of the techniques to try. > > > On 01/17/2012 07:00 PM, Ray wrote: > > How about this? (comes to mind after reading the previous posts) > > *Master*: only *distributes* URLs to crawl (crawl pool). > Responsible > for local lookup/*deduplication* of URLs before they enter the crawl > > pool. The lookup/dedup mechanism can also be used to generate > the list > of crawled URLs in the end too. > > *Slaves*: only *crawls*, *extracts* URLs and reports them back > to master > > _Iteration #1:_ > > Master is seeded with only one URL (let's say), which is the > root/starting URL for the site. > Master performs local lookup/deduplication, nothing to dedup > (only one URL). > Master distributes URL in crawl pool to slave (number of slaves to > use dependent on the max number of URLs to crawl/process per slave). > Slave crawls, extracts and reports extracted URLs to master. > > _Iteration #2...#n:_ > > Master gets reports of new URLs from slaves. > Master performs local lookup/deduplication, adding unrecognized > URLs to > crawl pool and local lookup table. > Master distributes URLs in crawl pool to corresponding number of > slaves. > Slaves crawl, extract and report extracted URLs to master. > (*Exit condition*: crawl pool empty after all working slaves have > > finished their current task and dedup completed) > > Regards, > Ray > > On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos > <tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com> > <mailto:tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com>__>> wrote: > > I've leaned towards #2 from the get go and the following > could help > reduce redundancy -- from a previous message: > > > All instances converging in intervals about the collective paths > they've discovered in order for each to keep a local look-up > cache > (no latency introduced since no-one would rely/wait on this > to be up > to date or even available) > > This info could be pulled along with the links to follow > from the > master. > > It's a weird problem to solve efficiently this one... :) > > > On 01/17/2012 05:45 PM, Richard Hauswald wrote: > > Yeah, you are right. URL's should be unique in the work > queue. > Otherwise - in case of circular links between the pages > - you could > end up in an endless loop :-o > > If a worker should just extract paths or do a full crawl > depends on > the duration of a full crawl. I can think of 3 different > ways, > depending on your situation: > 1. Do a full crawl and post back results + extracted paths > 2. Have workers do 2 different jobs, 1 job is to extract > paths, > 1 job > is to do the actual crawl > 3. Get a path and post back the whole page content so > the master can > store it. Then have a worker pool assigned for > extracting paths and > one for a full crawl, both based on the stored page content. > But this really depends on the network speed, the load > the workers > create on the web application to crawl(in case its not > just a simple > html file based web site), the duration of a full crawl > and the > number > of different paths in the application. > > Is there still something I missed or would one of 1,2,3 > solve > your problem? > > On Tue, Jan 17, 2012 at 4:07 PM, Tasos > Laskos<tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com> <mailto:tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com>__>> > > wrote: > > Well, websites are a crazy mesh (or mess, sometimes) and > lots of pages link > to other pages so workers will eventually end up being > redundant. > > Of course, I'm basing my response on the assumption that > your model has the > workers actually crawl and not simply visit a given > number > of pages, parse > and then sent back the paths they've extracted. > If so then pardon me, I misunderstood. > > > On 01/17/2012 05:01 PM, Richard Hauswald wrote: > > > Why would workers visiting the same pages? > > On Tue, Jan 17, 2012 at 3:44 PM, Tasos > Laskos<tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com> > <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>> > > wrote: > > > You're right, it does sound good but it > would still > suffer from the same > problem, workers visiting the same pages. > > Although this could be somewhat mitigated > the same > way I described in an > earlier post (which I think hasn't been > moderated yet). > > I've bookmarked it for future reference, > thanks man. > > > On 01/17/2012 04:20 PM, Richard Hauswald wrote: > > > > Btw, did you intentionally send this > e-mail > privately? > > > > No, clicked the wrong button inside > gmail - sorry. > > I thought your intend was to divide URLs by > subdomains, run a crawl, > "Briefly scope out the webapp structure and > spread the crawl of the > distinguishable visible directories > amongst the > slaves". This would > not redistribute new discovered URLs and > require > a special initial > setup step (divide URLs by subdomains). > And by > pushing tasks to the > workers you'd have to take care about > the load > of the workers which > means you'd have to implement a scheduling / > load balancing policy. By > pulling the work this would happen > automatically. To make use of multi > core CPUs you could make your workers > scan for > the systems count of > CPU cores and spawn workers threads in a > defined > ratio. You could also > run a worker on the master host. This should > lead to a good load > balancing by default. So it's not really > ignoring scheduling details. > > On Tue, Jan 17, 2012 at 2:55 PM, Tasos > Laskos<tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com> > <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>> > > wrote: > > > > Yep, that's similar to what I previously > posted with the difference > that > the > master in my system won't be a > slacker so > he'll be the lucky guy to > grab > the > seed URL. > > After that things are pretty much > the same > -- ignoring the scheduling > details etc. > > Btw, did you intentionally send this > e-mail > privately? > > > On 01/17/2012 03:48 PM, Richard > Hauswald wrote: > > > > > Yes, I didn't understand the > nature of > your problem the first time ... > > So, you could still use the pull > principle. To make it simple > I'll not > consider work packets/batching. This > could be used later to further > improve performance by reducing > "latency" / "seek" times. > > So what about the following process: > 1. Create a bunch of workers > 2. Create a List of URLs which > can be > considered the work queue. > Initially filled with one > element: the > landing page URL in state NEW. > 3. Let all your workers poll the > master > for a single work item in > state NEW(pay attention to > synchronize > this step on the master). One > of them is the lucky guy and > gets the > landing page URL. The master > will update work item to state > PROCESSING( you may append a > starting > time, which could be used for > reassigning already assigned > work items > after a timeout). All the other > workers > will still be idle. > 4. The lucky guy parses the page > for new > URLs and does whatever it > should also do. > 5. The lucky guy posts the > results + the > parsed URLs to the master. > 6. The master stores the > results, pushes > the new URLs into the work > queue with state NEW and updates the > work item to state COMPLETED. If > there is only one new URL we are not > lucky but if there are 10 we'd > have now 10 work items to > distribute. > 7. Continue until all work items > are in > state COMPLETED. > > Does this make sense? > > On Tue, Jan 17, 2012 at 2:15 PM, > Tasos > Laskos<tasos.laskos@gmail.com > <mailto:tasos.laskos@gmail.com> > <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>> > > wrote: > > > > > What prevents this is the > nature of > the crawl process. > What I'm trying to achieve > here is > not spread the workload but > actually > find > it. > > I'm not interested in > parsing the > pages or any sort of > processing but > only > gather all available paths. > > So there's not really any > "work" to > distribute actually. > > Does this make sense? > > > On 01/17/2012 03:05 PM, Richard > Hauswald wrote: > > > > > > Tasos, > what prevents you from > let the > workers pull the work > from the > master > instead of pushing it to the > workers? Then you could > let the > workers > pull work packets containing > e.g. 20 work items. After a > worker has > no > work left, it will push the > results to the master > and pull > another > work packet. > Regards, > Richard > > On Mon, Jan 16, 2012 at > 6:41 PM, > Tasos > > Laskos<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com> > <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>> > > wrote: > > > > > > Hi guys, it's been a > while. > > I've got a tricky > question > for you today and I hope > that we sort of > get > a > brainstorm going. > > I've recently > implemented a > system for audit > distribution in the > form > of > a > high performance > grid (won't > self-promote) however an > area which I > left > alone (for the time > being) > was the crawl process. > > See, the way it > works now is > the master instance > performs the > initial > crawl > and then calculates and > distributes the audit > workload amongst its > slaves > but the crawl takes > place > the old fashioned way. > > As you might have > guessed > the major set back > is caused > by the fact > that > it's > not possible to > determine > the workload of the > crawl a > priori. > > I've got a couple of > naive > ideas to parallelize the > crawl just to > get > me > started: > * Assign crawl of > subdomains to slaves > -- no > questions asked > * Briefly scope > out the > webapp structure and > spread > the crawl of > the > distinguishable visible > directories amongst > the slaves. > > Or even a > combination of the > above if applicable. > > Both ideas are > better than > what I've got now > and there > aren't any > downsides > to them even if the > distribution turns > out to be > suboptimal. > > I'm curious though, has > anyone faced a > similar problem? > Any general ideas? > > Cheers, > Tasos Laskos. > > > ___________________________________________________ > > The Web Security > Mailing List > > WebSecurity RSS Feed > http://www.webappsec.org/rss/____websecurity.rss > <http://www.webappsec.org/rss/__websecurity.rss> > > <http://www.webappsec.org/rss/__websecurity.rss > <http://www.webappsec.org/rss/websecurity.rss>> > > Join WASC on LinkedIn > http://www.linkedin.com/e/gis/____83336/4B20E4374DBA > <http://www.linkedin.com/e/gis/__83336/4B20E4374DBA> > > <http://www.linkedin.com/e/__gis/83336/4B20E4374DBA > <http://www.linkedin.com/e/gis/83336/4B20E4374DBA>> > > WASC on Twitter > http://twitter.com/wascupdates > > > websecurity@lists.webappsec.____org > <mailto:websecurity@lists.__webappsec.org > <mailto:websecurity@lists.webappsec.org>> > > > > > http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org > <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org> > <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org > <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>> > > > > > > > > > > > > > > > > > > > > > > > > > > > ___________________________________________________ > > The Web Security Mailing List > > WebSecurity RSS Feed > http://www.webappsec.org/rss/____websecurity.rss > <http://www.webappsec.org/rss/__websecurity.rss> > > <http://www.webappsec.org/rss/__websecurity.rss > <http://www.webappsec.org/rss/websecurity.rss>> > > Join WASC on LinkedIn > http://www.linkedin.com/e/gis/____83336/4B20E4374DBA > <http://www.linkedin.com/e/gis/__83336/4B20E4374DBA> > > <http://www.linkedin.com/e/__gis/83336/4B20E4374DBA > <http://www.linkedin.com/e/gis/83336/4B20E4374DBA>> > > WASC on Twitter > http://twitter.com/wascupdates > > websecurity@lists.webappsec.____org > <mailto:websecurity@lists.__webappsec.org > <mailto:websecurity@lists.webappsec.org>> > http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org > <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org> > <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org > <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>> > > > >
TL
Tasos Laskos
Wed, Jan 18, 2012 7:23 PM

Forgot to mention a couple of details:

  • the master won't be a slacker but also act as a worker as well.
  • Idle workers will try to pull work when they become idle even if no
    announcement has been previously set -- just to make sure.
  • Tasos

On 01/18/2012 09:16 PM, Tasos Laskos wrote:

I think I got it and it turns out to be a composite approach indeed.

  1. Master scopes out the place (follows 10 paths or so) and deduces the
    webapp structure -- could be incomplete, doesn't matter we just want
    some seeds.
  2. Master creates a per directory policy and assigns dirs to workers AND
    sends that police to them as well
  3. Workers perform the crawl as usual but also implement that policy for
    URLs that don't match their own policy rules i.e. send URLs that are out
    of their scope to the appropriate peer and let him handle it
  4. If no policy matches a URL then it is sent back to the master; the
    master creates a new policy(ies), stores the work in a Queue and then
    sends an announcement ("There's some work up for grabs" )
  5. Busy workers ignore it; idling workers try to pull it and the work is
    assigned first-come/first-serve along with the updated policy
  6. Go to 3

If at any point a worker becomes idle he sends the paths he has
discovered back to the master for store/further processing/whatever.

  • No need for look-ups
  • No decision lag as no-one will need to request permission to perform
    any sort of action
  • Automated load-balancing
  • No redundant crawls

Thoughts?

On 01/17/2012 07:07 PM, Ray wrote:

I thought mine was slightly different? But whichever the case, just to
contribute something to the discussion :)

On Wed, Jan 18, 2012 at 1:03 AM, Tasos Laskos <tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com> wrote:

Hm...that's the same thing Richard was saying at some point isn't it?
Certainly one of the techniques to try.

On 01/17/2012 07:00 PM, Ray wrote:

How about this? (comes to mind after reading the previous posts)

Master: only distributes URLs to crawl (crawl pool).
Responsible
for local lookup/deduplication of URLs before they enter the crawl

pool. The lookup/dedup mechanism can also be used to generate
the list
of crawled URLs in the end too.

Slaves: only crawls, extracts URLs and reports them back
to master

Iteration #1:

Master is seeded with only one URL (let's say), which is the
root/starting URL for the site.
Master performs local lookup/deduplication, nothing to dedup
(only one URL).
Master distributes URL in crawl pool to slave (number of slaves to
use dependent on the max number of URLs to crawl/process per slave).
Slave crawls, extracts and reports extracted URLs to master.

Iteration #2...#n:

Master gets reports of new URLs from slaves.
Master performs local lookup/deduplication, adding unrecognized
URLs to
crawl pool and local lookup table.
Master distributes URLs in crawl pool to corresponding number of
slaves.
Slaves crawl, extract and report extracted URLs to master.
(Exit condition: crawl pool empty after all working slaves have

finished their current task and dedup completed)

Regards,
Ray

On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos
<tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com__>> wrote:

I've leaned towards #2 from the get go and the following
could help
reduce redundancy -- from a previous message:

All instances converging in intervals about the collective paths
they've discovered in order for each to keep a local look-up
cache
(no latency introduced since no-one would rely/wait on this
to be up
to date or even available)

This info could be pulled along with the links to follow
from the
master.

It's a weird problem to solve efficiently this one... :)

On 01/17/2012 05:45 PM, Richard Hauswald wrote:

Yeah, you are right. URL's should be unique in the work
queue.
Otherwise - in case of circular links between the pages

  • you could
    end up in an endless loop :-o

If a worker should just extract paths or do a full crawl
depends on
the duration of a full crawl. I can think of 3 different
ways,
depending on your situation:

  1. Do a full crawl and post back results + extracted paths
  2. Have workers do 2 different jobs, 1 job is to extract
    paths,
    1 job
    is to do the actual crawl
  3. Get a path and post back the whole page content so
    the master can
    store it. Then have a worker pool assigned for
    extracting paths and
    one for a full crawl, both based on the stored page content.
    But this really depends on the network speed, the load
    the workers
    create on the web application to crawl(in case its not
    just a simple
    html file based web site), the duration of a full crawl
    and the
    number
    of different paths in the application.

Is there still something I missed or would one of 1,2,3
solve
your problem?

On Tue, Jan 17, 2012 at 4:07 PM, Tasos
Laskos<tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com__>>

wrote:

Well, websites are a crazy mesh (or mess, sometimes) and
lots of pages link
to other pages so workers will eventually end up being
redundant.

Of course, I'm basing my response on the assumption that
your model has the
workers actually crawl and not simply visit a given
number
of pages, parse
and then sent back the paths they've extracted.
If so then pardon me, I misunderstood.

On 01/17/2012 05:01 PM, Richard Hauswald wrote:

Why would workers visiting the same pages?

On Tue, Jan 17, 2012 at 3:44 PM, Tasos
Laskos<tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com__>>

wrote:

You're right, it does sound good but it
would still
suffer from the same
problem, workers visiting the same pages.

Although this could be somewhat mitigated
the same
way I described in an
earlier post (which I think hasn't been
moderated yet).

I've bookmarked it for future reference,
thanks man.

On 01/17/2012 04:20 PM, Richard Hauswald wrote:

Btw, did you intentionally send this
e-mail
privately?

No, clicked the wrong button inside
gmail - sorry.

I thought your intend was to divide URLs by
subdomains, run a crawl,
"Briefly scope out the webapp structure and
spread the crawl of the
distinguishable visible directories
amongst the
slaves". This would
not redistribute new discovered URLs and
require
a special initial
setup step (divide URLs by subdomains).
And by
pushing tasks to the
workers you'd have to take care about
the load
of the workers which
means you'd have to implement a scheduling /
load balancing policy. By
pulling the work this would happen
automatically. To make use of multi
core CPUs you could make your workers
scan for
the systems count of
CPU cores and spawn workers threads in a
defined
ratio. You could also
run a worker on the master host. This should
lead to a good load
balancing by default. So it's not really
ignoring scheduling details.

On Tue, Jan 17, 2012 at 2:55 PM, Tasos
Laskos<tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com__>>

wrote:

Yep, that's similar to what I previously
posted with the difference
that
the
master in my system won't be a
slacker so
he'll be the lucky guy to
grab
the
seed URL.

After that things are pretty much
the same
-- ignoring the scheduling
details etc.

Btw, did you intentionally send this
e-mail
privately?

On 01/17/2012 03:48 PM, Richard
Hauswald wrote:

Yes, I didn't understand the
nature of
your problem the first time ...

So, you could still use the pull
principle. To make it simple
I'll not
consider work packets/batching. This
could be used later to further
improve performance by reducing
"latency" / "seek" times.

So what about the following process:

  1. Create a bunch of workers
  2. Create a List of URLs which
    can be
    considered the work queue.
    Initially filled with one
    element: the
    landing page URL in state NEW.
  3. Let all your workers poll the
    master
    for a single work item in
    state NEW(pay attention to
    synchronize
    this step on the master). One
    of them is the lucky guy and
    gets the
    landing page URL. The master
    will update work item to state
    PROCESSING( you may append a
    starting
    time, which could be used for
    reassigning already assigned
    work items
    after a timeout). All the other
    workers
    will still be idle.
  4. The lucky guy parses the page
    for new
    URLs and does whatever it
    should also do.
  5. The lucky guy posts the
    results + the
    parsed URLs to the master.
  6. The master stores the
    results, pushes
    the new URLs into the work
    queue with state NEW and updates the
    work item to state COMPLETED. If
    there is only one new URL we are not
    lucky but if there are 10 we'd
    have now 10 work items to
    distribute.
  7. Continue until all work items
    are in
    state COMPLETED.

Does this make sense?

On Tue, Jan 17, 2012 at 2:15 PM,
Tasos
Laskos<tasos.laskos@gmail.com
mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com__>>

wrote:

What prevents this is the
nature of
the crawl process.
What I'm trying to achieve
here is
not spread the workload but
actually
find
it.

I'm not interested in
parsing the
pages or any sort of
processing but
only
gather all available paths.

So there's not really any
"work" to
distribute actually.

Does this make sense?

On 01/17/2012 03:05 PM, Richard
Hauswald wrote:

Tasos,
what prevents you from
let the
workers pull the work
from the
master
instead of pushing it to the
workers? Then you could
let the
workers
pull work packets containing
e.g. 20 work items. After a
worker has
no
work left, it will push the
results to the master
and pull
another
work packet.
Regards,
Richard

On Mon, Jan 16, 2012 at
6:41 PM,
Tasos

Laskos<tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com
<mailto:tasos.laskos@gmail.com mailto:tasos.laskos@gmail.com__>>

wrote:

Hi guys, it's been a
while.

I've got a tricky
question
for you today and I hope
that we sort of
get
a
brainstorm going.

I've recently
implemented a
system for audit
distribution in the
form
of
a
high performance
grid (won't
self-promote) however an
area which I
left
alone (for the time
being)
was the crawl process.

See, the way it
works now is
the master instance
performs the
initial
crawl
and then calculates and
distributes the audit
workload amongst its
slaves
but the crawl takes
place
the old fashioned way.

As you might have
guessed
the major set back
is caused
by the fact
that
it's
not possible to
determine
the workload of the
crawl a
priori.

I've got a couple of
naive
ideas to parallelize the
crawl just to
get
me
started:

  • Assign crawl of
    subdomains to slaves
    -- no
    questions asked
  • Briefly scope
    out the
    webapp structure and
    spread
    the crawl of
    the
    distinguishable visible
    directories amongst
    the slaves.

Or even a
combination of the
above if applicable.

Both ideas are
better than
what I've got now
and there
aren't any
downsides
to them even if the
distribution turns
out to be
suboptimal.

I'm curious though, has
anyone faced a
similar problem?
Any general ideas?

Cheers,
Tasos Laskos.


The Web Security
Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/____websecurity.rss
http://www.webappsec.org/rss/__websecurity.rss

<http://www.webappsec.org/rss/__websecurity.rss
http://www.webappsec.org/rss/websecurity.rss>

Join WASC on LinkedIn
http://www.linkedin.com/e/gis/____83336/4B20E4374DBA
http://www.linkedin.com/e/gis/__83336/4B20E4374DBA

<http://www.linkedin.com/e/__gis/83336/4B20E4374DBA
http://www.linkedin.com/e/gis/83336/4B20E4374DBA>

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.____org
<mailto:websecurity@lists.__webappsec.org
mailto:websecurity@lists.webappsec.org>

http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org

http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org

<http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>


The Web Security Mailing List

WebSecurity RSS Feed
http://www.webappsec.org/rss/____websecurity.rss
http://www.webappsec.org/rss/__websecurity.rss

<http://www.webappsec.org/rss/__websecurity.rss
http://www.webappsec.org/rss/websecurity.rss>

Join WASC on LinkedIn
http://www.linkedin.com/e/gis/____83336/4B20E4374DBA
http://www.linkedin.com/e/gis/__83336/4B20E4374DBA

<http://www.linkedin.com/e/__gis/83336/4B20E4374DBA
http://www.linkedin.com/e/gis/83336/4B20E4374DBA>

WASC on Twitter
http://twitter.com/wascupdates

websecurity@lists.webappsec.____org
<mailto:websecurity@lists.__webappsec.org
mailto:websecurity@lists.webappsec.org>
http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org

http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org

<http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org

http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>

Forgot to mention a couple of details: * the master won't be a slacker but also act as a worker as well. * Idle workers will try to pull work when they become idle even if no announcement has been previously set -- just to make sure. - Tasos On 01/18/2012 09:16 PM, Tasos Laskos wrote: > I think I got it and it turns out to be a composite approach indeed. > > 1. Master scopes out the place (follows 10 paths or so) and deduces the > webapp structure -- could be incomplete, doesn't matter we just want > some seeds. > 2. Master creates a per directory policy and assigns dirs to workers AND > sends that police to them as well > 3. Workers perform the crawl as usual but also implement that policy for > URLs that don't match their own policy rules i.e. send URLs that are out > of their scope to the appropriate peer and let him handle it > 4. If no policy matches a URL then it is sent back to the master; the > master creates a new policy(ies), stores the work in a Queue and then > sends an announcement ("There's some work up for grabs" ) > 5. Busy workers ignore it; idling workers try to pull it and the work is > assigned first-come/first-serve along with the updated policy > 6. Go to 3 > > If at any point a worker becomes idle he sends the paths he has > discovered back to the master for store/further processing/whatever. > > * No need for look-ups > * No decision lag as no-one will need to request permission to perform > any sort of action > * Automated load-balancing > * No redundant crawls > > Thoughts? > > On 01/17/2012 07:07 PM, Ray wrote: >> I thought mine was slightly different? But whichever the case, just to >> contribute something to the discussion :) >> >> On Wed, Jan 18, 2012 at 1:03 AM, Tasos Laskos <tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com>> wrote: >> >> Hm...that's the same thing Richard was saying at some point isn't it? >> Certainly one of the techniques to try. >> >> >> On 01/17/2012 07:00 PM, Ray wrote: >> >> How about this? (comes to mind after reading the previous posts) >> >> *Master*: only *distributes* URLs to crawl (crawl pool). >> Responsible >> for local lookup/*deduplication* of URLs before they enter the crawl >> >> pool. The lookup/dedup mechanism can also be used to generate >> the list >> of crawled URLs in the end too. >> >> *Slaves*: only *crawls*, *extracts* URLs and reports them back >> to master >> >> _Iteration #1:_ >> >> Master is seeded with only one URL (let's say), which is the >> root/starting URL for the site. >> Master performs local lookup/deduplication, nothing to dedup >> (only one URL). >> Master distributes URL in crawl pool to slave (number of slaves to >> use dependent on the max number of URLs to crawl/process per slave). >> Slave crawls, extracts and reports extracted URLs to master. >> >> _Iteration #2...#n:_ >> >> Master gets reports of new URLs from slaves. >> Master performs local lookup/deduplication, adding unrecognized >> URLs to >> crawl pool and local lookup table. >> Master distributes URLs in crawl pool to corresponding number of >> slaves. >> Slaves crawl, extract and report extracted URLs to master. >> (*Exit condition*: crawl pool empty after all working slaves have >> >> finished their current task and dedup completed) >> >> Regards, >> Ray >> >> On Tue, Jan 17, 2012 at 11:54 PM, Tasos Laskos >> <tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com> >> <mailto:tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com>__>> wrote: >> >> I've leaned towards #2 from the get go and the following >> could help >> reduce redundancy -- from a previous message: >> >> >> All instances converging in intervals about the collective paths >> they've discovered in order for each to keep a local look-up >> cache >> (no latency introduced since no-one would rely/wait on this >> to be up >> to date or even available) >> >> This info could be pulled along with the links to follow >> from the >> master. >> >> It's a weird problem to solve efficiently this one... :) >> >> >> On 01/17/2012 05:45 PM, Richard Hauswald wrote: >> >> Yeah, you are right. URL's should be unique in the work >> queue. >> Otherwise - in case of circular links between the pages >> - you could >> end up in an endless loop :-o >> >> If a worker should just extract paths or do a full crawl >> depends on >> the duration of a full crawl. I can think of 3 different >> ways, >> depending on your situation: >> 1. Do a full crawl and post back results + extracted paths >> 2. Have workers do 2 different jobs, 1 job is to extract >> paths, >> 1 job >> is to do the actual crawl >> 3. Get a path and post back the whole page content so >> the master can >> store it. Then have a worker pool assigned for >> extracting paths and >> one for a full crawl, both based on the stored page content. >> But this really depends on the network speed, the load >> the workers >> create on the web application to crawl(in case its not >> just a simple >> html file based web site), the duration of a full crawl >> and the >> number >> of different paths in the application. >> >> Is there still something I missed or would one of 1,2,3 >> solve >> your problem? >> >> On Tue, Jan 17, 2012 at 4:07 PM, Tasos >> Laskos<tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com> <mailto:tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com>__>> >> >> wrote: >> >> Well, websites are a crazy mesh (or mess, sometimes) and >> lots of pages link >> to other pages so workers will eventually end up being >> redundant. >> >> Of course, I'm basing my response on the assumption that >> your model has the >> workers actually crawl and not simply visit a given >> number >> of pages, parse >> and then sent back the paths they've extracted. >> If so then pardon me, I misunderstood. >> >> >> On 01/17/2012 05:01 PM, Richard Hauswald wrote: >> >> >> Why would workers visiting the same pages? >> >> On Tue, Jan 17, 2012 at 3:44 PM, Tasos >> Laskos<tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com> >> <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>> >> >> wrote: >> >> >> You're right, it does sound good but it >> would still >> suffer from the same >> problem, workers visiting the same pages. >> >> Although this could be somewhat mitigated >> the same >> way I described in an >> earlier post (which I think hasn't been >> moderated yet). >> >> I've bookmarked it for future reference, >> thanks man. >> >> >> On 01/17/2012 04:20 PM, Richard Hauswald wrote: >> >> >> >> Btw, did you intentionally send this >> e-mail >> privately? >> >> >> >> No, clicked the wrong button inside >> gmail - sorry. >> >> I thought your intend was to divide URLs by >> subdomains, run a crawl, >> "Briefly scope out the webapp structure and >> spread the crawl of the >> distinguishable visible directories >> amongst the >> slaves". This would >> not redistribute new discovered URLs and >> require >> a special initial >> setup step (divide URLs by subdomains). >> And by >> pushing tasks to the >> workers you'd have to take care about >> the load >> of the workers which >> means you'd have to implement a scheduling / >> load balancing policy. By >> pulling the work this would happen >> automatically. To make use of multi >> core CPUs you could make your workers >> scan for >> the systems count of >> CPU cores and spawn workers threads in a >> defined >> ratio. You could also >> run a worker on the master host. This should >> lead to a good load >> balancing by default. So it's not really >> ignoring scheduling details. >> >> On Tue, Jan 17, 2012 at 2:55 PM, Tasos >> Laskos<tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com> >> <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>> >> >> wrote: >> >> >> >> Yep, that's similar to what I previously >> posted with the difference >> that >> the >> master in my system won't be a >> slacker so >> he'll be the lucky guy to >> grab >> the >> seed URL. >> >> After that things are pretty much >> the same >> -- ignoring the scheduling >> details etc. >> >> Btw, did you intentionally send this >> e-mail >> privately? >> >> >> On 01/17/2012 03:48 PM, Richard >> Hauswald wrote: >> >> >> >> >> Yes, I didn't understand the >> nature of >> your problem the first time ... >> >> So, you could still use the pull >> principle. To make it simple >> I'll not >> consider work packets/batching. This >> could be used later to further >> improve performance by reducing >> "latency" / "seek" times. >> >> So what about the following process: >> 1. Create a bunch of workers >> 2. Create a List of URLs which >> can be >> considered the work queue. >> Initially filled with one >> element: the >> landing page URL in state NEW. >> 3. Let all your workers poll the >> master >> for a single work item in >> state NEW(pay attention to >> synchronize >> this step on the master). One >> of them is the lucky guy and >> gets the >> landing page URL. The master >> will update work item to state >> PROCESSING( you may append a >> starting >> time, which could be used for >> reassigning already assigned >> work items >> after a timeout). All the other >> workers >> will still be idle. >> 4. The lucky guy parses the page >> for new >> URLs and does whatever it >> should also do. >> 5. The lucky guy posts the >> results + the >> parsed URLs to the master. >> 6. The master stores the >> results, pushes >> the new URLs into the work >> queue with state NEW and updates the >> work item to state COMPLETED. If >> there is only one new URL we are not >> lucky but if there are 10 we'd >> have now 10 work items to >> distribute. >> 7. Continue until all work items >> are in >> state COMPLETED. >> >> Does this make sense? >> >> On Tue, Jan 17, 2012 at 2:15 PM, >> Tasos >> Laskos<tasos.laskos@gmail.com >> <mailto:tasos.laskos@gmail.com> >> <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>> >> >> wrote: >> >> >> >> >> What prevents this is the >> nature of >> the crawl process. >> What I'm trying to achieve >> here is >> not spread the workload but >> actually >> find >> it. >> >> I'm not interested in >> parsing the >> pages or any sort of >> processing but >> only >> gather all available paths. >> >> So there's not really any >> "work" to >> distribute actually. >> >> Does this make sense? >> >> >> On 01/17/2012 03:05 PM, Richard >> Hauswald wrote: >> >> >> >> >> >> Tasos, >> what prevents you from >> let the >> workers pull the work >> from the >> master >> instead of pushing it to the >> workers? Then you could >> let the >> workers >> pull work packets containing >> e.g. 20 work items. After a >> worker has >> no >> work left, it will push the >> results to the master >> and pull >> another >> work packet. >> Regards, >> Richard >> >> On Mon, Jan 16, 2012 at >> 6:41 PM, >> Tasos >> >> Laskos<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com> >> <mailto:tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>__>> >> >> wrote: >> >> >> >> >> >> Hi guys, it's been a >> while. >> >> I've got a tricky >> question >> for you today and I hope >> that we sort of >> get >> a >> brainstorm going. >> >> I've recently >> implemented a >> system for audit >> distribution in the >> form >> of >> a >> high performance >> grid (won't >> self-promote) however an >> area which I >> left >> alone (for the time >> being) >> was the crawl process. >> >> See, the way it >> works now is >> the master instance >> performs the >> initial >> crawl >> and then calculates and >> distributes the audit >> workload amongst its >> slaves >> but the crawl takes >> place >> the old fashioned way. >> >> As you might have >> guessed >> the major set back >> is caused >> by the fact >> that >> it's >> not possible to >> determine >> the workload of the >> crawl a >> priori. >> >> I've got a couple of >> naive >> ideas to parallelize the >> crawl just to >> get >> me >> started: >> * Assign crawl of >> subdomains to slaves >> -- no >> questions asked >> * Briefly scope >> out the >> webapp structure and >> spread >> the crawl of >> the >> distinguishable visible >> directories amongst >> the slaves. >> >> Or even a >> combination of the >> above if applicable. >> >> Both ideas are >> better than >> what I've got now >> and there >> aren't any >> downsides >> to them even if the >> distribution turns >> out to be >> suboptimal. >> >> I'm curious though, has >> anyone faced a >> similar problem? >> Any general ideas? >> >> Cheers, >> Tasos Laskos. >> >> >> ___________________________________________________ >> >> The Web Security >> Mailing List >> >> WebSecurity RSS Feed >> http://www.webappsec.org/rss/____websecurity.rss >> <http://www.webappsec.org/rss/__websecurity.rss> >> >> <http://www.webappsec.org/rss/__websecurity.rss >> <http://www.webappsec.org/rss/websecurity.rss>> >> >> Join WASC on LinkedIn >> http://www.linkedin.com/e/gis/____83336/4B20E4374DBA >> <http://www.linkedin.com/e/gis/__83336/4B20E4374DBA> >> >> <http://www.linkedin.com/e/__gis/83336/4B20E4374DBA >> <http://www.linkedin.com/e/gis/83336/4B20E4374DBA>> >> >> WASC on Twitter >> http://twitter.com/wascupdates >> >> >> websecurity@lists.webappsec.____org >> <mailto:websecurity@lists.__webappsec.org >> <mailto:websecurity@lists.webappsec.org>> >> >> >> >> >> http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org >> >> <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org> >> >> <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org >> >> <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ___________________________________________________ >> >> The Web Security Mailing List >> >> WebSecurity RSS Feed >> http://www.webappsec.org/rss/____websecurity.rss >> <http://www.webappsec.org/rss/__websecurity.rss> >> >> <http://www.webappsec.org/rss/__websecurity.rss >> <http://www.webappsec.org/rss/websecurity.rss>> >> >> Join WASC on LinkedIn >> http://www.linkedin.com/e/gis/____83336/4B20E4374DBA >> <http://www.linkedin.com/e/gis/__83336/4B20E4374DBA> >> >> <http://www.linkedin.com/e/__gis/83336/4B20E4374DBA >> <http://www.linkedin.com/e/gis/83336/4B20E4374DBA>> >> >> WASC on Twitter >> http://twitter.com/wascupdates >> >> websecurity@lists.webappsec.____org >> <mailto:websecurity@lists.__webappsec.org >> <mailto:websecurity@lists.webappsec.org>> >> http://lists.webappsec.org/____mailman/listinfo/websecurity_____lists.webappsec.org >> >> <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org> >> >> <http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org >> >> <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>> >> >> >> >> >> >