Empathy List Archives

Master

 I've leaned towards #2 from the get go and the following could help
 reduce redundancy -- from a previous message:

 All instances converging in intervals about the collective paths
 they've discovered in order for each to keep a local look-up cache
 (no latency introduced since no-one would rely/wait on this to be up
 to date or even available)

 This info could be pulled along with the links to follow from the
 master.

 It's a weird problem to solve efficiently this one... :)

 On 01/17/2012 05:45 PM, Richard Hauswald wrote:

     Yeah, you are right. URL's should be unique in the work queue.
     Otherwise - in case of circular links between the pages - you could
     end up in an endless loop :-o

     If a worker should just extract paths or do a full crawl depends on
     the duration of a full crawl. I can think of 3 different ways,
     depending on your situation:
     1. Do a full crawl and post back results + extracted paths
     2. Have workers do 2 different jobs, 1 job is to extract paths,
     1 job
     is to do the actual crawl
     3. Get a path and post back the whole page content so the master can
     store it. Then have a worker pool assigned for extracting paths and
     one for a full crawl, both based on the stored page content.
     But this really depends on the network speed, the load the workers
     create on the web application to crawl(in case its not just a simple
     html file based web site), the duration of a full crawl and the
     number
     of different paths in the application.

     Is there still something I missed or would one of 1,2,3 solve
     your problem?

     On Tue, Jan 17, 2012 at 4:07 PM, Tasos
     Laskos<tasos.laskos@gmail.com <mailto:tasos.laskos@gmail.com>>
       wrote:

         Well, websites are a crazy mesh (or mess, sometimes) and
         lots of pages link
         to other pages so workers will eventually end up being
         redundant.

         Of course, I'm basing my response on the assumption that
         your model has the
         workers actually crawl and not simply visit a given number
         of pages, parse
         and then sent back the paths they've extracted.
         If so then pardon me, I misunderstood.

         On 01/17/2012 05:01 PM, Richard Hauswald wrote:

             Why would workers visiting the same pages?

             On Tue, Jan 17, 2012 at 3:44 PM, Tasos
             Laskos<tasos.laskos@gmail.com
             <mailto:tasos.laskos@gmail.com>>
               wrote:

                 You're right, it does sound good but it would still
                 suffer from the same
                 problem, workers visiting the same pages.

                 Although this could be somewhat mitigated the same
                 way I described in an
                 earlier post (which I think hasn't been moderated yet).

                 I've bookmarked it for future reference, thanks man.

                 On 01/17/2012 04:20 PM, Richard Hauswald wrote:

                         Btw, did you intentionally send this e-mail
                         privately?

                     No, clicked the wrong button inside gmail - sorry.

                     I thought your intend was to divide URLs by
                     subdomains, run a crawl,
                     "Briefly scope out the webapp structure and
                     spread the crawl of the
                     distinguishable visible directories amongst the
                     slaves". This would
                     not redistribute new discovered URLs and require
                     a special initial
                     setup step (divide URLs by subdomains). And by
                     pushing tasks to the
                     workers you'd have to take care about the load
                     of the workers which
                     means you'd have to implement a scheduling /
                     load balancing policy. By
                     pulling the work this would happen
                     automatically. To make use of multi
                     core CPUs you could make your workers scan for
                     the systems count of
                     CPU cores and spawn workers threads in a defined
                     ratio. You could also
                     run a worker on the master host. This should
                     lead to a good load
                     balancing by default. So it's not really
                     ignoring scheduling details.

                     On Tue, Jan 17, 2012 at 2:55 PM, Tasos
                     Laskos<tasos.laskos@gmail.com
                     <mailto:tasos.laskos@gmail.com>>
                       wrote:

                         Yep, that's similar to what I previously
                         posted with the difference
                         that
                         the
                         master in my system won't be a slacker so
                         he'll be the lucky guy to
                         grab
                         the
                         seed URL.

                         After that things are pretty much the same
                         -- ignoring the scheduling
                         details etc.

                         Btw, did you intentionally send this e-mail
                         privately?

                         On 01/17/2012 03:48 PM, Richard Hauswald wrote:

                             Yes, I didn't understand the nature of
                             your problem the first time ...

                             So, you could still use the pull
                             principle. To make it simple I'll not
                             consider work packets/batching. This
                             could be used later to further
                             improve performance by reducing
                             "latency" / "seek" times.

                             So what about the following process:
                             1. Create a bunch of workers
                             2. Create a List of URLs which can be
                             considered the work queue.
                             Initially filled with one element: the
                             landing page URL in state NEW.
                             3. Let all your workers poll the master
                             for a single work item in
                             state NEW(pay attention to synchronize
                             this step on the master). One
                             of them is the lucky guy and gets the
                             landing page URL. The master
                             will update work item to state
                             PROCESSING( you may append a starting
                             time, which could be used for
                             reassigning already assigned work items
                             after a timeout). All the other workers
                             will still be idle.
                             4. The lucky guy parses the page for new
                             URLs and does whatever it
                             should also do.
                             5. The lucky guy posts the results + the
                             parsed URLs to the master.
                             6. The master stores the results, pushes
                             the new URLs into the work
                             queue with state NEW and updates the
                             work item to state COMPLETED. If
                             there is only one new URL we are not
                             lucky but if there are 10 we'd
                             have now 10 work items to distribute.
                             7. Continue until all work items are in
                             state COMPLETED.

                             Does this make sense?

                             On Tue, Jan 17, 2012 at 2:15 PM, Tasos
                             Laskos<tasos.laskos@gmail.com
                             <mailto:tasos.laskos@gmail.com>>
                               wrote:

                                 What prevents this is the nature of
                                 the crawl process.
                                 What I'm trying to achieve here is
                                 not spread the workload but
                                 actually
                                 find
                                 it.

                                 I'm not interested in parsing the
                                 pages or any sort of processing but
                                 only
                                 gather all available paths.

                                 So there's not really any "work" to
                                 distribute actually.

                                 Does this make sense?

                                 On 01/17/2012 03:05 PM, Richard
                                 Hauswald wrote:

                                     Tasos,
                                     what prevents you from let the
                                     workers pull the work from the
                                     master
                                     instead of pushing it to the
                                     workers? Then you could let the
                                     workers
                                     pull work packets containing
                                     e.g. 20 work items. After a
                                     worker has
                                     no
                                     work left, it will push the
                                     results to the master and pull
                                     another
                                     work packet.
                                     Regards,
                                     Richard

                                     On Mon, Jan 16, 2012 at 6:41 PM,
                                     Tasos
                                     Laskos<tasos.laskos@gmail.com
                                     <mailto:tasos.laskos@gmail.com>>
                                       wrote:

                                         Hi guys, it's been a while.

                                         I've got a tricky question
                                         for you today and I hope
                                         that we sort of
                                         get
                                         a
                                         brainstorm going.

                                         I've recently implemented a
                                         system for audit
                                         distribution in the
                                         form
                                         of
                                         a
                                         high performance grid (won't
                                         self-promote) however an
                                         area which I
                                         left
                                         alone (for the time being)
                                         was the crawl process.

                                         See, the way it works now is
                                         the master instance performs the
                                         initial
                                         crawl
                                         and then calculates and
                                         distributes the audit
                                         workload amongst its
                                         slaves
                                         but the crawl takes place
                                         the old fashioned way.

                                         As you might have guessed
                                         the major set back is caused
                                         by the fact
                                         that
                                         it's
                                         not possible to determine
                                         the workload of the crawl a
                                         priori.

                                         I've got a couple of naive
                                         ideas to parallelize the
                                         crawl just to
                                         get
                                         me
                                         started:
                                           * Assign crawl of
                                         subdomains to slaves -- no
                                         questions asked
                                           * Briefly scope out the
                                         webapp structure and spread
                                         the crawl of
                                         the
                                         distinguishable visible
                                         directories amongst the slaves.

                                         Or even a combination of the
                                         above if applicable.

                                         Both ideas are better than
                                         what I've got now and there
                                         aren't any
                                         downsides
                                         to them even if the
                                         distribution turns out to be
                                         suboptimal.

                                         I'm curious though, has
                                         anyone faced a similar problem?
                                         Any general ideas?

                                         Cheers,
                                         Tasos Laskos.

                                         _________________________________________________
                                         The Web Security Mailing List

                                         WebSecurity RSS Feed
                                         http://www.webappsec.org/rss/__websecurity.rss
                                         <http://www.webappsec.org/rss/websecurity.rss>

                                         Join WASC on LinkedIn
                                         http://www.linkedin.com/e/gis/__83336/4B20E4374DBA
                                         <http://www.linkedin.com/e/gis/83336/4B20E4374DBA>

                                         WASC on Twitter
                                         http://twitter.com/wascupdates

                                         websecurity@lists.webappsec.__org
                                         <mailto:websecurity@lists.webappsec.org>

                                         http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
                                         <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>

 _________________________________________________
 The Web Security Mailing List

 WebSecurity RSS Feed
 http://www.webappsec.org/rss/__websecurity.rss
 <http://www.webappsec.org/rss/websecurity.rss>

 Join WASC on LinkedIn
 http://www.linkedin.com/e/gis/__83336/4B20E4374DBA
 <http://www.linkedin.com/e/gis/83336/4B20E4374DBA>

 WASC on Twitter
 http://twitter.com/wascupdates

 websecurity@lists.webappsec.__org
 <mailto:websecurity@lists.webappsec.org>
 http://lists.webappsec.org/__mailman/listinfo/websecurity___lists.webappsec.org
 <http://lists.webappsec.org/mailman/listinfo/websecurity_lists.webappsec.org>

websecurity@lists.webappsec.org

Re: [WEB SECURITY] Parallelizing the crawl