[WEB SECURITY] Re: Notes on Unicode, Encoding/Canonicalization Attacks, & Web Development

Arian J. Evans arian.evans at anachronic.com
Mon Jun 8 21:14:54 EDT 2009


On Mon, Jun 8, 2009 at 4:14 PM, Chris Weber<chris at casabasec.com> wrote:
> I'm giving a presentation titled "Unraveling Unicode" at Black Hat in Vegas.

Awesome. Given what I wrote below though, are you in agreement?

> To generalize the "#3 solutions" for most of these "#2 weakness nodes", a
> Globalization engineer who understands security will tell you to perform
> security checks more than once - first on the input before conversion, and
> then on the output after conversion, and  sometimes in the middle. That's
> often the only way to deal with the unknown string-handling stuff that
> happens end to end.

Sure. That is incidentally how WhiteHat Sentinel tests for these types
of things. We still see and find unusual and unexpected behaviors even
after folks think they have tested this thoroughly. The more complex
the app, the more likely to find that sort of behavior (with even 2,
3, or more forms of transcoding or normalization/decoding going on).

Essentially you can seed an application with micro-tains. When we see
a micro-taint we recognize as a UID or GUID in a form or state unlike
what we expect it to be in, from there it is pretty easy to trace that
micro-taint backwards. (micro-taint == within a specific webapp or
privilege context)

Macro-tainting and finding these issues at a macro-level across
multiple apps and users is interesting and needs done.
Multi-app/multi-user interaction can also introduce these issues. This
is needed in situations where two different language users are sharing
data between multiple applications, and or different language apps
share data via shared MQs and DBs, and they are converting or
normalizing it as they pass it back and forth.

Regarding Macro-tainting across multiple applications -- we are not
quite there yet. We might combine macro-tainting with some other
technology doing runtime monitoring or use network-layer technologies
to trace macro-taints across apps as they transform to narrow down the
"who and where". That could get fairly complex and there are still
simpler things to solve first though.

The good news is that I don't know of any attackers at this level of
sophistication yet. Some of the SQL injection bots used interesting
encoding techniques, but mainly to evade filters/input blacklists, and
normalize some tsql characters across protocols (I think).

Actually...given the number of exploitable conditions on the web
without utilizing these techniques, maybe it isn't surprising
attackers are not using these techniques much yet. So once we clean up
the rest of the mess, we'll have these issues to deal with next. :)

Not sure I will be @ BH, but definitely look forward to your
presentation. I suspect next year we will be releasing a lot more
attack-node data on this subject. Besides yourself it seems like few
folks are doing any work here. And yet -- the issues are out there.

Great conversation, cheers


> -----Original Message-----
> From: arian.evans at gmail.com [mailto:arian.evans at gmail.com] On Behalf Of
> Arian J. Evans
> Sent: Monday, June 08, 2009 3:03 PM
> To: Chris Weber; websecurity at webappsec.org
> Subject: Notes on Unicode, Encoding/Canonicalization Attacks, & Web
> Development
> Chris -- After reflecting a bit I think we agree across the
> board...but are looking at the issues from different perspectives.
> It's a good subject and there are a lot of folks new to this on the
> list, so bear with me as I attempt to simplify explanation as much as
> possible.
> I think there are three main ways we can group and classify these issues:
> 1) Attack Nodes
> 2) Weakness Nodes
> 3) Software Development Pattern and Practice Issues/Guidelines
> This is what I mean by these buckets:
> Attack Nodes: Logical groupings of attacks that exploit weaknessses
> (with no knowledge of "why it exists or works" required).
> Weakness Nodes: Logical groupings of root flaws that may be
> exploitable by Attacks. (lack of encoding, inconsistent/lack of
> canonicalization, etc.)
> nota bene: Attacks & Weaknesses usually have many-to-one mappings.
> Meaning more than one type of attack, and many attack vectors, can all
> map to one or two root weaknesses.
> Software Development Pattern & Practice issues/guidelines: Logical
> groupings of issues into Design or Implementation "flaws" or
> "recommendations".
> Design Pattern issues include things like app-global KC/KD
> normalization.  Implementation practice
> issues/best-programming-practices cover things like making a mistake
> in coding a best-fit unicode-mapping library, and safe practices to
> escape/normalize specific strings, etc.
> ---
> I have been inherently grouping these issues into a notion of
> #1/Attack Node data (and some wild speculation about nodes #2 and #3).
> The majority of my focus these days is around node #1 & Black Box
> testing so I will stick to that.
> All of the above are valid and legitimate (and needed!) ways to look
> at and describe the problem.
> We definitely need folks to map #1 issues over into #2 and #3, (as
> well as vice-versa), especially for developers. I find the Attack Node
> issues surrounding this discussion are some of the most confusing to
> explain both security folks and developers. These types of attacks
> often wind up looking like black-magic, especially when you get into 2
> and 3 unique layers of encoding and canonicalization or equivalence.
> Thanks for reminding me that there is more than one way to look at all of
> these.
> ---
> I have gotten quite a few offline questions about this subject. For
> folks new to dealing with the silly-string mess that encoding and
> canonicalization issues create -- here are some useful reference
> links:
> Unicode Normalization & Equivalence overview:
> http://en.wikipedia.org/wiki/Unicode_normalization
> Unicode Canonicalization overview:
> http://en.wikipedia.org/wiki/Canonicalization
> Unicode Normalization Forms: These are the normalization guidelines
> that Chris referred to in several places:
> http://unicode.org/reports/tr15/
> Unicode Code Charts for Symbols: These symbols include both
> false-familiar/confusables and symbols that fall into best-fit mapping
> charts.
> http://www.unicode.org/charts/symbols.html
> If you want to start looking for things that may be transcoded into or
> interpreted as a single-quote ' or greater-than > or ../.. these
> charts are a good place to start. I have found a number of visual
> false-familiar mappings when testing multi-language and multi-code
> page supporting software by referencing these charts.
> For those who have mentioned offline that they find all this
> challenging and confusing: you are not alone. Reading unicode docs
> gives me a mind-numbing headache. It's taken me years to try to get my
> head around all of the strange encodings, decodings, and literal
> transcodings in software....and I am positive that I still do not have
> all the answers. :)
> Feel free to ask me questions offline if you want.
> --
> Arian Evans
> When in danger or in doubt: accept all input and process it.
> On Sat, Jun 6, 2009 at 6:39 PM, Arian J.
> Evans<arian.evans at anachronic.com> wrote:
>> On Sat, Jun 6, 2009 at 5:43 PM, Chris Weber<chris at casabasec.com> wrote:
>>> Your discussion point #2 seems to digress, talking about the confusables
> and
>>> lookalikes don't seem to lend to the original subject.  Unless, you're
>>> suggesting that they somehow add to the canonicalization of strings that
>>> White Hat is seeing?
>> Yes, that is exactly what I am saying.
>> It is much easier to inject a CAST or a SELECT past a blacklist if
>> there are multiple characters canonicalized to As and Es in the
>> application.
>> And the same goes for things like double-quotes. Many (most?) language
>> character sets have confusables and false-familiars with U000/001
>> Unicode, and Latin/ASCII, and sometimes they are canonicalized as
>> such.
>> I have nothing that tells me, when I see a character conversion, if it
>> is a "best fit" mapping or an attempt to canonicalize confusables or
>> avoid name collision. So I put them all in the same bucket in terms of
>> security measurement/classification.
>> A developer using unicode would probably not put them in the same bucket.
>> -ae

Join us on IRC: irc.freenode.net #webappsec

Have a question? Search The Web Security Mailing List Archives: 

Subscribe via RSS: 
http://www.webappsec.org/rss/websecurity.rss [RSS Feed]

Join WASC on LinkedIn

More information about the websecurity mailing list