[WEB SECURITY] Notes on Unicode, Encoding/Canonicalization Attacks, & Web Development

Arian J. Evans arian.evans at anachronic.com
Mon Jun 8 18:02:31 EDT 2009

Chris -- After reflecting a bit I think we agree across the
board...but are looking at the issues from different perspectives.
It's a good subject and there are a lot of folks new to this on the
list, so bear with me as I attempt to simplify explanation as much as

I think there are three main ways we can group and classify these issues:

1) Attack Nodes
2) Weakness Nodes
3) Software Development Pattern and Practice Issues/Guidelines

This is what I mean by these buckets:

Attack Nodes: Logical groupings of attacks that exploit weaknessses
(with no knowledge of "why it exists or works" required).

Weakness Nodes: Logical groupings of root flaws that may be
exploitable by Attacks. (lack of encoding, inconsistent/lack of
canonicalization, etc.)

nota bene: Attacks & Weaknesses usually have many-to-one mappings.
Meaning more than one type of attack, and many attack vectors, can all
map to one or two root weaknesses.

Software Development Pattern & Practice issues/guidelines: Logical
groupings of issues into Design or Implementation "flaws" or

Design Pattern issues include things like app-global KC/KD
normalization.  Implementation practice
issues/best-programming-practices cover things like making a mistake
in coding a best-fit unicode-mapping library, and safe practices to
escape/normalize specific strings, etc.


I have been inherently grouping these issues into a notion of
#1/Attack Node data (and some wild speculation about nodes #2 and #3).
The majority of my focus these days is around node #1 & Black Box
testing so I will stick to that.

All of the above are valid and legitimate (and needed!) ways to look
at and describe the problem.

We definitely need folks to map #1 issues over into #2 and #3, (as
well as vice-versa), especially for developers. I find the Attack Node
issues surrounding this discussion are some of the most confusing to
explain both security folks and developers. These types of attacks
often wind up looking like black-magic, especially when you get into 2
and 3 unique layers of encoding and canonicalization or equivalence.

Thanks for reminding me that there is more than one way to look at all of these.



I have gotten quite a few offline questions about this subject. For
folks new to dealing with the silly-string mess that encoding and
canonicalization issues create -- here are some useful reference

Unicode Normalization & Equivalence overview:

Unicode Canonicalization overview:

Unicode Normalization Forms: These are the normalization guidelines
that Chris referred to in several places:


Unicode Code Charts for Symbols: These symbols include both
false-familiar/confusables and symbols that fall into best-fit mapping


If you want to start looking for things that may be transcoded into or
interpreted as a single-quote ' or greater-than > or ../.. these
charts are a good place to start. I have found a number of visual
false-familiar mappings when testing multi-language and multi-code
page supporting software by referencing these charts.

For those who have mentioned offline that they find all this
challenging and confusing: you are not alone. Reading unicode docs
gives me a mind-numbing headache. It's taken me years to try to get my
head around all of the strange encodings, decodings, and literal
transcodings in software....and I am positive that I still do not have
all the answers. :)

Feel free to ask me questions offline if you want.

Arian Evans
When in danger or in doubt: accept all input and process it.

On Sat, Jun 6, 2009 at 6:39 PM, Arian J.
Evans<arian.evans at anachronic.com> wrote:
> On Sat, Jun 6, 2009 at 5:43 PM, Chris Weber<chris at casabasec.com> wrote:
>> Your discussion point #2 seems to digress, talking about the confusables and
>> lookalikes don't seem to lend to the original subject.  Unless, you're
>> suggesting that they somehow add to the canonicalization of strings that
>> White Hat is seeing?
> Yes, that is exactly what I am saying.
> It is much easier to inject a CAST or a SELECT past a blacklist if
> there are multiple characters canonicalized to As and Es in the
> application.
> And the same goes for things like double-quotes. Many (most?) language
> character sets have confusables and false-familiars with U000/001
> Unicode, and Latin/ASCII, and sometimes they are canonicalized as
> such.
> I have nothing that tells me, when I see a character conversion, if it
> is a "best fit" mapping or an attempt to canonicalize confusables or
> avoid name collision. So I put them all in the same bucket in terms of
> security measurement/classification.
> A developer using unicode would probably not put them in the same bucket.
> -ae

Join us on IRC: irc.freenode.net #webappsec

Have a question? Search The Web Security Mailing List Archives: 

Subscribe via RSS: 
http://www.webappsec.org/rss/websecurity.rss [RSS Feed]

Join WASC on LinkedIn

More information about the websecurity mailing list