monoids-0.1.36: Monoids, specialized containers and a general map/reduce frameworkSource codeContentsIndex
Data.Monoid.Lexical.UTF8.Decoder
Portabilitynon-portable (MPTCs)
Stabilityexperimental
Maintainerekmett@gmail.com
Description

UTF8 encoded unicode characters can be parsed both forwards and backwards, since the start of each Char is clearly marked. This Monoid accumulates information about the characters represented and reduces that information using a CharReducer, which is just a Reducer Monoid that knows what it wants to do about an invalidChar -- a string of Word8 values that don't form a valid UTF8 character.

As this monoid parses chars it just feeds them upstream to the underlying CharReducer. Efficient left-to-right and right-to-left traversals are supplied so that a lazy ByteString can be parsed efficiently by chunking it into strict chunks, and batching the traversals over each before stitching the edges together.

Because this needs to be a Monoid and should return the exact same result regardless of forward or backwards parsing, it chooses to parse only canonical UTF8 unlike most Haskell UTF8 parsers, which will blissfully accept illegal alternative long encodings of a character.

This actually fixes a potential class of security issues in some scenarios:

http://prowebdevelopmentblog.com/content/big-overhaul-java-utf-8-charset

NB: Due to naive use of a list to track the tail of an unfinished character this may exhibit O(n^2) behavior parsing backwards along an invalid sequence of a large number of bytes that all claim to be in the tail of a character.

Documentation
module Data.Monoid.Reducer.Char
data UTF8 m Source
show/hide Instances
runUTF8 :: CharReducer m => UTF8 m -> mSource
Produced by Haddock version 2.3.0