Streamly.Internal.Unicode.Stream

Setup

To execute the code examples provided in this module in ghci, please run the following commands first.

>>> :m

>>> import qualified Streamly.Data.Fold as Fold
>>> import qualified Streamly.Data.Stream as Stream
>>> import qualified Streamly.Unicode.Stream as Unicode

For APIs that have not been released yet.

>>> :set -XMagicHash
>>> import qualified Streamly.Internal.Unicode.Stream as Unicode

Construction (Decoding)

decodeLatin1 :: Monad m => Stream m Word8 -> Stream m Char Source #

Decode a stream of bytes to Unicode characters by mapping each byte to a corresponding Unicode Char in 0-255 range.

UTF-8 Byte Stream Decoding

data CodingFailureMode Source #

Constructors

TransliterateCodingFailure
ErrorOnCodingFailure
DropOnCodingFailure

Instances

Instances details

Show CodingFailureMode Source #
Instance details Defined in Streamly.Internal.Unicode.Stream Methods showsPrec :: Int -> CodingFailureMode -> ShowS Source # show :: CodingFailureMode -> String Source # showList :: [CodingFailureMode] -> ShowS Source #

writeCharUtf8' :: Monad m => Parser Word8 m Char Source #

parseCharUtf8With :: Monad m => CodingFailureMode -> Parser Word8 m Char Source #

decodeUtf8 :: Monad m => Stream m Word8 -> Stream m Char Source #

Decode a UTF-8 encoded bytestream to a stream of Unicode characters. Any invalid codepoint encountered is replaced with the unicode replacement character.

decodeUtf8' :: Monad m => Stream m Word8 -> Stream m Char Source #

Decode a UTF-8 encoded bytestream to a stream of Unicode characters. The function throws an error if an invalid codepoint is encountered.

decodeUtf8_ :: Monad m => Stream m Word8 -> Stream m Char Source #

Decode a UTF-8 encoded bytestream to a stream of Unicode characters. Any invalid codepoint encountered is dropped.

UTF-16 Byte Stream Decoding

decodeUtf16le' :: Monad m => Stream m Word16 -> Stream m Char Source #

Similar to decodeUtf16le but throws an error if an invalid codepoint is encountered.

decodeUtf16le :: Monad m => Stream m Word16 -> Stream m Char Source #

Decode a UTF-16 encoded stream to a stream of Unicode characters. Any invalid codepoint encountered is replaced with the unicode replacement character.

The Word16s are expected to be in the little-endian byte order.

Resumable UTF-8 Byte Stream Decoding

data DecodeError Source #

Constructors

DecodeError !DecodeState !CodePoint

Instances

Instances details

Show DecodeError Source #
Instance details Defined in Streamly.Internal.Unicode.Stream Methods showsPrec :: Int -> DecodeError -> ShowS Source # show :: DecodeError -> String Source # showList :: [DecodeError] -> ShowS Source #

type DecodeState = Word8 Source #

type CodePoint = Int Source #

decodeUtf8Either :: Monad m => Stream m Word8 -> Stream m (Either DecodeError Char) Source #

Pre-release

resumeDecodeUtf8Either :: Monad m => DecodeState -> CodePoint -> Stream m Word8 -> Stream m (Either DecodeError Char) Source #

Pre-release

UTF-8 Array Stream Decoding

decodeUtf8Chunks :: MonadIO m => Stream m (Array Word8) -> Stream m Char Source #

Like decodeUtf8 but for a chunked stream. It may be slightly faster than flattening the stream and then decoding with decodeUtf8.

decodeUtf8Chunks' :: MonadIO m => Stream m (Array Word8) -> Stream m Char Source #

Like 'decodeUtf8'' but for a chunked stream. It may be slightly faster than flattening the stream and then decoding with 'decodeUtf8''.

decodeUtf8Chunks_ :: MonadIO m => Stream m (Array Word8) -> Stream m Char Source #

Like decodeUtf8_ but for a chunked stream. It may be slightly faster than flattening the stream and then decoding with decodeUtf8_.

Elimination (Encoding)

Latin1 Encoding to Byte Stream

encodeLatin1 :: Monad m => Stream m Char -> Stream m Word8 Source #

Like encodeLatin1' but silently maps input codepoints beyond 255 to arbitrary Latin1 chars in 0-255 range. No error or exception is thrown when such mapping occurs.

encodeLatin1' :: Monad m => Stream m Char -> Stream m Word8 Source #

Encode a stream of Unicode characters to bytes by mapping each character to a byte in 0-255 range. Throws an error if the input stream contains characters beyond 255.

encodeLatin1_ :: Monad m => Stream m Char -> Stream m Word8 Source #

Like encodeLatin1 but drops the input characters beyond 255.

UTF-8 Encoding to Byte Stream

readCharUtf8' :: Monad m => Unfold m Char Word8 Source #

readCharUtf8 :: Monad m => Unfold m Char Word8 Source #

readCharUtf8_ :: Monad m => Unfold m Char Word8 Source #

encodeUtf8 :: Monad m => Stream m Char -> Stream m Word8 Source #

Encode a stream of Unicode characters to a UTF-8 encoded bytestream. Any Invalid characters (U+D800-U+D8FF) in the input stream are replaced by the Unicode replacement character U+FFFD.

encodeUtf8' :: Monad m => Stream m Char -> Stream m Word8 Source #

Encode a stream of Unicode characters to a UTF-8 encoded bytestream. When any invalid character (U+D800-U+D8FF) is encountered in the input stream the function errors out.

encodeUtf8_ :: Monad m => Stream m Char -> Stream m Word8 Source #

Encode a stream of Unicode characters to a UTF-8 encoded bytestream. Any Invalid characters (U+D800-U+D8FF) in the input stream are dropped.

encodeStrings :: MonadIO m => (Stream m Char -> Stream m Word8) -> Stream m String -> Stream m (Array Word8) Source #

Encode a stream of String using the supplied encoding scheme. Each string is encoded as an Array Word8.

UTF-8 Encoding to Chunk Stream

UTF-16 Encoding to Byte Stream

encodeUtf16le' :: Monad m => Stream m Char -> Stream m Word16 Source #

Similar to encodeUtf16le but throws an error if any invalid character is encountered.

encodeUtf16le :: Monad m => Stream m Char -> Stream m Word16 Source #

Encode a stream of Unicode characters to a UTF-16 encoded stream. Any invalid characters in the input stream are replaced by the Unicode replacement character U+FFFD.

The resulting Word16s are encoded in little-endian byte order.

Transformation

stripHead :: Monad m => Stream m Char -> Stream m Char Source #

Remove leading whitespace from a string.

>>> stripHead = Stream.dropWhile Char.isSpace

Pre-release

lines :: Monad m => Fold m Char b -> Stream m Char -> Stream m b Source #

Fold each line of the stream using the supplied Fold and stream the result.

Definition:

>>> lines f = Stream.foldMany (Fold.takeEndBy_ (== '\n') f)

Usage:

>>> Stream.toList $ Unicode.lines Fold.toList (Stream.fromList "line1\nline2\nline3\n\n\n")
["line1","line2","line3","",""]

Pre-release

words :: Monad m => Fold m Char b -> Stream m Char -> Stream m b Source #

Fold each word of the stream using the supplied Fold.

Definition:

>>> words = Stream.wordsBy Char.isSpace

Usage:

>>> Stream.toList $ Unicode.words Fold.toList (Stream.fromList " ab  cd   ef ")
["ab","cd","ef"]

Pre-release

unlines :: MonadIO m => Unfold m a Char -> Stream m a -> Stream m Char Source #

Unfold a stream to character streams using the supplied Unfold and concat the results suffixing a newline character \n to each stream.

Definition:

>>> unlines = Stream.unfoldEachEndBy '\n'
>>> unlines = Stream.unfoldEachEndBySeq "\n" Unfold.fromList

Pre-release

unwords :: MonadIO m => Unfold m a Char -> Stream m a -> Stream m Char Source #

Unfold the elements of a stream to character streams using the supplied Unfold and concat the results with a whitespace character infixed between the streams.

>>> unwords = Stream.unfoldEachSepBy ' '
>>> unwords = Stream.unfoldEachSepBySeq " " Unfold.fromList

Pre-release

StreamD UTF8 Encoding / Decoding transformations.

decodeUtf8D :: Monad m => Stream m Word8 -> Stream m Char Source #

decodeUtf8D' :: Monad m => Stream m Word8 -> Stream m Char Source #

decodeUtf8D_ :: Monad m => Stream m Word8 -> Stream m Char Source #

encodeUtf8D :: Monad m => Stream m Char -> Stream m Word8 Source #

See section "3.9 Unicode Encoding Forms" in https://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf

encodeUtf8D' :: Monad m => Stream m Char -> Stream m Word8 Source #

encodeUtf8D_ :: Monad m => Stream m Char -> Stream m Word8 Source #

decodeUtf8EitherD :: Monad m => Stream m Word8 -> Stream m (Either DecodeError Char) Source #

resumeDecodeUtf8EitherD :: Monad m => DecodeState -> CodePoint -> Stream m Word8 -> Stream m (Either DecodeError Char) Source #

Decoding String Literals

fromStr# :: MonadIO m => Addr# -> Stream m Char Source #

Read UTF-8 encoded bytes as chars from an Addr# until a 0 byte is encountered, the 0 byte is not included in the stream.

Unsafe: The caller is responsible for safe addressing.

Note that this is completely safe when reading from Haskell string literals because they are guaranteed to be NULL terminated:

>>> Stream.fold Fold.toList (Unicode.fromStr# "Haskell"#)
"Haskell"

Word16 Utilities

mkEvenW8Chunks :: Monad m => Stream m (Array Word8) -> Stream m (Array Word8) Source #

Ensure chunks of even length. This can be used before casting the arrays to Word16. Use this API when interacting with external data.

The chunks are split and merged accordingly to create arrays of even length. If the sum of length of all the arrays in the stream is odd then the trailing byte of the last array is dropped.

swapByteOrder :: Word16 -> Word16 Source #

Swap the byte order of Word16

swapByteOrder 0xABCD == 0xCDAB
swapByteOrder . swapByteOrder == id

Deprecations

decodeUtf8Lax :: Monad m => Stream m Word8 -> Stream m Char Source #

Deprecated: Please use decodeUtf8 instead

Same as decodeUtf8

encodeLatin1Lax :: Monad m => Stream m Char -> Stream m Word8 Source #

Deprecated: Please use encodeLatin1 instead

Same as encodeLatin1

encodeUtf8Lax :: Monad m => Stream m Char -> Stream m Word8 Source #

Deprecated: Please use encodeUtf8 instead

Same as encodeUtf8