Loading...

Unicode.Char

This module provides APIs to access the Unicode character database (UCD) corresponding to Unicode Standard version 15.0.0.

This module re-exports several sub-modules under it. The sub-module structure under Char is largely based on the "Property Index by Scope of Use" in Unicode® Standard Annex #44.

The Unicode.Char.* modules in turn depend on Unicode.Internal.Char.* modules which are programmatically generated from the Unicode standard's Unicode character database files. The module structure under Unicode.Internal.Char is largely based on the UCD text file names from which the properties are generated.

For the original UCD files used in this code please refer to the UCD section on the Unicode standard page. See https://www.unicode.org/reports/tr44/ to understand the contents and the format of the unicode database files.

Documentation

unicodeVersion :: Version Source #

Version of the Unicode standard used by this package: 15.1.0.

Since: 0.3.0

data GeneralCategory Source #

Unicode General Categories.

These classes are defined in the Unicode Character Database, part of the Unicode standard.

Note: the classes must be in the same order they are listed in the Unicode Standard, because some functions (e.g. generalCategory) rely on the Enum instance.

Since: 0.3.0

Constructors

UppercaseLetter

Lu: Letter, Uppercase

LowercaseLetter

Ll: Letter, Lowercase

TitlecaseLetter

Lt: Letter, Titlecase

ModifierLetter

Lm: Letter, Modifier

OtherLetter

Lo: Letter, Other

NonSpacingMark

Mn: Mark, Non-Spacing

SpacingCombiningMark

Mc: Mark, Spacing Combining

EnclosingMark

Me: Mark, Enclosing

DecimalNumber

Nd: Number, Decimal

LetterNumber

Nl: Number, Letter

OtherNumber

No: Number, Other

ConnectorPunctuation

Pc: Punctuation, Connector

DashPunctuation

Pd: Punctuation, Dash

OpenPunctuation

Ps: Punctuation, Open

ClosePunctuation

Pe: Punctuation, Close

InitialQuote

Pi: Punctuation, Initial quote

FinalQuote

Pf: Punctuation, Final quote

OtherPunctuation

Po: Punctuation, Other

MathSymbol

Sm: Symbol, Math

CurrencySymbol

Sc: Symbol, Currency

ModifierSymbol

Sk: Symbol, Modifier

OtherSymbol

So: Symbol, Other

Space

Zs: Separator, Space

LineSeparator

Zl: Separator, Line

ParagraphSeparator

Zp: Separator, Paragraph

Control

Cc: Other, Control

Format

Cf: Other, Format

Surrogate

Cs: Other, Surrogate

PrivateUse

Co: Other, Private Use

NotAssigned

Cn: Other, Not Assigned

Instances
Instances details
Bounded GeneralCategory Source # 
Instance details

Defined in Unicode.Char.General

Enum GeneralCategory Source # 
Instance details

Defined in Unicode.Char.General

Ix GeneralCategory Source # 
Instance details

Defined in Unicode.Char.General

Show GeneralCategory Source # 
Instance details

Defined in Unicode.Char.General

Eq GeneralCategory Source # 
Instance details

Defined in Unicode.Char.General

Ord GeneralCategory Source # 
Instance details

Defined in Unicode.Char.General

data CodePointType Source #

Types of Code Points.

These classes are defined in the section 2.4 “Code Points and Characters” of the Unicode standard.

Since: 0.4.1

Constructors

GraphicType

Graphic: defined by the following general categories:

FormatType

Format: invisible but affects neighboring characters.

Defined by the following general categories: LineSeparator, ParagraphSeparator, Format.

ControlType

Control: usage defined by protocols or standards outside the Unicode Standard.

Defined by the general category Control.

PrivateUseType

Private-use: usage defined by private agreement outside the Unicode Standard.

Defined by the general category PrivateUse.

SurrogateType

Surrogate: Permanently reserved for UTF-16.

Defined by the general category Surrogate.

NoncharacterType

Noncharacter: a code point that is permanently reserved for internal use (see definition D14 in the section 3.4 “Characters and Encoding” of the Unicode Standard). Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10₁₆) and the values U+FDD0..U+FDEF.

They are a subset of the general category NotAssigned.

ReservedType

Reserved: any code point of the Unicode Standard that is reserved for future assignment (see definition D15 in the section 3.4 “Characters and Encoding” of the Unicode Standard). Also known as an unassigned code point.

They are a subset of the general category NotAssigned.

Instances
Instances details
Bounded CodePointType Source # 
Instance details

Defined in Unicode.Char.General

Enum CodePointType Source # 
Instance details

Defined in Unicode.Char.General

Ix CodePointType Source # 
Instance details

Defined in Unicode.Char.General

Show CodePointType Source # 
Instance details

Defined in Unicode.Char.General

Eq CodePointType Source # 
Instance details

Defined in Unicode.Char.General

Ord CodePointType Source # 
Instance details

Defined in Unicode.Char.General

generalCategory :: Char -> GeneralCategory Source #

The Unicode general category of the character.

This property is defined in the column 2 of the UnicodeData table.

This relies on the Enum instance of GeneralCategory, which must remain in the same order as the categories are presented in the Unicode standard.

show (generalCategory c) == show (Data.Char.generalCategory c)

Since: 0.3.0

isAscii :: Char -> Bool Source #

Selects the first 128 characters of the Unicode character set, corresponding to the ASCII character set.

isLatin1 :: Char -> Bool Source #

Selects the first 256 characters of the Unicode character set, corresponding to the ISO 8859-1 (Latin-1) character set.

isAsciiLower :: Char -> Bool Source #

Selects ASCII lower-case letters, i.e. characters satisfying both isAscii and isLower.

isAsciiUpper :: Char -> Bool Source #

Selects ASCII upper-case letters, i.e. characters satisfying both isAscii and isUpper.

isPunctuation :: Char -> Bool Source #

Selects Unicode punctuation characters, including various kinds of connectors, brackets and quotes.

This function returns True if its argument has one of the following GeneralCategorys, or False otherwise:

isPunctuation c == Data.Char.isPunctuation c

Since: 0.3.0

isSymbol :: Char -> Bool Source #

Selects Unicode symbol characters, including mathematical and currency symbols.

This function returns True if its argument has one of the following GeneralCategorys, or False otherwise: * MathSymbol * CurrencySymbol * ModifierSymbol * OtherSymbol

isSymbol c == Data.Char.isSymbol c

Since: 0.3.0

isControl :: Char -> Bool Source #

Selects control characters, which are the non-printing characters of the Latin-1 subset of Unicode.

This function returns True if its argument has the GeneralCategory Control.

isControl c == Data.Char.isControl c

Since: 0.3.0

isPrint :: Char -> Bool Source #

Selects printable Unicode characters (letters, numbers, marks, punctuation, symbols and spaces).

This function returns False if its argument has one of the following GeneralCategorys, or True otherwise:

isPrint c == Data.Char.isPrint c

Since: 0.3.0

isMark :: Char -> Bool Source #

Selects Unicode mark characters, for example accents and the like, which combine with preceding characters.

This function returns True if its argument has one of the following GeneralCategorys, or False otherwise:

isMark c == Data.Char.isMark c

Since: 0.3.0

isSeparator :: Char -> Bool Source #

Selects Unicode space and separator characters.

This function returns True if its argument has one of the following GeneralCategorys, or False otherwise:

isSeparator c == Data.Char.isSeparator c

Since: 0.3.0

codePointType :: Char -> CodePointType Source #

Returns the CodePointType of a character.

Since: 0.6.0

generalCategoryAbbr :: GeneralCategory -> String Source #

Abbreviation of GeneralCategory used in the Unicode standard.

Since: 0.3.0

isAlphabetic :: Char -> Bool Source #

Returns True for alphabetic Unicode characters (lower-case, upper-case and title-case letters, plus letters of caseless scripts and modifiers letters).

Note: this function is not equivalent to isAlpha / isLetter:

Since: 0.3.0

isWhiteSpace :: Char -> Bool Source #

Returns True for any whitespace characters, and the control characters \t, \n, \r, \f, \v.

See: Unicode White_Space.

Note: isWhiteSpace is not equivalent to isSpace. isWhiteSpace selects the same characters from isSpace plus the following:

  • U+0085 NEXT LINE (NEL)
  • U+2028 LINE SEPARATOR
  • U+2029 PARAGRAPH SEPARATOR

Since: 0.3.0

isNoncharacter :: Char -> Bool Source #

Returns True for any noncharacter.

A noncharacter is a code point that is permanently reserved for internal use (see definition D14 in the section 3.4 “Characters and Encoding” of the Unicode Standard).

Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10₁₆) and the values U+FDD0..U+FDEF.

Since: 0.6.0

isJamo :: Char -> Bool Source #

Determine whether a character is a jamo L, V or T character.

Since: 0.1.0

jamoNCount :: Int Source #

Total count of all jamo characters.

jamoNCount = jamoVCount * jamoTCount

Since: 0.1.0

jamoLFirst :: Int Source #

First leading consonant jamo.

Since: 0.1.0

jamoLCount :: Int Source #

Total count of leading consonant jamo.

Since: 0.3.0

jamoLIndex :: Char -> Maybe Int Source #

Given a Unicode character, if it is a leading jamo, return its index in the list of leading jamo consonants, otherwise return Nothing.

Since: 0.1.0

jamoLLast :: Int Source #

Last leading consonant jamo.

Since: 0.1.0

jamoVFirst :: Int Source #

First vowel jamo.

Since: 0.1.0

jamoVCount :: Int Source #

Total count of vowel jamo.

Since: 0.1.0

jamoVIndex :: Char -> Maybe Int Source #

Given a Unicode character, if it is a vowel jamo, return its index in the list of vowel jamo, otherwise return Nothing.

Since: 0.1.0

jamoVLast :: Int Source #

Last vowel jamo.

Since: 0.1.0

jamoTFirst :: Int Source #

The first trailing consonant jamo.

Note that jamoTFirst does not represent a valid T, it represents a missing T i.e. LV without a T. See comments under jamoTIndex .

Since: 0.1.0

jamoTCount :: Int Source #

Total count of trailing consonant jamo.

Since: 0.1.0

jamoTIndex :: Char -> Maybe Int Source #

Given a Unicode character, if it is a trailing jamo consonant, return its index in the list of trailing jamo consonants, otherwise return Nothing.

Note that index 0 is not a valid index for a trailing consonant. Index 0 corresponds to an LV syllable, without a T. See "Hangul Syllable Decomposition" in the Conformance chapter of the Unicode standard for more details.

Since: 0.1.0

jamoTLast :: Int Source #

Last trailing consonant jamo.

Since: 0.1.0

hangulFirst :: Int Source #

Codepoint of the first pre-composed Hangul character.

Since: 0.1.0

hangulLast :: Int Source #

Codepoint of the last Hangul character.

Since: 0.1.0

isHangul :: Char -> Bool Source #

Determine if the given character is a precomposed Hangul syllable.

Since: 0.1.0

isHangulLV :: Char -> Bool Source #

Determine if the given character is a Hangul LV syllable.

Note: this function requires a precomposed Hangul syllable but does not check it. Use isHangul to check the input character before passing it to isHangulLV.

Since: 0.1.0

isLowerCase :: Char -> Bool Source #

Returns True for lower-case characters.

It uses the character property Lowercase.

See: isLower for the legacy predicate.

Since: 0.3.0

isUpperCase :: Char -> Bool Source #

Returns True for upper-case characters.

It uses the character property Uppercase.

Note: it does not match title-cased letters. Those are matched using: generalCategory c == TitlecaseLetter.

See: isUpper for the legacy predicate.

Since: 0.3.0

caseFoldMapping :: Unfold Char Char Source #

Returns the full folded case mapping of a character if the character is changed, else nothing.

It uses the character property Case_Folding.

Since: 0.3.1

toCaseFoldString :: Char -> String Source #

Convert a character to full folded case if defined, else to itself.

This function is mainly useful for performing caseless (also known as case insensitive) string comparisons.

A string x is a caseless match for a string y if and only if:

foldMap toCaseFoldString x == foldMap toCaseFoldString y

The result string may have more than one character, and may differ from applying toLowerString to the input string. For instance, “ﬓ” (U+FB13 Armenian small ligature men now) is case folded to the sequence “մ” (U+0574 Armenian small letter men) followed by “ն” (U+0576 Armenian small letter now), while “µ” (U+00B5 micro sign) is case folded to “μ” (U+03BC Greek small letter mu) instead of itself.

It uses the character property Case_Folding.

toCaseFoldString c == foldMap toCaseFoldString (toCaseFoldString c)

Since: 0.3.1

lowerCaseMapping :: Unfold Char Char Source #

Returns the full lower case mapping of a character if the character is changed, else nothing.

It uses the character property Lowercase_Mapping.

Since: 0.3.1

toLowerString :: Char -> String Source #

Convert a character to full lower case if defined, else to itself.

The result string may have more than one character. For instance, “İ” (U+0130 Latin capital letter I with dot above) maps to the sequence: “i” (U+0069 Latin small letter I) followed by “ ̇” (U+0307 combining dot above).

It uses the character property Lowercase_Mapping.

See: toLower for simple lower case conversion.

toLowerString c == foldMap toLowerString (toLowerString c)

Since: 0.3.1

titleCaseMapping :: Unfold Char Char Source #

Returns the full title case mapping of a character if the character is changed, else nothing.

It uses the character property Titlecase_Mapping.

Since: 0.3.1

toTitleString :: Char -> String Source #

Convert a character to full title case if defined, else to itself.

The result string may have more than one character. For instance, “fl” (U+FB02 Latin small ligature FL) is converted to the sequence: “F” (U+0046 Latin capital letter F) followed by “l” (U+006C Latin small letter L).

It uses the character property Titlecase_Mapping.

See: toTitle for simple title case conversion.

Since: 0.3.1

upperCaseMapping :: Unfold Char Char Source #

Returns the full upper case mapping of a character if the character is changed, else nothing.

It uses the character property Uppercase_Mapping.

Since: 0.3.1

toUpperString :: Char -> String Source #

Convert a character to full upper case if defined, else to itself.

The result string may have more than one character. For instance, the German “ß” (U+00DF Eszett) maps to the two-letter sequence “SS”.

It uses the character property Uppercase_Mapping.

See: toUpper for simple upper case conversion.

toUpperString c == foldMap toUpperString (toUpperString c)

Since: 0.3.1

Utils

showCodePoint :: Char -> ShowS Source #

Show the code point of a character using the Unicode Standard convention: hexadecimal codepoint padded with zeros if inferior to 4 characters.

>>> showCodePoint '\xf' ""
"000F"
>>> showCodePoint '\x1ffff' ""
"1FFFF"

Since: 0.6.0

Re-export from base

ord :: Char -> Int Source #

The fromEnum method restricted to the type Char.

chr :: Int -> Char Source #

The toEnum method restricted to the type Char.