perluniprops lists the Unicode properties of the version of Unicode it supports. For Perl 5.32.1, that's Unicode 13.0.0.
You can obtain a list of the characters that match a category using Unicode::Tussle's unichars
.
unichars '\p{Close_Punctuation}'
And the help:
$ unichars --help
Usage:
unichars [*options*] *criterion* ...
Each criterion is either a square-bracketed character class, a regex
starting with a backslash, or an arbitrary Perl expression. See the
EXAMPLES section below.
OPTIONS:
Selection Options:
--bmp include the Basic Multilingual Plane (plane 0) [DEFAULT]
--smp include the Supplementary Multilingual Plane (plane 1)
--astral -a include planes above the BMP (planes 1-15)
--unnamed -u include various unnamed characters (see DESCRIPTION)
--locale -l specify the locale used for UCA functions
Display Options:
--category -c include the general category (GC=)
--script -s include the script name (SC=)
--block -b include the block name (BLK=)
--bidi -B include the bidi class (BC=)
--combining -C include the canonical combining class (CCC=)
--numeric -n include the numeric value (NV=)
--casefold -f include the casefold status
--decimal -d include the decimal representation of the code point
Miscellaneous Options:
--version -v print version information and exit
--help -h this message
--man -m full manpage
--debug -d show debugging of criteria and examined code point span
Special Functions:
$_ is the current code point
ord is the current code point's ordinal
NAME is charname::viacode(ord)
NUM is Unicode::UCD::num(ord), not code point number
CF is casefold->{status}
NFD, NFC, NFKD, NFKC, FCD, FCC (normalization)
UCA, UCA1, UCA2, UCA3, UCA4 (binary sort keys)
Singleton, Exclusion, NonStDecomp, Comp_Ex
checkNFD, checkNFC, checkNFKD, checkNFKC, checkFCD, checkFCC
NFD_NO, NFC_NO, NFC_MAYBE, NFKD_NO, NFKC_NO, NFKC_MAYBE
Other than reading the list of categories from the webpage, is there a way to programmatically get all the possible \p{...}
categories?
From the comments, I believe you are trying to port a Perl program using \p
regex properties to Python. You don't need a list of all categories (whatever that means); you just need to know what Code Points each of the property used by the program matches.
Now, you could get the list of Code Points from the Unicode database. But a much simpler solution is to use Python's regex module instead of the re module. This will give you access to the same Unicode-defined properties that Perl exposes.
The latest version of the regex module even uses Unicode 13.0.0 just like the latest Perl.
Note that the program uses \p{IsAlnum}
, a long way of writing \p{Alnum}
. \p{Alnum}
is not a standard Unicode property, but a Perl extension. It's the union of Unicode properties \p{Alpha}
and \p{Nd}
. I don't know know if the regex module defines Alnum
identically, but it probably does.