Fields Specification
Apache Avro has Primitive Types
, Complex Types
and Logical Types
, so we need to match these types with python types.
Primitive Types and python representation
The set of primitive type names is:
- null: no value
- boolean: a binary value
- int: 32-bit signed integer
- long: 64-bit signed integer
- float: single precision (32-bit) IEEE 754 floating-point number
- double: double precision (64-bit) IEEE 754 floating-point number
- bytes: sequence of 8-bit unsigned bytes
- string: unicode character sequence
So, the previous types can be matched to:
Avro Type | Python Type |
---|---|
string | str |
long | int |
boolean | bool |
double | float |
null | None |
bytes | bytes |
int | types.Int32 |
float | types.Float32 |
Since Python does not have native int32
or float32
types, use the dataclasses_avroschema.types.Int32
and dataclasses_avroschema.types.Float32
types to annotate your classes. These types are simple wrappers around Python's default int
and float
types.
Note
Primitive type names are also defined type names. Thus, for example, the schema “string” is equivalent to: {"type": "string"}
Warning
If you have to defined an avro schema
with primitive type as defined type then set all the properties like default
, aliases
, etc in the defined type
{
"name": "expirience",
"type": {"type": "int", "unit": "years", "default": 10}
}
Complex Types
Avro supports six kinds of complex types: enums, arrays, maps, fixed, unions and records.
Avro Type | Python Type |
---|---|
enums | tuple |
arrays | list |
maps | dict |
fixed | types.Fixed |
unions | typing.Union |
records | Python Class |
- Enums: Use the type name "enum" and support the following attributes:
- name: a JSON string providing the name of the enum (required).
- namespace: a JSON string that qualifies the name;
- aliases: a JSON array of strings, providing alternate names for this enum (optional).
- doc: a JSON string providing documentation to the user of this schema (optional).
- symbols: a JSON array, listing symbols, as JSON strings (required). All symbols in an enum must be unique; duplicates are prohibited. Every symbol must match the regular expression [A-Za-z_][A-Za-z0-9_] (the same requirement as for names).
When we want to define a enum
type we should specify a default value because we need to define the symbols
In future version we will have a custom enum type to avoid this
- Arrays: Use the type name "array" and support the following attribute:
- name: a JSON string providing the name of the enum (required).
-
items: the schema of the array's items.
-
Maps: Use the type name "map". Map keys are assumed to be string. Support the following attribute:
- name: a JSON string providing the name of the enum (required).
-
values: the schema of the map's values.
-
Fixed uses the type name "fixed" and supports two attributes:
- name: a string naming this fixed (required).
- namespace, a string that qualifies the name;
- aliases: a JSON array of strings, providing alternate names for this enum (optional).
-
size: an integer, specifying the number of bytes per value (required).
-
Unions:
Unions
are represented using JSON arrays. For example,["null", "string"]
declares a schema which may be either a null or string. Under the Avro specifications, if a union field as a default, the type of the default must be the first listed type in the array. Dataclasses-avroschema will automatically generate the appropriate array if a default is provided. Note that an optional field (typing.Optional[T]) generates the union[T, null]
, whereT
is the first element in the union.None
will need to be explicitly declared the default to generate the appropriate schema, if the default should beNone/null
. -
Records:
Records
use the type namerecord
and will represent theSchema
.
Logical Types
A logical type is an Avro primitive or complex type with extra attributes to represent a derived type. The attribute logicalType must always be present for a logical type, and is a string with the name of one of the logical types listed later in this section. Other attributes may be defined for particular logical types.
A logical type is always serialized using its underlying Avro type so that values are encoded in exactly the same way as the equivalent Avro type that does not have a logicalType attribute. Language implementations may choose to represent logical types with an appropriate native type, although this is not required.
Language implementations must ignore unknown logical types when reading, and should use the underlying Avro type. If a logical type is invalid, for example a decimal with scale greater than its precision, then implementations should ignore the logical type and use the underlying Avro type.
-
Date: The date logical type represents a date within the calendar, with no reference to a particular time zone or time of day. A date logical type annotates an Avro int, where the int stores the number of days from the unix epoch, 1 January 1970 (ISO calendar).
-
Time (millisecond precision): The time-millis logical type represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one millisecond. A time-millis logical type annotates an Avro int, where the int stores the number of milliseconds after midnight, 00:00:00.000.
-
Timestamp (millisecond precision): The timestamp-millis logical type represents an instant on the global timeline, independent of a particular time zone or calendar, with a precision of one millisecond. A timestamp-millis logical type annotates an Avro long, where the long stores the number of milliseconds from the unix epoch, 1 January 1970 00:00:00.000 UTC.
-
UUID: Represents a uuid as a string
-
Decimal: Represents a decimal.Decimal as bytes
Avro Type | Logical Type | Python Type |
---|---|---|
int | date | datetime.date |
int | time-millis | datetime.time |
long | time-micros | types.TimeMicro |
long | timestamp-millis | datetime.datetime |
long | timestamp-micros | types.DateTimeMicro |
string | uuid | uuid.uuid4 |
string | uuid | uuid.UUID |
bytes | decimal | types.condecimal |
Avro Field and Python Types Summary
Python Type | Avro Type | Logical Type |
---|---|---|
str | string | do not apply |
long | int | do not apply |
bool | boolean | do not apply |
double | float | do not apply |
None | null | do not apply |
bytes | bytes | do not apply |
typing.List | array | do not apply |
typing.Tuple | array | do not apply |
typing.Sequence | array | do not apply |
typing.MutableSequence | array | do not apply |
typing.Dict | map | do not apply |
typing.Mapping | map | do not apply |
typing.MutableMapping | map | do not apply |
types.Fixed | fixed | do not apply |
enum.Enum | enum | do not apply |
types.Int32 | int | do not apply |
types.Float32 | float | do not apply |
typing.Union | union | do not apply |
typing.Optional | union (with null ) |
do not apply |
Python class | record | do not apply |
datetime.date | int | date |
datetime.time | int | time-millis |
types.TimeMicro | long | time-micros |
datetime.datetime | long | timestamp-millis |
types.DateTimeMicro | long | timestamp-micros |
decimal.Decimal | bytes | decimal |
uuid.uuid4 | string | uuid |
uuid.UUID | string | uuid |
Python Type | Avro Type | Logical Type |
---|---|---|
str | string | do not apply |
long | int | do not apply |
bool | boolean | do not apply |
double | float | do not apply |
None | null | do not apply |
bytes | bytes | do not apply |
typing.List | array | do not apply |
typing.Tuple | array | do not apply |
typing.Sequence | array | do not apply |
typing.MutableSequence | array | do not apply |
typing.Dict | map | do not apply |
typing.Mapping | map | do not apply |
typing.MutableMapping | map | do not apply |
types.Fixed | fixed | do not apply |
str, enum.Enum | enum | do not apply |
types.Int32 | int | do not apply |
types.Float32 | float | do not apply |
typing.Union | union | do not apply |
typing.Optional | union (with null ) |
do not apply |
Python class | record | do not apply |
datetime.date | int | date |
datetime.time | int | time-millis |
types.TimeMicro | long | time-micros |
datetime.datetime | long | timestamp-millis |
types.DateTimeMicro | long | timestamp-micros |
decimal.Decimal | bytes | decimal |
uuid.uuid4 | string | uuid |
uuid.UUID | string | uuid |
typing.Annotated
All the types can be Annotated so metadata
can be added to the fields. This library will use the python type
to generate the avro field
and it will ignore the extra metadata
.
import dataclasses
import enum
import typing
from dataclasses_avroschema import AvroModel
class FavoriteColor(str, enum.Enum):
BLUE = "BLUE"
YELLOW = "YELLOW"
GREEN = "GREEN"
@dataclasses.dataclass
class UserAdvance(AvroModel):
name: typing.Annotated[str, "string"]
age: typing.Annotated[int, "integer"]
pets: typing.List[typing.Annotated[str, "string"]]
accounts: typing.Dict[str, typing.Annotated[int, "integer"]]
favorite_colors: typing.Annotated[FavoriteColor, "a color enum"]
has_car: typing.Annotated[bool, "boolean"] = False
country: str = "Argentina"
address: typing.Optional[typing.Annotated[str, "string"]] = None
class Meta:
schema_doc = False
UserAdvance.avro_schema()
resulting in
{
"type": "record",
"name": "UserAdvance",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "long"},
{"name": "pets", "type": {"type": "array", "items": "string", "name": "pet"}},
{"name": "accounts", "type": {"type": "map", "values": "long", "name": "account"}},
{"name": "favorite_colors", "type": {"type": "enum", "name": "FavoriteColor", "symbols": ["BLUE", "YELLOW", "GREEN"]}},
{"name": "has_car", "type": "boolean", "default": false},
{"name": "country", "type": "string", "default": "Argentina"},
{"name": "address", "type": ["null", "string"], "default": null}]
}'
(This script is complete, it should run "as is")
typing.Literal
Fields can be annotated with typing.Literal
in accordance with PEP 586. Note that a literal field with multiple arguments (i.e. of the form typing.Literal[v1, v2, v3]
) is interpreted as a union of literals (i.e. typing.Union[typing.Literal[v1], typing.Literal[v2], typing.Literal[v3]]
) in line with the PEP.
import enum
import typing
from dataclasses import dataclass
from dataclasses_avroschema import AvroModel
class E(enum.Enum):
ONE = "one"
@dataclass
class T(AvroModel):
f: typing.Literal[None, 1, "1", True, b"1", E.ONE]
print(T.avro_schema())
"""
{
"type": "record",
"name": "T",
"fields": [
{
"name": "f",
"type": [
"null",
"long",
"string",
"boolean",
"bytes",
{
"type": "enum",
"name": "E",
"symbols": [
"one"
]
}
]
}
]
}
"""
import enum
import typing
from dataclasses import dataclass
from dataclasses_avroschema import AvroModel
class E(str, enum.Enum):
ONE = "one"
@dataclass
class T(AvroModel):
f: typing.Literal[None, 1, "1", True, b"1", E.ONE]
print(T.avro_schema())
"""
{
"type": "record",
"name": "T",
"fields": [
{
"name": "f",
"type": [
"null",
"long",
"string",
"boolean",
"bytes",
{
"type": "enum",
"name": "E",
"symbols": [
"one"
]
}
]
}
]
}
"""
(This script is complete, it should run "as is")
Adding Custom Field-level Attributes
You may want to add field-level attributes which are not automatically populated according to the typing semantics
listed above. For example, you might want a "doc"
attribute or even a custom attribute (which Avro supports as long
as it doesn't conflict with any field names in the core Avro specification). An example of a custom attribute is a flag
for whether a field contains sensitive data. e.g. "sensitivty"
.
When your Python class is serialised to Avro, each field will contain a number of attributes. Some of these of are common
to all fields such as "name"
and others are specific to the datatype (e.g. array
will have the items
attribute).
In order to add custom fields, you can use the field
descriptor of the built-in dataclasses
package and provide a
dict
of key-value pairs to the metadata
parameter as in dataclasses.field(metadata={'doc': 'foo'})
.
from dataclasses import dataclass, field
from dataclasses_avroschema import AvroModel, types
@dataclass
class User(AvroModel):
"An User"
name: str = field(metadata={'doc': 'bar'})
age: int = field(metadata={'doc': 'foo'})
User.avro_schema()
{
"type": "record",
"name": "User",
"doc": "An User",
"fields": [
{"name": "name", "type": "string", "doc": "bar"},
{"name": "age", "type": "long", "doc": "foo"}
]
}
from dataclasses import dataclass, field
from dataclasses_avroschema import AvroModel, types
@dataclass
class User(AvroModel):
"An User"
name: str = field(metadata={'doc': 'bar', 'sensitivity': 'HIGH'})
age: int = field(metadata={'doc': 'foo', 'sensitivity': 'MEDIUM'})
User.avro_schema()
{
"type": "record",
"name": "User",
"doc": "An User",
"fields": [
{"name": "name", "type": "string", "doc": "bar", "sensitivity": "HIGH"},
{"name": "age", "type": "long", "doc": "foo", "sensitivity": "MEDIUM"}
]
}
(This script is complete, it should run "as is")
Exclude default value from schema
Sometimes it is useful to exclude default
values in the final avro schema
, for example when default values are dynamically generated. In the following example, we have an User with fields id
and created_at
, their default are generated when an instance is created. Because the defaults are dynamic then the avro schema
will change every time that is generated.
It is possible to exclude
the default
value using the exclude_default
in the metadata
. This is applicable when using default
or default_factory
import dataclasses
import datetime
from uuid import UUID, uuid4
from dataclasses_avroschema import AvroModel
@dataclasses.dataclass
class User(AvroModel):
id: UUID = dataclasses.field(default_factory=uuid4)
created_at: datetime.datetime = dataclasses.field(
default_factory=lambda: datetime.datetime.now(tz=datetime.timezone.utc)
)
User.avro_schema_to_python()
{
'type': 'record',
'name': 'User',
'fields': [
{
'name': 'uuid',
'type': {
'type': 'string',
'logicalType': 'uuid'
},
'default': '3f70bc2f-f533-434e-ab9b-982477626419' # IT WILL CHANGE EVERY TIME!!!
},
{
'name': 'created_at',
'type': {
'type': 'long',
'logicalType': 'timestamp-millis'
},
'default': 1705154403499 # IT WILL CHANGE EVERY TIME!!!
}
]
}
import dataclasses
import datetime
from uuid import UUID, uuid4
from dataclasses_avroschema import AvroModel
@dataclasses.dataclass
class User(AvroModel):
id: UUID = dataclasses.field(default_factory=uuid4, metadata={"exclude_default": True})
created_at: datetime.datetime = dataclasses.field(
default_factory=lambda: datetime.datetime.now(tz=datetime.timezone.utc),
metadata={"exclude_default": True}
)
User.avro_schema_to_python()
{
'type': 'record',
'name': 'User',
'fields': [
{
'name': 'uuid',
'type': {
'type': 'string',
'logicalType': 'uuid'
}
},
{
'name': 'created_at',
'type': {
'type': 'long',
'logicalType': 'timestamp-millis'
}
}
]
}
Note
This is also applicable for AvroBaseModel
(pydantic)
Fields with custom inner names
Some avro fields like arrays
, map
and fixed
define an inner name
besides the field name
. For this use cases, the metadata
property inner-name
must be defined, for example:
from dataclasses_avroschema import AvroModel
from dataclasses_avroschema import types
import dataclasses
import typing
@dataclasses.dataclass
class DeliveryBatch(AvroModel):
receivers_payload: typing.List[str] = dataclasses.field(metadata={'inner_name': 'my_custom_name'})
accounts: typing.Dict[str, str] = dataclasses.field(metadata={'inner_name': 'my_account'})
md5: types.confixed(size=16, namespace="md5", aliases=['md5', 'hash']) = dataclasses.field(metadata={'inner_name': 'my_md5'})
which will produce the following schema:
{
"type": "record",
"name": "DeliveryBatch",
"fields": [
{"name": "receivers_payload", "type": {"type": "array", "items": "string", "name": "my_custom_name"}},
{
"name": "accounts",
"type": {"type": "map", "values": "string", "name": "my_account"}
},
{
"name": "md5",
"type": {
"type": "fixed",
"name": "my_md5",
"size": 16,
"namespace": "md5",
"aliases": ["md5", "hash"]
}
},
]
}