Earth Data Labs

LoRaWAN Sessions: Why ABP can be more complex than OTAA

Ron J — Fri, 04 Oct 2024 11:31:23 GMT

The most common way for LoRaWAN devices to establish sessions is to use the Over-The-Air Activation, aka "OTAA". It uses a join-accept handshake to negotiate session keys with the network. It also does reset the session parameters such as the frame counter (FCnt), the data rate configuration (ADR) or the network Id (NetId).

On the other hand, devices have the possibility to use the Activation By Personalization, aka "ABP" for joining the network. At high level, ABP can be seen as a pre-agreed and frozen set of configuration parameters, meaning that there is no need for the initial handshake. However, ABP is not state-less. It is still possible to dynamically change the configure. The TTN console provides an option for reseting the session ABP devices.

This is the description provided by TTN:

Resetting the session context and MAC state will reset the end device to its initial (factory) state. This includes resetting the frame counters and any other persisted MAC setting on the end device. [...]. Activation-by-personalization (ABP) end devices will only reset the MAC state, while preserving up/downlink queues."

There are many fields defined in the MAC state, and here a few of the relevant ones:

field	description
last adr change `fcntup`	Frame counter of uplink, which confirmed the last ADR parameter change.
last dev status `fcntup`	Frame counter value of last uplink containing DevStatusAns.
MAC parameters	Parameters such as EIRP, Tx Power, delay, ...

There is also a definition for the Session state:

field	description
last use uplink `fcntup`	Last uplink frame counter value used. Network Server only. Application Server assumes the Network Server checked it.
last network downlink `n_fcntdown`	Last network downlink frame counter value used. Network Server only.
last app download `a_fcntdown`	Last application downlink frame counter value used. Application Server only.
last confirmed downlink `conf_fcntdown`	Frame counter of the last confirmed downlink message sent. Network Server only.

Our focus in the post are on the frame counters (fcnt). The reason is that, when restarting the device with ABP, the device needs to remember the last frame counter. But, is it enough? What about other counters, such as downlink counters, and other MAC parameters?

In the case of OTAA, when a the device restarts from a blank memory, then the session is automatically reset after re-joining the network. This means that the device only need to store in a "permanent storage" the keys and nonce. Other parameters such as the frame counter and session keys can be kept in the "battery powered storage", eg the RTC memory. If the RTC memory gets flushed, (eg if the battery is removed or flat), then the OTAA enabled device can get fresh new paramters by re-joining the network. But for ABP, this is not the case, and frame counters must be kept in the permanent storage. In practice, this is not an issue, but one may want to limit flash wearing by having to write into the flash each time a packet is sent or received.

Practical Implementation: RadioLib

Let's dig into the RadioLib LoRaWAN implementation to better understand how sessions are restored. The function is LoRaWANNode::setBufferSession(uint8_t* buffer), and located in File "src/protocols/LoRaWAN/LoRaWAN.cpp".

Restoring Sessions

The first part of the code check the integrity of the buffer.

// the Nonces buffer holds a checksum signature - compare this to the signature that is in the session buffer
uint16_t signatureNonces = LoRaWANNode::ntoh(&this->bufferNonces[NONCES_SIGNATURE]);
uint16_t signatureInSession = LoRaWANNode::ntoh(&session[SESSION_NONCES_SIGNATURE]);
if(signatureNonces != signatureInSession) {
  RADIOLIB_DEBUG_PROTOCOL_PRINTLN("The Session buffer (%04x) does not match the Nonces buffer (%04x)", signatureInSession, signatureNonces);
  return(RADIOLIB_ERR_SESSION_DISCARDED);
}

Then gets the value of the device Id, Network ID, sessions keys. This code is actually designed of OTAA, since for ABP, those parameters are all frozen.

// pull all authentication keys from persistent storage
this->devAddr = LoRaWANNode::ntoh(&session[SESSION_DEV_ADDR]);
memcpy(this->appSKey,     &session[SESSION_APP_SKEY],      RADIOLIB_AES128_BLOCK_SIZE);
memcpy(this->nwkSEncKey,  &session[SESSION_NWK_SENC_KEY],  RADIOLIB_AES128_BLOCK_SIZE);
memcpy(this->fNwkSIntKey, &session[SESSION_FNWK_SINT_KEY], RADIOLIB_AES128_BLOCK_SIZE);
memcpy(this->sNwkSIntKey, &session[SESSION_SNWK_SINT_KEY], RADIOLIB_AES128_BLOCK_SIZE);

The code then restores the various frame counters, as well as network Id and revision (eg LoraWan 1.1 or 1.0.x). In case of ABP, the network Id and revision is frozen, while other parameters are dynamic.

// restore session parameters
this->rev          = ntoh(&session[SESSION_VERSION]);
RADIOLIB_DEBUG_PROTOCOL_PRINTLN("LoRaWAN session: v1.%d", this->rev);
this->homeNetId    = ntoh(&session[SESSION_HOMENET_ID]);
this->aFCntDown    = ntoh(&session[SESSION_A_FCNT_DOWN]);
this->nFCntDown    = ntoh(&session[SESSION_N_FCNT_DOWN]);
this->confFCntUp   = ntoh(&session[SESSION_CONF_FCNT_UP]);
this->confFCntDown = ntoh(&session[SESSION_CONF_FCNT_DOWN]);
this->adrFCnt      = ntoh(&session[SESSION_ADR_FCNT]);
this->fCntUp       = ntoh(&session[SESSION_FCNT_UP]);

The rest of the code is used to restore the MAC state and parameters:

uint8_t cid; // Command ID
uint8_t cLen = 0; // Command Length
uint8_t cOcts[14] = { 0 }; // Command options buffer

// setup the default channels
if(this->band->bandType == BAND_DYNAMIC) {
  this->selectChannelPlanDyn();
} else { ... }

// for dynamic bands,  additional channels must be restored per-channel
if(this->band->bandType == BAND_DYNAMIC) { ... }

// restore the state - ADR needs special care, other is straight default
cid = MAC_LINK_ADR;
cLen = 14; // special internal ADR command
memcpy(cOcts, &session[SESSION_LINK_ADR], cLen);
(void)execMacCommand(cid, cOcts, cLen);

uint8_t cids[6] = {
  MAC_DUTY_CYCLE,          MAC_RX_PARAM_SETUP, 
  MAC_RX_TIMING_SETUP,     MAC_TX_PARAM_SETUP,
  MAC_ADR_PARAM_SETUP,     MAC_REJOIN_PARAM_SETUP
};
uint16_t locs[6] = {
  SESSION_DUTY_CYCLE,      SESSION_RX_PARAM_SETUP,
  SESSION_RX_TIMING_SETUP, SESSION_TX_PARAM_SETUP,
  SESSION_ADR_PARAM_SETUP, SESSION_REJOIN_PARAM_SETUP
};

for(uint8_t i = 0; i < 6; i++) {
  (void)this->getMacLen(cids[i], &cLen, DOWNLINK);
  memcpy(cOcts, &session[locs[i]], cLen);
  (void)execMacCommand(cids[i], cOcts, cLen);
}

// set the available channels
uint16_t chMask = LoRaWANNode::ntoh(&session[SESSION_AVAILABLE_CHANNELS]);
this->setAvailableChannels(chMask);

// copy uplink MAC command queue back in place
memcpy(this->fOptsUp, &session[SESSION_MAC_QUEUE], FHDR_FOPTS_MAX_LEN);
memcpy(&this->fOptsUpLen, &session[SESSION_MAC_QUEUE_LEN], 1);

This last part of the code looks a bit barbarian, but what it does in practice is only to call the following MAC commands:

LINK_ADR (LinkADRReq, 0x3): Sets the data rate, transmit power, repetition rate or channel.
DUTY_CYCLE (DutyCycleReq, 0x4): Sets the maximum aggregated transmit duty-cycle of a device
RX_PARAM_SETUP (RXParamSetupReq, 0x5): Sets the reception slots parameters
RX_TIMING_SETUP ( RXTimingSetupReq, 0x8): Sets the timing of the of the reception slots
TX_PARAM_SETUP (TxParamSetupReq, 0x9): Sets the maximum allowed dwell time and Max EIRP of end-device, based on local regulations
ADR_PARAM_SETUP (ADRParamSetupReq, 0xc): Sets the limit and delay parameters defining the ADR back-off algorithm.
REJOIN_PARAM_SETUP (RejoinParamSetupReq, 0xf): With this command, the network may request the device to periodically send a RejoinReq Type 0 message with a custom periodicity defined as a time or a number of uplinks
NEW_CHANNEL, DL_CHANNEL: In case of dynamic band configuration (which is the case of EU868/IN865/AS923/KR920 but not US915/AU915). Sets sets the center frequency of the new channel and the range of uplink data rates usable on this channel.

Voila, restoring a session seems simpler now. But, let's also have a looks at the code for initializing a session from scratch. The function is void LoRaWANNode::createSession(uint16_t lwMode, uint8_t initialDr) from the same LoRaWAN.cpp file.

Creating New Sessions

The structure is similar to the restore function. First it takes care of the bands:

this->clearSession();

// setup JoinRequest uplink/downlink frequencies and datarates
if (this->band->bandType == RADIOLIB_LORAWAN_BAND_DYNAMIC) {... }

// on fixed bands, the first OTAA uplink (JoinRequest) is sent on fixed datarate
if (this->band->bandType == RADIOLIB_LORAWAN_BAND_FIXED && lwMode == RADIOLIB_LORAWAN_MODE_OTAA) { ... }
else { ... }

And then executes the necessary MAC commands:

uint8_t cOcts[5]; // 5 = maximum downlink payload length
uint8_t cid = RADIOLIB_LORAWAN_MAC_LINK_ADR;
uint8_t cLen = 1;       // only apply Dr/Tx field
cOcts[0] = (drUp << 4); // set uplink datarate
cOcts[0] |= 0;          // default to max Tx Power
(void)execMacCommand(cid, cOcts, cLen);

cid = RADIOLIB_LORAWAN_MAC_DUTY_CYCLE;
this->getMacLen(cid, &cLen, RADIOLIB_LORAWAN_DOWNLINK);
uint8_t maxDCyclePower = 0;
switch (this->band->dutyCycle)
{
case (3600):  maxDCyclePower = 10; break;
case (36000): maxDCyclePower = 7;  break;
}
cOcts[0] = maxDCyclePower;
(void)execMacCommand(cid, cOcts, cLen);

cid = RADIOLIB_LORAWAN_MAC_RX_PARAM_SETUP;
(void)this->getMacLen(cid, &cLen, RADIOLIB_LORAWAN_DOWNLINK);
cOcts[0] = (RADIOLIB_LORAWAN_RX1_DR_OFFSET << 4);
cOcts[0] |= this->channels[RADIOLIB_LORAWAN_DIR_RX2].dr; // may be set by user, otherwise band's default upon initialization
LoRaWANNode::hton(&cOcts[1], this->channels[RADIOLIB_LORAWAN_DIR_RX2].freq, 3);
(void)execMacCommand(cid, cOcts, cLen);

cid = RADIOLIB_LORAWAN_MAC_RX_TIMING_SETUP;
(void)this->getMacLen(cid, &cLen, RADIOLIB_LORAWAN_DOWNLINK);
cOcts[0] = (RADIOLIB_LORAWAN_RECEIVE_DELAY_1_MS / 1000);
(void)execMacCommand(cid, cOcts, cLen);

cid = RADIOLIB_LORAWAN_MAC_TX_PARAM_SETUP;
(void)this->getMacLen(cid, &cLen, RADIOLIB_LORAWAN_DOWNLINK);
cOcts[0] = (this->band->dwellTimeDn > 0 ? 1 : 0) << 5;
cOcts[0] |= (this->band->dwellTimeUp > 0 ? 1 : 0) << 4;
uint8_t maxEIRPRaw;
switch (this->band->powerMax)
{
case (12):  maxEIRPRaw = 2; break;
case (14):  maxEIRPRaw = 4; break;
...
}
cOcts[0] |= maxEIRPRaw;
(void)execMacCommand(cid, cOcts, cLen);

cid = RADIOLIB_LORAWAN_MAC_ADR_PARAM_SETUP;
(void)this->getMacLen(cid, &cLen, RADIOLIB_LORAWAN_DOWNLINK);
cOcts[0] = (RADIOLIB_LORAWAN_ADR_ACK_LIMIT_EXP << 4);
cOcts[0] |= RADIOLIB_LORAWAN_ADR_ACK_DELAY_EXP;
(void)execMacCommand(cid, cOcts, cLen);

cid = RADIOLIB_LORAWAN_MAC_REJOIN_PARAM_SETUP;
(void)this->getMacLen(cid, &cLen, RADIOLIB_LORAWAN_DOWNLINK);
cOcts[0] = (RADIOLIB_LORAWAN_REJOIN_MAX_TIME_N << 4);
cOcts[0] |= RADIOLIB_LORAWAN_REJOIN_MAX_COUNT_N;
(void)execMacCommand(cid, cOcts, cLen);

The MAC commands are actually, and unsurprisingly, the same as for the restore function.

ABP Sessions

So, what about ABP sessions? What should actually be restored compared to OTAA. Well, just about every single MAC parameter, since it defines, for example, how much time the gateway has for transmitting a downlink to the device, and on which channel. If the gateway does not talk to the device on the same channels and at the same time, then that won't work. And in this case, only an ADR back-off to the initial setting, or a MAC reset in the TTN console can help to restore the connection. This is probably one of significant drawback from ABP compared to OTAA, since using OTAA, the device can dynamically reset the MAC state just by rejoining the network, while for ABP, the device must wait for the server to ADR back-off.

One question remains, about the session parameters. The above discussion is mainly about MAC parameters, which we now know must be restored according to the session. But what about the frame counters? What happens if there are not restored properly, can the server still receive the frames from the device? The answer is no - and to make it worse, this is a common issue with TTN/ABP, referring to the discussion "ABP device packets in gateway view but not arriving at application". And when looking at if "there is a way to monitor why the network server drops certain frames", then answer is pretty straightforward:

Question: Is there a way to monitor on the LNS why it is rejecting those frames ?

Answer: For ABP the classic is the frame counters. In this situation it’s unlikely to be as they are incrementing. So the refined answer is no, there are no further logs for us to see. If you want to write your own stack, how about doing it against a copy of TTS OS on a local server, then you can poke under the hood as much as you like.

In short, if there is an issue with ABP, you are on your own. And, while using the reset MAC state in the TTN console is "acceptable" for developpement configuration, it is definitely not for a production environment. And as for the "advice" to write one's own TTN stack, that's honestly the worst recommendation.

Of course, one may wonder why this is important, since restoring the full MAC and Session states using the setBufferSession should be enough to restore the frame counters? Well, in actual fact, that sometime works, but quite often, the devices gets into the "seen on the gateway but not in the app" state. And when this happens, the only solution that works is to reset the MAC state. Definitelty not a suitable solution for production.

Conclusion

It will take a bit more time to find out how to properly restore ABP session without getting in the "seen on the gateway but not in the app" issue, and I will write later about the possible approaches and solutions.

Meanwhile, using OTAA is the correct solution. And on a side topic, one question to investigate is why OTAA only needs to 2 way handshake, while TCP needs 3 handshakes. Something for a yet another future post!

The cost of LoRaWAN: frame encoding efficiency

Ron J — Thu, 03 Oct 2024 10:20:50 GMT

This is a quick post to document one of the question I had about LoRaWAN, namely what is the overhead for sending 1 byte of useful application payload when using LoRaWAN, versus using LoRa directly.

LoRaWAN frame payload format

In the above diagram "8b" refers to 8 bits, while "8B" refers to 8 bytes.

LoRaWAN encoding

Assuming that the "App Layer" is 1 byte, the amount of extra bits needed for encoding the frame from the LoRaWAN device to the LoRaWAN gateway consists of the following items:

LoRaWAN layer:

4 bytes for the device address (inside the header) - used for the cloud to identify the emitting device.
2 bytes for the frame counter (inside the header) - used for packet loss detection (and rejoin with OTTA), as well as to prevent replay attacks.
1 byte of control bits (inside the header), used for adaptive data rate (ADR) and acknowledge request control.
1 byte for the port - which can be any value from 1 to 223; the other values beeing reserved for protocol payload handling.

Medium Access control (MAC) layer:

1 byte for the header - which includes the frame type, and a version number known as the major. The most basic header is defined as an uplink data message, with a major equal to zero.
4 bytes for the message integrity code, aka "MIC". The MIC is not a CRC (which instead, sits at the physical layer level). Instead the MIC is used for authentication. It is the computed as the AES-CMAC crypto based on the network session key (NwkSKey) over the MAC header and payload.

So, this is 13 bytes overhead needed to transmit a single 1 byte useful payload. And considering that out of those 13 bytes, 10 are used for addressing, authentication, and sequence control, this is actually a pretty efficient protocol encoding implementation.

LoRa encoding

Let's continue the analysis of with the LoRa physical frame.

The overhead when using the LoRa physical frame encoding consists of those additional bits:

8 bytes for the preamble. The preamble, which is needed for the LoRa receive to start detecting a frame, can vary in length. For LoRaWAN, it is 8 bytes. Other modulation, such as GSFK, uses 5 bytes.
2.5 bytes for the header - I could not find any explicit description for this header payload, expect from this fascinating reverse engineering (section 4.6)
2 bytes for the CRC.

Actually, there are two transmitions modes for the physical layer: Explicit and Implicit. The later implicit mode assumes that the payload length, CR, and CRC are known in advance by both the sender and receiver, in which case the CRC and header are not transmitted, thus saving 4.5 bytes. However, while the implicit mode can work for specific frames with known length, it is not suitable to generic frames when ADR is used, since the coding rate can vary dynamically.

So, assuming an explicit physical frame, it takes 25.5 bytes overhead to transfer one byte of useful application data.

Conclusion

The graph below shows the actual encoding efficiency, measured for the 3 different encoding.

LoRaWAN frame encoding efficiency

The efficiency is defined as PL/(O+PL), where PL = is the useful application payload, and O the Overhead.

As a conclusion, sending 1 byte of useful application data has an efficiency of 3.8% when using LoRaWAN. And to achieve 80% encoding efficiency (eg overhead length smaller than one fifth of the application payload), LoRaWAN application need to send at least 100 bytes.

Getting started with OpenXR and Viulux VR headset

Ron J — Mon, 18 Jul 2022 14:05:00 GMT

During the past few weeks I have been experimenting with Virtual Reality, and more precisely with the OpenXR standard. My initial objective is simple: to enhance one of the existing atmospheric simulation to a 3D model and visualise it using a VR headset.

The initial thinking was to use a headset like the Oculus Quest. Unfortunately, that does not really work, and I soon realised that the best option was to use a tethered headset: That is, where the 3D/VR engine runs directly on a Linux server, and where the headset is just used as a simple display and orientation sensor.

While looking for an affordable VR headset (aka HMD, for Head Mounted Device), I stumbled on the Viulux V9, which can be bought for $60 on Taobao.

Viulux V9 HMD (Head Mounted Device), aka VR headset

This HMD has a quite impressive spec considering its cost: 2*1440*1440 displays with a 120 Hz refresh rate, 9 DOF orientational sensor, and extension for external SLAM sensors like NOLA. So, I just bought one from the Taobao shop.

Inside the Viulux V9 hardware

Opening the Viulux V9 does not require a screw driver: Just need to remove the lid which is clipped into the headset.

The inner PCB is not overly complex: It is composed of DP to MIPI display IC, two STM IC, and an Eprom, most likely to store the display information.

The display IC is the TC358860 from Toshiba and is just used to "convert the Embedded Display Port (eDPTM) video stream into an MIPI® DSI stream". The two other ICs are STM32, and it is not quite clear why the board needs two of them. Most likely, one is dedicated just to handle the real-time sensor data, and the other to handle the USB connection and other controls (audio volume, etc).

The orientational sensors are located on the other side of the PCB. It consist of a 6 DOF Accelerometer, Gyroscope (MPU 6000) and 3DOF Compass (HMC58883L).

Voila, that's it for the hardware. Now, let's have a look at what it takes from the software side to get the Viulux V9 to work.

OpenXR, aka the AR/VR standard

When it comes to Virtual Reality, the standard is called OpenXR. At first glance, it can look quite complex, but to make thing simple, one can consider OpenXR as a compositor, which helps a 3D rendering application to compose the frames needed for the two displays, by providing to the application the scene orientation, position and timing. Open XR also has an extensive set of API for handling auxiliary devices (eg gears), and relative spaces positioning, but we can skip this part for now.

Monado, the OpenXR runtime

To get started with getting the VR application to run on the Viulux HMD, one first need an OpenXR runtime which implements the above mentioned compositor. The very good news is that the team at Collabora together with Khronos have released an open source version on the OpenXR runtime called Monado, which does support many headset. The bad news is that the Viulux HMD does not work out of the box, but fortunately, it is an easy thing to fix.

Getting Monado to support Viulux

The Viulux V9 HMD has been designed to be a drop-in replacement for the Oculus Rift HMD, meaning that both USB and Video data should be inter-exchangeable. When plugin the USB cable in, one can see the following device (using lsusb)

Bus 001 Device 046: ID 2833:0001 Oculus VR, Inc. Rift Developer Kit 1
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            0
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0        64
  idVendor           0x2833 Oculus VR, Inc.
  idProduct          0x0001 Rift Developer Kit 1
  bcdDevice            0.01
  iManufacturer           1 Inlife 3D, Inc.
  iProduct                2 Tracker DK
  iSerial                 3 AAAAAAAAAAAA
  bNumConfigurations      1

It does present itself as an Oculus Rift, with a manufacturer defined as "Inline 3D, Inc." which is the actual Viulux manufacturer name.

The way Monado handles the HMD is by using the OpenHMD project, which implements the driver for many headsets, and specifically the Oculus Rift one.

Introducing OpenHMD

OpenHMD supports many head mounted devices (HMD). The way it detects HMD is mainly by scanning the USB devices and matching against a specific Product name. In the case of the Oculus rift it is done in the OpenHMD/ src/ drv_oculus_rift/ rift.c file:

static void get_device_list(ohmd_driver* driver, ohmd_device_list* list)
{
	// enumerate HID devices and add any Rifts found to the device list

	rift_devices rd[RIFT_ID_COUNT] = {
		{ "Rift (DK1)", OCULUS_VR_INC_ID, 0x0001,	-1, REV_DK1 },
		{ "Rift (DK2)", OCULUS_VR_INC_ID, 0x0021,	-1, REV_DK2 },
		{ "Rift (DK2)", OCULUS_VR_INC_ID, 0x2021,	-1, REV_DK2 },
        ...
	};

	for(int i = 0; i < RIFT_ID_COUNT; i++){
		struct hid_device_info* devs = hid_enumerate(rd[i].company, rd[i].id);
		struct hid_device_info* cur_dev = devs;

		while (cur_dev) {
			// We need to check the manufacturer because other companies (eg: VR-Tek)
			// are reusing the Oculus DK1 USB ID for their own HMDs
			if(ohmd_wstring_match(cur_dev->manufacturer_string, L"Oculus VR, Inc.") &&
			   (rd[i].iface == -1 || cur_dev->interface_number == rd[i].iface)) {
				int id = 0;
				ohmd_device_desc* desc = &list->devices[list->num_devices++];

The issue in the above detection code is that it explicitly checks for manufacturer. The reason is that other companies (like VR-Tek, or Inlife in our case) use the same idProduct and idVendor, but have different distortion and aberration parameters. So, to avoid using the wrong parameters for non Oculus manufactured HMD, OpenHMD is explicitly disabling those devices.

There are many solutions to this issue, and one straightforward is to create a new ID for those non Oculus manufactured devices, and check agains the manufacturer name. This, IMO, is more efficient that just duplicating the rift driver and make it a Viulux copy, since 99% of the code is reused between Viulux and Rift. The change proposal can be see from this commit

All fine, let's run the app then

With the previous change, the Viulux headset is automatically detected by OpenHMD. By default, OpenHMD will create a new window (using SDL) on the main screen, and to active the Viulux display, one needs to "move" the window on the Viulux display. Using fedora, this can be done by right clicking on the window and selecting the "Move to the monitor on the (...|right)".

Voila, that's all you need to get OpenXR to run on the Viulux HMD. Next, Just move the headset and checkout the scene!

Next Steps

That was a quick post to get familiar with OpenXR, Monado and OpenHMD.

Next step is to port the exiting 2D atmospheric simulation to a 3D simulating running using OpenXR. I will describe this experiment in the next post.

When IPv6 routing goes wrong

Ron J — Fri, 18 Mar 2022 11:02:25 GMT

I have been using VPS from Linode for more than 10 years, located in several locations world-wide, and except from Network maintenance or seldom DDOS attacks, I have never experienced any direct networking issue. Until last month, when I ran into an interesting routing issue between two VMs located in two data centres: Tokyo, Japan and London, UK.

It all started on February 23rd with alerts from a micro-service used to synchronise the data between a server in Tokyo and another one in London. After a quick investigation, it became clear that the issue was related to an abnormal network latency when connecting to the HTTPs API endpoint on the London server from the Tokyo service: the HTTP connection could be established, but getting the result from the API, which would usually take 100ms, would then take more than 30 seconds.

Micro-service connectivity

Looking further at the Rest API micro-service logs from the London-based server did not give more clue, so the natural next step was to check the connectivity from the Tokyo server down to the London server. A simple curl -vvv gave all the needed information

# curl "http://xxx.members.linode.com/service/sync" -vvv
* About to connect() to xxx.members.linode.com port 80 (#0)
*   Trying 2a01:7e00::1234:5678:9abc:def0...
* Connected to xxx.members.linode.com (2a01:7e00::1234:5678:9abc:def0) port 80 (#0)
* Connection timed out
*   Trying 139.123.123.123...
* Connected to xxx.members.linode.com (139.123.123.123) port 80 (#0)
> GET /service/sync HTTP/1.1
> User-Agent: curl/a.b.c
> Host: xxx.members.linode.com
...

What became clear is that, since the London server had both IPv4 and IPv6 advertised on the DNS, the curl command tried both IPv4 and IPv6, starting with IPv6. Where it went wrong is that the IPv6 connection failed, and only after a 30 seconds curl defaulted back to IPv4. Fortunately, the IPv4 connection worked fine, and the micro-service did provide the correct API result - excluding an issue from the micro-service

As an immediate action to keep the system running, the curl (fetch) command was first updated to enforce IPv4 use, and then the AAA entry from the DNS was remove.

That worked fine, yet, that did not explain why the IPv6 connectivity failed, especially since there was not system update on any of the servers for the past 7 days. I then decided to run a few more test to find out the root cause, and make sure that this situation would not happen again.

Understanding connectivity

snapshot from https://stefansundin.github.io/traceroute-mapper/

Layer 3 Connectivity - aka IP(v6)

When it comes to checking the network connectivity between two servers, the common approach is to use ping and trace-route. Somewhat, those are effective tools, but slightly outdated in 2022. Nowadays, a better tool, which combines both ping and trace route, as well as many more features is MTR, aka Multi Trace Route.

mtr -6rwbzc 100  2a01:7e00::f03c:....
Start: Fri Feb 25 11:19:39 2022
HOST: Tokyo                Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. 2400:8902::4255:....   0.0%   100    1.1   1.3   0.8   8.9   1.1
  2. 2400:8902:f::1         0.0%   100    0.6   1.6   0.4  37.4   4.8
  3. 2400:8902:5::1         0.0%   100    7.6   1.4   0.4  26.3   3.4
  4. 2001:678:34c:6c::1     0.0%   100  103.5 104.2 103.0 137.7   4.7
  5. 2600:3c01:3333:5::2    0.0%   100  103.1 103.7 103.0 115.4   1.9
  6. 2001:678:34c:44::2     0.0%   100  178.9 179.7 178.8 201.0   2.9
  7. 2600:3c04:15:5::1      0.0%   100  178.9 179.4 178.8 197.8   2.4
  8. 2001:678:34c:49::2     0.0%   100  259.4 254.2 252.8 293.5   5.4
  9. 2a01:7e00:7777:18::2   0.0%   100  253.9 253.8 253.5 258.0   0.6
 10. 2a01:7e00::f03c:....   2.0%   100  252.9 253.2 252.9 262.9   1.0

The MTR result did not show anything abnormal, expect from a 2% drop on the London VM hosting the micro-service. This drop is annoying and I will come back on this issue later. But for now, it does not explain why the curl command completely failed, since with the 2% drop, the curl command would still be able to go through. In other words, only a drop close to 100% would justify why the curl command does not work.

Layer 4 Connectivity - aka TCP, UDP and ICMP.

The less know part of ping and trace-route is that they essentially work using ICMP (aka Internet Control Message Protocol), a special Layer4 protocol for checking IP connectivity. And in my case, ICMP does not represent the ground truth: The HTTPs micro-service is exposed over TCP (or UDP for QUIC/HTTP3), meaning that when connecting using CURL to the micro-service, the real protocol that need to be checked is TCP and not ICMP.

Maybe that sounds counter intuitive: If the connectivity can be established using ICMP, why couldn't it using TCP or UDP? MTR has actually a statement about this issue. But before digging further into the details of the L4 connectivity, let's have a quick look at the MTR results using UDP and TCP:

mtr -6rwbzc 100 --udp 2a01:7e00::f03c:...
HOST: Tokyo                Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. 2400:8902::fa66:....   0.0%   100    0.9   1.6   0.7  22.4   2.4
     2400:8902::4255:....
  2. 2400:8902:d::1         0.0%   100   19.4   1.6   0.3  25.1   3.9
  3. 2001:678:34c:6c::1     0.0%   100    0.9  54.4   0.4 131.8  51.9
     2400:8902:5::1
  4. 2600:3c01:3333:5::2    0.0%   100  103.0 103.3 102.9 115.5   1.5
     2001:678:34c:6c::1
  5. 2001:678:34c:44::2     0.0%   100  158.7 130.8 102.9 161.2  27.8
     2600:3c01:3333:5::2
  6. 2001:678:34c:44::2     0.0%   100  158.9 159.4 158.6 175.0   1.9
     2600:3c04:15:5::1
  7. 2001:678:34c:49::2     0.0%   100  169.3 194.5 158.7 246.2  36.7
     2600:3c04:15:5::1
  8. 2a01:7e00:7777:18::2   0.0%   100  232.8 233.9 232.7 244.6   2.0
     2001:678:34c:49::2
  9. 2a01:7e00::f03c:....   0.0%   100  233.4 233.3 232.7 239.3   0.8
     2a01:7e00:7777:18::2

At first glance, the UDP MTR does not show anything abnormal, at least as far as drop is concerned. The only annoying part is that the routes do not seem to be stable, and since MTR is sorting by hop, each hop can be associated to several IPs. For instance 2a01:7e00:7777:18::2 which seems to be the Linode UK front router just before the VM can be seen on hop 8 and 9. Yet, I do not really explain why if this IP can be seen on the last hop, there is not another hop with the actual VM IP.

mtr -6rwbzc 100 --tcp 2a01:7e00::f03c:...
HOST: Tokyo                Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.2400:8902::4255:....    0.0%   100    0.9   1.1   0.7   2.5   0.2
  2.2400:8902:f::1          0.0%   100    0.5   1.5   0.4  39.9   4.5
  3.2400:8902:5::1          0.0%   100  106.4  59.8   0.4 149.1  53.9
    2001:678:34c:6c::1
  4.2600:3c01:3333:5::2     0.0%   100  106.5 108.5 106.3 144.9   5.9
    2001:678:34c:6c::1
  5.2600:3c01:3333:5::2     0.0%   100  162.2 137.6 106.3 194.8  28.2
    2001:678:34c:44::2
  6.2001:678:34c:44::2      0.0%   100  162.3 162.7 162.1 179.3   1.8
    2600:3c04:15:5::1
  7.2600:3c04:15:5::1       0.0%   100  162.2 193.5 162.2 246.8  36.9
    2001:678:34c:49::2
  8.2001:678:34c:49::2      0.0%   100  236.8 237.4 236.1 251.2   2.1
    2a01:7e00:7777:18::2
  9.2a01:7e00:7777:18::2   48.0%   100  236.9 237.1 236.8 237.9   0.0
 10.???                    100.0   100    0.0   0.0   0.0   0.0   0.0

This second trace, this time using TCP, is spot on: It clearly shows a complete (100%) drop on the last IP, which is the London-based VM IP.

So, well.... ICMP and UDP working fine, but not TCP? And only for IPv6? To add on the weirdness of the situation, I checked the connectivity to the London VM from another VM located in Tokyo, and it did not show any issue. This is definitely and interesting situation worth digging out.

Layer 2 Connectivity - aka frames.

The way MTR TCP connectivity testing works is by establishing TCP sessions (aka connect) while at the same time setting the socket in non-blocking mode as well as setting the IP TTL . When the TTL is big enough, the server will respond with a SYN-ACK, while if the TTL is too low, an ICMP message will be sent back.

From https://www.mdpi.com/2076-3417/6/11/358/htm

In order to check the TCP SYN/SYN-ACK handshake, I used netcat: One on the London server, used in listen mode, and the other on the Tokyo server, establishing connections. At the same time, I used tcpdump to capture the raw frames.

The London based server did show the SYN and SYN-ACK response packets.

#London > sudo tcpdump -c 10 ip6 host 2400:8902::f03c:.... -n
02:45:10.545147 IP6 2400:8902::f03c:.....54118 > 2a01:7e00::f03c:.....krb524: Flags [S], seq 1241850342, win 64800, options [mss 1440,sackOK,TS val 3582020984 ecr 0,nop,wscale 7], length 0
02:45:10.545222 IP6 2a01:7e00::f03c:.....krb524 > 2400:8902::f03c:.....54118: Flags [S.], seq 4136409442, ack 1241850343, win 64260, options [mss 1440,sackOK,TS val 4039886398 ecr 3582020984,nop,wscale 7], length 0

While the Tokyo based server did not show any of the SYN-ACK response.

#Tokyo> sudo tcpdump -c 10 ip6 host 2a01:7e00::f03c:....
11:45:10.418211 IP6 2400:8902::f03c:.....54118 > 2a01:7e00::f03c:.....krb524: Flags [S], seq 1241850342, win 64800, options [mss 1440,sackOK,TS val 3582020984 ecr 0,nop,wscale 7], length 0
11:45:11.434915 IP6 2400:8902::f03c:.....54118 > 2a01:7e00::f03c:.....krb524: Flags [S], seq 1241850342, win 64800, options [mss 1440,sackOK,TS val 3582022001 ecr 0,nop,wscale 7], length 0

This meant that the London server did properly receive the traffic, and for some reason, this traffic would never get back to the London server. Since this only happened for TCP packets, the next question was wether the server in London could actually establish TCP connections with the server in Tokyo.

#London > mtr --tcp -r -c 10 2400:8902::f03c:...
Start: Sat Feb 26 04:27:05 2022
HOST: London                Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2a01:7e00::208:....  0.0%    10    1.1   1.3   1.1   2.0   0.0
  2.|-- 2a01:7e00:7777:20::1 0.0%    10    3.0   3.8   0.5  15.1   4.6
  3.|-- 2a01:7e00:7777:5::1  0.0%    10    0.5  16.1   0.5  74.5  30.8
  4.|-- 2600:3c04:15:5::2    0.0%    10   76.4  74.8  74.6  76.4   0.5
  5.|-- 2600:3c04:15:5::2    0.0%    10  130.3  97.0  74.6 130.4  28.7
  6.|-- 2600:3c01:3333:5::1  0.0%    10  130.6 131.6 130.3 138.4   2.4
  7.|-- 2001:678:34c:6c::2   0.0%    10  136.3 175.8 130.3 239.8  53.5
  8.|-- 2001:678:34c:6c::2   0.0%    10  236.5 236.8 236.4 237.5   0.0
  9.|-- 2400:8902::f03c:.... 0.0%    10  239.0 263.9 237.1 336.7  42.7

Unfortunately, the connection in that direction did work fine. If it had not, that could have indicated an issue with the router not properly forwarding packets, for instance because of a corrupted cache (as described in everflow). But no, that was not the case.

Layer 2 - Neighbour Connectivity

After scratching my head for a few more hours, I decided to reach out to the Linode team who suggested to flush the ARP cache using ip neigh flush all and checking the results using ip -6 neigh show.

At first glance, having to flush the ARP entries is somewhat a bit irrational in the context of this problem: ARP is only about resolving the next-hop mac address. So, if the traffic can get through UDP, that would mean that the next hop is was properly resolved. In that case, why couldn't the same next-hop work too with TCP?

I decided anyways to follow the suggestion from the Linode team, and, to my surprise, there was actually a lot of FAILED entries in the ARP table (FAILED indicates that the system could not be reached, while STALE indicates that the connection hasn't been recently verified):

#London > ip -6 neigh show
fe80::1 dev eth0 lladdr 00:05:xx:xx:xx:xx router REACHABLE
fe80::bace:f6ff:fexx:4aa6 dev eth0 lladdr b8:ce:xx:xx:xx:xx STALE
fe80::8678:acff:fexx:21cc dev eth0 lladdr 84:78:xx:xx:xx:xx router STALE
fe80::bace:f6ff:fexx:5a56 dev eth0  router FAILED
fe80::bace:f6ff:fexx:4b66 dev eth0  router FAILED
fe80::063f:72ff:fexx:5af2 dev eth0  FAILED
fe80::063f:72ff:fexx:53f2 dev eth0  FAILED
fe80::bace:f6ff:fexx:5ee6 dev eth0  FAILED
fe80::bace:f6ff:fexx:4a66 dev eth0  FAILED

Flushing did not fix anything. But checking the ARP table triggered the idea to alos check the ip routing table, and check if any of those failed IP could be in the routing table. The result, which can be obtained using ip -6 r came as a big surprise:

#London > ip -6 r
fe80::/64 dev eth0 proto kernel metric 100 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev tailscale0 proto kernel metric 256 pref medium
default proto ra metric 100
	nexthop via fe80::1 dev eth0 weight 1
	nexthop via fe80::4094 dev eth0 weight 1
	nexthop via fe80::063f:72ff:fexx:52f2 dev eth0 weight 1
	nexthop via fe80::bace:f6ff:fexx:4a66 dev eth0 weight 1
	nexthop via fe80::bace:f6ff:fexx:42c6 dev eth0 weight 1
	nexthop via fe80::bace:f6ff:fexx:5ee6 dev eth0 weight 1
	nexthop via fe80::bace:f6ff:fexx:42e6 dev eth0 weight 1
	nexthop via fe80::bace:f6ff:fexx:5a56 dev eth0 weight 1
	nexthop via fe80::063f:72ff:fexx:53f2 dev eth0 weight 1
	nexthop via fe80::bace:f6ff:fexx:4b66 dev eth0 weight 1
	...

There are many questions from the routing output. The first and most obvious is why is there so many IP listed as the next-hop? Usually VMs report only fe80::1 or fe80::4096 as the next hop, but not this specific server, which listed more than 30 next hop IPs for the same eth0 network interface.

Second question, why did most of those next hop IPs were actually those marked as failed in the ARP entries. The good news is that it could explain the issue with TCP and IPv6: Provided the IP routing kernel would "stick" the TCP flow (identified by a directional source-destination IP) to a given failed next-hop, that would explain why some flow would fail, but not other.

At this point, I still do not have the answers to the those two questions, and quick a google search did give any outstanding explanation. It seems that the only way is to dig into the kernel code, in a similar way as this excellent article on IPv6 routing performance. I will keep this for a later post.

Layer 1 - Physical Connectivity

Meanwhile, there is actually another much more relevant question: in a data-center, how many peers neighbours can a VM have? I was somewhat naïve, thinking that all the VM traffic would go through a single Router (VTEP).

To verify the assumption, I first ran tcpdump to extract the MAC address and sort them by frequency. The result came again as a big surprise:

 #London > sudo tcpdump -c 10000  -nnS -e | awk '{ print($2);print($4); }' | grep -v Flags | sed 's/,//g' | sort | uniq -c | sort -n
  10000 f2:3c:91:xx:xx:01
   9367 f2:3c:91:xx:xx:02
    307 00:00:0c:9f:f0:19
     51 00:05:73:a0:0f:ff
     ...

The first two address f2:3c:91:xx:xx:xx are mac addresses of VMs hosted in the London data-center. Having direct VM L2 connectivity is ok provided that VxLAN is used, which ought to be the case (it is actually quite difficult to find any information from Linode DC topology).

What is interesting are the other IP addresses. Could they be from different VTEPs, which seems to be a recommended load-balancing design? To know the answer, the traffic needs to be classified by mac and IP flows. I implemented a simple golang app (gist) to classify the packets, and here is the result:

IP	MAC
ipv4-public-linode-tokyo-1	>00:00:0c:9f:f0:19
ipv4-public-linode-tokyo-2	>00:00:0c:9f:f0:19
ipv4-public-linode-london-2	>00:00:0c:9f:f0:19
ipv6-public-linode-tokyo-2	>00:05:73:a0:0f:ff
ipv6-public-linode-tokyo-1	>00:05:73:a0:0f:ff >00:05:73:a0:0f:fe
ipv6-public-digitalocean-uk	>00:05:73:a0:0f:ff
ipv6-public-digitalocean-de	>00:05:73:a0:0f:fe
ipv4-private-linode-london-2	>f2:3c:91:37:xx:xx
ipv4-private-linode-london-3	>f2:3c:91:a1:xx:xx
ipv4-private-linode-london-1	>f2:3c:92:a1:xx:xx

It shows that all private IP address are using their own Mac address, which is the expected behaviour for VxLAN. It also shows that IPv6 and IPv4 are using different next-hop, which is not a problem in itself. And finally, it shows that the connection to Tokyo-1 server is using two next-hops MAC addresses, which, at first glance, seems suspicious.

A TCP-dump helped to confirm that the two next-hops seems to be distributed among packets based on the flow label.

Packet 1 & 2, using flow label 774398:

- Layer 1 (14 bytes) = SrcMAC=f2:3c:91:..:..:.. DstMAC=00:05:73:a0:0f:fe EthernetType=IPv6 Length=0}
- Layer 2 (40 bytes) = IPv6	FlowLabel=774398 Length=32 NextHeader=TCP HopLimit=64 SrcIP=2a01:7e00::f03c:... DstIP=2400:8902::f03c:...

- Layer 1 (14 bytes) = SrcMAC=f2:3c:91:..:..:.. DstMAC=00:05:73:a0:0f:fe EthernetType=IPv6 Length=0}
- Layer 2 (40 bytes) = IPv6	FlowLabel=774398 Length=32 NextHeader=TCP HopLimit=64 SrcIP=2a01:7e00::f03c:... DstIP=2400:8902::f03c:...

Packet 3&4, using flow label 106397

- Layer 1 (14 bytes) = SrcMAC=f2:3c:91:..:..:.. DstMAC=00:05:73:a0:0f:ff EthernetType=IPv6 Length=0}
- Layer 2 (40 bytes) = IPv6	FlowLabel=106397 Length=32 NextHeader=TCP HopLimit=64 SrcIP=2a01:7e00::f03c:... DstIP=2400:8902::f03c:...

- Layer 1 (14 bytes) = SrcMAC=f2:3c:91:..:..:.. DstMAC=00:05:73:a0:0f:ff EthernetType=IPv6 Length=0}
- Layer 2 (40 bytes) = IPv6	FlowLabel=106397 Length=394 NextHeader=TCP HopLimit=64 SrcIP=2a01:7e00::f03c:... DstIP=2400:8902::f03c:...

To understand the reason for those two hops, we need to get back to the ARP entries. Here is what ip -6 neigh show shows for the two MAC addresses:

IP	dev	lladr	State
fe80::4094	eth0	`00:05:73:a0:0f:fe`	router REACHABLE
fe80::1	eth0	`00:05:73:a0:0f:ff`	router REACHABLE

The two IPs are both link-local IPv6 addresses, and fe80::1 is the default IPv6 gateway, while fe80::4094 is actually a random link local address. Considering that the two mac address differ only by 1 bit, it should be ok to assume that they come from the same network device having two MAC addresses, maybe used in LAG mode. In any case, since those two macs are valid and reachable next-hops, it is totally acceptable for the Kernel to use any of those for IP routing.

The more interesting question is why does this only happen for a single remote VM, which, coincidentally, is also located in the Tokyo data-centre. To answer this question, I will need to deep dive into the Linux kernel, with the hint that it has something to do with flow-label. I will keep this for a later post.

For now, this multi next-hop routing behaviour is maybe not a bad news, because it could definitely explain the reason why the IPv6 connectivity got initially lost from the two VM in London and Tokyo.

Conclusion

I would have liked to be able to do even more post-mortem analysis, but unfortunately, the failing server in London got an automated upgrade to migrate the external disk from SSD to NVMe, which required a reboot.

The somewhat good news is that the problem got fixed after the reboot. Yet, in terms of devops, that is not a good news, because having to reboot a server should always remain an exceptional situation. But given that the issue seems to have been caused by an IPv6 routing corruption, and that flushing the ARP entries did not solve the issue, there was not much alternative left but reboot.

Going forward, there is one thing to remember when handling connectivity issue: always check the IP routing table! This should actually be automated and periodically checked by monitoring tools.

A new look at SQLite

Ron J — Fri, 28 Jan 2022 05:04:45 GMT

Probably like many, i used to be looking at SQLite as a nice tool for working on dev or staging apps, and Postgres or MySQL as the solution for production environment. Until I read the excellent article from Avery Pennarun, where SQLite is described as an open source tool with "... quality levels too high to be rational for the market to provide".

Actually, why couldn't SQLite be used for production environment? What if my preconceived idea that SQLite is "too light to be performant" wasn't true? What if SQLite could outperform PGX or MySQL? By performance, I mean both read and write operations per second, but also the disk space needed to store the actual data.

Context

This study is done in the context of the Time-Series database (TSDB) used to store data from waqi.info and aqicn.org - There is a lot of data, and 99% of it is time-series data, which is currently carefully stored on a dual PGX MySQL cluster managed by a custom framework called ATSDB.

This ATSDB framework makes time-series storage automated, handling not only transparent table time slicing, but also string-to-index mapping, data streaming, data re-compression, L1 and L2 caching, as well as backup, plus a few more extras, all, of course, done in a completely automated way.

As we are now in the process of adding full support for PGX to ATSDB, the question that popped-up is, why couldn't we also add support SQLite? To get a rational answer, it was decided to first do a few benchmarks.

Benchmarking

In order to perform the benchmark, we used 2 tables type. The first one contains 2 columns, of type string, indexed to respectively 4 and 1 byte. The second table contains an additional 6 columns of type int encoded in 4 bytes. Note that both table contain an implicit 4 bytes timestamp, automatically added by ATSDB for time-sliced tables.

type Table2Columns struct {
	Sensor string `atsdb:"map:uint32"`
	Status string `atsdb:"map:uint8"`
}

type Table8Columns struct {
	Sensor   string `atsdb:"map:uint32"`
	Status   string `atsdb:"map:uint8"`
	Count1   int
	Count2   int
	Count3   int
	Count4   int
	Count5   int
	Count6   int
}

Write Speed

The test consist in inserting as fast as possible data during a period of 1 minute, using either a MySQL backend, or a SQLite backend. The test platform is a 4 cores Intel i5-7200U CPU running at 2.50GHz.

Here are the results for the insertion speed:

Backend	2 columns	8 columns
PGX	19,459 inserts/sec	17,433 inserts/sec
MySQL	17,189 inserts/sec	15,511 inserts/sec
SQLite	187,539 inserts/sec	111,118 inserts/sec

Wow, I seriously did not expect this kind of result. SQLite is up to 10 times faster than MySQL and PGX, and this even without any specific optimisation.

Read Speed

Next is the read-back speed. This can be important when having to preload caches that require to get, for instance, the past 8 hours of stored data.

Backend	2 columns	8 columns
PGX	790,506 read/sec	402,582 read/sec
MySQL	782,317 read/sec	326,740 read/sec
SQLite	478,080 read/sec	207,321 read/sec

This time PGX wins, but only by a factor of 2. And SQLite is already able to achieve almost half million reads/second, which is definitely enough for our use case. Further more, the ATSDB support streaming, so when having to read really huge amount of data way back in time, data is always progressively extracted. So, this SQLite read-back performance is definitely enough.

Storage Efficiency

Last, but not the least, we need to look at the disk usage, since when accumulating data for more than 15 years, no one wants to end-up having to pay an exponentially increasing bill to cloud providers for data-store. In other, the smaller the data store on disk, i.e. the lower the entropy, the better.

The result below are obtained by taking the table size divided by the number of entries stored in the table. In the case of SQLite, it is the file size, while for MySQL, it is both the data and index size obtained from the information schema metadata. As for PGX, it is the pg_total_relation_size (oid) from the pg_class c

Backend	2 columns	8 columns
PGX	44.5 bytes per entry	68.5 bytes per entry
MySQL	39.8 bytes per entry	71.2 bytes per entry
SQLite	14.1 bytes per entry	44.6 bytes per entry

Remember thats ATSDB adds an implicit 4 bytes for the timestamp. So, for the 2 column version, the minimum entropy is 4 (timestamp)+4 (sensor value)+1(status value), i..e 9 bytes per entry, and for the 8 column version, it requires an additional 6*4 bytes, eg a total of 35 bytes. Based, on those number, we can get the DB engine storage overhead, and the lower the better:

Backend	2 columns	8 columns
PGX	394% overhead	96% overhead
MySQL	342% overhead	103% overhead
SQLite	57% overhead	27% overhead

Yet, to be precise, MySQL has a wonderful feature that allows to change the storage engine. Most commonly, InnoDB is used, but actually, MyISAM is so much more performant in terms of storage efficiency, at the cost of slower inserts speed. So, to be exact, we need to evaluate the storage efficiency comparing both MyISAM and InnoDB against SQLite:

Backend	2 columns (bytes/entry)	2 columns (overhead)	8 columns (bytes/ entry)	8 columns (overhead)
SQLite	14.1	57%	44.6	27%
MySQL (MyISAM)	10.1	12%	35.1	0%

Using MyISAM, MySQL can over perform SQLite, and that makes it a better choice for archiving warm data, eg tables that at mostly read and seldom updated. But for live data, eg tables where most operations are about inserting or updating, and where reads are handling in memory using L1 cache, then SQLite is definitely a very good choice.

As for PGX, I do not have enough knowledge about its internal storage structure to say wether it can be optimised, but this post should be a good starting link. I'll blog about that in a later post.

Conclusion

The results speaks for it, and I must admit I was completely biased and wrong: SQLite is not a "light DB", it is an over performing database!

And when taking into account how easy it is to deploy an SQLite database compared to deploying MySQL or PGX, this makes SQLite the preferred data-store choice for many applications.

ESP32G Smart Ethernet Gateway

Ron J — Sat, 08 Jan 2022 03:45:38 GMT

I recently stumbled on the AI Thinker's "ESP32-G" WiFi+BLE+Ethernet smart gateway because of its cost and form factor.

The gateway is available at 98CNY (~15 USD) on Tabao, which makes it even more attractive than the low-cost TpLink WRT54GL routers, considering the vibrant Espressif eco-system.

The primary objective of this gateway is to allow external Bluetooth or Wifi-mesh devices to connect to the LAN. The Software running on the gateway even comes with an MQTT client used to forward the message from the IOT devices connected to the gateway (see the doc for more info).

Anyways, regardless of how nice the preloaded SW is, the real question is can one flash the gateway with a custom software... and the answer is of course yes! But, first, let's have a look inside the box:

Inside the ESP32-G

It's actually quite simple, which is not a surprise considering the price. The Ethernet controller is an LAN8720, the same as the one used in the WT32-ETH01. There is also a fully wired CH340 Serial TTL converter which allows to flash the gateway only using the USB port. And the gateway also exposes 6 GPIOs which can be used to add extra sensor or modules. Cool!

In the previous post, I have been using the WT32-ETH01 configuration to drive the Ethernet LAN. Let's see how the configuration changes:

ESP32-G pinout and LAN8720 wiring

The main difference with the WT32-ETH01 is that there is no external oscillator to drive the LAN8720, so the clock needs to be provided by the ESP32, which it does via the GPIO 17. The other difference is the Reset (aka Power) PIN, which is using GPIO5 on the ESP32-G, while it was GPIO16 on the WT32-ETH01:

Configuration	ESP32-G	WT32-ETH01
Power (Reset)	5	16
Phy Address	1	1
MDIO	18	18
MDC	23	23
Clock Mode	OUT	IN
Clock GPIO	17	0

The following initialisation code does work fine:

#define ETH_TYPE ETH_PHY_LAN8720
#define ETH_POWER_PIN 5
#define ETH_MDC_PIN 23
#define ETH_MDIO_PIN 18
#define ETH_ADDR 1

void initialize_ethernet(void)
{
    eth_mac_config_t mac_config = ETH_MAC_DEFAULT_CONFIG();
    eth_phy_config_t phy_config = ETH_PHY_DEFAULT_CONFIG();
    phy_config.phy_addr = ETH_ADDR;
    phy_config.reset_gpio_num = ETH_POWER_PIN;

    mac_config.smi_mdc_gpio_num = ETH_MDC_PIN;
    mac_config.smi_mdio_gpio_num = ETH_MDIO_PIN;

    mac_config.clock_config.rmii.clock_mode = EMAC_CLK_OUT;
    mac_config.clock_config.rmii.clock_gpio = EMAC_CLK_OUT_180_GPIO;

    esp_eth_mac_t *mac = esp_eth_mac_new_esp32(&mac_config);
    esp_eth_phy_t *phy = esp_eth_phy_new_lan8720(&phy_config);

    esp_eth_config_t config = ETH_DEFAULT_CONFIG(mac, phy);
    ESP_ERROR_CHECK(esp_eth_driver_install(&config, &s_eth_handle));
}

Et voila, that's all what is needed to start using this cool ESP32-Gateway. In my case, I have already added an extra $3 GPS module.

In the next post, I'll be looking at enabling the WIFI mesh capability.

WT32-ETH0 performance analysis using IPerf

Ron J — Mon, 29 Mar 2021 01:59:42 GMT

In the previous posts, we have been looking at enabling NuttX on the ESP32 based WT32-ETH0 module. In this post, we'll be looking at assessing the performance on the ethernet port, and verify if the ESP32 can really drive traffic up to 100MB/s.

Configuration

The easiest way to get a quick performance assessment of a network device is to use IPerf. Using NuttX, it's so simple- one just need to enable the iperf example under the network utilities within the apps.

Application Configuration  --->
    Network Utilities  --->
        [*] iperf example
        (eth0) Wi-Fi Network device

Note that since we will be testing the ethernet port, the "Wi-Fi network device" is changed to eth0.

Also, since performance in critical for this test, we'll get rid of the network traces (aka ninfo) which could seriously impact the performance:

 Build Setup  --->
   Debug Options  --->
     [] Network Informational Output

After flashing, the iperf app is available from the Nutt shell:

nsh> ?
Builtin Apps:
  dhcpd        dhcpd_stop   nsh          ping6        sh
  dhcpd_start  iperf        ping         renew        wapi

Initial Testing

So, let's run iperf then - but first with the UDP mode. On the server (the OpenWRT router in my case), this command is used:

iperf -s -p 5471 -i 1 -w 416K -u

And on the ESP32, the iperf client, this command is used:

iperf -c 192.168.1.1 -p 5471 -i 1 -u

And that's the result since from the server:

[  3] local 192.168.1.1 port 5471 connected with 192.168.1.183 port 6184
[ ID] Interval       Transfer     Bandwidth        Jitter   Lost/Total Datagrams
[  3]  0.0- 1.0 sec  3.85 MBytes  32.3 Mbits/sec   0.250 ms    1/ 4035
[  3]  1.0- 2.0 sec  3.84 MBytes  32.2 Mbits/sec   0.248 ms    0/ 4022
[  3]  2.0- 3.0 sec  3.84 MBytes  32.2 Mbits/sec   0.248 ms    0/ 4022
[  3]  3.0- 4.0 sec  3.84 MBytes  32.2 Mbits/sec   0.248 ms    0/ 4023
[  3]  4.0- 5.0 sec  3.84 MBytes  32.2 Mbits/sec   0.247 ms    0/ 4022
[  3]  5.0- 6.0 sec  3.84 MBytes  32.2 Mbits/sec   0.248 ms    0/ 4023
[  3]  6.0- 7.0 sec  3.84 MBytes  32.2 Mbits/sec   0.248 ms    0/ 4023
[  3]  7.0- 8.0 sec  3.84 MBytes  32.2 Mbits/sec   0.249 ms    0/ 4022
[  3]  8.0- 9.0 sec  3.84 MBytes  32.2 Mbits/sec   0.248 ms    0/ 4023
[  3]  9.0-10.0 sec  3.84 MBytes  32.2 Mbits/sec   0.248 ms    0/ 4023
[  3] 10.0-11.0 sec  3.84 MBytes  32.2 Mbits/sec   0.249 ms    0/ 4023
[  3] 11.0-12.0 sec  3.84 MBytes  32.2 Mbits/sec   0.248 ms    0/ 4022
[  3] 12.0-13.0 sec  3.84 MBytes  32.2 Mbits/sec   0.249 ms    0/ 4023
[  3] 13.0-14.0 sec  3.84 MBytes  32.2 Mbits/sec   0.248 ms    0/ 4023

What we are expecting is up to 100Mb/s, and we get 1/3 if it out of the box. Not bad! But let's try to find if there any configuration which could be updated to improve the performance - before taking this initial result as granted.

Reverse Engineering the performance

Inside the iPerf client

The code used for the iperf client in UDP mode is quite simple. Here is its pseudo code:

static int iperf_run_udp_client(void)
{
  sockfd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
  setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));

  addr.sin_family = AF_INET;
  addr.sin_port = htons(s_iperf_ctrl.cfg.dport);
  addr.sin_addr.s_addr = s_iperf_ctrl.cfg.dip;

  buffer = s_iperf_ctrl.buffer;
  udp = (struct iperf_udp_pkt_t *)buffer;
  want_send = s_iperf_ctrl.buffer_len;

  while (!s_iperf_ctrl.finish)
    {
      actual_send = sendto(sockfd, buffer, want_send, 0,
                           (struct sockaddr *)&addr, sizeof(addr));

      if (actual_send == want_send)
        {
          s_iperf_ctrl.total_len += actual_send;
        }
      else
        {
         .... handle the error ...
         }
    }
}

So, just call "sendto" as fast as possible! What we have here is a standard producer consumer, where the producer is the above code calling the blocking sendto function, and the consumer the ethernet device consuming data via its DMA interface.

Inside the ESP32 hardware

The way the DMA interface to the PHY works is explained in the diagram below (based on spec sheet for the ESP32 and LX7).

Note that in the case of the WT32-ETH01, the PHY is a LAN8720A from microchip, but there could be any other kind of PHY up to 100Mb/s.

From the ESP32 spec sheet in the Ethernet DMA Features

The DMA has independent Transmit and Receive engines (...) space. The Transmit engine transfers data from the system memory to the device port (MTL), while the Receive engine transmits data from the device port to the system memory. The controller uses descriptors to efficiently move data from source to destination with minimal Host CPU intervention. The DMA is designed for packet-oriented data transmission, such as frames in Ethernet. The controller can be programmed to interrupt the Host CPU for normal situations, such as the completion of frame transmission or reception, or when errors occur

So, all the ESP32 processor should have to do is to prepare the frames and insert them into the DMA ring controller. Assuming we want to achieve 100Mb/s with 1250 bytes frames, that would require having to prepare 10K packets per second, i.e. 100 microseconds per packet. Since we are using UDP, which is quite lightweight in terms of protocol overhead - only having to compute checksums, that should not be a problem.

Inside the NuttX UDP stack

The call to sendto in the iperf client, eventually leads to the following calls:

sendto-> psock_sendto -> inet_sendto -> psock_udp_sendto

The psock_udp_sendto comes is two flavors: The unbuffered one, and the buffered one. By default, the unbuffered one is used, so let's check this one. Here is its pseudo code:

ssize_t psock_udp_sendto(FAR struct socket *psock, FAR const void *buf,
 size_t len, int flags, FAR const struct sockaddr *to, socklen_t tolen)
{
  FAR struct udp_conn_s *conn;
  struct sendto_s state;
  int ret;

  /* Get the underlying the UDP connection structure.  */
  conn = (FAR struct udp_conn_s *)psock->s_conn;

  /* Assure the the IPv4 destination address maps to a valid MAC 
   * address in the ARP table.
   */
  if (psock->s_domain == PF_INET)
    {
      FAR const struct sockaddr_in *into = 
        (FAR const struct sockaddr_in *)to;
      in_addr_t destipaddr = into->sin_addr.s_addr;

      /* Make sure that the IP address mapping is in the ARP table */
      ret = arp_send(destipaddr);
    }

  /* Initialize the state structure.  This is done with the network
   * locked because we don't want anything to happen until we are
   * ready. */

  net_lock();
  memset(&state, 0, sizeof(struct sendto_s));

  /* This semaphore is used for signaling and, hence, should not have
   * priority inheritance enabled. */

  nxsem_init(&state.st_sem, 0, 0);
  nxsem_set_protocol(&state.st_sem, SEM_PRIO_NONE);
  state.st_buflen = len;
  state.st_buffer = buf;
  state.st_sock = psock;

  /* Get the device that will handle the remote packet transfers */
  state.st_dev = udp_find_raddr_device(conn);

  /* Set up the callback in the connection */
  state.st_cb = udp_callback_alloc(state.st_dev, conn);
  state.st_cb->flags   = (UDP_POLL | NETDEV_DOWN);
  state.st_cb->priv    = (FAR void *)&state;
  state.st_cb->event   = sendto_eventhandler;

  /* Notify the device driver of the availability of TX data */
  netdev_txnotify_dev(state.st_dev);

  /* Wait for either the receive to complete or for an error/timeout to
   * occur. NOTES:  net_timedwait will also terminate if a signal
   * is received. */
  ret = net_timedwait(&state.st_sem, _SO_TIMEOUT(psock->s_sndtimeo));

  /* Make sure that no further events are processed */
  udp_callback_free(state.st_dev, conn, state.st_cb);

  /* Release the semaphore */
  nxsem_destroy(&state.st_sem);

  /* Unlock the network and return the result of the sendto() 
   * operation */
  net_unlock();
  return ret;
}

The call to the next layer is done via the netdev_txnotify_dev which calls the driver emac_txavail callback. There are two things that could be of a concern in this code:

1- Systematic call to arp_send each time a UDP frame needs to be sent. Fortunately, the arp code is pretty optimised with an ARP cache, so the ARP frame will only be sent once. Even with this cache, it is still possible to optimize the overhead due the arp_send call by replacing arp_send(destipaddr) with arp_lookup(destipaddr)?0:arp_send(destipaddr);, but that does not give more than 3Mb/s improvement - so nothing serious to be considered right now.

2- Systematic call to net_lock and net_unlock while waiting for the frame to be sent. This means that, provided this function is the bottleneck in the UDP performance, it would not possible to use concurrent clients to improve the performance since each client would be sequentially scheduled.

Inside the NuttX network driver

The two functions in psock_udp_sendto which trigger the actual frame transmission:

  netdev_txnotify_dev(state.st_dev);
  net_timedwait(&state.st_sem, _SO_TIMEOUT(psock->s_sndtimeo));

The first one (netdev_txnotify_dev) lets the driver know that some TX work is pending - what happens in practice is that the driver notifies the kernel worker queue (via work_queue) that works has to be done - and this work will be executed on the specific kernel worker task. There are two tasks, a low and high priority one, and the network is using the low priority one. This comment in the NuttX code explains why using the high priority queue for the network task is wrong:

NOTE: However, the network should NEVER run on the high priority work queue! That queue is intended only to service short back end interrupt processing that never suspends. Suspending the high priority work queue may bring the system to its knees!

The second function (net_timedwait) waits for the first one to be scheduled and executed. What happens behind the scene is a smart "net_lock" manipulation allowing the TX worker, also waiting on the lock, to be executed "synchronously":

What the above diagram does not explain is the complexity of the devif_timer - which actually calls the callback sendto_eventhandler defined in the psock_udp_sendto function, and then calls the emac_txpoll which eventually starts the DMA via emac_transmit.

There is a small detail in the net_timedwait function worth commenting about: It is that it disables both interrupts (via enter_critical_section ) and the addition of any new task in the to be scheduled list (via sched_lock) . Then, when the semaphore is actually locked (via nxsem_wait), the next likely to be scheduled task in the kernel worker. This way, the scheduling overhead should be limited.

Conclusions

That was a quick deep dive into the NuttX network architecture, but an interesting one to learn how the network "full-stack" is actually implemented. Yet, there are so many details which were not covered, but there is a real feeling of a solid architecture behind NuttX, and that's something very positive if one would decide to capitalise on NuttX for developing a large application eco-system.

Back to the performance, we did not quite reach the 100Mb/s. So, that will be the objective for the next post - where we will try to implement one of those dumb packet blaster by directly bypassing the network stack and inserting frames directly into the DMA queue. We'll also be looking at the options for micro-performance analysis - to measure the actual time spent by functions - for instance using the ESP32 RSR instruction which provides up to 4 nanosecond accuracy, or with the performance monitor from IDF.

Image credits: https://en.wikipedia.org/wiki/Speed_limits_by_country

Getting IPv6 working with NuttX from Day One

Ron J — Tue, 23 Mar 2021 02:39:34 GMT

It's already 2021: IPv6 has been created more than 20 years ago, and nowadays more than one third of the traffic on the Internet is using IPv6. Yet, in the previous 3 post about enabling NuttX on the ESP32, I made a big mistake by forgetting to enable IPv6!

So, in this post, we'll try to get an IPv6 configuration working. At the same time, we will try get rid of the IPv4 NAT which was required for the bridge setup, and which would not be necessary any more using IPv6 prefix delegation.

Configuration

Enabling IPv6 is as simple as:

Networking Support  --->
   Internet Protocol Selection  --->
      [*] IPv6

But of course, after all the features enabled in the past 3 post, the probability the the compilation works the first time is quite small. And indeed, here is the error:

xtensa-esp32-elf-ld: nuttxspace/nuttx/staging/libarch.a(esp32_wlan.o): in function `wlan_ifup':
esp32_wlan.c:(.text.wlan_ifup+0x45): undefined reference to `winfo'
make[1]: *** [nuttx] Error 1

Fortunately, that is just a typo issue with a function incorrectly named winfo while it should ninfo . So, after patching the file, compilation works fine

Run Time

After booting, the networking configuration is still not quite right:

nsh> ifconfig
eth0    Link encap:Ethernet HWaddr a8:03:2a:a1:2c:c4 at UP
        inet addr:192.168.1.182 DRaddr:192.168.1.1 Mask:255.255.255.0
        inet6 addr: fc00::2/112
        inet6 DRaddr: fc00::1/112

Static IP address configuration

Before getting into the advanced IPv6 address configuration using DHCPv6 or SLAAC, let's first try instead to use a predefined IPv6 address. To make sue the configuration tool is working, let's just use the same IPv6 configuration as the default one:

ifconfig wlan0 fc00::1 inet6 gateway fc00::2

That works fine. However, when trying to input longer IPv6 address, the NSH console fails to accept longer inputs... That's yet another configuration which should be changed, to allow "longer command lines"..

Application Configuration  --->
    NSH Library  --->
        Command Line Configuration  --->
        	(200) Max command line length

After this change, it is possible to configure the IPv6 address:

nsh> ifconfig eth0 2888:8201:2848:abcd:0123:0123:0123:d58e inet6 gateway 2888:8201:2848:abcd::1

nsh> ifconfig
eth0    Link encap:Ethernet HWaddr a8:03:2a:a1:2c:c5 at UP
        inet addr:192.168.1.183 DRaddr:192.168.1.1 Mask:255.255.255.0
        inet6 addr: 2408:8206:2640:a833:4e5:c85d:bf9:d58e/64
        inet6 DRaddr: 2408:8206:2640:a833::1/64

Ping6 application

The default ping NSH command does not support IPv6. To enable ping6, the following needs to be added:

Networking Support  --->
   ICMPv6 Networking Support  --->
      [*] Enable ICMPv6 networking
Application Configuration  --->
   System Libraries and NSH Add-Ons  --->
   	  [*] ICMPv6 'ping6' command (NEW)

Voila, after fixing another small compilation bug due to an unreferenced function we are now ready to ping openwrt.com , aka 2600:3c02:1::2d4f:f40e:

nsh > ping6 2600:3c02:1::2d4f:f40e
psock_socket: ERROR: socket address family unsupported: domain:10, type:2, protocol:58
socket: ERROR: psock_socket() failed: -106
ERROR: socket() failed: 106

ICMPv6 socket support

Well... that did not work as expected! It fails on domain (family)= PF_INET6, type = SOCK_DGRAM and protocol = IPPROTO_ICMP6. The missing link is the NET_ICMPv6_SOCKET configuration which needs to be enabled too

Networking Support  --->
   ICMPv6 Networking Support  --->
      [*]   IPPROTO_ICMP6 socket support
      [*]   Solicit destination addresses

We also enable the "solicit destination addresses", which covers the ICMPv6 Neighbour Discovery (ND) protocol, for the solicitation packet. That's the same as ARP, but for IPv6. Without it, the stack would not be able to resolve the ethernet address associated to the IPv6 gateway.

nsh> ping6 2600:3c02:1::2d4f:f40e
sendto_request: Outgoing ICMPv6 packet length: 105 (65)
neighbor_dumpentry: Entry found: 2408:8206:2640:a833:0000:0000:0000:0001
neighbor_ethernet_out: Outgoing IPv6 Packet length: 119 (65)
emac_transmit: d_buf=0x3ffb2cb6 d_len=119
emac_recvframe: RX bytes 119
icmpv6_poll_eventhandler: flags: 0002
icmpv6_datahandler: Buffered 94 bytes
icmpv6_readahead: Received 65 bytes (of 94)
56 bytes from 2600:3c02:1::2d4f:f40e icmp_seq=0 time=410 ms

DNSv6 support

Let's try to ping6 with the domain name instead of the direct IP address:

nsh> ping6 openwrt.com
udp_send: Outgoing UDP packet length: 57
udp_readahead: Received 57 bytes (of 74)
dns_recv_response: ID 37905
dns_recv_response: Query 128
dns_recv_response: Error 0
dns_recv_response: Num questions 1, answers 1, authrr 0, extrarr 0
dns_recv_response: Question: type=001c, class=0001
dns_parse_name: Compressed answer
dns_recv_response: Answer: type=001c, class=0001, ttl=0034ab, length=0010
dns_recv_response: IPv6 address: 0000:0000:0100:0000:0200:3c00:0000:2600
...
udp_readahead: Received 45 bytes (of 62)
dns_recv_response: ID 46525
dns_recv_response: Query 128
dns_recv_response: Error 0
dns_recv_response: Num questions 1, answers 1, authrr 0, extrarr 0
dns_recv_response: Question: type=0001, class=0001
dns_parse_name: Compressed answer
dns_recv_response: Answer: type=0001, class=0001, ttl=0034cf, length=0004
dns_recv_response: IPv4 address: 72.52.179.175
PING6 2600

sendto_request: Outgoing ICMPv6 packet length: 105 (65)
icmpv6_readahead: Received 65 bytes (of 94)
56 bytes from 2600:3c02:1::2d4f:f40e icmp_seq=0 time=450 ms

Good news, all works fine, so no special configuration needed here.

Let's SLAAC ...

When it comes to IPv6, the common approach is to use SLAAC rather than DHCPv6. The main advantage of SLAAC is its simplicity - it requires to have ICMPv6 (which we have already confirmed to work), and support for two Neighbour discovery (ND) packets: Router discovery and advertisement (RD/RA).

Looking at the NuttX code, one can find the needed code to handle router solicitations:

/****************************************************************************
 * Name: icmpv6_rsolicit
 *
 * Description:
 *   Set up to send an ICMPv6 Router Solicitation message.  This version
 *   is for a standalone solicitation.  If formats:
 *
 *   - The IPv6 header
 *   - The ICMPv6 Neighbor Router Message
 *
 *   The device IP address should have been set to the link local address
 *   prior to calling this function.
 *
****************************************************************************/

void icmpv6_rsolicit(FAR struct net_driver_s *dev);

This code is only available when activating the configuration is NET_ICMPv6_AUTOCONF .

Networking Support  --->
   ICMPv6 Networking Support  --->
       [*]   ICMPv6 auto-configuration

Compilation works fine, but after flashing the new firmware and sending the interface UP using the ifconfig eth0 up , nothing happens - and only IPv4 address is obtained via the DHCPv4.

The missing link is the netlib_icmpv6_autoconfiguration function, which is responsible for sending the SIOCIFAUTOCONF ioctl to the driver, which will in turn call the icmpv6_autoconfig function, and then call our icmpv6_rsolicit. The problem is that this netlib_icmpv6_autoconfiguration function is only called during the netlib initialisation process (aka netinit), but not during the nsh net commands, especially the ifconfig eth0 up one . A quick fix consists is adding this line before:

git diff nshlib/nsh_netcmds.c                                    08:49:40
diff --git a/nshlib/nsh_netcmds.c b/nshlib/nsh_netcmds.c
index d4633e10..1e4605c4 100644
--- a/nshlib/nsh_netcmds.c
+++ b/nshlib/nsh_netcmds.c
@@ -918,6 +918,12 @@ int cmd_ifconfig(FAR struct nsh_vtbl_s *vtbl, int argc, char **argv)
 #endif /* CONFIG_NET_IPv4 */
 #endif /* CONFIG_NETINIT_DHCPC || CONFIG_NETINIT_DNS */

+#ifdef CONFIG_NET_ICMPv6_AUTOCONF
+  /* Perform ICMPv6 auto-configuration */
+  netlib_icmpv6_autoconfiguration(ifname);
+#endif
+
+
 #if defined(CONFIG_NETINIT_DHCPC)
   /* Get the MAC address of the NIC */

Voila, after this fix, the ifconfig eth0 up will successfully acquire it's global IPv6 via SLAAC:

nsh> ifconfig eth0 up
icmpv6_autoconfig: Auto-configuring eth0
icmpv6_autoconfig: lladdr=fe80:0000:0000:0000:aa03:2aff:fea1:2cc5
icmpv6_solicit: Outgoing ICMPv6 Neighbor Solicitation length: 72 (32)
icmpv6_solicit: Outgoing ICMPv6 Neighbor Solicitation length: 72 (32)
icmpv6_solicit: Outgoing ICMPv6 Neighbor Solicitation length: 72 (32)
icmpv6_solicit: Outgoing ICMPv6 Neighbor Solicitation length: 72 (32)
icmpv6_solicit: Outgoing ICMPv6 Neighbor Solicitation length: 72 (32)
icmpv6_router_eventhandler: flags: 6000 sent: 0
icmpv6_rsolicit: Outgoing ICMPv6 Router Solicitation length: 56 (16)
icmpv6_rwait: Waiting...
icmpv6_rnotify: Notified
icmpv6_setaddresses: preflen=64 netmask=ffff:ffff:ffff:ffff:0000:0000:0000:0000
icmpv6_setaddresses: prefix=2408:8206:2640:a833:0000:0000:0000:0000
icmpv6_setaddresses: IP address=2408:8206:2640:a833:aa03:2aff:fea1:2cc5
icmpv6_setaddresses: DR address=fe80:0000:0000:0000:ee41:18ff:fe0c:53dc
icmpv6_rwait_cancel: Canceling...
icmpv6_autoconfig: Timed out... retrying 1
icmpv6_router_eventhandler: flags: 6000 sent: 0
icmpv6_rsolicit: Outgoing ICMPv6 Router Solicitation length: 56 (16)
icmpv6_rnotify: Notified
icmpv6_setaddresses: preflen=64 netmask=ffff:ffff:ffff:ffff:0000:0000:0000:0000
icmpv6_setaddresses: prefix=2408:8206:2640:a833:0000:0000:0000:0000
icmpv6_setaddresses: IP address=2408:8206:2640:a833:aa03:2aff:fea1:2cc5
icmpv6_setaddresses: DR address=fe80:0000:0000:0000:ee41:18ff:fe0c:53dc
icmpv6_rwait: Waiting...
icmpv6_rwait_cancel: Canceling...

Voila, it's not fully working :-)

nsh> ifconfig
eth0    Link encap:Ethernet HWaddr a8:03:2a:a1:2c:c5 at UP
        inet addr:192.168.1.183 DRaddr:192.168.1.1 Mask:255.255.255.0
        inet6 addr: 2408:8206:2640:a833:aa03:2aff:fea1:2cc5/64
        inet6 DRaddr: fe80::ee41:18ff:fe0c:53dc/64

Conclusion

That's was not a straightforward configuration, and there are quite a few patches needed for the NuttX source code to compile, but fortunately, IPv6 is now working natively on our ESP32 hardware running NuttX. The natural next step would be to even remove the IPv4 configuration and only get an IPv6 network, but we'll keep this configuration for a later post.

In the next post, we'll be looking at enabling the DHCPv6 server and prefix delegation, as a way to remove the need for the IPv4 NAT in our ETH-WIFI bridge setup.

There are also now quite a few fixes that should be pushed upstream, so I'll make a PR and update this post when the PR is ready.

Photo credits: flickr.com/photos/anniemole/313981428/in/photostream/

Esp32, NuttX, and bridged networking

Ron J — Sun, 14 Mar 2021 23:00:00 GMT

In this blog post, we will be looking at the NuttX configuration needed for bridged networking from the WIFI interface to the Ethernet interface, using an ESP32 based WT32-ETH01. We'll also be looking at the routing options, and the possibility to route via tunnels.

Note that this is follow-up from the previous post about enabling a dual ETH+WIFI networking stack on the WT32-ETH01.

Bridge as a NuttX application

So, let's start with the bridge example provided in the NuttX apps. It contains two implementation files, one for the hosts, and one for the bridge. The intention is to drive the traffic in this way:

The bridge works by creating two tasks ( using task_create ). Each task if getting it's IP address by invoking the DHCP client (so, the bridge could work with a single network interface). It then listens on an UDP socket, on a given port, and for each packet received, it will forward it to the outgoing IP address. Two tasks are needed as each task is handling unidirectional traffic.

Unfortunately, the bridge example is not of any help. For a bridge to work, we should be able to intercept any kind of traffic, at IP level, and forward it at IP level. The standard way to do that is to use raw sockets, but let's first have a look at uIP networking stack, in case it would already have such bridge capability.

Bridge as a NuttX System Service

Looking at the NuttX net folder, one can notice the ipforward service, described as the "L2 forwarding service".

L2 forwarding

That's exactly what we need - so, let's enable it with this configuration:

Networking Support  --->
   Internet Protocol Selection  --->
       [*] Enable L2 forwarding

By default, the services has a table 4 forwarding entries, which get populated dynamically as incoming packet come in. When the stack receives a packet with a destination IP address which does not match the interface IP address, it willl call the ipv4_forward function, described as:

This function is called from ipv4_input when a packet is received that is not destined for us. In this case, the packet may need to be forwarded to another device (or sent back out the same device) depending configuration, routing table information, and the IPv4 networks served by various network devices.

The ipv4_forward function will first try to find the forwarding interface, using netdev_findby_ripv4addr

int ipv4_forward(FAR struct net_driver_s *dev, FAR struct ipv4_hdr_s *ipv4)
{
  in_addr_t destipaddr;
  in_addr_t srcipaddr;
  FAR struct net_driver_s *fwddev;
  int ret;

  /* Search for a device that can forward this packet. */

  destipaddr = net_ip4addr_conv32(ipv4->destipaddr);
  srcipaddr  = net_ip4addr_conv32(ipv4->srcipaddr);

  fwddev     = netdev_findby_ripv4addr(srcipaddr, destipaddr);
  if (fwddev == NULL)
    {
      nwarn("WARNING: Not routable\n");
      return (ssize_t)-ENETUNREACH;
    }

Routing Table

From the netdev_findby_ripv4addr one can notice that there is a CONFIG_NET_ROUTE configuration used for the routing table. It is not enabled by default, so let's activate it under:

Networking Support  --->
	Routing Table Configuration  --->
    	[*] Routing table support

It also comes with a default of 4 routes, but note that routes are handled at network prefix level, while forwarding entries are at IP level. So, if a route handles a /16 prefix, you could have up to 64K IP to forward.

Unfortunately, the compilation does not work the first time, complaining about missing RTF_XXX definitions.

src/network.c: In function 'wapi_act_route_gw':
src/network.c:141:17: error: 'RTF_UP' undeclared (first use in this function); did you mean 'IFF_UP'?
   rt.rt_flags = RTF_UP | RTF_GATEWAY;
                 ^~~~~~
                 IFF_UP
src/network.c:141:17: note: each undeclared identifier is reported only once for each function it appears in
src/network.c:141:26: error: 'RTF_GATEWAY' undeclared (first use in this function)
   rt.rt_flags = RTF_UP | RTF_GATEWAY;
                          ^~~~~~~~~~~
src/network.c:144:22: error: 'RTF_HOST' undeclared (first use in this function)
       rt.rt_flags |= RTF_HOST;

The issue is actually from the WAPI wireless configuration shell that was enabled for the WIFI configuration. The good news is that it tells us that WAPI can be used to inspect and configure the routes. The bad news is that it seems we will have to manually patch the code for the compilation to work.

The reason for the failure is that WAPI wants to ioctl SIOCADDRT and SIOCDELRT to configure the routing entries. So, let's have a look at how the net routing component handles those two ioctl. That's done in the netdev_ioctl.c file:

#ifdef CONFIG_NET_ROUTE
static int netdev_rt_ioctl(FAR struct socket *psock, int cmd,  FAR struct rtentry *rtentry)
{
  switch (cmd)
    {
      case SIOCADDRT:  /* Add an entry to the routing table */
          return ioctl_add_ipv4route(rtentry);
          break;
          ...
      case SIOCDELRT:  /* Add an entry to the routing table */
          return ioctl_del_ipv4route(rtentry);
          break;
}
#endif

Good news, both ioctl_del_ipv4route and ioctl_add_ipv4route ignore the rt_flag which WAPI was trying to set, so we can safely comment out the line rt.rt_flags = RTF_UP | RTF_GATEWAY; in WAPI network.c. Those flags are used to indicate if the routing entry is a host or a gateway - and we'll figure out later if this matters to us.

Run-time route configuration

Now that the compilation is working, let's try to configure the routes via WAPI. Here are the logs which can be seen from the NuttShell:

NuttShell (NSH) NuttX-10.0.1
nsh> ipv4_forward: WARNING: Packet forwarding to same device not supported (srcIP=192.168.1.203 dstIP=224.0.0.251)
udp_input: WARNING: No listener on UDP port 5353

The message packet forwarding to same device not supported tells us that a packet has been received from "192.168.1.203" and aimed for "224.0.0.251", which is a multicast IP for the mDNS service running on port 5353. This protocol is not enabled by default, hence the "no listener" warning log.

So, let's try to check the routes now with a route command:

nsh> route
SEQ   TARGET          NETMASK         ROUTER
   1. 0.0.0.0         0.0.0.0         192.168.1.1

The netmask 0 corresponds to a /32 CIDR, meaning that all IP addresses will be considered for this route. As for the router 192.168.1.1, this is the openwrt router, and this is exactly the expected IP. So, first, let's try to add a route

nsh> addroute 8.8.8.8/32 192.168.1.1
nsh> route
SEQ   TARGET          NETMASK         ROUTER
   1. 0.0.0.0         0.0.0.0         192.168.1.1
   2. 8.8.8.8         255.255.255.255 192.168.1.1

Then let's try to ping the newly added destination (8.8.8.8). It works fine because the gateway is properly defined.

nsh> ping 8.8.8.8
PING 8.8.8.8 56 bytes of data
56 bytes from 8.8.8.8: icmp_seq=1 time=190 ms

Now, let's try to add a route to a non existing gateway. Ping should fail with a "no route available" - not because it can not find the route, but because it can not find a way to reach the gateway (to be precise, because the gateway in on the same subnet as the ESP, the ESP will try to ARP the gateway, and since no one replies, it will fail to send the ICMP packet).

nsh> addroute 4.4.4.4/32 192.168.1.2
nsh> route
SEQ   TARGET          NETMASK         ROUTER
   1. 0.0.0.0         0.0.0.0         192.168.1.1
   2. 8.8.8.8         255.255.255.255 192.168.1.1
   3. 4.4.4.4         255.255.255.255 192.168.1.2
nsh> ping 4.4.4.4
PING 4.4.4.4 56 bytes of data
arp_send_eventhandler: flags: 2000 sent: 0
arp_send: ERROR: arp_wait failed: -116
...
icmp_sendto: ERROR: Not reachable

Bridge Configuration

Now that routes are working, we just need to slight change the ESP configuration, from a Wifi Station to a WIFI Access point, which the Mac will connect to, as described in the diagram below:

There are a few configurations which need to be changed, and the first is to add a DHCP server so that the MAC can get an IP address when connecting to the WT32 Access Point (AP):

Application Configuration  ---> 
    Network Utilities  --->
        [*] DHCP server

The second is to turn the WIFI from a Station to an AP. Let's try to do that with WAPI:

nsh> wapi mode wlan0 WAPI_MODE_MASTER

Unfortunately, that does not work, because... well, the underling SIOCSIWMODE is not implemented in the current esp32 driver. Never mind, let's try the reverse bridge, where the Mac is connected via Ethernet:

For this configuration, there should only be a need for the DHCP server. So, after flashing the new firmware configuration, and connecting the Mac to the ESP via the ethernet cable, I could see the following logs:

udp_input: WARNING: No listener on UDP port 67

That just tells us that the DHCP Server is not started. Looking at the code, the missing link seems that we also need to enable the server from the app config.

Application Configuration  --->
  Examples  --->
    [*] DHCP server example

Voila, let's reflash and start the app from the Nutt Shell:

nsh> dhcpd_start eth0
nsh> ifconfig
eth0    Link encap:Ethernet HWaddr a8:03:2a:a1:2c:c5 at UP
        inet addr:10.0.0.1 DRaddr:10.0.0.1 Mask:255.255.255.0

wlan0   Link encap:Ethernet HWaddr a8:03:2a:a1:2c:c4 at UP
        inet addr:192.168.1.182 DRaddr:192.168.1.1 Mask:255.255.255.0

Bingo, it works - the eth0 interface is having its own subnet. A quick check on the Mac confirms that all works fine :-)

⋊> ~/P/m/n/nuttx on master ⨯ ifconfig en5 flags=8863 mtu 1500
	options=4
	inet6 fe80::8ca:dbd1:a602:81d6%en5 prefixlen 64 secured scopeid 0xf
	inet 10.0.0.2 netmask 0xffffff00 broadcast 10.0.0.255
	nd6 options=201
	media: autoselect (100baseTX )
	status: active

Last, step, let's try to ping from the Mac and see if it gets through the ESP - for that to work, we should first ensure that the route is properly set on the ESP:

nsh> addroute default 192.168.1.1 wlan0
nsh> route
SEQ   TARGET          NETMASK         ROUTER
   1. 0.0.0.0         0.0.0.0         192.168.1.1

The first attempt to ping 8.8.8.8 on the mac resulted in a Request timeout. So dumping the the traffic on the OpenWRT router showed and frames were correctly forwarded:

root@OpenWrt:~# tcpdump ether host A8:03:2A:A1:2C:C4 -nS
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-lan, link-type EN10MB (Ethernet), capture size 262144 bytes
04:03:59.040947 IP 10.0.0.2 > 8.8.8.8: ICMP echo request, id 19945, seq 9, length 64
04:04:00.040736 IP 10.0.0.2 > 8.8.8.8: ICMP echo request, id 19945, seq 10, length 64

Only problem, no response... At first glance, it looks like a routing issue on the OpenWRT, which needs to add a route for the 10.x.x.x subnet to the ESP gateway

root@OpenWrt:~# route add -net 10.0.0.0/24 gateway 192.168.1.182

After that, it finally works:

PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=108 time=293.731 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=108 time=352.465 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=108 time=334.467 ms

While, technically, we could say the routing is working, actually, this last step is not a desired one. What should happen is that the ESP should NAT the IP address from the Mac; that's a standard process for routers.

I am not sure of the reason why NuttX does not NAT the outgoing packet - there does not seems to be any trace of "NAT" in the net source code, and ip forward service does not seems to be handling the NAT, but only decrementing the TTL in the IP header. Let's keep this parked fro now, and we'll come back to this NAT configuration issue in a later blog post.

Tunnelling Traffic

Now that routes are working, the next logical step is to tunnel part of the traffic. The standard way in Posix is to create TAP/TUN virtual interfaces, and fortunately for us, NuttX supports such interfaces. But let's keep this investigation for the next blog.

Conclusion

Obviously, the more we dig into NuttX and the more complex it becomes. But at the same time, since NuttX is a Posix compatible system, we can refer to all the online information about Posix to troubleshoot our configuration. That makes it not only much easier, but also much more interesting as all our configuration learnings can be applied to standard Linux too.

Next step, we will be checking the Ethernet driver performance, to check if we can really reach 100Mb/s, as well as the IP forwarding performance. We'll also be looking at setting up the TAP/TUN interface for the tunnelling.

a Da Vinci self-supporting bridge - It's a bit like NuttX and ESP: Robust but unstable for beginners

Esp32, NuttX, and Ethernet on a WT32-ETH01

Ron J — Sat, 06 Mar 2021 10:08:02 GMT

This is a follow-up from the previous post about getting started with ESP32 and NuttX. This time, we'll be looking at trying to enable the Ethernet module on the LAN8720-based WT32-ETH01, and use NuttX micro-IP networking stack as a dual interface stack.

So, let's get started with the ethernet configuration first.

Configuring the Ethernet peripheral

Looking at the NuttX configuration for network drivers, one can notice that there is a flag for the 8720. So let's try to enable it from the menu config:

Device Drivers  --->
	Network Device/PHY Support  ----
    		*** External Ethernet MAC Device Support ***

Unfortunately, there is nothing under External Ethernet MAC Device Support , and that's where one should enable the LAN8720. So looking a bit further into the config, I noticed that for the external ethernet to be supported, the ARCH_HAVE_PHY config should be enabled.

Unfortunately, that's not the case in any of the ESP32 default config, and there is no trace of a ARCH_HAVE_PHY config in the ESP KConfig. But looking at the specific ESP32 config, I noticed that ETH could be enabled by selecting Ethernet MAC under the peripheral selection.

System Type  --->
    ESP32 Peripheral Selection  --->
        Ethernet MAC
    Ethernet configuration  --->
        (9) RX description number
        (8) TX description number
        (23) MDC Pin
        (18) MDIO Pin
        (5) Reset PHY Pin -->>> Change to 16
        (1) PHY address

The default configuration, when compared to the indication here is correct, at least for MDC and MDIO.

mdc_pin: GPIO23
mdio_pin: GPIO18
clk_mode: GPIO0_IN
phy_addr: 1
power_pin: GPIO16

However, the reset phy pin, which is referred as a power pin in esp-idf, is wrong- it should be 16 and not the default 5.

As concerns the clock mode, it is handled by this code in the esp-idf:

    if (emac_config.clock_mode != ETH_CLOCK_GPIO0_IN) {
#if CONFIG_SPIRAM_SUPPORT
        // make sure Ethernet won't have conflict with PSRAM
        if (emac_config.clock_mode >= ETH_CLOCK_GPIO16_OUT) {
            if (esp_spiram_is_initialized()) {
                ESP_LOGE(TAG, "GPIO16 and GPIO17 are occupied by PSRAM, please switch to ETH_CLOCK_GPIO_IN or ETH_CLOCK_GPIO_OUT mode");
                ret = ESP_FAIL;
                goto _verify_err;
            } else {
                ESP_LOGW(TAG, "Using GPIO16/17 to output Ethernet RMII clock, make sure you don't have PSRAM on board");
            }
        }
#endif
        // 50 MHz = 40MHz * (6 + 4) / (2 * (2 + 2) = 400MHz / 8
        rtc_clk_apll_enable(1, 0, 0, 6, 2);
        REG_SET_FIELD(EMAC_EX_CLKOUT_CONF_REG, EMAC_EX_CLK_OUT_H_DIV_NUM, 0);
        REG_SET_FIELD(EMAC_EX_CLKOUT_CONF_REG, EMAC_EX_CLK_OUT_DIV_NUM, 0);

        if (emac_config.clock_mode == ETH_CLOCK_GPIO0_OUT) {
            PIN_FUNC_SELECT(PERIPHS_IO_MUX_GPIO0_U, FUNC_GPIO0_CLK_OUT1);
            REG_WRITE(PIN_CTRL, 6);
            ESP_LOGD(TAG, "EMAC 50MHz clock output on GPIO0");
        } else if (emac_config.clock_mode == ETH_CLOCK_GPIO16_OUT) {
            PIN_FUNC_SELECT(PERIPHS_IO_MUX_GPIO16_U, FUNC_GPIO16_EMAC_CLK_OUT);
            ESP_LOGD(TAG, "EMAC 50MHz clock output on GPIO16");
        } else if (emac_config.clock_mode == ETH_CLOCK_GPIO17_OUT) {
            PIN_FUNC_SELECT(PERIPHS_IO_MUX_GPIO17_U, FUNC_GPIO17_EMAC_CLK_OUT_180);
            ESP_LOGD(TAG, "EMAC 50MHz inverted clock output on GPIO17");
        }
    }

There is no such thing yet int he NuttX version, but fortunately for us, the NuttX SW assumes that the clock mode is GPIO0.

So, voila, the configuration looks good - with the exception of the power pin - and so we are ready to compile the SW.

Compilation

Well, after all the struggles met during the first phase, for enabling WIFI, expecting that the compilation would work without any error the first time would have be utopia. And indeed, here is what the compiler said:

 CC:  chip/esp32_emac.c
chip/esp32_emac.c: In function 'phy_enable_interrupt':
chip/esp32_emac.c:1243:38: error: 'MII_INT_REG' undeclared (first use in this function); did you mean 'MISC_REG'?
   ret = emac_read_phy(EMAC_PHY_ADDR, MII_INT_REG, ®val);
                                      ^~~~~~~~~~~
                                      MISC_REG
chip/esp32_emac.c:1243:38: note: each undeclared identifier is reported only once for each function it appears in
chip/esp32_emac.c:1243:52: error: 'regval' undeclared (first use in this function); did you mean 'sigval'?
   ret = emac_read_phy(EMAC_PHY_ADDR, MII_INT_REG, ®val);
                                                    ^~~~~~
                                                    sigval
chip/esp32_emac.c:1249:39: error: 'MII_INT_CLREN' undeclared (first use in this function)
                            (regval & ~MII_INT_CLREN) | MII_INT_SETEN);
                                       ^~~~~~~~~~~~~
chip/esp32_emac.c:1249:56: error: 'MII_INT_SETEN' undeclared (first use in this function); did you mean 'MII_MSR_ESTATEN'?
                            (regval & ~MII_INT_CLREN) | MII_INT_SETEN);
                                                        ^~~~~~~~~~~~~
                                                        MII_MSR_ESTATEN
chip/esp32_emac.c:1240:12: warning: unused variable 'phyval' [-Wunused-variable]
   uint16_t phyval;
            ^~~~~~
chip/esp32_emac.c: In function 'emac_ioctl':
chip/esp32_emac.c:186:24: warning: initialization of 'struct esp32_emacmac_s *' from incompatible pointer type 'struct esp32_emac_s *' [-Wincompatible-pointer-types]
 #define NET2PRIV(_dev) ((struct esp32_emac_s *)(_dev)->d_private)
                        ^
chip/esp32_emac.c:2098:38: note: in expansion of macro 'NET2PRIV'
   FAR struct esp32_emacmac_s *priv = NET2PRIV(dev);
                                      ^~~~~~~~
chip/esp32_emac.c:2111:17: warning: implicit declaration of function 'phy_notify_subscribe' [-Wimplicit-function-declaration]
           ret = phy_notify_subscribe(dev->d_ifname, req->pid, &req->event);
                 ^~~~~~~~~~~~~~~~~~~~
chip/esp32_emac.c:2116:21: error: too many arguments to function 'phy_enable_interrupt'
               ret = phy_enable_interrupt(priv);
                     ^~~~~~~~~~~~~~~~~~~~
chip/esp32_emac.c:1238:12: note: declared here
 static int phy_enable_interrupt(void)
            ^~~~~~~~~~~~~~~~~~~~
make[1]: *** [esp32_emac.o] Error 1
make: *** [arch/xtensa/src/libarch.a] Error 2

Something is obviously wrong in the ESP code here. For example, from the esp32_emac.c, one can see that there is a conflict in the signature of the phy_enable_interrupt function, which is declared without any argument.

#if defined(CONFIG_NETDEV_PHY_IOCTL) && defined(CONFIG_ARCH_PHY_INTERRUPT)
static int phy_enable_interrupt(void)
{
  ...
}
#endif

But later called, in the same file, with an extra priv argument.

static int emac_ioctl(struct net_driver_s *dev, int cmd, unsigned long arg)
{
    ...
#ifdef CONFIG_NETDEV_PHY_IOCTL
#ifdef CONFIG_ARCH_PHY_INTERRUPT
	...
    ret = phy_enable_interrupt(priv);
    ...
#endif

Most likely, the ESP team did not try yet to verify the ethernet driver with the CONFIG_NETDEV_PHY_IOCTL configuration.

So, after disabling the IO control under Networking support->Network Device Operations->Enable PHY ioctl , bingo!, the compilation is working :-)

Starting the Ethernet driver

make download ESPTOOL_PORT=/dev/cu.SLAB_USBtoUART ESPTOOL_BAUD=115200 ESPTOOL_BINDIR={$NUTTX_SPACE}/esp-bins

After a successful flashing, of course, no trace of the Ethernet driver initialisation... Looking deeper into the code, the issue is conflict on the NETDEV_LATEINIT config - which must be disabled for ethernet:

#if !defined(CONFIG_NETDEV_LATEINIT)
void up_netinitialize(void)
{
  esp32_emac_init();
}
#endif

So, let's disable it under Networking Support -> Link layer support -> Late driver initialization (I wonder why I enabled it in first place). After that, the interface is finally visible:

nsh> ifconfig
eth0    Link encap:Ethernet HWaddr b4:e6:2d:95:b1:05 at DOWN
        inet addr:0.0.0.0 DRaddr:0.0.0.0 Mask:0.0.0.0

wlan0   Link encap:Ethernet HWaddr b4:e6:2d:95:b1:05 at UP
        inet addr:192.168.1.218 DRaddr:192.168.1.1 Mask:255.255.255.0

Both interface share the same ethernet address. This will definitely confuse the router, which both WIFI and ETH are connected to, but let's try anyways. Next step is to get the interface up using the ifup eth0 command.

nsh> ifup eth0
netdev_ifr_ioctl: cmd: 1818
emac_init_phy: PHY register 0x0 is: 0x3100
emac_init_phy: PHY register 0x2 is: 0x0007
emac_init_phy: PHY register 0x3 is: 0xc0f1
emac_wait_linkup: PHY register 0x1 is: 0x782d
ifup eth0...OK

Good news, it works, and frames can be received on the ethernet port. But patience, the IP address is still not resolved.

One of the reason the DHCP doe not work for ETH0 is the conflict on the mac addresses. The way the ethernet driver gets its address is using this code:

static int emac_read_mac(uint8_t *mac)
{
  uint32_t regval[2];
  uint8_t *data = (uint8_t *)regval;
  uint8_t crc;
  int i;

  /* The MAC address in register is from high byte to low byte */

  regval[0] = getreg32(MAC_ADDR0_REG);
  regval[1] = getreg32(MAC_ADDR1_REG);

  crc = data[6];
  for (i = 0; i < 6; i++)
    {
      mac[i] = data[5 - i];
    }

  if (crc != esp_crc8(mac, 6))
    {
      nerr("ERROR: Failed to check MAC address CRC\n");

      return -EINVAL;
    }

  return 0;
}

There is nothing wrong with this function, expect that the WIFI adapter uses the same "copy pasted" function. There's definitely room for a clean network architecture here. But anyways, let's hack a mac[5]+=1 just after the CRC check to ensure the mac is unique.

nsh> ifconfig
eth0    Link encap:Ethernet HWaddr a8:03:2a:62:5a:89 at DOWN
        inet addr:0.0.0.0 DRaddr:0.0.0.0 Mask:0.0.0.0

wlan0   Link encap:Ethernet HWaddr a8:03:2a:62:5a:88 at UP
        inet addr:192.168.1.111 DRaddr:192.168.1.1 Mask:255.255.255.0
        
nsh> ifup eth0
netdev_ifr_ioctl: cmd: 1818
emac_init_gpio: emac_init_gpio
emac_config: emac_config: got EMAC_SR_E
emac_read_mac: emac_read_mac -> a8:03:2a:62:5a:89
emac_config: emac_config: macaddr=a8:03:2a:62:5a:8a
emac_init_phy: PHY register 0x0 is: 0x3100
emac_init_phy: PHY register 0x2 is: 0x0007
emac_init_phy: PHY register 0x3 is: 0xc0f1
emac_wait_linkup: emac_wait_linkup:
emac_wait_linkup: PHY register 0x1 is: 0x782d
emac_init_dma: emac_init_dma
emac_start: emac_start
ifup eth0...OK

nsh> ifconfig eth0 up
cmd_ifconfig: Host IP: up
...
dhcpc_request: Received ACK
dhcpc_request: Got IP address 192.168.1.112
dhcpc_request: Got netmask 255.255.255.0
dhcpc_request: Got DNS server 192.168.1.1
dhcpc_request: Got default router 192.168.1.1
dhcpc_request: Lease expires in 43200 seconds
...

nsh> ifconfig
eth0    Link encap:Ethernet HWaddr a8:03:2a:62:5a:89 at UP
        inet addr:192.168.1.112 DRaddr:192.168.1.1 Mask:255.255.255.0

wlan0   Link encap:Ethernet HWaddr a8:03:2a:62:5a:88 at UP
        inet addr:192.168.1.111 DRaddr:192.168.1.1 Mask:255.255.255.0

Et voila, it looks like it's working :-)

Verifying the dual network stack

But well, as usual, something went wrong. Pinging the wlan0 worked, but not the eth0 (no response). Let's have a look at the logs, when do a ping from the mac:

PING 192.168.1.112 (192.168.1.112): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2

The corresponding log shows a

wlan_rxpoll: ARP frame

Good news, the ARP frame is receive, bad news, it is received by the WIFI driver and not the ETH driver. Why would this happen? Because the router thinks the IP 192.168.1.112 is located on the wifi interface. And for this to happen, that would mean that the DHCP request has been done for WIFI. So, let's look at how the DHCP client actually works:

nsh> ifconfig eth0 up
dhcpc_open: MAC: a8:03:2a:62:5a:89
dhcpc_request: Broadcast DISCOVER
udp_callback: flags: 0010
sendto_eventhandler: flags: 0010
udp_send: UDP payload: 256 (0) bytes
udp_send: Outgoing UDP packet length: 284
udp_callback: flags: 0010
wlan_rxpoll: IPv4 frame
udp_callback: flags: 0002
udp_eventhandler: flags: 0002
udp_recvfrom_newdata: Received 300 bytes (of 300)
udp_eventhandler: UDP done
dhcpc_request: Received OFFER from c0a80101
...

The only hint is the wlan_rxpoll: IPv4 frame - that just tells us that the frame is received over WIFI. Since we have a dual interface system, the solution in posix consists in "binding to device" the socket used for the DHCP client. So, let's have a look at the NuttX DHCP Client:

FAR void *dhcpc_open(FAR const char *interface, FAR const void *macaddr,
                     int maclen)
{
      ...

      /* Create a UDP socket */
      pdhcpc->sockfd = socket(PF_INET, SOCK_DGRAM, 0);

       ...

#ifdef CONFIG_NET_UDP_BINDTODEVICE
      /* Bind socket to interface, because UDP packets have to be sent 
       * to the broadcast address at a moment when it is not possible 
       * to decide the target network device using the local or 
       * remote address (which is, by definition and purpose of 
       * DHCP, undefined yet).
       */

      ret = setsockopt(pdhcpc->sockfd, IPPROTO_UDP, UDP_BINDTODEVICE,
                       pdhcpc->interface, strlen(pdhcpc->interface));
       ...
#endif

Of course, the NET_UDP_BINDTODEVICE config is not enabled by default. So let's try again after enabling under Networking Support > UDP Networking > UDP Bind-to-device support.

Bingo, it's working :-)

nsh> ifconfig eth0 up
dhcpc_open: MAC: a8:03:2a:62:5a:89
dhcpc_request: Broadcast DISCOVER
emac_txavail_work: ifup: 1
udp_callback: flags: 0010
sendto_eventhandler: flags: 0010
udp_send: UDP payload: 256 (0) bytes
udp_send: Outgoing UDP packet length: 284
emac_transmit: d_buf=0x3ffb6dec d_len=298
udp_callback: flags: 0010
emac_recvframe: RX bytes 342
emac_rx_interrupt_work: IPv4 frame
udp_callback: flags: 0002
net_dataevent: No receive on connection
udp_datahandler: Buffered 300 bytes
udp_readahead: Received 300 bytes (of 317)
dhcpc_request: Received OFFER from c0a80101

Ping is also working :-)

⋊> ~/P/m/n/nuttx on master ⨯ ping 192.168.1.112
PING 192.168.1.112 (192.168.1.112): 56 data bytes
64 bytes from 192.168.1.112: icmp_seq=0 ttl=64 time=5.438 ms
64 bytes from 192.168.1.112: icmp_seq=1 ttl=64 time=5.306 ms
64 bytes from 192.168.1.112: icmp_seq=2 ttl=64 time=5.176 ms
64 bytes from 192.168.1.112: icmp_seq=3 ttl=64 time=4.650 ms
64 bytes from 192.168.1.112: icmp_seq=4 ttl=64 time=5.147 ms
64 bytes from 192.168.1.112: icmp_seq=5 ttl=64 time=5.436 ms

And NuttX log also shows a correct icmp input message after the emac_rx interrupt message.

 nsh> emac_recvframe: RX bytes 98
emac_rx_interrupt_work: IPv4 frame
icmp_input: Outgoing ICMP packet length: 84 (84)
emac_transmit: d_buf=0x3ffb3e5c d_len=98

Conclusion

What I really like with NuttX is this feeling of Posix - like, when one has a dual stack, the solution is to bind to the device, and for NuttX, that's the exact same solution - a no brainer which brings a lot of advantages when talking about portability.

The downside is that the ESP32 port is still far from being stable, but let's hope for the best and wait for Espressif to implement a nice network driver architecture in the short future.

Next step, will be a deep dive into the NuttX dual stack micro-IP architecture, and how to enable the bridging mode. That will be the focus on the next blog.

The WT32-ETH01 used for this experiment

Getting started with NuttX and Esp32 on MacOS

Ron J — Sun, 28 Feb 2021 02:54:46 GMT

In this post, we will be looking at getting an ESP32 working with NuttX and WIFI networking enabled, using a Mac as the development platform. The ESP hardware is a LOLIN32, but any dev-kit C ESP32 hardware will work fine.

The inspiration for this post is based on Sara Monteiro's nuttx+esp32 getting started article article, but adapted for MacOS and extended to support WIFI networking configuration.

Environment Setup

For the sake of simplicity, I will be using a environment variable to point of the folder used for this project. I am also using fish as the default shell.

export NUTTX_SPACE=(realpath ~/projects/iot/nuttxspace/)

Download NuttX

First step is to checkout the source NuttX code:

cd $NUTTX_SPACE
git clone https://bitbucket.org/nuttx/tools.git
git clone https://github.com/apache/incubator-nuttx.git nuttx
git clone https://github.com/apache/incubator-nuttx-apps.git apps

Build kconfig configuration tool.

cd $NUTTX_SPACE/tools/kconfig-frontends
./configure --enable-mconf
make
make install

Bootloader

mkdir {$NUTTX_SPACE}/esp-bins
curl -L "https://github.com/espressif/esp-nuttx-bootloader/releases/download/latest/bootloader-esp32.bin" -o $NUTTX_SPACE/esp-bins/bootloader-esp32.bin
curl -L "https://github.com/espressif/esp-nuttx-bootloader/releases/download/latest/partition-table-esp32.bin" -o $NUTTX_SPACE/esp-bins/partition-table-esp32.bin

Alternatively, building your own version of the boot-loader can be done quite easily, provided you have docker installed.

git clone https://github.com/espressif/esp-nuttx-bootloader.git {$NUTTX_SPACE}/esp-bootloader
docker run --rm -v {$NUTTX_SPACE}/esp-bootloader:/work -w /work espressif/idf:release-v4.3 ./build.sh

If all works fine, you should be able to see the built files in the out_ folder:

ls -la {$NUTTX_SPACE}/out/
drwxr-xr-x   8 ron  staff    256 28 Feb 10:19 ./
drwxr-xr-x  14 ron  staff    448 28 Feb 10:18 ../
-rw-r--r--   1 ron  staff  23824 28 Feb 10:17 bootloader-esp32.bin
-rw-r--r--   1 ron  staff  18528 28 Feb 10:19 bootloader-esp32c3.bin
-rw-r--r--   1 ron  staff   3072 28 Feb 10:17 partition-table-esp32.bin
-rw-r--r--   1 ron  staff   3072 28 Feb 10:19 partition-table-esp32c3.bin
-rw-r--r--   1 ron  staff  36748 28 Feb 10:17 sdkconfig-esp32
-rw-r--r--   1 ron  staff  33935 28 Feb 10:19 sdkconfig-esp32c3

ESP-IDF

Assuming that you have already installed the ESP-IDF, you should be able to

export IDF_PATH=(realpath ~/projects/iot/esp-idf/)
. $IDF_PATH/export.fish

Building the app

App Configuration

Generate the kernel/app configuration for the ESP32 platform.

cd {$NUTTX_SPACE}/nuttx
./tools/configure.sh esp32-devkitc:nsh

This will create the file .config which contains all the necessary flags for ESP32. For example, this is the generated config for the devkit-C:

CONFIG_ARCH_XTENSA=y
CONFIG_ARCH="xtensa"
CONFIG_ARCH_CHIP="esp32"
CONFIG_ARCH_BOARD="esp32-devkitc"
CONFIG_ARCH_CHIP_ESP32=y
CONFIG_ARCH_FAMILY_LX6=y
CONFIG_XTENSA_CP_INITSET=0x0001
CONFIG_XTENSA_DUMPBT_ON_ASSERT=y
CONFIG_XTENSA_BTDEPTH=50

#
# ESP32 Configuration Options
#
CONFIG_ARCH_CHIP_ESP32WROVER=y
CONFIG_ESP32_DUAL_CPU=y
CONFIG_ESP32_FLASH_4M=y
CONFIG_ESP32_PSRAM_8M=y
CONFIG_ESP32_ESP32DXWDXX=y
CONFIG_ESP32_DEFAULT_CPU_FREQ_240=y
CONFIG_ESP32_DEFAULT_CPU_FREQ_MHZ=240

Build

make

If all goes fine, you should be able to see this at the end of the compilation

AR (create): libboard.a   esp32_boot.o esp32_bringup.o esp32_appinit.o
LD: nuttx
CP: nuttx.hex
CP: nuttx.bin
MKIMAGE: ESP32 binary
esptool.py --chip esp32 elf2image --flash_mode dio --flash_size "4MB" -o nuttx.bin nuttx
esptool.py v3.1-dev
Generated: nuttx.bin (ESP32 compatible)

Flash

I am using a Wemos LOLIN32 1.0

make download ESPTOOL_PORT=/dev/cu.SLAB_USBtoUART ESPTOOL_BAUD=115200 ESPTOOL_BINDIR={$NUTTX_SPACE}/esp-bins

Connecting........_
Chip is ESP32-D0WDQ6 (revision 1)
Features: WiFi, BT, Dual Core, 240MHz, VRef calibration in efuse, Coding Scheme None
Crystal is 40MHz
MAC: b4:e6:2d:95:b1:05
Uploading stub...
Running stub...
Stub running...
Configuring flash size...
Compressed 23824 bytes to 14851...
Wrote 23824 bytes (14851 compressed) at 0x00001000 in 1.6 seconds (effective 118.2 kbit/s)...
Hash of data verified.
Compressed 3072 bytes to 69...
Wrote 3072 bytes (69 compressed) at 0x00008000 in 0.1 seconds (effective 454.4 kbit/s)...
Hash of data verified.
Compressed 124896 bytes to 52034...
Wrote 124896 bytes (52034 compressed) at 0x00010000 in 4.8 seconds (effective 207.1 kbit/s)...
Hash of data verified.

Leaving...
Hard resetting via RTS pin...

Connecting to NuttX shell

screen /dev/cu.SLAB_USBtoUART 115200

Just write help and you'll should see this:

help usage:  help [-v] []

  .         cd        echo      hexdump   mh        rm        time      xd
  [         cp        exec      kill      mount     rmdir     true
  ?         cmp       exit      ls        mv        set       uname
  basename  dirname   false     mb        mw        sleep     umount
  break     dd        free      mkdir     ps        source    unset
  cat       df        help      mkrd      pwd       test      usleep

Builtin Apps:
  nsh  sh
nsh>

Voila, that's it for the basic config. Next step is to enable the WIFI connectivity.

Enabling WIFI Networking

NuttX uses the uIP networking stack, unlike ESP-IDF which uses LWiP.

Basic Network config

Enabling WIFI can be done by configuring the nuttx app:

make -C {$NUTTX_SPACE}/nuttx distclean
{$NUTTX_SPACE}/nuttx/tools/configure.sh esp32-devkitc:nsh
make -C {$NUTTX_SPACE}/nuttx menuconfig

Go to Networking Support and enable it, as well as

    * Networking Support: yes
        * Link layer support
            * Late driver initialization: yes
    * System Type
        * ESP32 Peripheral Selection
            * Wireless: yes
    * RTOS Features
        * Work queue support
            * Generic work notifier
            * High priority (kernel) worker thread
        * Pthread Options
            *  Enable mutex types
    * Device Drivers
        * Wireless Device Support
            * IEEE 802.11 Device Support

To make it easier to debug, I also enabled the traces from:

    * Build Setup
        * Debug Options
            * Enable Error Output
            * Enable Debug Features
            * Network Debug Features
            * Wireless Debug Features

After flashing the, the following logs can be seen:

I (266) boot: Disabling RNG early entropy source...
esp32_rng_initialize: Initializing RNG
esp32_net_initialize: B4:E6:2D:95:B1:05
netdev_register: Registered MAC: b4:e6:2d:95:b1:05 as dev: wlan0
I (16) wifi:wifi driver task: 5, prio:253, stack:3584, core=0
I (18) wifi:wifi firmware version: 3cc2254
I (18) wifi:wifi certification version: v7.0
I (18) wifi:config NVS flash: disabled
I (22) wifi:config nano formating: disabled
I (26) wifi:Init data frame dynamic rx buffer num: 32
I (31) wifi:Init management frame dynamic rx buffer num: 32
I (36) wifi:Init management short buffer num: 32
I (40) wifi:Init dynamic tx buffer num: 32
I (44) wifi:Init static rx buffer size: 1600
I (48) wifi:Init static rx buffer num: 10
I (52) wifi:Init dynamic rx buffer num: 32
netdev_ifr_ioctl: cmd: 1794
tcp_callback: flags: 0040
netdev_ifr_ioctl: cmd: 1796
tcp_callback: flags: 0040
netdev_ifr_ioctl: cmd: 1800
tcp_callback: flags: 0040
netdev_ifr_ioctl: cmd: 1818
wlan_ifup: Bringing up: 10.0.0.2
tcp_callback: flags: 0040

NuttShell (NSH) NuttX-10.0.1
nsh> ifconfig
wlan0   Link encap:Ethernet HWaddr b4:e6:2d:95:b1:05 at UP
        inet addr:10.0.0.2 DRaddr:10.0.0.1 Mask:255.255.255.0

Accessing WAPI

WAPI is a lightweight wrapper for iwconfig, wlanconfig, ifconfig, and route commands, and that's the command we need to use to scan the av available access points.

The first attempt resulted in a failure due to the missing ioctl support.

nsh> wapi scan wlan0
netdev_ifr_ioctl: cmd: 35608
ioctl(SIOCSIWSCAN): 25
ERROR: Process command (scan) failed.

After enabling the Wireless IOCTL

    * Networking support
        * Network Device Operations
            * Enable Wireless ioctl()

Unfortunately, at the time of writing (Feb. 2021), one can notice that from the esp32_wlan.c, the scan is not currently supported. So, the only possiblity is to connect manually:

wapi psk wlan0 access_point_password 0
wapi essid wlan0 access_point_ssid 0

From my access point running OpenWRT, I could notice that the ESP32 was connected, and was allocated an IP address. However, from the NSH CLI, the IP address remmained unchanged:

nsh> ifconfig
wlan0   Link encap:Ethernet HWaddr b4:e6:2d:95:b1:05 at UP
        inet addr:10.0.0.2 DRaddr:10.0.0.1 Mask:255.255.255.0

Also, pinging the device from my Mac laptop to the IP address mentionned on the OpenWRT router, would result in frames beeing dropped (or unanswered) since uIP did not get the correct IP config.

nsh> wlan_rxpoll: ARP frame
arp_arpin: ARP request for IP da01a8c0

Enabling DHCP support

To enable the DHCP client:

    * Networking Support
        * UDP Networking
            * UDP Networking
            * UDP broadcast Rx support
        * Socket Support
            * Socket options
    * Library Routines
        * NETDB Support
            * DNS Name resolution
    * Application
        * System Libraries and NSH Add-Ons
            * DHCP Address Renewal (NEW)
        * Wireless Libraries and NSH Add-Ons
            * IEEE 802.11 Configuration Library
                * IEEE 802.11 Command Line Tool

After that, and refering to this post on esp32.com, it is possible to enable the DHCP for the network initlization using:

    * Application
        * Network Utilities
            * DHCP client
            * Network initialization
                * IP Address Configuration
                    * Use DHCP to get IP address
                    * Router IPv4 address: 0xc0a80101

It is quite weird to have to configure the Router IPv4 address, and further more having to do it using hexadecimal (0xc0a80101 in my case), but well, that's the only way to get things working.

Unfortunately, that was not enough. After trying to setup the SSID and passkey from WAPI, it always ended-up with errors:

NuttShell (NSH) NuttX-10.0.1
nsh> wapi psk wlan0 my-ap-password 1
netdev_ifr_ioctl: cmd: 35636
nsh> wapi essid wlan0 my-ap-ssid 1
netdev_ifr_ioctl: cmd: 35610
phy_version: 4500, 0cd6843, Sep 17 2020, 15:37:07, 0, 2
wifi_set_intr: cpu_no=0, intr_source=0, intr_num=0, intr_prio=1I (6220) wifi:mode : sta (b4:e6:2d:95:b1:05)
I (6221) wifi:enable tsf
esp_event_post: Event: base=WIFI_EVENT id=2 data=0 data_size=0 ticks=4294967295
I (6231) wifi:Set ps type: 0

I (6597) wifi:new:<3,0>, old:<1,0>, ap:<255,255>, sta:<3,0>, prof:1
I (7249) wifi:state: init -> auth (b0)
I (7256) wifi:state: auth -> assoc (0)
I (7262) wifi:state: assoc -> run (10)
I (7276) wifi:connected with my-ap-ssid, aid = 2, channel 3, BW20, bssid = ee:41:18:0c:53:dd
I (7277) wifi:security: WPA2-PSK, phy: bgn, rssi: -29
I (7278) wifi:pm start, type: 0

esp_event_post: Event: base=WIFI_EVENT id=4 data=0x3ffe5bb0 data_size=44 ticks=4294967295
nsh> I (7301) wifi:AP's beacon interval = 102400 us, DTIM period = 2

The only way to overcome this issue was to set the SSID and passkey directly in the menu config, under Application -> Network Utilities -> Network initialization -> WAPI Configuration (SSID / Passprhase). And fortunately, after that, it was possible to the the correct IP address:

nsh> ifconfig
wlan0   Link encap:Ethernet HWaddr b4:e6:2d:95:b1:05 at UP
        inet addr:192.168.1.218 DRaddr:192.168.1.1 Mask:255.255.255.0

Pinging the Internet

Unfortunately, even after having the correct IP condiguration, PING would still not work, failing with socket address family unsupported.

nsh> ping baidu.com
dns_recv_response: ID 36690
dns_recv_response: Query 128
dns_recv_response: Error 0
dns_recv_response: Num questions 1, answers 2, authrr 0, extrarr 0
dns_recv_response: Question: type=0001, class=0001
dns_parse_name: Compressed answer
dns_recv_response: Answer: type=0001, class=0001, ttl=0000bf, length=0004
dns_recv_response: IPv4 address: 220.181.38.148
dns_parse_name: Compressed answer
dns_recv_response: Answer: type=0001, class=0001, ttl=0000bf, length=0004
dns_recv_response: IPv4 address: 39.156.69.79
psock_socket: ERROR: socket address family unsupported: 2
socket: ERROR: psock_socket() failed: -106

The missing link was to enable IPPROTO_ICMP socket support under Networking Support -> ICMP Networking Support

NuttShell (NSH) NuttX-10.0.1
nsh>
nsh> ping baidu.com
sendto_eventhandler: Send ICMP request
sendto_request: Outgoing ICMP packet length: 84 (84)
icmp_poll_eventhandler: flags: 0002
icmp_datahandler: Buffered 81 bytes
icmp_readahead: Received 64 bytes (of 81)
56 bytes from 39.156.69.79: icmp_seq=0 time=40 ms

Voila, finally, it's working :-)

Also, just in case, I could see some random failures during the ping: up_assert: Assertion failed at file:mm_heap/mm_free.c line: 170 task: ping

Conclusions

NuttX is definitely a promising solution - especially considering the eco-system that is forming arround it. However, at this stage the ESP32 support is quite limited. But the good news is that Espressif seems to be proactively adding support for their chip, so let's hope that within a few weeks WIFI - and other drivers - will be completely supported.

Next step is to try to enable Ethernet on the W32-ETH01

Benchmarking Mysql, Postgres, MongoDB performance in NoSQL mode using Golang

Ron J — Thu, 08 Mar 2018 09:00:00 GMT

When it comes to storing significantly large volumes of time-series data, with several thouthands rows added every second, one of the first questions which comes to one's mind is wether traditional databases such as MySql, Postgres or Sqlite can handle those data volume? Or whether it is needed to use dedicated platforms and architectures such as Hadoop, MongoDB, Click house or influxDB.

The code associated to this post is availaible on github: https://github.com/atsdb/db-perf-test

Introduction

Benchmarks-relevant measurements

There are several aspects to answering this question. First, in terms of performance of inserting new data, but also in terms of reading back from the database to find specific data. If the assumption is that the incoming volume of data is in the order of magnitude of several kilo inserts per second, then the natural question is wether traditional databases like MySQL or Postgres can do it? And the next question is, if they can handle 1K rps, then how what is the maximum speed they can ingest data? 10K, 50K, 100K?

The second aspect is related to the size of the data stored by the database engine. For an application that would collect just 1K rows per second, where one row is defined using 20 bytes (5 times 4 bytes integers), how long would it take to fill a 200GB disk? Assuming that the database stores the data without any extra overhead, then that would take (1000*20)*3600*24 / 180.(1024^3) = 110 days. But in practice, the overhead from the database storage engine can be quite fat, or even very fat... so that the disk would just could fill-up in 1 month or even less.

The third aspect, less important but still critical for assessing the various database, in the the performance of the database server in terms of CPU usage and memory usage. For instance, for a database that would be able to ingest 10K row/sec, how much CPU would the database use? 10%, 50% or 100%? In the following part of this article, we will be looking at the CPU usage, CPU load and system memory usage.

Performing the test using NoSQL schema.

There can be a lot of buzz around the NoSQL meaning, so what is refers to in this article, in the usage of database tables without any kind of jointure, and where one table is used to store has much rows as possible (up to 100 millions or more).

There are 3 kinds of table schema used for the test:

slim table: contains 2 columns of type integer (4 bytes), without any index.
light table with key and index: This table contains two more table compared to the first table; one used as auto-inc index, and the other one as non-primary index. This is useful to test the index performance for each database engine.
large table with key: It contains 10 rows, the first is the primary-index, the second one is 200 characters (variable), and the last 8 integers. This last table is useful to asses the performance of fat table.

Example for the first slim table without index:

CREATE TABLE `test-slim-table` (
  timestamp int(11) DEFAULT NULL,
  value int(11) DEFAULT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Example for the second light table with index:

CREATE TABLE `test-light-table-index` (
  index int(10) unsigned NOT NULL AUTO_INCREMENT,
  timestamp int(11) DEFAULT NULL,
  val1 smallint(6) DEFAULT NULL,
  val2 smallint(6) DEFAULT NULL,
  PRIMARY KEY (idx),
  KEY col1 (col1)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Testing strategy and scenarios

In order to maximize the speed at which database can ingest rows, three different strategies are defined:

Naïve inserts

The naive insert works by first doing prepare on an insert query, and then calling exec on the prepared query as fast as possible, using random data for each column. The insert query looks like:

insert into test-light-table-index (index,timestamp,val1,val2) (0,4384932083409,87,39804)

Using go, the naive insert is implemented using the those functions:

func (db *DB) Prepare(query string) (*Stmt, error)
func (db *DB) Exec(query string, args ...interface{}) (Result, error)

Transaction based inserts

The drawback of the first naive method is that the database have to keep the ACID promise for each row inserted, thus typically having to flush the data to disk and recompute the index for each exec. A better way is to use a transaction, and insert the rows into the transaction which then gets a commit every second. The advantage is that the ACID promise only needs to be kept at commit time, thus reducing the disk flushing and index computation overhead.

Using go, the transaction can be handled easily using the two functions:

func (db *DB) Begin() (*Tx, error)
func (tx *Tx) Commit() error

Note that the prepare statement needs to be re-prepared after each new transaction is created.

Shards inserts

The last strategy consists is splitting the incoming data into several schema-identical tables. The way the data is split is not the critical for this article, assuming that the art of find well-balanced hash algorithm is business specific. The most important is wether the usage of table shards can improve the performance, and if yes, what is the optimal number of tables to be created.

For this scenarios, the transaction based insert is used to each table, and the test consist is running several transaction based insert in parallel. Only the overall insert performance is being looked at, i.e only the sum of all tables inserts/second will be investigated.

Preliminary result

Test are running for 10 minutes, on a Azure A8 instance (28GB RAM). All the numbers below are express in rows/seconds, and represent the overall rps (i.e. total number of inserted rows divided by test duration).

Database configuration is off-the-shell, except for the following MySQL / InnoDB tunings:

innodb_buffer_pool_size: 11GB
innodb_file_per_table: 1

Inserts per Second performance

Note that MySQL/MyISAM transaction insert result is the same as the naive insert result since MyISAM does not support transactions.

Database Engine	naive insert	transaction insert	shard insert
Table Schema: Slim
MySQL myISAM	18,104	18,104	58,897
MySQL InnoDB	731	9,107	53,399
Postgres	2,120	10,147	63,481
Table Schema: Light with index
MySQL myISAM	12,317	12,317	26,026
MySQL InnoDB	660	6,815	23,415
Postgres	1,607	3,845	30,022
Table Schema: Large
MySQL myISAM	14,377	14,377	24,786
MySQL InnoDB	674	6,060	17,350
Postgres	1,745	10,147	24,019

Data for MongoDB, as well as Click-House, Sqlite and Influx DB will be added in a later post

Disk usage performance

Numbers below are expressed in average size in bytes per row (including data and index).

Database Engine	Slim	Light w. Index	Large
MyISAM	9	37	244
InnoDB	37	62	328
Postgress	36	44	318

MyISAM is outstandanding compared to the other. It stores the data in an optimal way, and yet performs better in terms of write/seconds.

Gridded Population of the World

Ron J — Sun, 13 Mar 2016 13:46:14 GMT

The Gridded Population of the World, also known as GPW, can be used as a way to estimate the amount of anthropogenic pollution.

The following animation is applied to Northern India, and based on the GPW version 4 combined with the real-time wind forecast from data the Global Forecast System.

The full explanation available is available from aqicn.org/faq/2016-02-28/air-quality-forecasting-in-northern-india/

A visual study of air pollution Forecasting

Ron J — Fri, 20 Nov 2015 11:43:02 GMT

We have been writing quite a few times about the influence of wind on air pollution, and how strong winds (or, to be more precise, strong ventilation) can help to clean the air in a very short time. But we never had the opportunity to create on a dynamic visualization of this phenomenon, so this is what this article is about.

The full explanation available from this link: http://aqicn.org/faq/2015-11-05/a-visual-study-of-wind-impact-on-pm25-concentration/

Real-time volcano ash forecast

Ron J — Thu, 18 Jun 2015 06:47:00 GMT

We have been recently working with the Quito Ambiente team for forecasting in real-time the volcano ash from Quito's 3 closest Volcanos.

The full project explanation is available from this link. The current result is: