Stop Talking, Start Doing

EDDL: How do we train neural networks on limited edge devices - PART 1

2021-10-13T16:53:20-04:00

This post introduces our previous milestone in project “Edge trainer”, as the paper “EDDL: A Distributed Deep Learning System for Resource-limited Edge Computing Environment.” was published. As the first part of the introductions, I focus only on the motivation and summary of our works. More details in design and implementation can be found in late posts.

Why do we need training on edge?

Cloud is not trustworthy anymore. More and more facts supports that breach on cloud happens frequently than before. Nowadays, with more generated personal sensitive data has been uploaded to the cloud center, tech company know better to someones than user themselves.

Researchers, no matter in industry on academia, are working in a way that still learning from users’ data but also keeping raw sensitive data under users’ control. Many publications already showed feasibility of only sharing after-trained model instead of raw data. One recent popular study on this is google’s federated learning.

During investigated this problem, we found that let end user train their own data is safe, but sacrifice efficiency. Since one end device has limited resources, training time and power consumption can be disappointing. We believe there must have a leverage between privacy and efficiency in some target scenarios.

Fortunately, we observed that users who belongs to the same campus, plant, firm and community always share similar interests. Therefore, these co-located users have similar demands in using AI-involved routines. Also, co-located users are easily targeted by same type of threats, such as ransomware to financial practitioners.

Think about this, sending features of a new malware app to cloud services in order to train a neural networks used by antivirus program. This process may takes long time and small amount of samples may not be recognized by the global neural networks model. With a customized local model trained and deployed on the edge can successfully counter the problem. With edge training as a supplement of cloud training can achieve better response time and let the whole system more flexible.

Why training on edge is hard?

Since all co-located users’ device can be used for an edge training, issues and challenges occur as deploying this distributed system.

The first challenge is struggling workers. Training devices are heterogeneity, from limited IoT camera to high-end media center with powerful GPU. They are not designed to do machine learnings. So, a good edge-based distributed learning framework must can handle variety speeds in training tasks.

The second challenge is how to scale up clusters. In a campus, thousands and more devices may contribute computing resources to the same training tasks. However, these devices may located in far not matter in physical or in network topology. How can we well use them well, without struggled with endless transmission time remains a challenge.

The third issue is frequently joining and exiting of devices. We can’t rely on each devices to faithfully working on training tasks rather than their original workload. Smartly schedule work balance and handle join/exit issues also need under consideration.

Our proposal

Dynamic training data distribution and runtime profiler

We design a dynamic training data distribution mechanism that helps to both the first and the third challenges. Preprocessing data can be transmitted without leakage of raw sensitive information. This can helps with struggling workers who can train small batches in order to upload parameters with a similar training time. Also, for extremely slow devices, join and exit of devices cases, dynamic data distribution and profiler can helps with keep global training parameters from polluted and staleness.

To counter heterogeneity’s, more approaches were applied in our later research. More details were introduced to runtime profiler in the later works.
Asynchronous and synchronous aggregation enabled

In our findings, asynchronous and synchronous parameter update have their pros and cons. Keeping sync all the time leads struggling worker issue unsolvable. However, async’s harm to accuracy and convergence time also need attentions. To carefully chose between these two update policies at the runtime is what we proposed to make use of their own advantages.
Leader role splitting

The idea is to let worker devices with higher bandwidth taking leader role during training. Parameter updating does not require much computation but only need bandwidth. Devices with sufficient bandwidth can also work as virtual leader devices. This approach helps with minimize physical devices we used and more leaders can further scale up workers limits.

Generate Word Cloud Figures with Chinese-Tokenization and WordCloud python libraries

2020-09-15T22:00:14-04:00

Let’s generate a word cloud like this. Don’t understand the language is not a big deal. If your written language is based on latin alphabet(or other language has space between words), skip tokenization.

Background

Recently, I set up a web-based RSS client for retrieving and organizing everyday news. I used TinyTinyRSS, or as ttrss, a popular RSS client which friendly to docker. Thanks to developer HenryQW, a well-written Nginx-based docker configuration is already available in docker hub. With more feeds were added, I found some feeds does not need to be checked everyday. Thus I was thinking to create a script to automatically list all keywords appears in a last period and generate a heat map kind figure of it.

Before you go further, I’ll tell you all my settings to give readers a general overview.

My first step is to read all text-based information from TTRSS’s PostgreSQL database. With information, I used a Chinese-NLP library, jieba, to extract all keyword with their occurrences frequency. By using WordCloud, a python library, word cloud figure is generated and present. More details will be discussed in later sections.

Get RSS feeds’ text

My first thought is generating a keyword heat map for economy news of a last week. Since this blog post are more skewed to Chinese tokenization and draw the word cloud figure. I’ll leave my code here just in case. The SQL connector I used is psycopg2, an easy-use PostgreSQL library.

def __init__(self):
	self.dbe = psycopg2.connect(
    	host=DB_HOST, port=DB_PORT, database=DB_NAME, user=DB_USER, password=DB_PASS)

def get_1w_of_feed_byid(self, id=1) -> list:
	cur = self.dbe.cursor()
    cur.execute('SELECT content FROM public.ttrss_entries \
    	where date_updated > now() - interval \'1 week\' AND id in ( \
        select int_id from DB_TABLE_NAME \
        where feed_id=' + str(id) + ' \
        ) \
        ORDER BY id ASC '
        )
	rows = cur.fetchall()
	return rows

Most arguments are intuitive and easy to understand. The only exception is argument of function get_1w_of_feed_byid. This id is the feed index of my subscriptions.

Tokenize with frequency

Two popular tokenization library were used, and I chose jieba after a few comparison. Before cutting the sentence, we first need to remove all punctuation marks.

def remove_biaodian(text: str) -> str:
    punct = set(u''':!),.:;?]}¢'"、。〉》」』】〕〗〞︰︱︳﹐､﹒
                ﹔﹕﹖﹗﹚﹜﹞！），．：；？｜｝︴︶︸︺︼︾﹀﹂﹄﹏､～￠
                々‖•·ˇˉ―--′’”([{£¥'"‵〈《「『【〔〖（［｛￡￥〝︵︷︹︻
                ︽︿﹁﹃﹙﹛﹝（｛“‘-—_…''')
    ret = ""
    for x in text:
        if x in punct:
            ret += ''
        else:
            ret += x
    return ret

After we have an all characters string, we can call jieba. By using the function jieba.posseg.cut with or without paddle, we can have a word list and their “part of speech”. As you can see in the following code, I also did two more works.

First, in the if statement, I only kept all nouns with some categories. Category abbreviation such as “nr” and “ns” represent different “part of speech”, I attached with categories I used in the following table. For more details you can find in this link.

The second work is only keeping words with length longer than 2 characters. In Chinese, there’s no space between words such as Latin writing systems. Since then, some single-character-words such as conjunction words are easy to be misrecognized as specialty-noun. And this misrecognition will cause more single-character being regarded as specialty-noun. I am not able to improve NLP method, so I used a easy way to fix this by removing any words less than 2 characters.

import jieba.posseg as pseg

def get_noun_jieba(self, content: str) -> list:
	content = remove_biaodian(content)
	words = pseg.cut(content)	# Invoking jieba.posseg.cut function 

	ret = []
	for word, flag in words:
		# print(word, flag)
		if flag in ['nr', 'ns', 'nt', 'nw', 'nz', 'PER', 'ORG', 'x']:   # LOC
			ret.append(word)
	return [remove_biaodian(i) for i in ret if i.strip() != "" and len(remove_biaodian(i.strip())) >= 2]

Word category names and abbreviations

Abbreviation	Category name/ Part of speech
nr	People name noun
ns	Location name noun
nt	Organization name noun
nw	Arts work noun
nz	Other noun
PER	People name noun
ORG	Location name noun
x	Non-morpheme word

With all words extracted, we can easily calculate their frequencies. After this, we can using the following line of code to print a sorted result to verify correctness.

noun = seg.get_noun_jieba(test_content)
# ... Calculate frequency of above word list ...
print(sorted(a_dict.items(), key=lambda x: x[1]))

Draw word cloud

With a keyword and frequency dictionary(data structure), we can just call built-in functions from wordcloud library to generate the figure.

First we need to initialize an instance of wordcloud class. As you can see in my code, I set it with 6 parameters. Width and Height of the canvas, maximum amount of words used to generate the figure, the font of words, background color and margin between any two words.

After having the instance, we call function generate_from_frequencies and pass keyword dictionary to it. The return value of this function is an bitmap image, which we can use matplotlib to plot it to your screen.

I tested my plot on ubuntu-subsystem on Windows 10, unfortunately matplotlib under subsystem depends on x11 window manager and its not default available on windows. We need to install an x11 manager to support. Xming is the one I used.

from wordcloud import WordCloud
import matplotlib.pyplot as plt

font_path = "./font/haipai.ttf"
output_path = "./font/out.png"


def show_figure_with_frequency(keywords: dict):
    wc = WordCloud(width=828, height=1792, max_words=200, font_path=font_path,
                   background_color="white", margin=1).generate_from_frequencies(keywords)
    plt.imshow(wc)
    plt.axis('off')
    plt.show()

If everything work fine, a word cloud figure will show up in a new window. My version looks like this.

This generated word cloud figure reflects the most popular economy news’ keyword in the week started 06-28-2020. Two largest words in the figure are “新冠” and “新冠病毒”, both means “Covid-19” (This figure was in the week of the second covid spur in Beijing, China). The size of the image fits my phone screen and I can use an app to automatic sync it to my phone’s wallpaper. However, in this image, too many location nouns are presented. This will be something I can make progress on in the future.

Xv6 introduction

2017-07-28T14:56:55-04:00

In this post, you will learn a few basic concepts of xv6. Learning path will be closed coupled to first project assignment I gave when I assisted in teaching OS classes. Understand system call and know how to implement a simple one will be coved as the first half. In the second half of this post, I will discuss a little bit more on how to debug xv6 using gdb.

Xv6 Systemcall

To invoke a system call, we have to first define a user mode function to be the interface of the kernel instruction in file user.h.

void function (void);

This interface-like function will then pass the function name, in this case function, to usys.S. When using user mode function in programs, usys.S will generate a reference to SYS_function and push system call number of this function into %eax. After that, system can know from syscall.c and determining whether this system call is available. We must define same name system function and add it into syscall.h and syscall.c.

#define SYS_function ##  // ## is the system call number
[SYS_function]  sys_function // real system function name
extern int sys_function(void); // real system function declaration

After adding these sentences to syscall files, we can implement real function in specific place where you want to make the function works well.

Sometimes, we need to pass variables among system calls. In this case, variables’ values are not necessary and even can’t be pass directly into system_function. When invoke a system call function, all variables of this system call will be pushed into current process’ stack. In file syscall.c, multiple functions are provided to get these variables from the process. I won’t waste time on explaining how to use these functions especially when elegant and detailed comments were written in source codes. However, I will explain concepts and how process organized and works in xv6 in future articles.

Debug xv6 with gdb

Please make sure that you have used gdb before. If you never used gdb, you may write a simple 50-100 lines c code and practice how to use gdb first.

To make sure xv6 gdb enabled, please check if .gdbinit.tmpl file exist. This file is used for generate .gdbinit file which you can late consider it as a configuration for gdb.

Before running the xv6 instance in QEMU, one more thing you need to know is that using gdb to debug xv6 must be attached remotely. This is because xv6 was running within QEMU, and emulator is virtually gapped from the host device. Later when you start debugging, QEMU will open a gdb server to let gdb client connect to.

Once you want to start, using following command to compile and run xv6

$ make qemu-nox-gdb
*** Now run 'gdb'.
qemu-system-i386 -nographic -drive file=fs.img,index=1,media=disk,format=raw -drive file=xv6.img,index=0,media=disk,format=raw -smp 2 7

At this moment, it feels xv6 was stuck, this is because QEMU is ready to be connected by the gdb client. You may use the .gdbinit to automatically finish this remote connection by simple typein following command in another terminal.

$ gdb -x .gdbinit
GNU gdb (Debian 8.2.1-2+b3) 8.2.1

...

The target architecture is assumed to be i8086
[f000:fff0]    0xffff0: ljmp   $0x3630,$0xf000e05b
0x0000fff0 in ?? ()
+ symbol-file kernel
warning: A handler for the OS ABI "GNU/Linux" is not built into this configuration
of GDB.  Attempting to continue with the default i8086 settings.

(gdb) 

Now within this gdb client shell, type ‘c’ to continue the xv6, and you will see xv6 start execution in the first terminal.

At this moment, you may add breakpoints to your code to see if your code is correctly implemented or not.

One more thing, if you open .gdbinit file, you’ll find that it by default connect to a localhost target. If you are working on some other environment that target and client were not placed in the same device, change the localhost to ip address correspondingly. Using ssh may connect to different physical devices under same domain name, this is because load balancer were used. To check ip address, search command ip.

target remote localhost:28467
# target remote [ip-addr]:28467

Some of my previews experiment works: 2016

2016-10-28T12:27:33-04:00

This blog contains only some basic record of my works. For some details, I will write a unique blog just for some specific topics.

2016-10

Time Experiment of rsync

Patch is based on rsync with version 3.1.2. [Rsync|Patch]

How to collect data

Basically, everything of transmission time and computation time will be output with overall time will be printed on the console. But we also need some bash script to collect data through different size of random size and with different modification through them.

Start from 8K to 64M, modify at beginning, [Bash script]
Start from 8K to 64M, modify at last, [Bash script]
Start from 8K to 64M, modify at random place with a (slow) python script, [Bash script|Python program]

Time Experiment of seafile

Patch is based on seafile 5.1.4. You can find the release from seafile official repo. You may follow official compile instructions from here. [Patch no longer avaiable, new version at following sections]

How to collect data

We also need everything be done using scripting. But this time I only design added some distance between two increasing files’ sizes.

Start from 8K to 16M, 4 times increasing, modify at beginning/ at 1024 different places with python script. [Bash Script|Python program]
After using this auto testing script, everything of output will be marked in log files of seafile, which located in ~/.ccnet/log/seafile.log
We need to use this simple awk code and vim operation to extract data.

# CDC: content defined chucks
# HUT: Http upload traffic
# ALL: overall time of one commit & upload
awk '/CDC|HUT|ALL/ {print $4,$5}' ~/.ccnet/log/seafile.log > results.stat

Install Seafile on odroid xu

Due to failure of my cross-compile to seafile on android. I used develop board as a replacement experiment platform for ARM-seafile testing. I used a odroid xu as hardware standard. Because all I need is an ARM platform, only an ARM-Ubuntu is enough for me. But develop prototype on a board is much fun than coding, I won’t address much this time. But I’ll start a blog telling some really cool stuff I made for a strange aim.

To install a ubuntu with GUI is my all preparation work. I found to way to do this.

armhf is a website for arm-based ubuntu. It has a detailed instruction to follow at here. They also provide ubuntu 12.04/ 14.04 and debian 7.5 to choose. But unfortunately odroid xu’s hdmi output doesn’t supported by ubuntu native firmware. So install ubuntu-desktop might can’t be boot up for video output.
Burn images is much easy to install a pre-complied ubuntu system. I found this on odroid xu’s forum, which contains xubuntu image [download] for odroid xu. With this image, you just need to use dd command to write whole system mirror into sdcard.

# If .img end with xz, use this command to uncompress first
unxz ubuntu-14.04lts-xubuntu-odroid-xu-20140714.img.xz    
# Burn image into SD-card
sudo dd if=ubuntu-14.04lts-xubuntu-odroid-xu-20140714.img of=/dev/sdb bs=1M conv=fsync
sync

2016-11

Android Kernel

How to build an Android Kernel?

Generally, I won’t tell anything in this parts, just mark some related links, and point out some mistakes or error solutions.

Google Official Guide – If you don’t have AOSP sources, you have to download prebuilt toolchains which recommended in this guide might not be correct. Use following links to choose your fitting tools. — ASOP git root, under sub class “/platform/prebuilts/gcc”
Packing and Flashing a Boot.img [highly recommend]

2016-12

Android Kernel

How to compile with ftrace?

If we want to debug under android, ftrace is a great tool for working. But, ftrace is not available in android if we used default configure file. Android kernel configuration is in arch/arm64/kernel/configs. We need to add few lines under that.

CONFIG_STRICT_MEMORY_RWX=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_PERSISTENT_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_PREEMPT_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_STACK_TRACER=y

How to extract android images: Dump an image

If we want to hold a rooted status after flashing boot, we need to extract an image from android devices. We can first use following command to find which blocks belongs to. According to some references, this article provide three ways to dump an image, I picked one for easy using.

adb shell
ls -al /dev/block/platform/$SOME\_DEVICE../../by-name # {Partitions} -> {Device Block}

# dump file
su
dd if=/dev/block/mmcblk0p37 of=/sdcard/boot.img

Using charles proxy to monitor mobile SSL traffics

2016-10-27T22:50:33-04:00

In this blog, I will generally talk about how to use proper tools to monitor SSL traffics of a mobile devices. Currently, I only can dealing with those SSL traffics which use an obviously certification. Some applications may not using system root cert or they doesn’t provide us a method to modify their own certs. For these situation, I still didn’t find a good solutions for it. But I’ll keep updating this if I get one.
My current solution is using AP to forward all SSL traffic to a proxy, charles proxy is my first choice (Prof asked). It’s a non-free software which still update new versions now. So mainly, I’ll talk about how to charles SSL proxy.

Preparations

Monitor device situation: Linux Machine with wireless adapter
Download the newest version(4.0.1) of charles
Target android devices with root privilege

Install Charles and Configuration

You have to install charles first. After downloading the charles proxy, you have to unzip it and configure some basic settings.

# open charles first
./bin/charles  

Save charles’ private key and public key

In Help -> SSL Proxying -> Export Charles Root Certificate and Private Key, enter a password and save the public and private key in *.p12 format.
You also need to save charles Root Certificate, it also contains in the same menu. For convience, save it as *.pem format.

Set Proxy and SSL Proxy

Stop Talking is the worst title of one blog

2016-10-26T22:50:33-04:00