Add new post about eddl part 2

2026-06-12 23:58:11 -07:00 · 2021-10-31 18:18:46 -04:00
parent 6f4c32d4fb
commit f86bce1774
20 changed files with 1572 additions and 153 deletions
@@ -3,6 +3,7 @@ title:  "Xv6 introduction"
 date:   2017-07-28 14:56:55 -0400
 tags: xv6
 author: Pengzhan Hao
+cover: '/static/2021-10/Xv6_LS_Command_Output.png'
 ---

 In this post, you will learn a few basic concepts of xv6. Learning path will be closed coupled to first project assignment I gave when I assisted in teaching OS classes.
@@ -14,66 +14,66 @@ More details in design and implementation can be found in late posts.

 ## Why do we need training on edge?

-Cloud is not trustworthy anymore. More and more facts supports that breach on cloud happens frequently than before.
-Nowadays, with more generated personal sensitive data has been uploaded to the cloud center, tech company know better to someones than user themselves.
+Cloud is not trustworthy anymore. More and more facts support that breach on the cloud happens frequently than before.
+Nowadays, with more generated personal sensitive data has been uploaded to the cloud center, tech companies know better to someones than the user.
  
-Researchers, no matter in industry on academia, are working in a way that still learning from users' data but also keeping raw sensitive data under users' control.
-Many publications already showed feasibility of only sharing after-trained model instead of raw data.
+Researchers, no matter in the industry on academia, are working in a way that still learning from users' data but also keeping raw sensitive data under users' control.
+Many publications have already shown the feasibility of only sharing the after-trained model instead of raw data.
 One recent popular study on this is google's [federated learning](https://ai.googleblog.com/2017/04/federated-learning-collaborative.html).
  
-During investigated this problem, we found that let end user train their own data is safe, but sacrifice efficiency.
+During investigating this problem, we found that letting end-user train their data is safe, but sacrifice efficiency.
 Since one end device has limited resources, training time and power consumption can be disappointing.
-We believe there must have a leverage between privacy and efficiency in some target scenarios.
+We believe there must have leverage between privacy and efficiency in some target scenarios.

-Fortunately, we observed that users who belongs to the same campus, plant, firm and community always share similar interests.
+Fortunately, we observed that users who belong to the same campus, plant, firm, and community always share similar interests.
 Therefore, these co-located users have similar demands in using AI-involved routines.
-Also, co-located users are easily targeted by same type of threats, such as ransomware to financial practitioners.
+Also, co-located users are easily targeted by the same type of threats, such as ransomware to financial practitioners.

-Think about this, sending features of a new malware app to cloud services in order to train a neural networks used by antivirus program.
-This process may takes long time and small amount of samples may not be recognized by the global neural networks model.
-With a customized local model trained and deployed on the edge can successfully counter the problem.
-With edge training as a supplement of cloud training can achieve better response time and let the whole system more flexible.
+Think about this, sending features of a new malware app to cloud services to train neural networks used by antivirus programs.
+This process may take a long time and a small number of samples may not be recognized by the global neural networks model.
+A customized local model trained and deployed on the edge can successfully counter the problem.
+With edge training as a supplement to the cloud training can achieve better response time and let the whole system more flexible.

 ## Why training on edge is hard?

-Since all co-located users' device can be used for an edge training, issues and challenges occur as deploying this distributed system.
+Since all co-located users' devices can be used for edge training, issues and challenges occur as deploying this distributed system.

 The first challenge is **struggling workers**.
-Training devices are heterogeneity, from limited IoT camera to high-end media center with powerful GPU.
-They are not designed to do machine learnings.
-So, a good edge-based distributed learning framework must can handle variety speeds in training tasks.
+Training devices are heterogeneous, from limited IoT cameras to high-end media centers with powerful GPUs.
+They are not designed to do machine learning.
+So, a good edge-based distributed learning framework must be able to handle a variety of speeds in training tasks.

 The second challenge is how to **scale up** clusters.
-In a campus, thousands and more devices may contribute computing resources to the same training tasks.
-However, these devices may located in far not matter in physical or in network topology. 
-How can we well use them well, without struggled with endless transmission time remains a challenge.
+On a campus, thousands and more devices may contribute computing resources to the same training tasks.
+However, these devices may be located far no matter in physical or in network topology.
+The question of how can we well use them well, without struggling with endless transmission time remains a challenge.

 The third issue is frequently **joining and exiting** of devices.
-We can't rely on each devices to faithfully working on training tasks rather than their original workload.
+We can't rely on each device to faithfully work on training tasks rather than their original workload.
 Smartly schedule work balance and handle join/exit issues also need under consideration.

 ## Our proposal

 - Dynamic training data distribution and runtime profiler

-    We design a dynamic training data distribution mechanism that helps to both the first and the third challenges.
-    Preprocessing data can be transmitted without leakage of raw sensitive information. 
-    This can helps with struggling workers who can train small batches in order to upload parameters with a similar training time.
-    Also, for extremely slow devices, join and exit of devices cases, dynamic data distribution and profiler can helps with keep global training parameters from polluted and staleness.
+    We design a dynamic training data distribution mechanism that helps both the first and the third challenges.
+    Preprocessing data can be transmitted without leakage of raw and sensitive information.
+    This can help struggling workers who can train small batches in order to upload parameters with a similar training time.
+    Also, for extremely slow devices, join and exit of devices cases, dynamic data distribution and profiler can help with keeping global training parameters from pollution and staleness.

-    To counter heterogeneity's, more approaches were applied in our later research.
-    More details were introduced to runtime profiler in the later works. 
+    To counter heterogeneity, more approaches were applied in our later research.
+    More details were introduced to the runtime profiler in the later works.

 - Asynchronous and synchronous aggregation enabled

-    In our findings, asynchronous and synchronous parameter update have their pros and cons. 
-    Keeping sync all the time leads struggling worker issue unsolvable.
-    However, async's harm to accuracy and convergence time also need attentions.
-    To carefully chose between these two update policies at the runtime is what we proposed to make use of their own advantages.
+    In our findings, asynchronous and synchronous parameter update have their pros and cons.
+    Keeping sync all the time leads to struggling worker issues unsolvable.
+    However, async's harm to accuracy and convergence time also needs attention.
+    To carefully choose between these two update policies at the runtime is what we proposed to make use of their own advantages.

 - Leader role splitting

-    The idea is to let worker devices with higher bandwidth taking leader role during training.
-    Parameter updating does not require much computation but only need bandwidth. 
+    The idea is to let worker devices with higher bandwidth take leader-role during training.
+    Parameter updating does not require much computation but only needs a great of bandwidth.
    Devices with sufficient bandwidth can also work as virtual leader devices.
-    This approach helps with minimize physical devices we used and more leaders can further scale up workers limits.
+    This approach helps minimize physical devices we used and more leaders can further scale up workers' limits.
@@ -0,0 +1,109 @@
+---
+title:  "EDDL: How do we train neural networks on limited edge devices - PART 2"
+date:   2021-10-31 13:01:14 -0400
+tags: Research
+author: Pengzhan Hao
+cover: '/static/2021-10/f.5_Impl_leader_worker.png'
+mathjax: true
+---
+
+In the last post, part1, our idea of distributed learning on edge environment was generally addressed.
+I introduced the reason why edge distributed learning is needed and what improvements it can achieve.
+In this post, I will talk about our motivation study and how our framework works.
+  
+## How does data support us training on edge?
+
+Before designing and implementing our framework, we first need confirmation that training on edge resource-limited devices is worthwhile.
+We were using a malware detection neural network to show why a small, customized neural network is better.
+
+We collected 32000+ mobile apps feature as global data.
+With these data records, we trained a multilayer perceptron called "PerNet" to determine whether a given feature belongs to a benign or malware app.
+We called this **detection**.
+As well, PerNet can also classify malware apps into different types of attacks.
+We called this **classification**.
+The global model can achieve 93% above recall rate and 96.93% above accuracy.
+
+With all these data, we selected two community app usage sub-dataset for local model generations.
+
+- Large categories (Scenario 1)
+    We chose the 5 largest categories of apps, including entertainment, tools, brain&Puzzle, Lifestyle, and Education, as well as the 5 largest malware categories.
+    All together, 12000+ apps were included in this sub-dataset, almost 50 to 50 between benign and malware.
+
+- Campus-community categories (Scenario 2)
+    We chose the 5 most downloaded categories from college students as benign groups, as well as a similar amount of 5 malware categories.
+    To ensure that malware apps are included in 5 benign categories, we also considered synthesizing some other malware apps within categories of 5 most downloaded(benign) categories.
+
+With these two types of sub-dataset, we used the same PerNet to generate multiple local models.
+Under each scenarios experiment, we compared global and local models on the preserved test dataset.
+In all classification performances, local beat global in every scenario.
+In detection performances, local also share the same accuracy as global does.
+
+![Inference results](/static/2021-10/t.3_inference_result.png)
+
+In summary, local models were trained on special occasions.
+Under the same circumstance, a global model can achieve no better accuracy than local models.
+The reason why local is better might be because of overfitting.
+I believe this issue also be considered in the machine learning communities that they brought [transfer learning](https://en.wikipedia.org/wiki/Transfer_learning),
+a technique to optimize global models to special scenarios but performing more training to a global model once it's shipped to local.
+
+## Design and Implementation
+
+### Overall design
+
+The basic EDDL distributed training setup consists of 3 parts.
+**EDDL training cluster**, a device cluster that consists of edge or mobile devices that are participating in training.
+**EDDL manager**, the initial driver program that works as collect training data, relay data to training devices and initial training clusters.
+**Training data entry (TDE)**, a data storage for all training data.
+
+### Dynamic training data distribution
+
+Existing distributed DNN training solutions usually statically partition training data among workers.
+It can be a problem when the training node joins and exits.
+We designed our framework that can dynamically distribute training data during learning.
+Before every training batch started, a batch of TDE will be sent to devices.
+
+In our experiments, we found that by applying this design, overall training time was shortened by doing.
+Especially in large amount devices cases, this optimization can be 50% less than statically divided.
+
+### Scaling up cluster size
+
+Our framework was designed to have both sync and async parameter aggregation.
+Asynchronous aggregation can allow a high outcome of training batch but with a sacrifice or converge time.
+Synchronous aggregation allows a quick converge time in epochs, however can't ensure performance when there's a struggler worker.
+
+As showed in experiments, we chose sync as default because the converging time is dominant in overall training time.
+But, we also considered the possibilities of that async with more workers can achieve similar overall training time.
+
+We introduced a formula to determine whether adding more training nodes can help or not.
+Here we used bandwidth usage coefficient (BUC) as
+
+$$ BUC = \dfrac{n}{T_{sync}} $$
+
+In this formula, $$n$$ is the number of devices, and $$T_{sync}$$ is the transmission time of parameters.
+With an increasing number of workers, n increase linearly but transmission time does not.
+When $$BUC$$ increases, the cluster can speed up training time by adding workers.
+Otherwise, adding more workers won't help with overall training time.
+
+### Adaptive leader role splitting
+
+The idea of role splitting is simple that a device can work as a worker as well leader.
+The advantage of doing this is straightforward that we can transfer 1 less parameter and training time will be shortened.
+
+However, in our current settings, it can't perform much better help since only 1 leader role is in a cluster.
+We can benefit from this in our future works.
+
+### Overall architecture
+
+![Implementation](/static/2021-10/f.5_Impl_leader_worker.png)
+
+Details were given in the image.
+
+### Prototype hardware and software
+
+EDDL was designed to be run on two single-board computer embedded platforms.
+One such platform is [ODROID-XU4](https://www.hardkernel.com/shop/odroid-xu4-special-price/), which is equipped with a 2.1/1.4 GHz 32-bit ARM processor and 2GB memory.
+The other platform is the [Raspberry Pi 3 Model B board](https://www.raspberrypi.com/products/raspberry-pi-3-model-b/), which comes with an ARM 1.2 GHz 64-bit quad-core processor and 1GB memory.
+
+The operating system running on the above platforms is Ubuntu 18.04 with Linux kernel 4.14.
+We used [Dlib](http://dlib.net/), a C++ library that provides implementations for a wide range of machine learning algorithms.
+We chose the Dlib library because it is written in C/C++, and can be easily and natively used in embedded devices.