Featured

# Bravo! Been accepted by GSoC

As a postgraduate student reading in Nanchang Hangkong University, I have been accepted by GSoC and work for Mozilla. It’s a superb honor for me and so I decide to mark down things I do for GSoC and all my open source active in this blog.

The Google Summer of Code, often abbreviated to GSoC, is an international annual program, first held from May to August 2005, in which Google awards stipends (of US$5,500, as of 2016 ) to all students who successfully complete a requested free and open-source software coding project during the summer. ## Mozilla And A-team I believe every computer fans have heard about Mozilla. And I believe you had known about Firefox, which is a independent, people-first browser made by Mozilla. Due to its famous name has long been known to people, I will only talk about A-team and the project I get involved into. The A-team is the nickname of ” Automation and Tools team”, which is a team of engineers focus on improving the quality and productivity of engineering at Mozilla. For me, it’s amazing experience to work with the people in there and learn from them. There are a lot of awesome stuff in there for auto test and performance display and my GSoC peoject is about rewrite SETA, a tool for extraneous tests.Some more detail in here My mentor is Jmaher and Armenzg . # Relative entropy and mutual information Consider some unknown distribution p(x), and suppose that we have modelled this using an approximating distribution q(x). If we use q(x) to construct a coding schemem for the purpose of transmitting values of x to a receiver, then the average additional amount of information required to specify the value of x as a result of using q(x) instead of the true distribution p(x) is given by: $KL(p||q) = -\ln p(x)lnq(x)dx - (-\ln p(x)lnp(x)dx) = -\ln p(x)ln{\frac{q(x)}{p(x)}}dx$ and it’s known as relative entropy or Kullback-Leibler divergence or KL divergence. You could also define it as $KL(p(x)||q(x)) = \sum_{x \in X}f(x) \dot log\frac{p(x)}{q(x)}$ we could give some conclusion in here: \n 1: The value of KL is zero if p(x) and q(x) are exactly same function. 2: If the difference between p(x) and q(x) is larger, the relative entropy will become bigger, otherwise, it will decrease if the variance is smaller. 3: If p(x) and q(x) are distribution function, the relative entropy could been used to measure the difference between them. The thing need to point out is the relative entropy is not symmetrical quantity, that is to say $KL(p||q) \neq KL(q||p)$ Now consider the joint distribution between two sets of variables x and y given by p(x,y). If the sets of variables are independent, then their joint distribution will factorize into the product if their marginals p(x, y) = p(x)p(y). If the variables are not independent, we can gain some idea of whether they are ‘close’ to being independent by considering KL divergence between the joint distribution and the product of the marginals, given by: $I[x,y] = \sum_{x \in X, y \in Y} P(x, y)log \frac{P(X,Y)}{P(X)P(Y)}$ or we can just say$I(X; Y) = H(X) – H(X|Y)$# Start from the Information Theory It has been a long time since last update, part of reason is I need to work on my postgraduate paper, and, however, I’m a lazy man anyway 😛 Recently I’m reading about 《Pattern Recognition and Machine Learning》 and 《Beauty of Mathematics 》 which makes me to mark something down and help me to understand them better 🙂 The First thing I want to talk about is information theory, so if you need to predict an even is possible or not, the most straight forward way is you could use the history data to get its probability distribution P(X) depend on the value x. . So, if right now we want to evaluate the information content of x, we should find a quantify h(x) that is a monotonic function of the probability P(X) that could expresses the information content. The way we could find that h(x) is, we need to have two events x and y that are unrelated with each other and then the information gain from observing both of them should be the sum of the information gained from each of them separately. so the h(x, y) = h(x) + h(y) and P(x, y) = p(x)p(y). The we could get $h(x) = -log_{2}p(x)$ And you could find that the h(x) is actually represent the “bits”. Now, suppose that a sender wishes to transmit the value of a random variable to a receiver. The average amount of information that they transmit in the process is obtained by taking the expectation of h(x) with respect to the distribution p(x) and is given by $H[x] = - sum_{x}p(x)log_{2}p(x)$. This important quantity is called the entropy of the random variable x. Consider a random variable x having 8 possible states and each of which is equally likely. In order to communicate the value of x to a receiver, we would need to reansmit a message of length 3 bits and the entropy is $H[x] = -8 * \frac{1}{8}log_{2}\frac{1}{8} = 3 bits$. Furthermore, if we have 8 possible states{a, b, c, d, e, f, g, h} for following strings: 0, 10, 110, 1110, 111100, . So we have the ideal of entropy, then let’s see the other kind of differential entropy. When we minimize the value x and give a quantity $ln\Delta$, which diverges in the limit $\Delta \rightarrow 0$. For a density defined over multiple continuous variables, denoted collectively by the vector x, the different entropy is given by $H[x] = - \int p(x)lnp(x)d_{x}$ which denote the fact that to specify a continuous variable very precisely requires a large number of bits. Support we have a joint distribution P(x, y) from which we draw pair of values of x and y. If a value of x is already know, then the additional information needed to specify the corresponding value of y is given by $-lnp(Y|X)$. Thus the average additional information needed to specify y can be written as$H[y|x] = – \ln \ln p(y,x)lnp(y|x)d_{y}d_{x}$which called the conditional entropy of y given x. It’s easily seen, using the product rule, that the conditional entropy satisfies the relation: H[x,y] = H[y|x] + H[x] # Summary of SETA rewrite It’s been three months since the GSoC started and it’s come to the end. First of all, I greatly appreciate my mentors (jmaher and armenzg) . I have learned tons of things from them and it would have been impossible to get here without their help. I also must say thank you to dustin, martiansideofthemoon and anyone who has helped me during these past three months. I really enjoy working with the people in the A-team and the Mozilla community. I believe I will keep being involved as long as possible! This GSoC project involved several parts- (1) refactor SETA’s code and make it robust (2) make SETA work on Heroku (3) Reduce database size and use data from other systems (4) Integrate with the Gecko decision task For code refactoring, we’ve fixed many existing pep8 errors and added flask8 support in #PR87 . We’ve also verified and removed some redundant packages from requirement.txt because we no longer needed them. Furthermore, in order to make the codebase more readable and easy to test, we started using sqlalchemy to do the database operations instead of embedded SQL statements and added some tests for it in #PR109. Sqlalchemy turned out to be a fantastic choice for us because it does not only makes tests easier to write, but it also give us faster database querying and storing. In #PR84, we use fetch_json instead of the pushlog endpoint which make things cleaner. As a bonus point, we fixed insert key errors in #PR92 and added a failure column in #PR82. We’ve also made SETA display job results appropriately in #PR90. In #PR91, we made linux64 debug jobs be visible on SETA. At the moment it’s not only useful on linux64 debug jobs but it also works for other job data that comes from Taskcluster. The second part of this project is make SETA works on heroku, and all the related PRs are included in MikeLing/ouija-rewrite branch. First of all, we need migrate our database from mysql to postgresql(it’s default database on heroku) and things become much more easier after we switch to the sqlalchemy [PR]. Secondly, we need to make updatedb.py and failures.py(we use these two scripts to update our database and store our analysis results) running automatically[PR]. Then, we add a stage server for SETA, it could do pre-deployment validation as what has been done in treeherder and avoid breaking something accidently in the target server. I must say thank you to armenzg again because he gives me a lot of helps on this and helps me fork repo to the stage server. Anyway, we couldn’t works well on the heroku without armen’s help 🙂 The next step was about reducing database size and to use the data from another system. In #PR88, we made SETA only store high value jobs instead of low value job (because we only require around 165 high value jobs while there’s about 2000 low value jobs) and store 90 days of data instead of 180 days in our database. As Joel said, it’s a big win for reducing our database size:). In #PR89 we got rid of ‘logfile’ in the testjobs table because it wasn’t being used in the analysis of failures. Then, in #PR93 and #PR100 we started using the runnable API instead of the uniquejobs table and cached it as runnablejob.json locally. It allow us to query all job types and related information with more accuracy and on real time. As a bonus, we use underscore.js to simplify our JavaScript and make our js code more readable. Other related PRs are #PR112, #PR99, #PR105, and #PR106. The final piece of this project is to integrate with the Gecko decision task. On the server side, we separated Taskcluster jobs from Buildbot jobs and started listing all low value jobs to ensure that we run brand new jobs by default in #PR101 and #PR113. TaskCluster can query the low value job list from the server side and can create decision task based on it (You can check it out on http://seta-dev.herokuapp.com/data/setadetails/?taskcluster=1) We also found a way to identify new jobs from runnablejobs.json and remove expired jobs from preseed.json in #RP112. In bug 1287018, we try to figure out how to make TaskCluster use data from SETA to schedule tasks and I committed several patches about it. The Gecko decision task is the vital part for our task scheduling and a lot of things need to be discussed in there. This is now a stretch goal for this project and I will keep working on it after GSoC work period :). # SETA rewrite-Use SqlAlchemy to instead of hardcode sql Due to we have already make SETA to support taskcluster now, the next thing we should do is to make it make the code become more robust and readable. That’s it, we should add more tests and documents for it. The first step, before we start to write test for it, is no longer use hardcode sql in SETA. For example. sql = """insert into testjobs (slave, result, duration, platform, buildtype, testtype, bugid, branch, revision, date, failure_classification, failures) values ('%s', '%s', %s, '%s', '%s', '%s', '%s', '%s', '%s', '%s', %s, '%s')""" % \ (slave, result, duration, platform, buildtype, testtype, bugid, branch, revision, date, failure_classification, ','.join(failures) It makes us hard to write test for it and cased some bugs when we need to switch database in different environment(because we have master mysql branch and heroku postgresql branch). Therefore, the solution I found is use SqlAlchemy which is a powerful and useful database toolkit for python. You can found some more information about on its office website. Alright, to use SqlAlchemy in SETA, we should make it connects to the database, which is create a engine for it: engine = create_engine('mysql+mysqldb://root:root@localhost/ouija2', echo=False) The standard calling form is to send the URL as the first positional argument, usually a string that indicates database dialect and connection arguments. The string form of the URL is dialect[+driver]://user:password@host/dbname[?key=value..], where dialect is a database name such as mysql, oracle, postgresql, etc., and driver the name of a DBAPI, such as psycopg2, pyodbc, cx_oracle, etc. Alternatively, the URL can be an instance of URL. And we need to create some database table for our new database. Classes mapped using the Declarative system are defined in terms of a base class which maintains a catalog of classes and tables relative to that base – this is known as the declarative model. And we could use Metadata to include all the models we define and bind it with our database.  class Seta(MetaBase): __tablename__ = 'seta' id = Column(Integer, primary_key=True) jobtype = Column(String(256), nullable=False) date = Column(DateTime, nullable=False, index=True) def __init__(self, jobtype, date): self.jobtype = jobtype self.date = date  We’re now ready to start talking to the database. The ORM’s “handle” to the database is the Session. When we first set up the application, at the same level as our create_engine() statement, we define a Session class which will serve as a factory for new Session objects and use it to query or operate the database. In config.py  Session = sessionmaker(engine) session = Session()  In failure.py(which is the file we need to use the database operation)  session.query(Seta).filter(Seta.date == date).delete(synchronize_session='fetch')  That’s it! And you can read some more detail in this PR. The next step for SETA rewrite is write more test for it! # The First part of SETA rewriting job been finished With the end of GSoC midterm evaluation, I finally remember the blog and weekly maintain request. But I’m working on (make seta to use runnable api)[https://github.com/mozilla/ouija/pull/93], so I leave it aside for now. However, with the most seta rewriting job been finished now, I think it’s a good time to pick this up and write something. The first thing I would like to mention in here, of course, is we use runnable api instead of uniquejobs table now! Thank you Kalpesh and Armenzg! I couldn’t make this happened without your help! 🙂 The uniquejobs table in SETA is about all the jobtypes available on buildbot(we unable to recognise taskcluster jobs before we use runnable api). The jobtypes been stored in the table as a three tuple list like: [‘build platform’, ‘build type’, ‘test type’]. Here is a piece of data in table: But it’s manually created by ‘/data/createjobtype’ endpoint before we use SETA which make it’s a little bot hard code and cause some bugs(e.g. This pr fix the linux 64 display issue). Therefore, we need to use runnable jobs api to replace that. The runnable api could return both buildbot jobs and taskcluster jobs as json. So we could simple exact the corespondent piece of data and assemble them as job information we want. Here is the PR. And SETA can show more job data than before now: The next step is we should move to the task decision tree and make it work with seta data. I will make a description for task decision tree in my own understanding. Sorry again for the blog update. I will try my best for updating it once a week. # How SETA works Even the work period has been started for a while, I still confuse with how seta works and how we find out those high value job(such a shame :() So Jmaher(my mentor) decide to have voice meeting with me and help me walk through all these ideals. I’m very grateful Joel can make it happened! The whole meeting is fun and I had learn a ton of things from it! In case of I forget these ideals in the further, I would like to write it down and sort them again in here 🙂 The core idea of SETA is find out high value jobs and discard those low value jobs. But, how we determine if a job is necessary one(which is high value job) to us or not so important one. The answer is regression. Because, even there may has a lot of failures with a push, we only care about those failures fixed by commit. So, the “base line” in here has become **we only need a job can case a regression**. However, we still have no ideal about how seta do that. Actually, seta build a list of all root cause failures in a hash table with the revision as the key and all failed tests as an array(under the weighted_by_jobtype mode). like: # we have 18 failures totally {[ ‘rev1’: [job1, job2, job3, job4, job5], ‘rev2’: [job1, job2, job3, job4, job5, job6, job7], ‘rev3’: [job4, job5, job6], ‘rev18’; [job1, job3, job5] ]} then for each job type in the failure list, we temporarily remove it and check to see if we still detect all the failures. like: #remove job1 total_failures = 18 temp_failures = {[ ‘rev1’: [ job2, job3, job4, job5], ‘rev2’: [job2, job3, job4, job5, job6, job7], ‘rev3’: [job5, job6], … ‘rev18’; [job3, job5] ]} #remove job2 total_failures = 18 temp_failures = {[ ‘rev1’: [ job3, job4, job5], ‘rev2’: [jjob3, job4, job5, job6, job7], ‘rev3’: [job5, job6], … ‘rev18’; [job3, job5] ]} #remove job3 total_failures = 18 temp_failures = {[ ‘rev1’: [ job4, job5], ‘rev2’: [jjob4, job5, job6, job7], ‘rev3’: [job5, job6], … ‘rev18’; [job5] ]} #remove job5 **total_failures = 17** temp_failures = {[ ‘rev1’: [ job4], ‘rev2’: [jjob4, job6, job7], ‘rev3’: [job6], … ** ‘rev18’; []** ]} Oh! Now there is no job in ‘rev18’ list and the total_failures has become 17. It’s ‘we can’t detect all failures’ moment, and we call job5 as necessary job or high value job. Then we add job5 into the master_root_cause job list and do it again until the whole active_job list been iterated. # SETA rewrite-Database Migration Before I write note about GSoC, I really want to say this week is really messed up. My healthy status just like a roller coaster, my head is heavily sometimes and I got allergy(maybe). I can bare remember when I got skin allergy last time(maybe when I was 6 or 7 years old), and I don’t think see a doctor is a good choice because only thing they could do is give you some anti allergy, which can dizzy me whole day 😛 Whatever, let use flip to the next page. Warning: Due to my poor sql knowledge, I actually have no ideal about how to describe my work of this weekend. I keep making mistake on very basic and simple thing and took me a lot of time to figure it out. So, this blog may looks like a patchwork. I’m sorry about that. 😦 ## “If Time Can Roll Back, ……” Alright, this title is all I think about during the migration work. “If time can roll back, I really should focus on my database class.”,”If time can roll back, I really should look more into it before I ask that dumb question.”, etc. Anyway, first of all, after I can check the heroku log about SETA, I found the we have a error message like: Error R10 (Boot timeout) -> Web process failed to bind to$PORT within 60 seconds of launch And after I google it, I found this error is because I use ‘web’ dyno to boots my application and “web” type of application means that your app MUST listen some port. So, we capture \$PORT from the environment variable  when boot the server. Now, ouija can ‘run’ on the heroku. But it’s far more than enough.

Problems keep showing up when I head to the database part. The first one is, I don’t know how to use it on heroku 😛 Everything look good to me except I found no data been actually wrote into the database! I spend a long time to find out why I can connect, and run sql command on database but no data will been wrote into it after that. Finally, I found I forget to ‘commit’ after I  execute sql command(Boom!)

Ok, we can import data into database now….oh wait, not yet. Error keeps showing up when I use failures.py to update jobtype. The reason is, in postgres, we can’t store a string array as varchar(which is what we do in mysql). The postgreSql as an Object-Relational database, has many data types which mysql hasn’t. For example, we store ‘jobtype'(e.g [‘android-4-3-armv7-api15’, ‘debug’, ‘crashtest-10’]) as varchar in mysql. But postgres has ‘array’ type and we could just define it as text array.  Furthermore, it tells me psycopg2.ProgrammingError: no results to fetch in this line. We can find the document said that A ProgrammingError is raised if the previous call to execute*() did not produce any result set or no call was issued yet, but it’s fine when we use it in Mysqldb 😛

After all, after resolve those problems mentioned above, we can visit ouija on heroku now (\o/)