CS647: Distributed Systems
General Information
Course Coordinator(s): | Dr. Colin S. Gordon |
---|---|
Instructor Contact Information (phone, email, website): |
csgordon@drexel.edu https://www.cs.drexel.edu/~csgordon/ |
Office Hours, Location, Mailbox: | Office hours by appointment, in 3675 Market Street Office 1137 for those near campus, or online for remote students. If you have anything you'd like to discuss, please do not hesitate to schedule an appointment! It really is not an imposition. There are no recurring hours because historically office hours associated with evening classes have not been well-attended because of student scheduling conflicts, and the presence of recurring hours tends to discourage students who can't make them from reaching out. |
Student Learning Information
Course Description
In-depth discussion of fundamental concepts of distributed computer systems. Covers development techniques and runtime challenges, with a focus on reliability and system validation techniques. Subjects discussed include: interprocess communication, remote procedure calls and method invocation, middleware, distributed services, coordination, transactions, replication and weak data consistency models. Significant system-building term project in Java or similar language.
Course Purpose within a Program of Study
Within the revised MSSE program, this will serve as one of 6 possible CS electives (3 required along with 3/6 IS electives) to provide broader knowledge knowledge of software engineering. The MSSE degree emphasizes modern practices and techniques to produce reliable software that functions as desired, in a timely manner. Distributed systems are an increasingly important domain and more software systems move to shared cloud infrastructure.
Within the CS PhD program, the course will serve as an elective suitable for any graduate student, but particularly for those with research interests in software engineering, systems, or programming languages.
Statement of Expected Learning
The course objectives are to:
- Teach students the core aspects of distributed systems that make their design and implementation more challenging than sequential software running on a single machine
- Communicate core theoretical results (outcomes, not necessarily proof) establishing hard limits on what is possible in distributed systems (e.g., CAP and its limitations, FLP)
- Teach students how weakening notions of correctness (specifically, notions of serializability and consistency) can help programs achieve fault tolerance
- Prepare students to validate running distributed systems
As learning outcomes, students completing this course should be able to:
- Understand how independent network and machine failure conspire to make reliable distributed systems difficult to achieve
- Articulate and understand the trade-offs between different well-known data consistency models
- Be able to design a distributed system for a specified level of fault tolerance
- Understand how to reason about the execution of a distributed system
- Understand the basic concepts of how to validate a running distributed system
- Apply cutting edge tools to investigate the behavior of systems under failure
- Understand practical design considerations for some modern distributed systems frameworks, and their limitations
- Recognize cases when distribution is not necessary
Course Materials
Required and Recommended Texts, Readings, and Resources
Required: No textbooks, instructor-selected research papers.
Recommended: Designing Data-Intensive Applications, by Martin Kleppmann. This book is optional, but will provide a nice extended discussion for much of the course material. In addition to the natural option to purchase a hardcopy, it is available in electronic form (with and without DRM, see the book's site), and via O'Reilly's Safari Books Online platform. You can access this via Drexel Libraries' subscription by going here, clicking on "Full Text Online" and signing in with your Drexel credentials.
We may also read excerpts from Concurrency Control and Recovery in Database Systems, which is freely available in PDF format from the authors.
Required and Supplemental Materials and Technologies
This course will be programming-intensive; you should expect to write a moderate amount of very challenging code. Assignments must be completed using one of the following actor frameworks:
- You may use Akka on the JVM, through either Java or Scala
- You may use Akka.NET on Mono or .NET Core, through either C# or F#
Akka.NET is a .NET port of the original JVM implementation of Akka, so both provide nearly identical interfaces. I'm happy to read and grade code in any of the four languages above, though I'm less familiar with F# than the others, so I'll be less helpful debugging F#-specific issues.
Projects in the course will require you to write full self-contained projects using a build file for a cross-platform build tool that handles fetching dependencies, compiling your code, and executing the code. On the JVM, any of the major tools is acceptable (ant
+ivy
, maven
, gradle
, or sbt
). For .NET, msbuild
via the console, or dotnet
for .NET Core projects are acceptable (update: also fake
for F# projects). Projects must build from the command line: projects that only build via an IDE (Eclipse, Visual Studio, etc.) will be penalized, though you are welcome to use whatever editors or IDEs you like for writing your code.
The JVM version of Akka has the best documentation. Akka.NET's documentation seems a bit more sparse, but since it attempts to expose nearly the same API as the JVM version, you could largely adapt the Java version of the JVM Akka documentation for C# if you needed to.
Both versions of Akka have reasonable books available to you through the Drexel Libraries:
- Akka in Action for the JVM
- Reactive Applications with Akka.NET for .NET. This book was just recently published, so it doesn't show up in the library's own catalog yet. To access it, you'll need to get into Safari Books Online differently (e.g., via the other book), then search in the catalog. You can register an account once you're there, making it possible to bookmark the book.
- For F# specifically, the book Mastering F# (also on Safari) contains a brief introduction
Assignments, Assessments, and Evaluations
Graded Assignments and Learning Activities
The course grading is focused on responses to readings, as well as short homework assignments.
Readings
Each week (except the first) you will need to respond to two research papers, no later than 6pm the day before class. Late responses are not accepted, but below there is a policy allowing you to skip a few of these during the term without penalty.
For each paper you will submit a commentary that covers your understanding of the major points and contributions of the paper, the paper’s limitations, things you might not have understood (which we can explain in class!), ways to extend or improve the paper’s work, or other problems this paper’s solutions might provide insights to. Note that not all of these questions make sense for every paper we read. I mainly want to see (a) how well you understood the main ideas from the paper, and (b) that you’ve thought about how well or poorly it might really work.
You should consider:
- A short (3-4 sentence) summary
- List the paper’s major contributions (these may or may not match what the authors claim)
- What’s the greatest technical strength in the paper?
- What are some limitations of the techniques described?
- What are some ways the paper’s results could be extended, improved, or better evaluated?
- Does this paper suggest solutions (or approaches) to other problems?
- Are there parts you didn’t understand?
Not all of these will make sense for every paper, but most of them are sensible to cover for most papers.
Note that different people will often have different takes on the same paper, disagree on whether a choice made by the authors is a strength or weakness, or find different things clarifying or confusing. This is all okay! Everyone has different backgrounds. If you are confused about some part of the paper, don’t be shy: almost certainly someone else found it confusing, too, and maybe they’re too timid to mention it. By pointing out what was confusing or difficult, we raise the opportunity to discuss it in class and help everyone understand better. I found some parts of these papers difficult or confusing the first time (or two) I read them - this is normal.
You may skip up to 4 reading responses during the term, by submitting, instead of a response, a sentence saying you are using a skip --- by the deadline. Do not email me a request, just submit it. You may do this for any 4 responses, distributed any time in the term. You may skip one response in each of four weeks, both responses in each of two weeks, or take the in-between option of skipping one week completely and half the work on two additional weeks. Assignments for those weeks will simply be omitted when calculating your reading response grade for the term. One thing to consider, though: the skip only means you do not need to write a response. If it's a paper that is useful for your homework, you may still find it beneficial to read the paper, even if you skip the response. You may apply skips retroactively; if at the end of the term you have a 0 or 1 you'd like to remove, and you have skips remaining, you can apply the skip to that response by emailing me a request in the final week of the term (I will not do this automatically). There is no bonus for having unused skips at the end of the term.
A Note on Proofs: Many of the papers you will read this term include formal proofs of correctness for an algorithm or protocol. You won't be asked to produce proofs in this class. But, you will need to understand them. Some of your homework assignments will have you implementing algorithms from these papers, and understanding the proofs of correctness will help you think about the code you write. More broadly, outside this class, some of the proofs are fundamental impossibility results. It's all well and good to understand that X is impossible, but you're rarely asked to do X that is known to be impossible. Instead, you're sometimes asked to do Y, where Y and X have some strong similarities. Sometimes Y is simpler than X in a key way that makes it possible. Sometimes Y is actually a variation on X. Understanding the proof for why X is impossible will help you recognize when you see variations on it.
Homeworks
The homeworks are tentatively on the following:
- Basic distributed systems programming model and storage consistency
- Distributed transactions
- Consensus
- Big data
The late policy for homeworks is as follows: for the term, you have 5 late days to distribute between homework assignments at your discretion, with one restriction: the final homework may not be submitted after the end of the final week of classes.
Each homework should only require a modest amount of code, but that code might be very difficult to write and debug. in addition to coding, each homework will include some kind of reflection or analysis of what you did, generally in an open-ended way.
Grading Matrix
- 50% Reading Assignment Assessments
- 50% Homework (split evenly across assignments)
Responses are graded on a scale of 0 to 2, with the possibility earn a 3/2 on any summary:
- 0 - No summary, or your summary looks like you did not read the material.
- 1 - You read the material, but essentially summarized without reflecting
- 2 - You wrote a solid, thoughtful response
- 3 - You wrote a particularly insightful response
As noted above, the grading for responses emphasizes that you have made a serious effort to understand the paper and consider its strengths, limitations, etc. The response grading does not necessarily depend on correctness of your understanding, because most of you will lack the background to fully understand all parts of the paper, which can lead to misunderstandings and lack of understanding for parts. That's okay and expected, and won't hurt your grade unless your response suggests you have very fundamental misunderstandings of what the papers are even trying to accomplish. Historically, this has only occurred when people have not actually made a serious effort to read the papers. The responses are due the day before each lecture to allow me time to read the responses, and update the lecture material to both address common or important questions, and clarify any misunderstandings I see.
The late/skip policies were described above.
In addition the late and skip policies, extensions are possible for good reason with reasonable notice. I am aware that students have jobs, family matters, paper deadlines for their PhD, etc., which can interfere with completing assignments. I want your grade to reflect your mastery of the material and quality of work you hand in, not whether or not you were fortunate enough to avoid major life events during the term. If something comes up during the term, let me know. If it's unexpected (e.g., you end up in the ER when you were planning to work on coursework), let me know when you can and we'll figure it out. If it's something you know about in advance (e.g., you must travel for work), let me know as soon as you know, and we can discuss whether we should give you an extension on an assignment. I reserve the right to request supporting evidence for your stated need for an extension (only so far as justifying the existence of a good excuse; e.g., I might ask for a note confirming existence of a health issue interfering with attendance or assignment completion, but I don't need to know the details of the particular health issue).
Attendance
If you're in the in-person section: Drexel's stated policy is that course attendance is mandatory for students in the in-person section. I will not take attendance explicitly every class, because it's tedious and takes away time from actual material. But I do expect you to come, and if you are absent on a regular basis it will negatively affect your term grade, beyond the grading percentages above. That said, I understand things happen (you might get sick, your car might break down). If you're in the in-person section and will miss class, please let me know.
If you're in the online section: I will work to make it possible for you to ask questions live during class, for those interested in watching live. Otherwise, enjoy the flexibility of online learning.
Academic Integrity, Cheating, and Plagiarism
The list of links at the end of the syllabus include a link to the University's academic integrity policy. If you haven't actually read it before, you should, because not meaning to plagiarize is not an excuse for plagiarism. This includes not realizing that something needed to be quoted, or being unfamiliar with the idea that paraphrased sentences still require citation (and possibly quotes), or opting to reuse someone else's words or code because you're not confident in the quality of your own.
The general idea is that you should not submit work that is not your own --- code or written prose --- that is not properly attributed. This includes, but is not limited, to things like putting direct quotes from someone else's writing in quotes and citing the source, and giving the source for small snippets of code you might have taken from StackOverflow or similar. Again, you should read the actual university integrity policy.
The University leaves the penalty for cheating, plagiarism, etc. in a course up to the professor. If you cheat in this class, I will give you an F for the term. I realize that most cheating is a consequence of poor time management, or unexpected or hard-to-manage obligations beyond the class. That is exactly why you have late days for homeworks, skips for readings, and the course has a fairly flexible extension policy - I want you to succeed, but I want you to do so honestly. If you have any doubts about whether something might cross the line into cheating, please ask me before you do it. The worst I'll say is "No, don't do that." And I'll be glad you asked. This is far better than an F for the term.
A final note: If you quote or reuse other sources (properly, with attribution) so heavily I feel like you haven't actually done the work for the assignment/response, I'll give you a 0 (because you didn't do the work). But you are allowed to take small snippets (e.g., setting up an actor, the basis for a build file) from external sources for small things that are not central to the assignment.
Grade Scale
The following scale will be used to convert points to letter grades:
Grade | |||||
---|---|---|---|---|---|
97-100 | A+ | 82-86.99 | B | 70-71.99 | C- |
92-96.99 | A | 80-81.99 | B- | 67-69.99 | D+ |
90-91.99 | A- | 77-79.99 | C+ | 60-66.99 | D |
87-89.99 | B+ | 72-76.99 | C | 0-59.99 | F |
Note that the instructor may revise this conversion if/when necessary.
Course Schedule
(This schedule is tentative and may change during the course.)Most weeks attempt to pair:
- An older/classic paper with a modern paper
- A theory paper with a systems paper
Currently the syllabus is final up to and including week 9.
Week by week:
- Introduction, Overview, Actors
- Slides
- No readings due
- Challenges and Time in distributed systems
- Slides
- A Note on Distributed Computing. Sun Microsystems Laboratories, Inc. Technical Report SMLI TR-94-29. November 1994.
- Lamport, Leslie. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7), July 1978.
- Strong Consistency
- Slides
- Chandy, K. Mani and Lamport, Leslie. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems 3(1), February 1985.
- Spanner: Google’s Globally-Distributed Database. ACM Transactions on Computer Systems 31(3), August 2013.
- Optional: Chain Replication for Supporting High Throughput and Availability
- Consensus (Paxos, Raft, etc.)
- Slides
- Ongaro, Diego and Ousterhout, John. In Search of an Understandable Consensus Algorithm. USENIX Annual Technical Conference (ATC), 2014.
- See "Ongaro PDF" link near the bottom of the page
- Paxos Made Live - An Engineering Perspective. ACM Symposium on Principles of Distributed Computing (PODC), 2007.
- CAP, FLP, and other impossibilities
- Slides
- Fischer, Michael J., and Lynch, Nancy A., and Paterson, Michael S. Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM 32(2), April 1985.
- Brewer, Eric. CAP Twelve Years Later: How the "Rules" Have Changed. IEEE Computer 45, February 2012.
- Weak and Eventual Consistency
- Slides
- Managing update conflicts in Bayou, a weakly connected replicated storage system. ACM Symposium on Operating Systems Principles (SOSP), 1995.
- Conflict-Free Replicated Data Types. Symposium on Self-Stabilizing Systems (SSS), 2011.
- Getting Things Right
- Slides
- Disciplined Inconsistency with Consistency Types. In ACM Symposium on Cloud Computing (SoCC), 2016.
- Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In ACM Symposium on Operating Systems Principles (SOSP), 2015.
- Large scale data storage and processing: Hadoop & Spark
- Slides
- Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating Systems Design and Implementation (OSDI), 2004.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.
- Distributed Resource Management
- Slides
- Apache Hadoop YARN: Yet Another Resource Manager. In the 4th Annual Symposium on Cloud Computing (SoCC), 2013.
- Large-scale Cluster Management at Google with Borg. In the 10th European Conference on Computer Systems (EuroSys), 2015.
- Cryptocurrency (a.k.a., Consensus round 2)
- Slides
- Choose 2 of the following for responses:
- Bitcoin: A Peer-to-Peer Electronic Cash System
- Blockchains from a Distributed Computing Perspective. Communications of the ACM, February 2019.
- SoK: Research Perspectives and Challenges for Bitcoin and Cryptocurrencies. In IEEE Symposium on Security and Privacy (S&P), 2015.
- From Viewstamped Replication to Byzantine Fault Tolerance. Chapter from Replication: Theory and Practice, 2010. (Under the keywords look for "Download to read the full chapter text" to get just this chapter; the "Download book PDF" button in the top right downloads the entire book, courtesy of Drexel Library's subscription.)
- Practical Byzantine Fault Tolerance
Academic Policies
This course follows university, college, and department policies, including but not limited to:
- Academic Integrity, Plagiarism, Dishonesty and Cheating Policy: http://www.drexel.edu/provost/policies/academic_dishonesty.asp
- Student Life Honesty Policy from Judicial Affairs: http://www.drexel.edu/provost/policies/academic-integrity
- Students with Disability Statement: http://drexel.edu/oed/disabilityResources/students/
- Course Add/Drop Policy: http://www.drexel.edu/provost/policies/course-add-drop
- Course Withdrawal Policy: http://drexel.edu/provost/policies/course-withdrawal
- Department Academic Integrity Policy: http://drexel.edu/cci/resources/current-students/undergraduate/policies/cs-academic-integrity/
- Drexel Student Learning Priorities: http://drexel.edu/provost/assessment/outcomes/dslp/
- Office of Disability Resources: http://www.drexel.edu/ods/student_reg.html
Students requesting accommodations due to a disability at Drexel University need to request a current Accommodations Verification Letter (AVL) in the ClockWork database before accommodations can be made. These requests are received by Disability Resources (DR), who then issues the AVL to the appropriate contacts. For additional information, visit the DR website at drexel.edu/oed/disabilityResources/overview/, or contact DR for more information by phone at 215.895.1401, or by email at disability@drexel.edu.