Thesis topics - ideas.
Title: Dynamic Analysis of Large-Scale Programs
Thesis: Many kinds of static and dynamic analyses have been explored in research. However these analyses are usually run on small or mid-sized benchmark programs. We conjecture that interesting concurrency properties arise in large-scale software. To explore this, we propose building a dynamic analysis framework that scales well to large programs for exploring and exploiting novel concurrency properties.
Qualifying Examination Committee (Thesis Proposal):
Cormac
Steve: gone thanksgiving week, and osdi is dec 8-10
Scott
Jim (Chair): should be ok, but best before quarter ends
Qualifying Examination Time, Date and Place
Dissertation Reading Committee:
Dissertation Proposal (Table of Contents)
10 ~ 15 pages of prior work and strategies for success, w/ slide presentation
What the problem is
Concurrent code is hard to construct correctly
What other people have done
Proof techniques
Hoare logic and predicate transformer semantics
Owicki-Gries reasoning: Hoare logic w/ interleavings
Rely-Guarantee reasoning: Hoare logic w/ environment
Separation logic: Hoare logic w/ heap
Lamport's TLA: temporal logic assertions and rely-guarantee reasoning
Reduction theory: commute-based equality on traces
Model checking
Thread-modular reasoning
A Calculus of Atomic Actions: Iterative abstraction and reduction
Static analyses
Find deadlocks
Find data races
RacerX: Lockset analysis that is static, flow-sensitive, interprocedural
Find atomicity violations
Flanagan & Qadeer
1): If program type-checks, it is atomic
-
Type-based approaches
Annotation-based approaches
ESC/Java
Annotation inference: Houdini, Daikon
Pattern-based approaches
FindBugs
Tools
Klocwork K7
Dynamic analyses
Find deadlocks
Find data races
Happens-before absence: vector clocks, precise
Eraser algorithm: Lock set algorithm, lossy
Goldilocks: Fast and precise lock set algorithm
RaceTrack: “Adaptive” lock set algorithm
Find atomicity violations
Atomizer
Stoller & Wang
Velodrome
Find determinacy violations
Single Track
Model checking: Check that program assertions hold
Java Pathfinder
Chess
Tavla
Tools and frameworks
Roadrunner
Valgrind
-
Categorization of concurrency errors
Properties: DRF, deadlock-free, starvation-free, liveness, linearizability, atomicity, determinacy, immutability, etc.
Idioms: double-checked locking, objects escape local scope, wait gets lost, etc.
Other properties, or stuff to shoe-horn in
Linearizability: Composable sequential consistency
Immutability: Intrinsically thread-safe objects
Transactional memory: Atomicity baked into language semantics
Cilk: Fork-join parallelism
Map-reduce (fork-join), functional parallel programming
What I plan to do
“Here is something nobody has done before.”
Hypotheses of what I will find, to guide structure of exploration
E.g., what kind of things do I expect to find?
One way to guide discovery is to learn properties about running programs.
Patterns of liveness? Patterns of communication? How often, how much, with which resources/threads do threads interact with?
Server running patterns should prove interesting to examine. PRIOR ART in this?
Publication strategy
Since heavily code-based, may take much effort to find the interesting bits - may not get to publish as much
What conferences to target?
PLDI, POPL, OOPSLA, SOSP, OSDI
CFP deadlines?
Others?
Document a taxonomy of errors while exploring large code bases
Do properties in small programs scale to large programs?
What errors are introduced at the large scale?
Why I think this is a good idea
“There is a lot of future work to be done here.”
Eclipse under Roadrunner
Sor barrier code
Jigsaw
What I have done to prepare
Dynamic atomicity analysis (Velodrome): PLDI 2008
Critical path analysis: Performance analysis, class project
Direct-execution model checking (Tavla): writing up as masters project
Meta-questions
Do I understand the area well enough to identify a problem and potential solutions?
Do I have an actual problem that is real, worth solving, solvable, an unmet need in the literature?
Do I have an idea of how to go about solving the problem?
The problem is lack of knowledge in this area
By the end, we should have a better idea of the area
Have 4 or 5 things that we can expect to gain by the end of my research
E.g., patterns of concurrency for large-scale programs
Better understanding of errors that come out, comparison with small-scale errors
Concrete tools for analysis and tools to correct
Preliminary results very good for being convincing, at this stage
Show some difference in properties between small and large scale programs, even just one or two
The exam is just as much about the problem as it is about my preparedness. Is the problem worth spending a couple of years worth of effort exploring? Does it have real-world relevance?
oldoldoldold
Dynamic analysis of multithreaded code
Extending velodrome
Introduce generality without losing completeness
Implement velodrome instrumentation in Haskell, use Quickcheck to ferret out bugs
False positives - correctness condition to deal with the other 90%?
Interaction with scheduler
Angelic schedulers for production, demonic schedulers for debugging
Angelic schedulers would avoid deadlocks, avoid race conditions, and execute code in a sequentially consistent (or atomic) manner, even on architectures that implement weird memory models. It can do this by reasoning about synchronization constructs in the programming language, and on architectural memory models too. The additional constraints/overhead might induce performance hit, but hopefully not too much.
Such a scheduler would need to know some things about PL, and some guiding heuristics. What are these?
A principled approach is desired, really.
Essentially introducing limited clairvoyance to the scheduler.
Preempt on compiler-injected yields for efficiency?
The scheduler could own a particular core for computations and such.
Can we systematically avoid race conditions? Either serialize a method, or given a read, delay another read until the next write in the same method.
Essentially have the scheduler interpret the code in a desired way.
What can be guaranteed by this approach?
We'd like a compositional way of dealing with yield points.
How is this approach different from Grace?
4)
Grace incorporates STM and process VM isolation in a runtime environment to provide sequential semantics with respectable performance degradation. However this approach does not work for programs that are “inherently concurrent”
5).
Angelic scheduling, and its dual, demonic scheduling, would incorporate programming language reasoning in the runtime to provide correctness guarantees and quick debugging on programs with synchronization constructs that Grace cannot handle.
What other prior work is there? Surely somebody must have done something like this before.
Typical concerns of schedulers include fairness, efficiency, support for real-time deadlines, etc.
Low-overhead back-end scheduling on top of JVM
Scheduling good traces automatically, as opposed to manual synchronization
Only acquire locks, and steal locks as needed
Run in parallel, except where we think there is racy code
Auto-add barriers to model-checked code, if you know something bad will happen
Using RBED
7) scheduler's hard real-time guarantees for performance isolation
Quickcheck-ish ease of specifying and testing
8) 9)
Quickly generate 100 traces which might break invariants and postconditions, based on syntactic content, and execute them
Requires a priori annotation of code… hmm. Perhaps postulate background invariants, like “methods are atomic”, “globals are protected by some lock consistently”, etc.
Onus would be on the programmer to show that code is correct, by either fixing code (add synchronization statements) or adding annotations (like “I know this is the only method which writes to x”)
Continuum of specification/testing:
Restoring “continuity” to programs
Hypothesis - testing is enough for most programs
Correctness idioms in multithreaded code
Memory model issues
Multithreaded specification
“On Compositionality and Its Limitations”
15) shows that if your logic can express EG p — “there is a path such that all the nodes of the path have the property p”, then parallel composition fails, e.g., the total system specs cannot be verified by independently verifying component specs.
Related papers: Conjoining Specifications
16), Modular verification of multithreaded programs
17)
The onus should be on the added code to show/prove it interacts correctly with the existing code in a multithreaded way. This looks to the past. There is a notion of monotonicity that could be explored.
Looking to the future, one would also like to show/prove that it will interact correctly with anticipated code, that falls under reasonable assumptions like race-freedom, etc.
Do any current specification techniques deal with time and existing modules?
Assume/Guarantee reasoning deals with interference by positing an environment predicate to be satisfied.
ESC/Java is a “logical truth maintenance system” about specifications based on first-order logic. Perhaps base the specification on non-monotonic logic
18)?
It may be a better way of dealing with the closed-world assumption of thread-safety properties.
The specification problem involves about tweaking specs based on code changes, and thus the set of conclusions that can be drawn may change based on modifications to specs. Nonmonotonic reasoning may be better able to capture the logic of this refinement process.
“Aspect-oriented” specification, where certain invariants are posited in a “cross-cutting” way? 19)