User Tools

Site Tools


projects:thesistopics

Thesis topics - ideas.

Title: Dynamic Analysis of Large-Scale Programs

Thesis: Many kinds of static and dynamic analyses have been explored in research. However these analyses are usually run on small or mid-sized benchmark programs. We conjecture that interesting concurrency properties arise in large-scale software. To explore this, we propose building a dynamic analysis framework that scales well to large programs for exploring and exploiting novel concurrency properties.

Qualifying Examination Committee (Thesis Proposal):

  • Cormac
  • Steve: gone thanksgiving week, and osdi is dec 8-10
  • Scott
  • Jim (Chair): should be ok, but best before quarter ends

Qualifying Examination Time, Date and Place

  • Dec 12, 11 AM, E2 280.
  • Proposal needs to be distributed to committee by November 12!

Dissertation Reading Committee:

  • Cormac
  • Steve
  • Jim? Scott?

Dissertation Proposal (Table of Contents)

  1. 10 ~ 15 pages of prior work and strategies for success, w/ slide presentation
  2. What the problem is
    1. Concurrent code is hard to construct correctly
  3. What other people have done
    1. Proof techniques
      1. Hoare logic and predicate transformer semantics
      2. Owicki-Gries reasoning: Hoare logic w/ interleavings
      3. Rely-Guarantee reasoning: Hoare logic w/ environment
      4. Separation logic: Hoare logic w/ heap
      5. Lamport's TLA: temporal logic assertions and rely-guarantee reasoning
      6. Reduction theory: commute-based equality on traces
      7. Model checking
        • Abstraction, model extraction
        • Validate that temporal logic assertions hold
      8. Thread-modular reasoning
      9. A Calculus of Atomic Actions: Iterative abstraction and reduction
    2. Static analyses
      1. Find deadlocks
      2. Find data races
        1. RacerX: Lockset analysis that is static, flow-sensitive, interprocedural
      3. Find atomicity violations
        1. Flanagan & Qadeer 1): If program type-checks, it is atomic
        2. Wang & Stoller 2)
      4. Type-based approaches
      5. Annotation-based approaches
        1. ESC/Java
        2. Annotation inference: Houdini, Daikon
      6. Pattern-based approaches
        1. FindBugs
      7. Tools
        1. Klocwork K7
    3. Dynamic analyses
      1. Find deadlocks
      2. Find data races
        1. Happens-before absence: vector clocks, precise
        2. Eraser algorithm: Lock set algorithm, lossy
        3. Goldilocks: Fast and precise lock set algorithm
        4. RaceTrack: “Adaptive” lock set algorithm
      3. Find atomicity violations
        1. Atomizer
        2. Stoller & Wang
        3. Velodrome
      4. Find determinacy violations
        1. Single Track
      5. Model checking: Check that program assertions hold
        1. Java Pathfinder
        2. Chess
        3. Tavla
      6. Tools and frameworks
        1. Roadrunner
        2. Valgrind
        3. Coverity 3)
    4. Categorization of concurrency errors
      1. Properties: DRF, deadlock-free, starvation-free, liveness, linearizability, atomicity, determinacy, immutability, etc.
      2. Idioms: double-checked locking, objects escape local scope, wait gets lost, etc.
    5. Other properties, or stuff to shoe-horn in
      1. Linearizability: Composable sequential consistency
      2. Immutability: Intrinsically thread-safe objects
      3. Transactional memory: Atomicity baked into language semantics
      4. Cilk: Fork-join parallelism
      5. Map-reduce (fork-join), functional parallel programming
  4. What I plan to do
    1. “Here is something nobody has done before.”
    2. Hypotheses of what I will find, to guide structure of exploration
      1. E.g., what kind of things do I expect to find?
      2. One way to guide discovery is to learn properties about running programs.
        1. Patterns of liveness? Patterns of communication? How often, how much, with which resources/threads do threads interact with?
        2. Server running patterns should prove interesting to examine. PRIOR ART in this?
    3. Publication strategy
      1. Since heavily code-based, may take much effort to find the interesting bits - may not get to publish as much
      2. What conferences to target?
        1. PLDI, POPL, OOPSLA, SOSP, OSDI
          1. CFP deadlines?
        2. Others?
    4. Document a taxonomy of errors while exploring large code bases
      1. Do properties in small programs scale to large programs?
      2. What errors are introduced at the large scale?
  5. Why I think this is a good idea
    1. “There is a lot of future work to be done here.”
    2. Eclipse under Roadrunner
    3. Sor barrier code
    4. Jigsaw
  6. What I have done to prepare
    1. Dynamic atomicity analysis (Velodrome): PLDI 2008
    2. Critical path analysis: Performance analysis, class project
    3. Direct-execution model checking (Tavla): writing up as masters project
  7. Meta-questions
    1. Do I understand the area well enough to identify a problem and potential solutions?
    2. Do I have an actual problem that is real, worth solving, solvable, an unmet need in the literature?
    3. Do I have an idea of how to go about solving the problem?
      1. The problem is lack of knowledge in this area
      2. By the end, we should have a better idea of the area
    4. Have 4 or 5 things that we can expect to gain by the end of my research
      1. E.g., patterns of concurrency for large-scale programs
      2. Better understanding of errors that come out, comparison with small-scale errors
      3. Concrete tools for analysis and tools to correct
    5. Preliminary results very good for being convincing, at this stage
      1. Show some difference in properties between small and large scale programs, even just one or two
    6. The exam is just as much about the problem as it is about my preparedness. Is the problem worth spending a couple of years worth of effort exploring? Does it have real-world relevance?



oldoldoldold

  1. Dynamic analysis of multithreaded code
    • Extending velodrome
      • Introduce generality without losing completeness
      • Implement velodrome instrumentation in Haskell, use Quickcheck to ferret out bugs
    • False positives - correctness condition to deal with the other 90%?
    • Interaction with scheduler
      • Angelic schedulers for production, demonic schedulers for debugging
        • Angelic schedulers would avoid deadlocks, avoid race conditions, and execute code in a sequentially consistent (or atomic) manner, even on architectures that implement weird memory models. It can do this by reasoning about synchronization constructs in the programming language, and on architectural memory models too. The additional constraints/overhead might induce performance hit, but hopefully not too much.
        • Such a scheduler would need to know some things about PL, and some guiding heuristics. What are these?
          • A principled approach is desired, really.
          • Essentially introducing limited clairvoyance to the scheduler.
          • Preempt on compiler-injected yields for efficiency?
          • The scheduler could own a particular core for computations and such.
          • Can we systematically avoid race conditions? Either serialize a method, or given a read, delay another read until the next write in the same method.
          • Essentially have the scheduler interpret the code in a desired way.
          • What can be guaranteed by this approach?
          • We'd like a compositional way of dealing with yield points.
        • How is this approach different from Grace? 4)
          • Grace incorporates STM and process VM isolation in a runtime environment to provide sequential semantics with respectable performance degradation. However this approach does not work for programs that are “inherently concurrent” 5).
          • Angelic scheduling, and its dual, demonic scheduling, would incorporate programming language reasoning in the runtime to provide correctness guarantees and quick debugging on programs with synchronization constructs that Grace cannot handle.
            • Would need to be inherently parallel scheduler - a tough problem!
        • What other prior work is there? Surely somebody must have done something like this before.
          • Cilk papers - focuses on fork-join parallelism and work stealing
          • CMU papers - focuses on space efficiency
          • Scott Mahlke (UMich) has a POPL 2009 paper 6), “The theory of deadlock avoidance via discrete control”, that seems to be along these lines.
        • Typical concerns of schedulers include fairness, efficiency, support for real-time deadlines, etc.
      • Low-overhead back-end scheduling on top of JVM
      • Scheduling good traces automatically, as opposed to manual synchronization
      • Only acquire locks, and steal locks as needed
      • Run in parallel, except where we think there is racy code
        • Paint certain sections “red” where we think there is a race
      • Auto-add barriers to model-checked code, if you know something bad will happen
      • Using RBED 7) scheduler's hard real-time guarantees for performance isolation
    • Quickcheck-ish ease of specifying and testing 8) 9)
      • Quickly generate 100 traces which might break invariants and postconditions, based on syntactic content, and execute them
      • Requires a priori annotation of code… hmm. Perhaps postulate background invariants, like “methods are atomic”, “globals are protected by some lock consistently”, etc.
      • Onus would be on the programmer to show that code is correct, by either fixing code (add synchronization statements) or adding annotations (like “I know this is the only method which writes to x”)
    • Continuum of specification/testing:
      • (corrective execution) - (unit testing) - (dynamic analysis) - (types) - (ESC) - (coq)
    • Restoring “continuity” to programs
      • Small bugs should give rise to small errors
    • Hypothesis - testing is enough for most programs
  2. Correctness idioms in multithreaded code
    • False alarms - different kind of correctness property?
  3. Memory model issues
    • Tools to help reason about effect of specific memory models on given program
      • Prior art - Sebastian Burckhardt's dissertation 10)
      • Prior art - Madan and Sebastian's work on store buffer relaxations in model checking 11) 12)
      • POPL 2009 accepted two memory model papers 13)14)
      • The area of memory models and verification seems on its way to becoming well populated. I can imagine that in five years, a model checker can run x86-compliant demonic traces on programs to discover errors.
  4. Multithreaded specification
    • “On Compositionality and Its Limitations”15) shows that if your logic can express EG p — “there is a path such that all the nodes of the path have the property p”, then parallel composition fails, e.g., the total system specs cannot be verified by independently verifying component specs.
      • Related papers: Conjoining Specifications16), Modular verification of multithreaded programs17)
    • The onus should be on the added code to show/prove it interacts correctly with the existing code in a multithreaded way. This looks to the past. There is a notion of monotonicity that could be explored.
      • Does Assume/Guarantee reasoning enforce this condition?
    • Looking to the future, one would also like to show/prove that it will interact correctly with anticipated code, that falls under reasonable assumptions like race-freedom, etc.
    • Do any current specification techniques deal with time and existing modules?
    • Assume/Guarantee reasoning deals with interference by positing an environment predicate to be satisfied.
    • ESC/Java is a “logical truth maintenance system” about specifications based on first-order logic. Perhaps base the specification on non-monotonic logic18)?
      • It may be a better way of dealing with the closed-world assumption of thread-safety properties.
      • The specification problem involves about tweaking specs based on code changes, and thus the set of conclusions that can be drawn may change based on modifications to specs. Nonmonotonic reasoning may be better able to capture the logic of this refinement process.
    • “Aspect-oriented” specification, where certain invariants are posited in a “cross-cutting” way? 19)
1) Types for Atomicity
2) Static Analysis of Atomicity for Programs with Non-Blocking Synchronization
5) Here, they mean that a program's serial elision of cilk constructs results in a program with different semantics
6) A prior paper, “The Application of Supervisory Control to Deadlock Avoidance in Concurrent Software”, at http://cccp.eecs.umich.edu/papers/wang-wodes08.pdf
9) Another Quickcheck for Java: http://sourceforge.net/projects/qc4j/
10) Memory model sensitive analysis of concurrent data types, 2007
13) The Semantics of Multiprocessor Machine Code http://www.cl.cam.ac.uk/~pes20/weakmemory/
14) Relaxed memory models: an operational approach
18) where additional information may invalidate conclusions - see http://www.informatik.uni-leipzig.de/~brewka/papers/NMchapter.pdf
19) Aspect-Oriented Specification Architectures for Distributed Real-Time Systems http://www2.computer.org/portal/web/csdl/doi/10.1109/ICECCS.2001.930177
projects/thesistopics.txt · Last modified: 2009/02/24 15:58 (external edit)