Old days: No Virtual MachineYou write: Program in source language Source language specification MyProg.cpp Book: “The C++ programming language” Source-to-machine compiler Program in mac
Trang 1Modern programming languages:
ByteCode and Virtual Machines
CSE 6329, Spring 2011 Christoph Csallner, UTA
Trang 2Old days: No Virtual Machine
You write: Program in source language
Source language specification
MyProg.cpp
Book: “The C++
programming language”
Source-to-machine compiler
(Old) Microsoft Visual Studio language”
Trang 3Old days: No Virtual Machine
You write: Program in source language
Source language specification
MyProg.cpp
Book: “The C++
programming language”
Source-to-machine compiler
Program in machine code MyProg.exe in MS
Windows x86 binary
Machine instruction set
(Old) Microsoft Visual Studio
language”
Program in
Intermediate
Representation
Trang 4Today: Virtual Machines popular
You write: Program in source language MyProg.java
Source language spec Java SpecificationSrc-to-bytecode comp. javac MyProg.java
Program in bytecode MyProg.class
Bytecode language spec JVM Specification
java MyProg
Virtual machine
Trang 5Program Analysis today
• Many programs compiled to bytecode
– Virtual machine executes bytecode
• Bytecode has advantages over source language
• Many Program Analyses analyze bytecode
– Results translated back to your original Java/C#/… source program
• Example program anlyses that are very easy to use:
– For Java: FindBugs: http://findbugs.sourceforge.net/
– For C#: Pex for fun: http://www.pexforfun.com/
Trang 6Big picture
You write: MyProg.java
Source to bytecode compiler
E.g.: javac, MS Visual Studio
Program analysis E.g.: FindBugs, Pex
Bytecode: MyProg.class
Virtual machine, e.g.:
JVM, Net runtime
Trang 7Why is bytecode good for Program
Analysis?
Trang 8Simple yet powerful
• Bytecode is simpler than source language
– Similar to compiler IR
– Simplifies analysis
– Java, C#, VB, F#, etc are far more complex
Trang 9Simple yet powerful
• Bytecode is simpler than source language
– Similar to compiler IR
– Simplifies analysis
– Java, C#, VB, F#, etc are far more complex
• Retains most information of source language
– Similar to compiler IR
– Enables meaningful analysis
Trang 10• Fewer language elements = less “syntactic sugar”
• Example: Explicit loop constructs in Java
– Sourcecode: 4
• Which ones?
Trang 11• Fewer language elements = less “syntactic sugar”
• Example: Explicit loop constructs in Java
– Sourcecode: 4
• while, do (“until”), basic for, enhanced for
– Bytecode: 0
• ?
Trang 12• Fewer language elements = less “syntactic sugar”
• Example: Explicit loop constructs in Java
– Sourcecode: 4
• while, do (“until”), basic for, enhanced for
– Bytecode: 0
• All 4 are mapped to jumps
– Makes program analysis easier to implement
Trang 13• Still a non-trivial, Turing-complete language
– As least as expressive as Java source language
– Supports all legal Java source programs (and more)
• Bytecode retains most information of original source program
– Allows automatic reconstruction of source from bytecode– “Dis-assembler” fast, powerful, and convenient
Trang 14• Several “dis-assembler” libraries provide a nice API to retrieve and even change bytecode
– Beyond capability of Java or C# built-in reflection
– BCEL and ASM for Java bytecode
– ExtendedReflection (part of Pex) for Net bytecode
Trang 15Documented Standard
• Carefully designed and specified
– Better than most compiler IR
• Java Virtual Machine specification
Trang 16Shared Standard
• Shared standard among different languages
– Java, C#, VB, F#, etc all compiled to same bytecode
– Programs in many source languages can be checked with single Program Analysis tool
• Shared standard among different operating systems
– Cell phones, mainframe, etc all run same bytecode
– Programs on many OS can be checked with single tool
Trang 17Old days: Typically no shared intermediate
language
You write: MyProg.cpp You write: MyProg.ada
MyProg.exe in Windows x86 MyProg in Linux x86
Linux Windows
Visual
Studio
Trang 18Bytecode:
Shared intermediate language
You write: MyProg.java You write: MyProg.cs You write: MyProg.ada
Trang 19Many software engineering papers focus on combination of Java source with Java bytecode
• Probably easiest to understand
• Other combinations work similarly
• Well documented, many research papers
• Industrial-strength, but still relatively simple
– C# started with Java-like features
– But C# grew faster more complex now
– C++ more complex than Java
– Other combinations more obscure
Trang 20javac compiler implements our source-bytecode combination
Bytecode: MyProg.class
Java virtual
machine
JVM spec
Trang 21• Following overview gives a flavor
– Slightly simplified: Details may differ from JVM
– Omits several parts: Exceptions, floating point, …
• May be intimidating
– But remember that you can typically use a powerful
disassembler to help with bytecode
• Following mostly copied from Java virtual machine specification 2nd edition:
http://java.sun.com/docs/books/jvms/second_edition/html/VMSpecTOC.doc.html
Trang 23Structure of the Java Virtual Machine
= Sections of chapter 3 of JVM Spec
1 The class file format
2 Data types
3 Primitive types and values
4 Reference types and values
5 Runtime data areas
6 Frames
7 …
Trang 24Class file format
• Standard format for Java bytecode
• JVM accepts bytecode only in class file format
• JVM Spec, Section 4, defines class file format
– Contents
– Order
– Representation
– Verification [Section 4.9]
Trang 25Class file format
• Binary format
• Independent of hardware and OS
– Fixes byte order (“endianness”),
regardless of byte order of current machine
• Independent of actual files, despite the name
• Class may arrive at runtime as a byte array from
elsewhere
– From a class generator
– From the web
Trang 26Class/interface class file
• 1:1 mapping between (class or interface) and class file
– Class file can define a class or an interface
– Each class is defined in its own class file
– Each interface is defined in its own class file
• Applies to top-level types and nested types
– Java compiler creates a separate class file for each nested class
Trang 27Basic organization
• Class file = stream of bytes, 1 byte = 8 bits
• Multibyte items stored in big-endian
= High byte first
• Read consecutive bytes
• Interpret consecutive bytes as unsigned number
– 8 bit item = 1 byte [0 255]
– 16 bit item = 2 consecutive bytes [0 65,535]
– 32 bit item = 4 consecutive bytes [0 4,294,967,295]
– 64 bit item = 8 consecutive bytes
[0 18,446,744,073,709,551,615]
Trang 28Class file data types
• Own simple data types
– Different from Java data types
– Different from JVM data types
– Neither “byte” nor “int” nor “long”
• Just three types
– u1 = unsigned byte
– u2 = unsigned 2 consecutive bytes: (high, low)
– u4 = unsigned 4 consecutive bytes
Trang 29Class file structure
Trang 30Class/Interface
Header
Trang 32• Magic number
• First four bytes of a Java class file
• Each class file has the same magic number
• Helps OS recognize this file as a Java class file
• Value is 3405691582 = CAFEBABE in hex
• More on CafeBabe:
– http://www.artima.com/insidejvm/whyCAFEBABE.html
Trang 33minor_version, major_version
• Together define the version of the class file format used in the class file
• Tells JVM if it understands the format of the class file
– An older JVM can reject to load a class file, if the class file
is in a class file format that was defined after the JVM was released
Trang 34Constant Pool
of this Class/Interface
Trang 36Constant Pool
• Constants from user source program
– Constant String objects, int, float, long, double
• Internal String values
– Unicode character sequences
• Names and signatures of
– Classes, interfaces, methods, fields
Trang 38Constant Pool
• constant_pool_count = Number of entries in the
constant_pool (+ 1)
• constant_pool = Sequence of cp_info items
• cp_info = {u1 tag; u1 info[]; }
• Tag byte defines the kind of cp_info, e.g.:
– 3 indicates a CONSTANT_Integer_info
• Info array holds the actual data, e.g.:
– Info array of CONSTANT_Integer_info is one u4
Trang 39Index into Constant Pool
• u2 value
– Greater than zero
– Less than constant_pool_count
• Example
– constant_pool_count = 7
– 1 = Index of first element
– 6 = Index of last element
Trang 40Constant String Objects
• Declared in the user program as constant objects of the type String, e.g.:
– String s = “CSE 6329 rocks”;
• CONSTANT_String_info = {
u2 string_index; } // index into cp
• cp at string_index must be a CONSTANT_Utf8_info
Trang 41Internal String Values
• Holds a character sequence
– Each character is a Unicode character
– Each character represented by 1, 2, or 3 bytes
• Used for both user program constant objects and
internal Strings (method signatures, etc.)
Trang 42Access Rights
of this Class/Interface
Trang 44Class/Interface Access Rights: access_flags
• Bit mask – each bit represents a flag
• Each flag represents an access permission or a
property of this class or interface
– Flag = (class/interface) was declared …
– Flag = (class/interface) was declared …
Trang 45Class/Interface Access Rights:
Public or Default
• Class/interface either has public flag set or not
– No “private” or “protected” flags
• Public flag set
– Access from within or outside its package
• Default access rights, if public flag not set
– Access only from within its package
Trang 46Direct Subclass Relation
Trang 47Name and Direct Super-Class
of this Class/Interface
Trang 48– Name of class or interface
– In “internal” notation: Replace “.” with“/”
– Example: “java/lang/Object”
Trang 49• If this class file defines a class,
– super_class must be zero or an index into the cp
• If super_class is zero
– This class file must represent java.lang.Object – the root class of the Java class hierarchy
• If super_class is non-zero,
– cp at super_class must be a CONSTANT_Class_info
representing the direct super class
Trang 50• If this class file defines an interface
– super_class must be an index into the cp
– cp at super_class must be a CONSTANT_Class_info for
java.lang.Object
• This is a bit confusing
– An interface does not have a super class
– E.g., the instance method getSuperclass() of java.lang.Classreturns null if invoked on an interface
Trang 51Direct Interfaces
of this Class/Interface
Trang 53• interfaces_count = Number of direct super interfaces
• Interfaces = Array of indices into cp
• Cp at each index must be a
CONSTANT_Class_info that represents a direct super interface
Trang 55of this Class/Interface
Trang 56• fields_count = Number of fields declared by this class
or interface
– Includes static fields and instance fields
– Does not include any inherited fields
• fields = Sequence of field_info items
– Each field_info represents one field declared by this class
or interface
Trang 57• field_info = {
– u2 access_flags; // Access rights
– u2 name_index; // Simple name
– u2 descriptor_index; // Type
– u2 attributes_count; // Attributes
– attribute_info attributes[attributes_count]; }
Trang 58Field Access Rights field_info access_flags
• Flag = Field was declared …
– The field is accessible …
Trang 59Field Access Rights
• Only one of the access flags (public, private,
protected) may be set
• “Default” access, if no access flag is set
– Only within its package
• Reminder from Java Spec: Class X can access a field C.f only if it can access class C.
– Public field f may not be accessible for class X
Trang 60More Field Access Rights field_info access_flags
• Flag = Field was declared …
• 0x0008 = static
– Class field (one per class)
– Not an instance field (one per instance)
• 0x0010 = final
– No further assignment after initialization
• 0x0040 = volatile
• 0x0080 = transient
Trang 61Field Signature
• cp[name_index] is a CONSTANT_Utf8_info
– Simple name of field, e.g.:
– double[] foo; // “foo”
– static Object bar; // “bar”
Trang 62Descriptor Notation
• Cryptic type notation used in Java bytecode
– Notation Java type interpretation
– C char Unicode character
– L<name>; reference instance of <name>
– [ reference one array dimension
Trang 63Descriptor Notation
– Notation Java type interpretation
– B byte 8 bit signed integer
Trang 64Field Attributes:
attributes[attributes_count]
• attributes_count = Number of attributes for this field
• attributes = Sequence of attribute_info items
– Each attribute_info represents one attribute
• Examples:
– @Deprecated int myDeprecatedField = 0;
– @Deprecated @MyAttribute int otherField = 1;
Trang 65of this Class/Interface
Trang 67• methods_count = Number of methods declared by this class or interface
– Includes static methods and instance methods
– Includes constructors and static initializers
– Does not include inherited methods
• methods = Sequence of method_info items
– Each method_info represents one method declared by this class or interface
Trang 68• method_info {
– u2 access_flags; // Method access rights
– u2 name_index; // Simple name
– u2 descriptor_index; // Signature
– u2 attributes_count; // Attributes
– attribute_info attributes[attributes_count]; }
Trang 69Method Access Rights method_info access_flags
• Next two slides identical to field access rights
– Public, private, protected, default
• Fields, constructors, methods are all “members” of a class or interface
– Similar access right rules
Trang 70Method Access Rights method_info access_flags
– The method is accessible …
Trang 71Method Access Rights
• Only one of the access flags (public, private,
protected) may be set
• “Default” access, if no access flag is set
– Only within its package
• Reminder from Java Spec: Class X can access a
method C.m only if it can access class C.
– Public method m may not be accessible for class X
Trang 72More Method Access Rights method_info access_flags
• Flag = Field was declared …
• 0x0008 = static
– Class method (called independent of instance)
– Not an instance method (which needs an instance as a
“receiver instance” or “this parameter”)
• instance.method(p2, p3, )
• 0x0010 = final
– May not be overridden by sub-classes
Trang 73More Method Access Rights method_info access_flags
• Flag = Field was declared …
Trang 75Method Signature
• cp[descriptor_index] is CONSTANT_Utf8_info
– (Parameter types) Return type
– In same cryptic notation as field types
– “V” = void is also a legal return type
– Never includes a “receiver type”
• Examples
– public int foo() { } // “()I” instance method– MyClass(long p) {} // “(J)V” constructor
– static { bar = 5; } // “()V”
Trang 76Method Attributes:
attributes[attributes_count]
• attributes_count = Number of attributes for this
method
• attributes = Sequence of attribute_info items
– Each attribute_info represents one attribute
– Code attribute, present iff the method is neither abstract nor native
– Exceptions attribute, lists declared exceptions
– @Deprecated attribute
Trang 77Code of a method/constructor/clinit:
In a Code Attribute
• Code_attribute {
– { u2 start_pc; u2 end_pc; u2 handler_pc; u2 catch_type; }
exception_table[exception_table_length];
– attribute_info attributes[attributes_count]; }
Trang 78of this Class/Interface
Trang 80Class/Interface Attributes:
attributes[attributes_count]
• attributes_count = Number of attributes for this class
or interface
• attributes = Sequence of attribute_info items
– Each attribute_info represents one attribute
Trang 81Referring to fields/methods
in other classes
Trang 82• So far: How to define the elements of a class
– Class name
– Access rights of the class
– Fields of the class
Trang 83CONSTANT_Fieldref_info CONSTANT_Methodref_info
• Reference to a field/method/constructor
• CONSTANT_Fieldref_info { // similar for all
– u1 tag;
– u2 class_index; // type declaring this member
– u2 class_index; // type declaring this member
// CONSTANT_Class_info– u2 name_and_type_index;
// simple name and descriptor// CONSTANT_NameAndType_info}