Thursday, May 12, 2016

REST vs SOAP

REST

Pros

  • lightweight (in all means: no server- nor client-side extensions needed, no big chunks of XML are needed to be transfered here and there)
  • free choice of the data format - it's up on you to decide whether you can use plain TXT, JSON, XML, or even create you own format of data
  • most of the current data formats (and even if used XML) ensures that only the really required amount of data is transfered over HTTP while with SOAP for 5 bytes of data you need 1 kB of XML junk (exaggerated, ofc, but you got the point)

Cons

  • even there are tools that could generate the documentation from docblock comments there is need to write such comments in very descriptive way if one wants to achieve a good documentation as well

SOAP

Pros

  • has a WSDL that could be generated from even basic docblock comments (in many languages even without them) that works well as a documentation
    • even there are tools that could work with WSDL to give an enhanced try this request interface (while I do not know about any such tool for REST)
  • strict data structure

Cons

  • strict data structure
  • uses an XML (only!) for data transfers while each request contains a lot of junk and the response contains five times more junk of information
  • the need for external libraries (for client and/or server, though nowadays there are such libraries already a native part of many languages yet people always tend to use some third-party ones)
To conclude, I do not see a big reason to prefer SOAP over REST (and JSON). Both can do the same, there is a native support for JSON encoding and decoding in almost every popular web programming language and with JSON you have more freedom and the HTTP transfers are cleansed from lot of useless information junk. If I were to build any API now I would use REST with JSON.

Encryption Algorithms

Algorithms

There are a few dozen standard algorithms. The ones we’re most likely to be interested in are:

Symmetric Cipher

  • KeyGenerator – creates symmetric key
  • SecretKeyFactor – converts between symmetric keys and raw bytes
  • Cipher – encryption cipher
  • AlgorithmParameters – algorithm parameters
  • AlgorithmParameterGernerator – algorithm parameters

Asymmetric Cipher

  • KeyPairGenerator – creates public/private keys
  • KeyFactor – converts between keypairs and raw bytes
  • Cipher – encryption cipher
  • Signature – digital signatures
  • AlgorithmParameters – algorithm parameters
  • AlgorithmParameterGernerator – algorithm parameters

Digests

  • MessageDigest – digest (MD5, SHA1, etc.)
  • Mac – HMAC. Like a message digest but requires an encryption key as well so it can’t be forged by attacker

Certificates and KeyStores

  • KeyStore – JKS, PKCS, etc.
  • CertStore – like keystore but only stores certs.
  • CertificateFactory – converts between digital certificates and raw bytes.
It is critical to remember that most algorithms are provided for backward compatibility and should not be used for in greenfield development. As I write this the generally accepted advice is:
  • Use a variant of AES. Only use AES-ECB if you know with absolute certainty that you will never encrypt more than one blocksize (16 bytes) of data.
  • Always use a good random IV even if you’re using AES-CBC. Do not use the same IV or an easily predicted one.
  • Do not use less than 2048 bits in an asymmetric key.
  • Use SHA-256 or better. MD-5 is considered broken, SHA-1 will be considered broken in the near future.
  • Use PBKDF2WithHmacSHA1 to create AES key from passwords/passphrases. (See also Creating Password-Based Encryption Keys.)
Some people might want to use one of the other AES-candidate ciphers (e.g., twofish). These ciphers are probably safe but you might run into problems if you’re sharing files with other parties since they’re not in the required cipher suite.

In practice many if not most people use a third-party cryptographic library like BouncyCastle.

Final Notes

  1. Storing the text password with hashing is most dangerous thing for application security today.
  2. MD5 provides basic hashing for generating secure password hash. Adding salt make it further stronger.
  3. MD5 generates 128 bit hash. To make ti more secure, use SHA algorithm which generate hashes from 160-bit to 512-bit long. 512-bit is strongest.
  4. Even SHA hashed secure passwords are able to be cracked with today’s fast hardwares. To beat that, you will need algorithms which can make the brute force attacks slower and minimize the impact. Such algorithms are PBKDF2, BCrypt and SCrypt.
  5. Please take a well considered thought before applying appropriate security algorithm.
  6. Generate Secure Password Hash : MD5, SHA, PBKDF2, BCrypt Examples
  7. How to Encrypt user passwords
  8. Symmetric and Asymmetic encrption overview 
  9. Symmetric-vs-Asymmetric-Encryption 

Wednesday, May 11, 2016

JAVA FAQ

1. What are the principle concepts of OOPS?

These are the four principle concepts of object oriented design and programming:
  • Abstraction
  • Polymorphism
  • Inheritance
  • Encapsulation

2. How does abstraction differ from encapsulation?

  • Abstraction focuses on the interface of an object whereas Encapsulation prevents clients from seeing it’s inside view i.e. where the behavior of the abstraction is implemented.
  • Abstraction solves the problem in the design side while Encapsulation is the Implementation.
  • Encapsulation is the deliverables of Abstraction. Encapsulation barely talks about grouping up your abstraction to suit the developer needs.

3. What is an immutable object? How do you create one in Java?

Immutable objects are those whose state cannot be changed once they are created. Any modification will result in a new object e.g. String, Integer, and other wrapper class.

4. What are the differences between processes and threads?

  • A process is an execution of a program whereas a Thread is a single execution sequence within a process. A process can contain multiple threads.
  • Thread is at times called a lightweight process.

5. What is the purpose of garbage collection in Java? When is it used?

The purpose of garbage collection is to identify and discard the objects that are no longer needed by the application to facilitate the resources to be reclaimed and reused.

6. What is Polymorphism?

Polymorphism is briefly described as “one interface, many implementations”. Polymorphism is a characteristic of being able to assign a different meaning or usage to something in different contexts – specifically, to allow an entity such as a variable, a function, or an object to have more than one form. There are two types of polymorphism:
  • Compile time polymorphism
  • Run time polymorphism.
Compile time polymorphism is method overloading. Runtime time polymorphism is done using inheritance and interface.

7. In Java, what is the difference between method overloading and method overriding?

Method overloading in Java occurs when two or more methods in the same class have the exact same name, but different parameters. On the other hand, method overriding is defined as the case when a child class redefines the same method as a parent class. Overridden methods must have the same name, argument list, and return type. The overriding method may not limit the access of the method it overrides.

8. How do you differentiate abstract class from interface?

  • Abstract keyword is used to create abstract class. Interface is the keyword for interfaces.
  • Abstract classes can have method implementations whereas interfaces can’t.
  • A class can extend only one abstract class but it can implement multiple interfaces.
  • You can run abstract class if it has main () method but not an interface.

9. Can you override a private or static method in Java?

You cannot override a private or static method in Java. If you create a similar method with same return type and same method arguments in child class then it will hide the super class method; this is known as method hiding. Similarly, you cannot override a private method in sub class because it’s not accessible there. What you can do is create another private method with the same name in the child class.

10. What is Inheritance in Java?

Inheritance in Java is a mechanism in which one object acquires all the properties and behaviors of the parent object. The idea behind inheritance in Java is that you can create new classes building upon existing classes. When you inherit from an existing class, you can reuse methods and fields of parent class, and you can also add new methods and fields.
Inheritance represents the IS-A relationship, also known as parent-child relationship.
Inheritance is used for:
  • Method Overriding (so runtime polymorphism can be achieved)
  • Code Reusability

11. What is super in Java?

The super keyword in Java is a reference variable that is used to refer the immediate parent class object. Whenever you create the instance of subclass, an instance of parent class is created i.e. referred by super reference variable.
Java super Keyword is used to refer:
  • Immediate parent class instance variable
  • Immediate parent class constructor
  • Immediate parent class method

12. What is constructor?

Constructor in Java is a special type of method that is used to initialize the object. It is invoked at the time of object creation. It constructs the values i.e. provides data for the object and that is why it is known as constructor. Rules for creating Java constructor:
  • Constructor name must be same as its class name
  • Constructor must have no explicit return type
Types of Java constructors:
  • Default constructor (no-arg constructor)
  • Parameterized constructor

13. What is the purpose of default constructor?

A constructor that has no parameter is known as default constructor.
Syntax of default constructor:
(){}

14. What kind of variables can a class consist?

A class consists of Local Variable, Instance Variables and Class Variables.

15. What is the default value of the local variables?

The local variables are not initialized to any default value; neither primitives nor object references.

16. What are the differences between path and classpath variables?

PATH is an environment variable used by the operating system to locate the executables. This is the reason we need to add the directory location in the PATH variable when we install Java or want any executable to be found by OS.
Classpath is specific to Java and used by Java executables to locate class files. We can provide the classpath location while running a Java application and it can be a directory, ZIP file or JAR file.

17. What does the ‘static’ keyword mean? Is it possible to override private or static method in Java?

The static keyword denotes that a member variable or method can be accessed, without requiring an instantiation of the class to which it belongs. You cannot override static methods in Java, because method overriding is based upon dynamic binding at runtime and static methods are statically binded at compile time. A static method is not associated with any instance of a class, so the concept is not applicable.

18. What are the differences between Heap and Stack Memory?

Major difference between Heap and Stack memory are:
  • Heap memory is used by all the parts of the application whereas stack memory is used only by one thread of execution.
  • When an object is created, it is always stored in the Heap space and stack memory contains the reference to it.
  • Stack memory only contains local primitive variables and reference variables to objects in heap space.
  • Memory management in stack is done in LIFO manner; it is more complex in Heap memory as it is used globally.

19. Explain different ways of creating a Thread. Which one would you prefer and why?

There are three ways of creating a Thread:
1) A class may extend the Thread class
2) A class may implement the Runnable interface
3) An application can use the Executor framework, in order to create a thread pool.
The Runnable interface is preferred, as it does not require an object to inherit the Thread class.

20. What is synchronization?

Synchronization refers to multi-threading. A synchronized block of code can be executed by only one thread at a time. As Java supports execution of multiple threads, two or more threads may access the same fields or objects. Synchronization is a process which keeps all concurrent threads in execution to be in sync. Synchronization avoids memory consistence errors caused due to inconsistent view of shared memory. When a method is declared as synchronized the thread holds the monitor for that method’s object. If another thread is executing the synchronized method the thread is blocked until that thread releases the monitor.

21. How can we achieve thread safety in Java?

The ways of achieving thread safety in Java are:
  • Synchronization
  • Atomic concurrent classes
  • Implementing concurrent Lock interface
  • Using volatile keyword
  • Using immutable classes
  • Thread safe classes.

22. What are the uses of synchronized keyword?

Synchronized keyword can be applied to static/non-static methods or a block of code. Only one thread at a time can access synchronized methods and if there are multiple threads trying to access the same method then other threads have to wait for the execution of method by one thread. Synchronized keyword provides a lock on the object and thus prevents race condition.

23. What are the differences between wait() and sleep()?

  • Wait() is a method of Object class. Sleep() is a method of Object class.
  • Sleep() allows the thread to go to sleep state for x milliseconds. When a thread goes into sleep state it doesn’t release the lock.
  • Wait() allows the thread to release the lock and go to suspended state. The thread is only active when a notify() or notifAll() method is called for the same object.

24. How does HashMap work in Java ?

A HashMap in Java stores key-value pairs. The HashMap requires a hash function and uses hashCode and equals methods in order to put and retrieve elements to and from the collection. When the put method is invoked, the HashMap calculates the hash value of the key and stores the pair in the appropriate index inside the collection. If the key exists then its value is updated with the new value. Some important characteristics of a HashMap are its capacity, its load factor and the threshold resizing.

25. What are the differences between String, StringBuffer and StringBuilder?

String is immutable and final in Java, so a new String is created whenever we do String manipulation. As String manipulations are resource consuming, Java provides two utility classes: StringBuffer and StringBuilder.
  • StringBuffer and StringBuilder are mutable classes. StringBuffer operations are thread-safe and synchronized where StringBuilder operations are not thread-safe.
  • StringBuffer is to be used when multiple threads are working on same String and StringBuilder in the single threaded environment.
  • StringBuilder performance is faster when compared to StringBuffer because of no overhead of synchroniz

Spring FAQ

1. What is Spring?

Spring framework as “an application framework and inversion of control container for the Java platform". Spring is essentially a lightweight, integrated framework that can be used for developing enterprise applications in java.

2. Name the different modules of the Spring framework.

The Spring framework has the following modules:
  • JDBC module
  • ORM module
  • OXM module
  • JMS module
  • Transaction module
  • Web module
  • Web-Servlet module
  • Web-Struts module
  • Web-Portlet module

4. Explain Dependency Injection in the context of Spring framework.

Dependency Injection is a design pattern that allows users to remove hard-coded dependencies and ensure that the application is loosely coupled, extendable and maintainable. The dependency injection design pattern is used to move the dependency resolution from compile to runtime.

7. List some of the important annotations in annotation-based Spring configuration.

  • @Required @Autowired @Qualifier @Resource@PostConstruct @PreDestroy

8. In the context of Spring framework, explain aspect-oriented programming.

Aspect Oriented Programming basically breaks down program logic into smaller chunks called “concerns”. The functions across multiple points of an application are called cross-cutting concerns and these operate independent of the application’s core business logic. Some of the important Aspects in the context of Spring framework are; logging, auditing, caching and declarative transaction.

10. List the different Scopes of Spring bean.

There are five Scopes defined in Spring beans. They are:
  • singleton
  • prototype
  • request
  • session
  • global-session

13. What are the different types of Spring bean autowiring?

  • autowirebyName 
  • autowirebyType
  • autowire by constructor
  • autowire with @Autowired and @Qualifier annotations

14. Explain the role of DispatcherServlet and ContextLoaderListener.

DispatcherServlet is basically the front controller in the Spring MVC application as it loads the spring bean configuration file and initializes all the beans that have been configured. If annotations are enabled, it also scans the packages to configure any bean annotated with @Component, @Controller, @Repository or @Service annotations.
ContextLoaderListener, on the other hand, is the listener to start up and shut down the WebApplicationContext in Spring root. Some of its important functions includes tying up the lifecycle of ApplicationContext to the lifecycle of the ServletContext and automating the creation of ApplicationContext.

15. Explain the role of InternalResourceViewResolver and MultipartResolver.

InternalResourceViewResolver is one of the implementations of the ViewResolver interface that allows you to view the page directory and suffix locations through the bean properties.
MultipartResolver, on the other hand is the interface that is used for uploading files
– CommonsMultipartResolver and StandardServletMultipartResolver are two implementations provided that are provided by the Spring framework for file uploading.

16. How do you create ApplicationContext in a standalone Java program?

We can do this in three ways:
  • AnnotationConfigApplicationContext: If we are using annotations for configuration, then we can use this to initialize the container and get the bean objects.
  • ClassPathXmlApplicationContext: If we have spring bean configuration xml file in the Java application, then we can use this class to load the file and retrieve the container object.
  • FileSystemXmlApplicationContext: This is quite similar to ClassPathXmlApplicationContext except for the fact that the xml configuration file can be loaded from any location in the file system.

17. Explain the uses of the JDBC template in Spring.

Spring simplifies database access handling with the Spring JDBC Template.
The Spring JDBC Template has many advantages compared to the standard JDBC:
  • The Spring JDBC template automatically cleans up the resources; like releasing database connections.
  • The Spring JDBC template converts the standard JDBC SQLExceptions into RuntimeExceptions. This ensures faster response time to identify and eliminate errors.

18. What kinds of transaction management support does Spring offer?

  • Programmatic Transaction Management: for operations with lesser transaction, and
  • Declarative Transaction Management: for larger number of transactions.

19. Explain the difference between Concern and Cross-cutting concern in Spring AOP.

Simply put, Concern is the desired behavior in a module of an application. It is the core functionality the programmer wants to implement.
Cross-cutting concern, on the other hand, is the Concern that is applicable across the entire application. Examples of Cross-cutting concern would be security, data transfer, logging etc.

20. Explain Advice, in the context of Spring.

Advice is an implementation of Aspect. It is inserted into an application at Join Points. There are different types of Advice including “around,” “before” and “after”.

21. What is a JoinPoint, in the context of Spring?

A JoinPoint is an opportunity within the code to which we can apply an Aspect. In Spring programming, a Join Point always represents a method execution.

22. What kind of JoinPoints does Spring support?

The Spring framework supports method executionJoinPoints.

23. What is a Pointcut?

Pointcut is a predicate that matches join points. A point cut defines at what JoinPoints an advice should be applied.

24. What is a Target Object?

Target Object is a proxy object that is advised by one or more aspects.

25. What is Weaving?

It is the process of linking the Aspect with other applications.

Spark Faq

1. What is Apache Spark?

Wikipedia defines Apache Spark “an open source cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory primitives provides performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms.”
Spark is essentially a fast and flexible data processing framework. It has an advanced execution engine supporting cyclic data flow with in-memory computing functionalities. Apache Spark can run on Hadoop, as a standalone system or on the cloud. Spark is capable of accessing diverse data sources including HDFS, HBase, Cassandra among others

2. Explain the key features of Spark.

• Spark allows Integration with Hadoop and files included in HDFS.
• It has an independent language (Scala) interpreter and hence comes with an interactive language shell.
• It consists of RDD’s (Resilient Distributed Datasets), that can be cached across computing nodes in a cluster.
• It supports multiple analytic tools that are used for interactive query analysis, real-time analysis and graph processing. Additionally, some of the salient features of Spark include:
Lighting fast processing: When it comes to Big Data processing, speed always matters, and Spark runs Hadoop clusters way faster than others. Spark makes this possible by reducing the number of read/write operations to the disc. It stores this intermediate processing data in memory.
Support for sophisticated analytics: In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms. This allows users to combine all these capabilities in a single workflow.
Real-time stream processing: Spark can handle real-time streaming. MapReduce primarily handles and processes previously stored data even though there are other frameworks to obtain real-time streaming.  Spark does this in the best way possible.

3. What is “RDD”?

RDD stands for Resilient Distribution Datasets: a collection of fault-tolerant operational elements that run in parallel. The partitioned data in RDD is immutable and is distributed in nature.

4. How does one create RDDs in Spark?

In Spark, parallelized collections are created by calling the SparkContext “parallelize” method on an existing collection in your driver program.
                val data = Array(4,6,7,8)
                val distData = sc.parallelize(data)
Text file RDDs can be created using SparkContext’s “textFile” method. Spark has the ability to create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, among others. Spark supports text files, “SequenceFiles”, and any other Hadoop “InputFormat” components.
                 val inputfile = sc.textFile(“input.txt”)

5. What does the Spark Engine do?

Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.

6. Define “Partitions”.

A “Partition” is a smaller and logical division of data, that is similar to the “split” in Map Reduce. Partitioning is the process that helps derive logical units of data in order to speed up data processing.
Here’s an example:  val someRDD = sc.parallelize( 1 to 100, 4)
Here an RDD of 100 elements is created in four partitions, which then distributes a dummy map task before collecting the elements back to the driver program.

7. What operations does the “RDD” support?

  • Transformations
  • Actions

8. Define “Transformations” in Spark.

“Transformations” are functions applied on RDD, resulting in a new RDD. It does not execute until an action occurs. map() and filer() are examples of “transformations”, where the former applies the function assigned to it on each element of the RDD and results in another RDD. The filter() creates a new RDD by selecting elements from the current RDD.

9. Define “Action” in Spark.

An “action” helps in bringing back the data from the RDD to the local machine. Execution of “action” is the result of all transformations created previously. reduce() is an action that implements the function passed again and again until only one value is left. On the other hand, the take() action takes all the values from the RDD to the local node.

10. What are the functions of “Spark Core”?

The “SparkCore” performs an array of critical functions like memory management, monitoring jobs, fault tolerance, job scheduling and interaction with storage systems.
It is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic input and output functionalities. RDD in Spark Core makes it fault tolerance. RDD is a collection of items distributed across many nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections.

11. What is an “RDD Lineage”?

Spark does not support data replication in the memory. In the event of any data loss, it is rebuilt using the “RDD Lineage”. It is a process that reconstructs lost data partitions.

12. What is a “Spark Driver”?

“Spark Driver” is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. The driver also delivers RDD graphs to the “Master”, where the standalone cluster manager runs.

13. What is SparkContext?

“SparkContext” is the main entry point for Spark functionality. A “SparkContext” represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

14. What is Hive on Spark?

Hive is a component of Hortonworks’ Data Platform (HDP). Hive provides an SQL-like interface to data stored in the HDP. Spark users will automatically get the complete set of Hive’s rich features, including any new features that Hive might introduce in the future.
The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive operator plans from the semantic analyzer which is translated to a task plan that Spark can execute. It also includes query execution, where the generated Spark plan gets actually executed in the Spark cluster.

15. Name a few commonly used Spark Ecosystems.

  • Spark SQL (Shark)
  • Spark Streaming
  • GraphX
  • MLlib
  • SparkR

16. What is “Spark Streaming”?

Spark supports stream processing, essentially an extension to the Spark API. This allows stream processing of live data streams. The data from different sources like Flume and HDFS is streamed and processed to file systems, live dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.
Business use cases for Spark streaming: Each Spark component has its own use case. Whenever you want to analyze data with the latency of less than 15 minutes and greater than 2 minutes i.e. near real time is when you use Spark streaming

17. What is “GraphX” in Spark?

“GraphX” is a component in Spark which is used for graph processing. It helps to build and transform interactive graphs.

18. What is the function of “MLlib”?

“MLlib” is Spark’s machine learning library. It aims at making machine learning easy and scalable with common learning algorithms and real-life use cases including clustering, regression filtering, and dimensional reduction among others.

19. What is “Spark SQL”?

Spark SQL is a Spark interface to work with structured as well as semi-structured data. It has the capability to load data from multiple structured sources like “textfiles”, JSON files, Parquet files, among others. Spark SQL provides a special type of RDD called SchemaRDD. These are row objects, where each object represents a record.
Here’s how you can create an SQL context in Spark SQL:
        SQL context: scala> var sqlContext=new SqlContext
        HiveContext: scala> var hc = new HIVEContext(sc)

20. What is a “Parquet” in Spark?

Parquet” is a columnar format file supported by many data processing systems. Spark SQL performs both read and write operations with the “Parquet” file.

21. What is an “Accumulator”?

“Accumulators” are Spark’s offline debuggers. Similar to “Hadoop Counters”, “Accumulators” provide the number of “events” in a program.
Accumulators are the variables that can be added through associative operations. Spark natively supports accumulators of numeric value types and standard mutable collections. “AggregrateByKey()” and “combineByKey()” uses accumulators.

22. Which file systems does Spark support?

  • Hadoop Distributed File System (HDFS)
  • Local File system
  • S3

23. What is “YARN”?

“YARN” is a large-scale, distributed operating system for big data applications. It is one of the key features of Spark, providing a central and resource management platform to deliver scalable operations across the cluster.

24. List the benefits of Spark over MapReduce.

  • Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce.
  • Unlike MapReduce, Spark provides in-built libraries to perform multiple tasks form the same core; like batch processing, steaming, machine learning, interactive SQL queries among others.
  • MapReduce is highly disk-dependent whereas Spark promotes caching and in-memory data storage
  • Spark is capable of iterative computation while MapReduce is not.
Additionally, Spark stores data in-memory whereas Hadoop stores data on the disk. Hadoop uses replication to achieve fault tolerance while Spark uses a different data storage model, resilient distributed datasets (RDD). It also uses a clever way of guaranteeing fault tolerance that minimizes network input and output.

25. What is a “Spark Executor”?

When “SparkContext” connects to a cluster manager, it acquires an “Executor” on the cluster nodes. “Executors” are Spark processes that run computations and store the data on the worker node. The final tasks by “SparkContext” are transferred to executors.

26. List the various types of “Cluster Managers” in Spark.

The Spark framework supports three kinds of Cluster Managers:
  • Standalone
  • Apache Mesos
  • YARN

27. What is a “worker node”?

“Worker node” refers to any node that can run the application code in a cluster.

28. Define “PageRank”.

“PageRank” is the measure of each vertex in a graph.

29. Can we do real-time processing using Spark SQL?

Not directly but we can register an existing RDD as a SQL table and trigger SQL queries on top of that.

30. What is the biggest shortcoming of Spark?

Spark utilizes more storage space compared to Hadoop and MapReduce.
Also, Spark streaming is not actually streaming, in the sense that some of the window functions cannot properly work on top of micro batching.

Sunday, May 1, 2016

String vs StringBuffer vs StringBuilder

n Java, a String is a primitive data type and almost all developers use String for a character strings operation. Other than that, Sun provided StringBuffer, StringBuilder.

What are the differences between the String, Stringbuffer, and Stringbuilder?
1. String: A string is immutable, which means that once a String is created, its value cannot be changed.
?
1
2
3
4
5
String s = "Hello";
 
s = s + " World!";
 
System.out.println(s);
When executing this code, the string Hello World! will be printed. How?
Java created a new String object and stored "Hello World!" as its value. If this style of coding is used often in a program, the program will have performance problems due to the lack of memory.
2. StringBuffer: A StringBuffer is mutable, which means once a StringBuffer object is created, we just append the content to the value of the object instead of creating a new object. Their methods are synchronized when neccessary so that the StringBuffer will be used effectively in threads. The StringBuffer runs slow in a one-thread program.
?
1
2
3
StringBuffer sb = new StringBuffer("Hello");
sb.append(" World!");
System.out.println(sb.toString()); //Hello World!
3. StringBuilder: The StringBuilder is essentially the same as StringBuffer but it is not thread-safe, that means that their methods are not synchronized. In comparison to the other Strings, the Stringbuilder runs the fastest.
?
1
2
3
StringBuilder sb = new StringBuilder("Hello");
sb.append(" World!");
System.out.println(sb.toString()); //Hello World!

Comparator and Comparable


The Comparator and Comparable interfaces are two interfaces avaiable in Java for sorting the user defined objects.
For example, if you want to sort the List of Employee Objects based on Employee Id, name, or address, you would need to use either the comparable interface or the comparator interface.

Comparable Vs Comparator

  • Both interfaces are used for comparing two different objects of same class.
  • If the source code of the class(the object about to be sorted) is accessible and modifiable, then we can implement the Comparable interface. Otherwise, if a source code is not available, we can make use of the Comparator interface.
The examples below demostrates how to use the Comparator and Comparable interfaces for sorting an Employee Object by its id.

Using the Comparable Interface 

Overriding the compareTO() method if a source code is available is the proper way to use the comparable interface.
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
 
class Employee implements Comparable {
    int id;
    String name;
     
    public Employee(int id, String name) {
        this.id=id;
        this.name=name;
    }
    @Override
    public int compareTo(Object o) {
        Employee emp = (Employee)o;
        return this.id-emp.id;
    }
}
 
public class EmployeeSorting {
    public static void main(String[] args) {
        List employees = new ArrayList<>();
        Employee emp1 = new Employee(3, "Jerome");
        Employee emp2 = new Employee(1, "Albert");
        Employee emp3 = new Employee(2, "Samiya");
        Employee emp4 = new Employee(5, "Stella");
        Employee emp5 = new Employee(4, "Kent");
        employees.add(emp1);
        employees.add(emp2);
        employees.add(emp3);
        employees.add(emp4);
        employees.add(emp5);
         
        Collections.sort(employees);
         
        for (Employee employee : employees) {
            System.out.println(employee.id + ", " + employee.name);
        }
    }
}
After executing the class in the example above, the output will be as follows:
     1, Albert
     2, Samiya
     3, Jerome
     4, Kent
     5, Stella
The compareTo() method should compare this object to another object and return intValue. Below are the rules for intValue
if the method returns:
  • -ve value, then the object is smaller than another object.
  • 0, then the object value is same as another value.
  • +ve value, then the object is larger than another object.

Using Comparator

When a source code is not available for the object that we want to sort, we can use the comparator interface for sorting.
In this example, we will introduce a new class that implements the comparator interface and override the compare() method and Pass the Comparator object to the sort() method of collections.
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
 
class EmployeeComparator implements Comparator
{
    @Override
    public int compare(Object o1, Object o2) {
        Employee emp1 = (Employee)o1;
        Employee emp2 = (Employee)o2;
        return emp1.id-emp2.id;
    }
}
 
public class EmployeeSorting {
    public static void main(String[] args) {
        List employees = new ArrayList<>();
        Employee emp1 = new Employee(3, "Jerome");
        Employee emp2 = new Employee(1, "Albert");
        Employee emp3 = new Employee(2, "Samiya");
        Employee emp4 = new Employee(5, "Stella");
        Employee emp5 = new Employee(4, "Kent");
        employees.add(emp1);
        employees.add(emp2);
        employees.add(emp3);
        employees.add(emp4);
        employees.add(emp5);
         
        Collections.sort(employees, new EmployeeComparator());
         
        for (Employee employee : employees) {
            System.out.println(employee.id + ", " + employee.name);
        }
    }
}
The compare() method should compare one object with another object and return intValue