GPU Accelerated Parallel Heuristic for the 2D Knapsack Problem

4287 words (17 pages) Essay

8th Feb 2020 Computer Science Reference this


Disclaimer: This work has been submitted by a university student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of

A GPU accelerated parallel heuristic for

the 2D knapsack problem with rectangular pieces

Abstract: The 2D Knapsack Problem is one of the typical NP-hard problems in combinatorial optimization, which is easy to be described but hard to be solved. Given a set of rectangular pieces and a rectangular container, the two-dimensional knapsack problem (2D-KP) consists of orthogonally packing a subset of the pieces within the container such that the sum of the values of the packed pieces is maximized. If the value of a piece is given by its area the objective is to maximize the covered area of the container. It is still very CPU time-consuming when solving such a large problem instance. As graphic process units (GPUs) have been evolved to support general purpose computing, they are taken as a major accelerator in scientific and industrial computing. In this paper, we present an optimized parallel heuristic efficiently accelerated on GPUs and the proposed optimization algorithm can provide higher quality solutions within a reasonable computational time for 2D-KP.


Keywords: GPU, parallel heuristics, combinatorial optimization, algorithms, 2D Knapsack Problem, local search, greedy algorithm



The two-dimensional knapsack problem (2D-KP) with rectangular pieces is a class of well-studied and popular combinatorial optimization problems because it has many real-world applications in the paper, glass, wood and metal industries. This paper deals with the 2D-KP problem with a set of small rectangular pieces (R1,R2,….Rn); Ri = (wi, hi) and a larger rectangle, known as a container. The task is for a feasible arrangement of a subset of the pieces in the container that maximizes the total value of the packed pieces. If the value of a piece is given by its area, the aim is to maximize the covered area of the container. An arrangement of pieces, also known as a packing plan, is regarded as feasible if each piece is placed orthogonally (i.e. parallel to the container edges), is completely inside the container and if no two placed pieces overlap.

This problem has several real-world applications and is proven to be NP-hard. 2D packing problems result, for example, where goods have to be packed on pallets in horizontal layers. And the space-saving arrangement of adverts on the pages of newspapers, or the effective positioning of components on chips when designing integrated circuits, lead to 2D packing problems.

Solving combinatorial optimization problems is complex and time-consuming for CPUs, especially for large-scale problems like 2D-KP. Metaheuristics methods can be used to obtain satisfactory resolution (approximate optimum) in a reasonable time. Local search metaheuristics (LSMs) are one of the most widely researched single solution-based approaches with many variants such as iterated local search, Genetic algorithms, Hill climbing, tabu search and simulated annealing. Parallelism is a way not only to reduce the time complexity but also to improve the quality of the solutions provided.

Today, Graphics Processing Units (GPUs) have evolved from fixed function rendering devices to programmable and parallel processors. The demand from the market for real-time, high-definition 3D graphics motivates GPUs to become highly parallel, multithreaded, many-core processors with tremendous computational power and high-bandwidth memory. Therefore, the GPU architecture is designed such that more transistors are devoted to data processing than to data caching and flow control. GPU accelerated heuristics can be more computational efficient than CPU-based LSMs. Furthermore, it is an important problem to make LSM algorithms on GPU optimized for the best efficiency.

In this paper, we propose an optimized parallel heuristics on GPUs and test the algorithm with a typical 2D-KP in computational science. So, our proposed parallel approach with GPU; which incorporates the genetic algorithm, local search method and greedy algorithm, is proposed to be as an effective parallel heuristic approach to find an improved optimal solution for 2D-KP.

978-1-5386-7693-6/18/[email protected] IEEE


1)    Genetic Algorithm: Genetic Algorithm [7] is an optimization technique based on natural evolution that includes the idea of surviving of the most fitted element in a searching algorithm. In nature the most appropriate individuals have more possibilities of surviving and it will bring more adapted successors, those will be healthier than their parents. The same idea is applied by creating a new generation of solutions that are nearer to the demanded solution. A genetic algorithm consists of these steps: 1) encoding, 2) evaluation, 3) selection,

4)    crossover, 5) mutation, 6) decoding. Basic genetic algorithm is as follows:


Basic Genetic Algorithm:


Step-1: Initialize population with random solutions. Step-2: Evaluate each candidate.

Step-3: Repeat (iterations).

a)  Select parents.

b)  Recombine pairs of parents.

c)  Mutate the resulting children.

d)  Evaluate children.

e)       Select individuals for the next


Until Terminating condition is satisfied.


2)    Local Search Heuristic: Local search [8] is a heuristic method for solving computationally hard optimization problems. Local search can be used on problems that can be formulated as finding a solution maximizing a criterion among a number of candidate solutions. Local search algorithms move from solution to solution in the space of candidate solutions by applying local changes, until a solution deemed optimal is found or a time bound is elapsed.Fields within local search include Hill climbing, tabu search, Simulated annealing.

Local search methods have a tendency to become stuck in suboptimal regions or on plateaus where many solutions are equally fit.Tabu search enhances the performance of local search by relaxing its basic rule. First, at each step worsening moves can be accepted if no improving move is available (like when the search is stuck at a strict local minimum). In addition, prohibitions (henceforth the term tabu) are introduced to discourage the search from coming back to previously- visited solutions.Also, Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space. It is often used when the search space is discrete (e.g., all tours that visit a given set

of cities). For problems where finding an approximate global optimum is more important than finding a precise local optimum in a fixed amount of time.

3)    Greedy Algorithm: A greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. In mathematical optimization, greedy algorithms solve combinatorial problems having the properties of matroids.

In general, greedy algorithms have five components:

-  A candidate set, from which a solution is created

-  A selection function, which chooses the best candidate to be added to the solution

-   A feasibility function, that is used to determine if a candidate can be used to contribute to a solution

-    An objective function, which assigns a value to a solution, or a partial solution, and

-    A solution function, which will indicate when we have discovered a complete solution

Greedy algorithms produce good solutions on some mathematical problems.We can make whatever greedy choice seems best at the moment and then solve the subproblems that arise later. The choice made by a greedy algorithm may depend on choices made so far, but not on future choices or all the solutions to the subproblem. It iteratively makes one greedy choice after another, reducing each given problem into a smaller one.

  1. GPU ARCHITECTURAL CONSTRAINTS Until recently the only viable choice as a platform for

parallel programming was the conventional CPU processor, be it single- or multi-core. Usually many of them were arranged either tightly as multiprocessors, sharing a single memory space, or loosely as multi computers. CPU does not have a lot of arithmetic-logic units (ALU), but a large cache and an important control unit. As a result, the CPU is specialized to manage multiple and different tasks in parallel that require lots of data. Thereby, data are stored within a cache to accelerate its accesses. Graphic Processing Units (GPUs) are nowadays available in nearly all personal computers and they offer a potential reduction in the computational time. GPUs have evolved into a highly parallel, multithreaded and many-core environment. The modern GPU architecture is outstanding in performance with respect to power consumption, price and space occupied.

A GPU consists of a set of Stream Multiprocessors (SMs); each one features multiple processing units with the ability of executing thousands of operations concurrently. Inside a SM,

Fig. 1. GPU Architecture

threads are grouped into warps. A warp executes 32 threads in a SIMD (Single Instruction, Multiple Data) manner that is all threads within a warp execute the same operation on multiple data points. There are at least two types of memory: global and shared. Shared memory is usually limited to few Kilobytes per SM and global memory allows to store a large amount of data (e.g., 6GB).

Fig. 2. GPU threads model

A thread on GPU can be seen as an element of data to be processed. Compared to CPU threads, GPU threads are lightweight. It means that changing the context between two threads is not a costly operation. Regarding their spatial

organization, threads are organized within so called thread blocks.

A kernel is executed by multiple equally threaded blocks. Blocks can be organized into a one-dimensional or two- dimensional grid of thread blocks, and threads inside a block are grouped in a similar way. Each thread is provided with a unique id that can be used to compute on different data. The advantage of grouping is that the number of blocks processed simultaneously by the GPU is closely linked to hardware resources. All the threads belonging to the same thread block will be assigned as a group to a single multiprocessor, while different thread blocks can be assigned to different multiprocessors. Hence, a major issue is to control the threads parallelism to meet the memory constraints. From a hardware point of view, graphics cards consist of streaming multiprocessors, each with processing units, registers and on-chip memory. Since multiprocessors are organized according to the SPMD model, threads share the same code and have access to different memory areas.

Fig. 3. CPU – GPU communications

The CPU is considered as a host and the GPU is used as a device coprocessor. This way, each GPU has its own memory and processing elements that are separate from the host computer. Data must be transferred between the memory space of the host and the memory of GPU during the execution of programs.

Each processor device on GPU supports the single program multiple data (SPMD) model, i.e. multiple autonomous processors simultaneously execute the same program on different data. For achieving this, the concept of kernel is

defined. The kernel is a function callable from the host and executed on the specified device simultaneously by several processors in parallel. Communication between the CPU host and its device is done through the global memory.


Andreas Bortfeldt and Tobias Winter [1] proposed a genetic algorithm (GA) , which is addressing the guillotine case of the 2D-KP as well as the non-guillotine case. Moreover, an orientation constraint may optionally be taken into account and the given piece set may be constrained or unconstrained. The GA is subjected to an extensive test using well-known benchmark instances. In a comparison to recently published methods the GA yields competitive results according to them.

Defu Zhang and Furong Ye and Yain Whar Si and Stephen

  1. H. Leung [2] proposed a hybrid algorithm based on variable neighbourhood. Their proposed algorithm is a hybrid metaheuristic that combines an improved heuristic algorithm with a variable neighbourhood search.Different neighbourhoods are constructed based on the concept of block patterns. The proposed algorithm has three interesting features. First, a least-waste strategy is used to improve the constructive heuristics. Second, a better sorting sequence is selected to generate an initial solution. Finally, different neighbourhoods are constructed based on block patterns. The computational results from a diverse set of problem instances show that the proposed algorithm performs better than algorithms reported in the literature for most of the problem sets compared.

I (Mohammad Rashid) [3] also previously proposed a hybrid algorithm for 2D-KP that integrates a genetic algorithm with local-search heuristics and a greedy algorithm. In our previous approach we keep the evolutionary technique of genetic algorithms using operators like mutation, crossover and the selection of the most fitted element, but the crossover operator is enhanced with a local search heuristic. We also use the local search heuristic for its strong climbing ability as well as to find local optima efficiently. The greedy algorithm is responsible for generating the start generation as well as carrying out post-optimization of the previously best solution.


In this work we present a GPU accelerated parallel heuristics for 2D-KP that integrates a genetic algorithm with local-search heuristics and a greedy algorithm as we most recently proposed in our previous work. In our new approach, we parallelize our previously proposed heuristics efficiently accelerated on GPU by leveraging hundreds of GPU cores/threads, establish an efficient CPU-GPU communication and find a better optimal solution than all other previous works.


Proposed Parallel Heuristic:


Step-1:Allocate 200 GPU threads.

Step-2: For each GPU threads.

a)   Run the Main 2D-KP Algorithm in parallel to find the best packing plan.

b)     Send each thread obtained packing plan (sum of the values of the packed pieces) to CPU.

Step-3: CPU compares all the packing plans sent by all 200 GPU threads and determine the best packing plan that maximize the covered area of the container.


Below is our previously proposed algorithms to find optimal solution for 2D-KP.


Our Main 2D-KP Algorithm:


Step-1: Generate a series of random initial solutions (Packing plan with a sequence of rectangle pieces).

Step-2: Calculate fitness (height and width) of every rectangle pieces of initial solutions.

Step-3: Repeat (iterations).

a)     Pick a random solution and apply local search heuristic to select parents from initial population for cross over and mutation.

b)  Apply a greedy heuristic GreedyPacking() to generate greedy solution from parents.

c)  Calculate fitness of newly generated solution and add newly generated solution in initial population.

d)  Sort combined population (initial population

+ new greedy solution) by height/width and select best solution(Packing plan) for next population.

e)  Set initial population = next population Until Terminating condition is satisfied (for given

number of generations ).

Step-4: Call GreedyPostOptimization() algorithm to carry out post-optimization of the previously best solution


In our greedy approach, we considered the following conditions and fitness while building a greedy heuristic.

When h1>=h2,

if w = wi and h1 = li, then fitness = 6 if w = wi and h2 = li, then fitness = 5 if w = wi and h1 <li, then fitness = 4 if w >wi and h1 = li, then fitness = 3 if w = wi and h1 >li, then fitness = 2 if w >wi and h2 = li, then fitness = 1 if w >w, then fitness = 0

When h1<h2,

if w = wi and h2 = li, then fitness = 6 if w = wi and h1 = li, then fitness = 5 if w = wi and h2 <li, then fitness = 4 if w >wi and h2 = li, then fitness = 3 if w = wi and h2 >li, then fitness = 2 if w >wi and h1 = li, then fitness = 1 if w >w, then fitness = 0

For cases with fitness = 3, fitness = 1 or fitness = 0, if the piece i is packed into a space that causes the remaining width (w wi ) to be leq s than the minimum width of all remaining unpacked pieces, it is obvious that waste will occur. In this case, we present the least-waste strategy: select one unpacked piece j so that its width is maximal amongst the unpacked pieces and wj is less than or equal to w and its perimeter is not less than that of piece i . This strategy can therefore make up the disadvantage of the scoring strategy and may reduce the waste.

The proposed heuristic can be outlined as follows:


Our GreedyPacking() Algorithm:


Step-1: for all unpacked pieces do.

Step-2: select the piece i with the highest fitness according to the above mentioned conditions and fitness

Step-3: if the fitness of the piece i is 3 or 1 or 0 then

select the piece j according to the lease waste


if the piece j can be found then pack the piece j instead of packing i

else pack the piece i else pack the piece i

Step-4: Return greedy solution.


In addition, we proposed the following GreedyPostOpti- mization() algorithm to carry out post-optimization of the previously best solution.This algorithm stores a list of free rectangles that represents the free area left in the bin at some packing step.


Our GreedyPostOptimization() Algorithm:


foreach Free Rectangle

Compute the free area left in the bin

Select an unpacked rectangle, which has smaller height/width than the free area

Pack the piece in that free area


We set up a CUDA programming environment in Visual Studio to build the parallel heuristic on NVIDIA GPU using C++ to find the optimal solution for the 2DKP. After paralleliz- ing our proposed heuristics with GPU, we find an improved optimized results as follows:

GPUThreads CoveredArea Maximized CoveredArea

1 19

2 21

     10          18                           

     30          17               23         

     80          20                           

     140          23                           

     200          22                           

We can see that different GPU threads return different covered areas. As we consider only the maximum, the sum of the covered area can be maximized by approximately 35.29%. So, we can say that we have found much better optimal solution by parallelizing the heuristic on GPUs.


In this work, we have presented a GPU accelerated parallel heuristic integrating a greedy approach into a genetic algorithm with local-search. The experimental results presented expose the improvement achieved by our new parallel heuristic over previous work as detailed in the previous section. All in all, the results presented show that the sum of the covered area can be maximized by approximately 35.29%. This improvement is due to the parallelization with GPU.


[1] Andreas Bortfeldt and Tobias Winter, A Genetic Algorithm for the Two-Dimensional Knapsack Problem with Rectangular Pieces, German, 2008.

[2] Defu Zhang and Furong Ye and Yain Whar Si and Stephen C. H. Leung, A hybrid algorithm based on variable neighbourhood for the strip packing problem, Journal of Combinatorial Optimization. China, 2016.

[3] Mohammad Harun Rashid, A Greedy Local Search Metaheuristic with GeneticAlgorithmforthe2DKnapsackProblemwithRectangularPieces, Pace University. USA, 2017.

[4] Jakob Puchinger and Guunther Raidl, An Evolutionary Algorithm for Column Generation in Integer Programming: An Effective Approach for 2D Bin Packing, UK, 2016.

[5] Jukka Jylnki, A Thousand Ways to Pack the Bin – A Practical Approach to Two-Dimensional Rectangle Bin Packing, 2010.

[6]E. HOPPER and B. C. H. TURTON, A Review of the Application ofMeta- Heuristic Algorithms to 2D Strip Packing Problems, Artificial Intelligence Review. UK, 2001.

[7]Andrea Lodi, Algorithms for Two-Dimensional Bin Packing and Assign- ment Problems, UNIVERSITA DEGLI STUDI DI BOLOGNA. 1999.



Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on the website then please:

Related Lectures

Study for free with our range of university lectures!