SQL for computing h-score (h-index)
Asked Answered
H

4

7

According to wikipedia:

A scientist has index h if h of his/her Np papers have at least h citations each, and the other (Np − h) papers have no more than h citations each.

Imagine we have SCIENTISTS, PAPERS, CITATIONS tables with 1-n relation between SCIENTISTS and PAPERS and 1-n relation between PAPERS and CITATION TABLES. How to write a SQL statement that would compute h-score for each scientist in SCIENTISTS table?

To present some research effort I did here is a SQL computing number of citations for each paper:

SELECT COUNT(CITATIONS.id) AS citations_count
FROM PAPERS
LEFT OUTER JOIN CITATIONS ON (PAPERS.id = CITATIONS.paper_id)
GROUP BY PAPERS.id
ORDER BY citations_count DESC;
Halfcocked answered 13/9, 2013 at 12:19 Comment(2)
What database are you using? And, are you sure that the h-score is well-defined?Yanez
Any database, this should be generic. I don't understand second question.Halfcocked
Y
5

What the h-value is doing is counting the citations in two ways. Let's say a scientist has the following citation counts:

10
 8
 5
 5
 2
 1

Let's the number that have that many or more citations, and the difference between the two:

10    1    9
 8    2    6
 5    3    2
 5    3    2
 2    5    -3
 1    6    -5

The number you want is where this is 0. In this case, the number is 4.

The fact that the number is 4 makes this hard, because it is not in the original data. That makes the calculation harder, because you need to generate a numbers table.

The following does this using SQL Server syntax for generating a table with 100 numbers:

with numbers as (
      select 1 as n
      union all
      select n+1
      from numbers
      where n < 100
     ),
     numcitations as (
      SELECT p.scientistid, p.id, COUNT(c.id) AS citations_count
      FROM PAPERS p LEFT OUTER JOIN
           CITATIONS c
           ON p.id = c.paper_id
      GROUP BY p.scientist, p.id
     ),
     hcalc as (
      select scientistid, numbers.n,
             (select count(*)
              from numcitations nc
              where nc.scientistid = s.scientistid and
                    nc.citations_count >= numbers.n
             ) as hval
      from numbers cross join
           (select scientistid from scientist) s
     )
select *
from hcalc
where hval = n;

EDIT:

There is a way to do this without using the numbers table. The h-score is the count of cases where the number of citations is greater than or equal to the citation count. This is much easier to calculate:

select scientistid, count(*)
from (SELECT p.scientistid, p.id, COUNT(c.id) AS citations_count,
             rank() over (partition by p.scientistid, p.id order by count(c.id) desc) as ranking
      FROM PAPERS p LEFT OUTER JOIN
           CITATIONS c
           ON p.id = c.paper_id
      GROUP BY p.scientist, p.id
     ) t
where ranking <= citations_count
group by scientistid;
Yanez answered 13/9, 2013 at 13:20 Comment(2)
Thanks, I will definitely test both versions!Halfcocked
It seems your second approach does not work dbfiddle.uk/…Zak
S
1

"Maybe" late but here is another solution that does not rely on analytical functions. Hope it's useful for anybody else with a similar question.

(my previous answer didn't take into considerations your db design)

The query would be:

with aux as ( -- Amount of citations
  select p.id, p.scientist_id, p.title, count(*) as citations
  from paper p join citation c2 on (p.id = c2.paper_id)
  group by 1,2,3 )
select id, name, min(h) as h_index from (
  select c.id,
         c.name,
         aux1.title,
         aux1.citations,
         -- Replacing the use of analytical functions
         -- Eq. to: count(*) over (parition by c.id order by p.citations desc)
         ( select count(*) from aux aux2
           where aux2.citations >= aux1.citations
             and aux2.scientist_id = aux1.scientist_id) as h
  from scientist c join aux aux1 on (c.id = aux1.scientist_id)
) where h >= citations
group by id, name;

And the supporting tables and data to run the examples would be:

create table scientist (
  id integer primary key, -- Alias to ROWID SQLite
  name varchar );

create table paper (
  id integer primary key,
  scientist_id integer not null,
  title varchar );

create table citation (
  id integer primary key,
  paper_id integer not null,
  citation_details varchar ); -- not used

-- Data insertion

insert into scientist (name) values
  ('Mr Doom'),
  ('Batman'),
  ('Superman');

insert into paper (scientist_id, title) values
  (1,'Doomsday'), -- 12 citations
  (1,'Bad Day'), -- 11 citations
  (1,'Sad Day'), -- 10 citations
  (1,'Programming Day'), -- 5 citations
  (1,'Debugin bug #! Error'), -- 2 citations
  (1,'Happy Coding'), -- 0 citations
  (2,'I have money'), -- 1 citations
  (3,'I have no planet'); -- 0 citations (it's not very popular)


insert into citation (paper_id) values
  (1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),
  (2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),
  (3),(3),(3),(3),(3),(3),(3),(3),(3),(3),
  (4),(4),(4),(4),(4),
  (5),(5),
  (7);

I've run it using SQLite3.8. Source code and SQLite example db could be found in my github repository.

Situate answered 17/11, 2015 at 22:55 Comment(0)
B
0

This is a MS SQL solution.

/*
The h-index is calculated by counting the number of publications for which an author has been 
cited by other authors at least that same number of times.  For instance, an h-index of 17 means 
that the scientist has published at least 17 papers that have each been cited at least 17 times.  
If the scientist's 18th most cited publication was cited only 10 times, the h-index would remain at 17.  
If the scientist's 18th most cited publication was cited 18 or more times, the h-index would rise to 18.
*/
declare @num_pubs int = 4;
declare @sited_seed int = 10;
declare @publication table
    (
     scientist_id int,
     publication_title int,
     publication_cited int
    );

with numbers as (
      select 1 as n
      union all
      select n + 1
      from numbers
      where n < @num_pubs
     )

insert into @publication select 1 as scientist_id, n as publication_title, ABS(Checksum(NewID()) % @sited_seed) as publication_cited from numbers

select * from @publication

-- data sample for scientist#1
-- scientist_id     publication      sited
-- 1                pub 1            2
-- 1                pub 2            0
-- 1                pub 3            1
-- 1                pub 4            3
select scientist_id, max(pub_row_number) as h_index
from (
        select p.scientist_id, pub_row_number --, count(p.publication_cited)
        from @publication as p
            inner join  
            (
                select scientist_id, publication_title, publication_cited, 
                        ROW_NUMBER() OVER (PARTITION BY scientist_id ORDER BY publication_cited) as pub_row_number from @publication
                -- create a unique index for publications, and using it to triangulate against the number of cited publications
                -- scientist_id     publication      sited  pub_row_number
                -- 1                pub 1            2      1
                -- 1                pub 2            0      2
                -- 1                pub 3            1      3
                -- 1                pub 4            3      4
            ) as c
        on pub_row_number <= p.publication_cited and p.scientist_id = c.scientist_id
                -- triangulation {pub_row_number <= p.publication_cited} solves two problems
                -- 1. it removes all publications with sited equals to zero
                -- 2. for every publication (pub_row_number) it creates a set of all available sited publications, so that it is possible to   
                --    count the number of publications that has been cited at least same number of times
                -- scientist_id  pub_row_number       publication_cited
                -- 1             1         >          0  >> filtered out
                -- 1             1         <=         1
                -- 1             1         <=         2
                -- 1             1         <=         3
                -- 1             2         <=         2
                -- 1             2         <=         3
                -- 1             3         <=         3
        group by p.scientist_id, pub_row_number --, p.publication_cited
        having pub_row_number <= count(p.publication_cited)
                -- {pub_count <= count(p.publication_cited)} this tiangulation creates a final count
                -- there are 3 publications sited at least once, 2 - sited at least 2 times, and one sited at least 3 times
                -- scientist_id  pub_row_number       count(publication_cited)
                -- 1             1         <=         3
                -- 1             2         <=         2 
                -- 1             3         >          1  >> filtered out via tiangulation
    ) as final 
                -- finally, max(pub_count) pulls the answare
                -- scientist_id  h_index   
                -- 1             2           
group by scientist_id

UNION 
-- include scientist without publications
select scientist_id, 0 from scientist where scientist_id not in (select scientist_id from publication) 

UNION 
-- include scientist with publications but without citation
select scientist_id, 0 from publication group by scientist_id having sum(publication_cited) = 0
Bitterroot answered 13/7, 2018 at 20:25 Comment(0)
H
-1
SELECT F.id,F.name,SUM(H_INDEX) AS  H_INDEX
FROM (
    SELECT a.id,a.name 
    ,CASE WHEN h.HI IS NULL then 0 
          WHEN cited>=HI then 1 else 0 end AS H_INDEX
    ,HI,cited
    FROM author a
    LEFT JOIN (select publication2.author_id,publication2.title,publication2.cited, count(1) AS HI
                from publication publication2
                INNER JOIN publication publication1 on publication2.cited >= publication1.cited
                           and publication2.author_id = publication1.author_id
                GROUP BY publication2.author_id,publication2.title,publication2.cited
              ) h on a.id=h.author_id
) F GROUP BY F.id,F.name ORDER BY H_INDEX DESC
Hottentot answered 29/6, 2016 at 11:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.